Seek freedom and become captive of your desires, seek discipline and find your liberty. — Frank Herbert, Dune
“Negative freedom is freedom from constraint, that is, permission to do things;Positive freedom is empowerment, that is, ability to do things...Negative and positive freedoms, it might seem, aretwo different descriptions of the same thing.No! Life is not so simple. There is reason to think that constraints(prohibitions, if you like) can actually help people to do thingsbetter. Constraints can enhance ability...”— Angus Sibley, “Two Kinds of Freedom”
“...filesystem people should aim to make “badly written” code“just work” unless people are really really unlucky.Because like it or not, that’s what 99% of all code is...Crying that it’s an application bug is like crying over the speed of light: you should deal with *reality*, not what you wish reality was.”—Linus Torvalds,on a slightly different topic (but I like the sentiment)
Years ago I thought the lack of restrictions were a sign of simple andclean design to be held up as a badge of honor compared to more limitedoperating systems. Now that I am responsible for production shell scripts Iam a firm supporter of your view that filenames should be UTF-8 with nocontrol characters. Other troublesome filenames you pointed out such asthose with leading and trailing spaces and leading hyphens should probablybe prohibited too.— Doug Quale, email dated 2016-10-04
Traditionally, Unix/Linux/POSIX pathnames and filenamescan be almost any sequence of bytes.Apathname lets you select a particular file, and mayinclude zero or more “/” characters.Each pathname component (separated by “/”) is a filename;filenames cannot contain “/”.Neither filenames nor pathnames can containthe ASCII NUL character (\0), because that is the terminator.
This lack of limitations is flexible,but it also creates a legion of unnecessary problems.In particular, this lack of limitationsmakes it unnecessarily difficult to write correct programs(enabling many security flaws).It also makes it impossible to consistently and accurately display filenames,causes portability problems, and confuses users.
This article will try to convince you thataddingsome tiny limitations on legal Unix/Linux/POSIXfilenames would be an improvement.Many programsalready presume these limitations,the POSIX standardalready permits such limitations, andmany Unix/Linux filesystemsalready embed such limitations —so it’d be better to make these (reasonable) assumptionstrue in the first place.This article will discuss, in particular, the three biggest problems:control characters in filenames(including newline, tab, and escape),leading dashes in filenames, and thelack of a standard character encoding scheme(instead of using UTF-8).These three problems impact programs written in any language onUnix/Linux/POSIX system.There are other problems, of course.Spaces in filenames can cause problems;it’s probably hopeless to ban them outright, butresolving some of the other issues willsimplify handling spaces in filenames.For example, when using a Bourne shell, you can use anIFS trick(usingIFS=`printf '\n\t'`)to eliminate some problems with spaces.Similarly,special metacharacters in filenamescause some problems;I suspect few if any metacharacters could be forbidden on all POSIX systems,but it’d be great if administrators could locally configure systemsso that they could prevent or escape such filenames when they want to.I then discuss someother tricks that can help.
After limiting filenames slightly,creating completely-correct programs ismuch easier, and some vulnerabilities in existing programs disappear.This article then notessome others’ opinions;I knew that some people wouldn’t agree with me,but I’m heartened thatmanydo agree that something should be done.Finally, I briefly discuss some methods for solving this long-term;these include forbidding creation of such names (hiding them if theyalready exist on the underlying filesystem),implementing escaping mechanisms,or changing how tools work so that these are nolonger problems (e.g., when globbing/scanning, have the librariesprefix “./” to any filename beginning with “-”).Solving this is not easy, and I suspect that several solutions will be needed.In fact, this paper became long over time because I kept findingnew problems that needed explaining (new “worms under the rocks”).If I’ve convinced you that this needs improving,I’d like your help in figuring out how to best do it!
Filename problems affect programs writtenin any programming language.However, they can be especially tricky to deal with when using Bourne shells(including bash and dash).If you just want to write shell programs that can handle filenamescorrectly, you should see the short companion articleFilenames and Pathnames in Shell: How to do it correctly.
That said, I'll note that the POSIX standard committee has, atleast so far, decided tonot implement such restrictions.They have decided topermitbytes 1 through 31 inclusive, including newline, in filenames,and instead have decided toadd a few mechanisms to make it easier to handle such filenames(such asfind -print0,xargs -0, andread -d "").These POSIX additions were accepted in 2023 and are included in the2024 release of POSIX.This is only a partial measure, but itis an improvement.
Imagine that you don’t know Unix/Linux/POSIX(I presume you really do), and that you’re trying to do some simpletasks.For our purposes we will primarily show simple scripts on the command line(using a Bourne shell) for these tasks.However, many of the underlying problems affectany program,as we'll show by demonstrating the same problems in Python3.
For example, let’s try to print out the contents of all files inthe current directory, putting the contents into a file in the parent directory:
cat * > ../collection # WRONG
| In a well-designed system, simple things should be simple, andthe “obvious easy” way to do simple common tasksshould be the correct way.I call this goal “no sharp edges” — to use an analogy,if you’re designing a wrench, don’t put razor blades on the handles.Typical Unix/Linux filesystems fail this test — theydohave sharp edges. |
The list doesn’t include “hidden” files(filenames beginning with “.”), but often that’s what you want anyway,so that’s not unreasonable.The problem with this approach is that although thisusually works, filenames could begin with “-” (e.g., “-n”).So if there’s a file named “-n”, and you’re using GNU cat, all of a suddenyour output will be numbered!Oops; that means onevery command we have to disable option processing.
Some earlier readers thought that this was a shell-specific problem,even though I repeatedly said otherwise.Their “solution” was to use another language like Python...except the problem doesn't go away.Let's write the same thing in Python3:
#!/bin/env python3# WRONGimport subprocess,ossubprocess.run(['cat'] + os.listdir('.'), stdout=open('../collection', 'w'))Exactly the same problem happens in Python3 and in any other language -if there if a filename beginning with-, the receiving programwill typically see that as an option flag (not a file) and mishandle it.Notice that this invocation of subprocess.run doesnot use ashell (there are options likeshell=True that would do that,but we aren't using any of them).So the illusion that “this is just a shell problem”is proven false.It's true that you would not normally runcatfrom within Python, but it's also rare to run cat from a shell.Instead,cat is here as a trivial demo showing thatsafely invoking other programs is harder than it should be.Programs written in any language oftendo need toinvoke other programs... and here we see the danger of doing so.
The “obvious” way to resolvethis problem is to litter command invocations with“--” before the filename(s).You will find many people recommending this.But that solution turns out this doesn’t really work, becausenot all commands support “--” (ugh!).For example, the widely-used “echo” command is notrequired to support “--”.What’s worse, echo does support at least one dash option, so we need toescape leading-dash values somehow.POSIX recommends that you use printf(1) instead of echo(1),but some old systems do not include printf(1).Many other programs that handle options do not understand“--” either, so this is not a robust solution.
In my opinion, a much better solution is to prefix globs like thiswith “./”.In other words, you should do this instead:
cat ./* > ../collection # CORRECT
Prefixing relative globs with “./”always solves the “leading dash” problem,but it sure isn’t obvious.In fact, many shell books and guides completely omit this information,or don’t explain it until far later in the book(which many people never read).Even people who know this will occasionally forget to do it.After all, people tend to do things the “easy way” thatseems to work, resulting inmillions of programs that have subtle bugs (which sometimes lead to exploits).Complaining that people must rewrite all of their programs to usea non-obvious (and ugly) construct is unrealistic.Most people who writecat *do not intend for the filenames to be used as command options(as noted in theThe Unix-haters Handbook page 27).
In many cases globbing isn’t what we want.We probably don’t want the “cat *” commandto examine directories, andglob patterns like “*” won’t recursively descendinto subdirectories either.It is often the case that we want to handle a large collection of filesspread across directories, and we may want to record informationabout those files (such as their names) for processing later.
The primary tool for walking POSIX filesystems in shell is the“find” command, and many languages have a built-in libraryto recursively walk directories.In theory, we could just replace the “*” with somethingthat computes thelist of such file names (which will also include the hidden files):
cat `find . -type f` > ../collection # WRONG
This constructdoesn’t fail because of leading dashes; find always prefixesfilenames with the starting directory, so all of the filenames inthis example will start with “./”.This construct does have trouble with scale —if the list is really long, yourisk an “argument list too long” error,and even if it works, the system has to build up a complete listall at once (which is slow and resource-consuming if the list is long).Even if the list of files is short, this construct has many other problems.One problem (among several!) is that if filenames can contain spaces,their names will be split (file “a b” will beincorrectly parsed as two files, “a” and “b”).
Okay, so let’s use a “for” loop,which is better at scaling up to large sets of files andcomplicated processing of the results.When using shell you need to useset -f to deal withfilenames containing glob characters (like asterisk), but you can do that.Problem is,the “obvious” for loop won’t work either,for the same reason; it breaks up filenames that contain spaces,newlines or tabs:
( set -f ; for file in `find . -type f` ; do # WRONG cat "$file" done ) > ../collectionHow about using find with a “while read” loop? Let’s try this:
( find . -type f | # WRONG while read filename ; do cat "$filename" ; done ) > ../collectionThis is widely used, but still wrong.It works if a filename has spaces in the middle, but itwon’t work correctly if the filename begins or ends with whitespace(they will get chopped off).Also, if a filename includes “\”, it’ll getcorrupted; in particular, if it ends in “\”, it will becombined with the next filename (trashing both).Okay, maybe that’s just a perversity of the defaults ofshell’s “read”, but there are other problems aswe’ll see in a moment.
Now at this point, some of you may suggest using xargs, like this:
( find . -type f | xargs cat ) > ../collection # WRONG, WAY WRONG
Yet this is wrong on many levels.By default, xargs’ input isparsed, sospace characters (as well as newlines) separate arguments, andthe backslash, apostrophe, double-quote,and ampersandcharacters are used for quoting.According to the POSIX standard,underscoremay have a special meaning (it willstop processing) if you omit the -E option, too!So even though this “simple” use of xargs workson some filenames, it fails on many characters that are allowed in filenames.The xargs quoting convention isn’t even consistent with the shell.Using xargs while limiting yourself to the POSIX standard is an exercisein pain, if you are trying to create actually-correct programs, becauseit requires substitutions to work around xargs quoting.
So let’s “fix” handling filenames with spacesby combining find (which can output filenames a line at a time)with a “while” loop (using read -r and IFS),a “for” loop, xargs with quoting and -E, or xargsusing a non-standard GNU extension “-d”(the extension makes xargs more useful):
# WRONG: ( find . -type f | while IFS="" read -r filename ; do cat "$filename" ; done ) > ../collection # OR WRONG: IFS="`printf '\n'`" # Split filenames only on newline, not space or tab ( for filename in `find . -type f` ; do cat "$filename" done ) > ../collection # OR WRONG, yet portable; space/backslash/apostrophe/quotes ok in filenames: ( find . -type f | sed -e 's/[^[:alnum:]]/\\&/g' | xargs -E "" cat ) > ../collection # OR WRONG _and_ NON-STANDARD (uses a GNU extension): ( find . -type f | xargs -d "\n" cat ) > ../collection
Whups, all four of these don’t work correctly either.All of these create a list of filenames, with each filenameterminated by a newline(just like the previous version of “while”).Butfilenames can include newlines!
Handling filenames with all possible characters (including newlines)can be hard to do portably.You can usefind...-exec...{}, which is portable,but this gets ugly fast if the command being executed is nontrivial.It can also be slow, because this has to start a new process for every file,and the new process cannot trivially set a variable that can be usedafterwards (the variable value disappears when the process goes away).POSIX has more recently extended find so thatfind -exec ... {} +(plus-at-end) creates sets of filenames that are passed to otherprograms (similar to how xargs works);this is faster, but it still creates new processes, makingtracking-while-processing very inconvenient.I believe that some versions of find have not yet implemented this morerecent addition,which is another negative to using it (but itis standard so I expectthat problem to go away over time).In any case, both of these forms get ugly fast if what you’reexec-ing is nontrivial:
# These are CORRECT but have many downsides: ( find . -type f -exec cat {} \; ) > ../collection # OR ( find . -type f -exec cat {} + ) > ../collectionIs this a problem just for shell? Not at all.Other languages do have libraries for safely walking directory structures,and typically they handle this correctly... but that is not the onlysituation.It's quite common to want to make a list of files that are stored somewhere,typically in a file, for reuse later.This is commonly done by storing a list of filenameswhere each name is terminated by a newline.Why? Because lots of tools easily handle that format, and it isthe "obvious" thing to do.For example, here's how you might do this (incorrectly) in Python3:
#!/bin/python3# WRONGwith open('filelist.txt') as fl: for filename in fl: # do something with filename, e.g., open itYou can use options to separate filenames with \0 instead;this has been widely implemented for many years, and has been formallyblessed by the 2024 edition of POSIX for a few cases:
# Simple approach, find ... xargs, POSIX 2024 compliant: ( find . -type f -print0 | xargs -0 cat ) > ../collection # Using find and a shell loop, for more complex situations, # this is supported by POSIX 2024: find . -print0 | while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. done
Using \0 as a filename separator definitely helps, but itrequires that you use such options.The option names to use this convention (when available)are jarringly inconsistent (perl has -0, while GNU tools havesort -z, find -print0, xargs -0, and grep either -Z or --null).POSIX 2024 did formally add such support in a few cases, but not all.POSIX 2024 supports find, xargs, and read... but lacks support forthis format in other tools like grep and sort.This format is also more difficult to view and modify(in part because fewer tools support it), compared to theline-at-a-time format that is widely supported.You can’t even pass such null-separatedlists back to the shell via command substitution;cat `find . -print0` andsimilar “for” loops don’t work.
The problem hits other languages, too.Many applications, regardlesstheir implementation language,store information using one filename per line(with an unencoded filename)because so many tools support that format.The only problem is that it's wrong when newlines can occur in filenames.
This is silly; processing lines of text files is well-supported, andfilenames are an extremely common data value, but you can’teasily combine these constructs?
Oh, anddon’t display filenames.Filenames could contain control characters that control the terminal(and X-windows), causing nasty side-effects on display.Displaying filenames can even cause a security vulnerability —and who expectsprinting a filename to be a vulnerability?!?In addition, you have no way of knowing for certainwhat the filename’s character encodingis, so if you got a filename from someone elsewho uses non-ASCII characters, you’re likely to end up with garbagemojibake.
Again, this is not just a shell issue.Merely displaying filenames in any language can be dangerous, andthere is no guarantee that the encoding of the filename is the sameas the encoding used by standard output.So this is an example of an incorrect and potentially dangerous Python3 program:
#!/bin/python3# WRONG - control characters and encoding issueimport osfor filename in os.listdir('.'): print(filename)Ugh — lots of annoying problems, caused not because we don’t haveenough flexibility, but because we have too much.Many documents describe the complicated mechanisms that can be usedto deal with this problem, such asBashFAQ’s discussionon handling newlines in filenames.Many of the suggestions posted on the web arewrong, for example,many people recommend the incorrectwhile read lineas the correct solution.In fact, I found that the BashFAQ’s 2009-03-29 entry didn’t walkfiles correctly either(one of their examples usedfor file in *.mp3; do mv "$file" ...,but this fails if a filename begins with “-”;yes, I fixed it).If the “obvious” approaches to common tasksdon’t work correctly, andrequire complicated mechanisms instead, I think there is aproblem.
In a well-designed system, simple things should be simple, andthe “obvious easy” way to do simple common tasks should be the correct way.I call this goal “no sharp edges” — to use an analogy,if you’re designing a wrench, don’t put razor blades on the handles.Typical Unix/Linux filesystems fail this test — theydohave sharp edges.Because it’s hard to do things the “right” way, many Unix/Linux programssimply assume that “filenames are reasonable”, even though the systemdoesn’t guarantee that this is true.This leads to programs with occasional errors that are sometimes hard to solve.
In some cases, these errors can even be security vulnerabilities.My“Secure Programming for Linux and Unix HOWTO” hasa section dedicated to vulnerabilities caused by filenames.Similarly,CERT’s “Secure Coding” item MSC09-C(Character Encoding — Use Subset of ASCII for Safety)specifically discusses the vulnerabilities due to filenames.The Common Weakness Enumeration (CWE) includes 3 weaknesses relatedto this(CWE 78,CWE 73, andCWE 116),all of which are in the2009 CWE/SANSTop 25 Most Dangerous Programming Errors.VulnerabilityCVE-2011-1155 (logrotate)andCVE-2013-7085 (uscan in devscripts, which allowed remote attackersdelete arbitrary files via a whitespace character in a filename)are a few examples of the many vulnerabilitiesthat can be triggered by malicious filenames.
These types of vulnerabilities occasionally get rediscovered, too.For example,Leon Juranic released in 2014 an essay titledBack to the Future: Unix Wildcards Gone Wild,which demonstrates some of the problems that can be caused because filenamescan begin with a hyphen (which are then expanded by wildcards).I am really glad that Juranic is making more people aware of the problem!However, this is not new information;these types of vulnerabilities have been known for decades.Kucan commentson this, noting that this particular vulnerability can becountered by always beginning wildcards with “./”.This is true, andfor many years I have been recommended prefixing globs with “./”.I still recommend it as part of a solution that works today.However, we’ve been trying to teach people to do this for decades,and the teaching is not working.People do things the easy way, even if it creates vulnerabilities.
It would be better if the system actuallydid guarantee thatfilenames were reasonable; then already-written programs wouldbe correct.For example, if you could guarantee that filenames don’t include controlcharacters and don’t start with “-”,the following script patterns would always work correctly:
#!/bin/sh # CORRECT if files can't contain control chars and can't start with "-": set -eu # Always put this in Bourne shell scripts IFS="`printf '\n\t'`" # Always put this in Bourne shell scripts # This presumes filenames can't include control characters: for file in `find .` ; do ... command "$file" ... done # This presumes filenames can't begin with "-": for file in * ; do ... command "$file" ... done # You can print filenames if they're always UTF-8 & can't inc. control chars
I comment on a number of problems that filenames cause the Bourne shell,specifically, because anything that causes problems with Bourne shellscripts interferes with use of Unix/Linux systems.The Bourne shell isnot going away; it is built into POSIX,it is directly used by nearly every Unix-like system for starting it up,and most GNU/Linux users use Bourne shells for interactive command line use.What’s more, the leading contender, C shells (csh), areloathed by many (for an explanation, see“Csh Programming Considered Harmful” by Tom Christiansen).Now, it’s true that some issues are innate to the Bourne shell,and cannot be fixed by limiting filenames.TheBourne shell is actually a nice programming language for what it is for,but as noted by Bourne himself, its design requirements led to compromisesthat can sometimes be irksome.In particular, in most cases Bourne shell scripts willstill need to double-quote variable references in most cases,even if filenames are limited to more reasonable values.For those who don’t know,when using a variable value, you usually need to write"$file" and not$filein Bourne shells (due to quirks in the language that make it easy touse interactively).You don’t need to double-quote values in certain cases(e.g., if they can only contain letters and digits), but thoseare special cases.Since variables can store information other than filenames, manyBourne shell programmers get into the habit of adding double-quotes aroundall variables anyway unless they want a special effect, and that effectivelyresolves the issue.But as shown above, that’s not the only issue;it can be difficult to handle all filenames correctlyin the Bourne shelleven when you use double-quotes correctly.
Filename problems tend to happen inany language;they arenot specific to any particular language.For example, if a filename begins with “-”,and another command is invoked with thatfilename as its parameter, that command will see an option flag... nomatterwhat computer languages are being used.Similarly, it’s more awkward to pass lists of filenames between programsin different languages when newlines can be part of the filename.Practically every language gracefully handles line-at-a-timeprocessing; it’d be nice to be able to easily use that with filenames.
The problem of awkward filenames is so bad that there are programs likedetox andGlindra that try tofix “bad” filenames.The POSIX standard includespathchk; this lets you determine that a filename is bad.But the real problem is that bad filenames were allowed in the first placeand aren’t prevented or escaped by the system —cleaning them up later is a second-best approach.
Lots of programs presume “bad” filenames can’t happen,and fail to handle them.For example, many programs fail to handle filenames withnewlines in them, because it’s harder to write programs thathandle such filenames correctly.In several cases, developers have specifically stated that there’sno point in supporting such filenames!For example:
There are a few programs thatdo try to handle all cases.According to user proski,“One of the reasons git replaced many shell scripts with C code wassupport for weird file names. C is better at handling them.In absence of such issues, many commands would have remained shellscripts, which are easier to improve”.But such exceptions prove the rule — many developers would notbe willing to re-write working programs, in a different language,just to handle bad filenames.
Failure to handle “bad” filenames can lead to mysterious failures and evensecurity problems... but only if they can happen at all.If “bad” filenames can’t occur, the problems they cause go away too!
The POSIX standard defines what a “portable filename” is; this definitionimplies that many filenames arenot portable and thusdo not need to be supported by POSIX systems.For all the details, see theAustin Common StandardsRevision Group web page.To oversimplify,the POSIX.1-2008 specification is simultaneously released as bothThe Open Group’s Base Specifications Issue 7 andIEEE Std 1003.1(TM)-2008. I’ll emphasize the Open Group’s version,since it is available at no charge via the Internet (good job!!).Its “base definitions” document section 4.7 (“Filename Portability”) says:
For a filename to be portable across implementations conforming to POSIX.1-2008, it shall consist only of the portable filename character set as defined in Portable Filename Character Set.Portable filenames shall not have the <hyphen> character as the first character since this may cause problems when filenames are passed as command line arguments.
I then examined the Portable Filename Character Set, defined in 3.276(“Portable Filename Character Set”); this turns out to be just A-Z, a-z, 0-9,<period>, <underscore>, and <hyphen> (aka the dash character).So it’s perfectly okay for a POSIX system to reject a non-portable filenamedue to it having “odd” characters or a leading hyphen.
In fact, the POSIX.1-2008 spec includes a standard shell program called“pathchk”, which can be used to determine if aproposed pathname (filename) is portable.Its “-p” option writes a diagnostic if the pathname is too long(more than {_POSIX_PATH_MAX} bytes or contains any componentlonger than {_POSIX_NAME_MAX} bytes), or contains any characterthat is not in the portable filename character set.Its “-P” option writes a diagnostic if the pathname is empty orcontains a component beginning with a hyphen.GNU, and many others, include pathchk.(My thanks to Ralph Corderoy for reminding me of pathchk.)So not only does the POSIX standard note that some filenames aren’tportable... it even specifically includes tools to help identifybad filenames (such as ones that includecontrol characters or have a leading hyphen in a component).
Indeed, existing POSIX systemsalready reject some filenames.A common reason is that many POSIX systems mount local or remote filesystemsthat have additional rules, e.g., for Microsoft Windows.Wikipedia’s entry onFilenamesreports on these rules in more detail.For example, the Microsoft Windows kernelforbids the use of characters in range 1-31 (i.e., 0x01-0x1F) in filenames,so any such filenames can’t be shared with Windows users, and they’renot supposed to be stored on their filesystems.I wrote some code and found thatthe Linux msdos module (which supports one of theWindows filesystems) already rejects some“bad” filenames, returning the EINVAL error message instead.
The Plan 9 operating system was developed by many Unix luminaries;its filenames can only contain printable characters(that is, any character outside hexadecimal 00-1F and 80-9F) and cannotinclude either slash or blank(per intro(5)).Tom Duff explains whyPlan 9 filenames with spaces are a painfor many reasons, in particular, that they mess up scripts.Duff said,“When I was working on theplan 9 shell, I did a survey of all the file names on all theunix machines that I could conveniently look at, and discovered,unsurprisingly, that characters other than letters, digits,underscore, minus, plus and dot were so little used thatforbidding them would not impact any important use of thesystem. Obviously people stick to those characters toavoid colliding with the shell’s syntax characters. I suggested(or at least considered) formalizing the restriction, specificallyto make file names easier to find by programs like awk.Probably rob took the more liberal road of forbidding del, spaceand controls, the first because it is particularly hard to type,and the rest because, as Russ noted, they confound the usualline- and field-breaking rules.”
So some application developersalready assume that filenames aren’t“unreasonable”,the existing standard (POSIX)alreadypermits operating systems to reject certain kinds of filenames, andexisting POSIX and POSIX-like systemsalready reject certain filenamesin some circumstances.In that case, what kinds of limitations could we add to filenames that wouldhelp users and software developers?
First:Why the heck are the ASCII control characters(byte values 1 through 31, as well as 127) permitted in filenames?The point of filenames is to create human-readable names for collectionsof information, but since these characters aren’t readable,the whole point ofhaving filenames is lost.There’s no advantage to keeping these as legal characters,and the problems are legion:they can’t be reasonably displayed, many are troublesome to enter(especially in GUIs!), and they cause nothing but nasty side-effects.They also cause portability problems, since filesystems for Microsoft Windowscan’t contain bytes 1 through 31 anyway.
One of the nastiest permitted control characters is the newline character.Many programs work a line-at-a-time, with a filename as the content orpart of the content; this is great, except it failswhen a newline can be in the filename.Many programs simply ignore the problem, and presume that there are nonewlines in filenames.But this creates a subtle bug, possibly even a vulnerability —it’d be better to make the no-newline assumption true in the first place!I know ofno program that legitimately requires the ability toinsert newlines in a filename.Indeed, it’s not hard to find comments like“ban newlines in filenames”.GNU’s “find” and “xargs” make it possible to work around this byinserting byte 0 between each filename... but few other programs support thisconvention (even “ls” normally doesn’t, and mostshells cannot do word-splitting on \0).Using byte 0 as the separator is a pain to use anyway;who wants to read the intermediate output of this?Even if the only character that is forbidden is newline, that would still help.For example,ifnewlines can’t happen in filenames, you can usea standard (POSIX) feature of xargs (which disables variousquoting problems of xargs by escaping each character with a backslash)(lwn forgot the -E option, which I have added):
find . -type f | sed -e 's/./\\&/g' | xargs -E "" somecommand
The “tab” character is another control character that makes no sense;if tabs arenever in filenames, then it’s a great character to useas a “column separator”for multi-column data output — especially sincemany programsalready use this convention.But the tab character isn’t safe to use (easily) if it can be part ofa filename.
Some control characters, particularly the escape (ESC) character, can causeall sorts of display problems, including security problems.Terminals (like xterm, gnome-terminal, the Linux console, etc.) implementcontrol sequences.Most software developers don’t understand that merelydisplayingfilenames can cause security problems if they can contain control characters.The GNU ls program tries to protect users from this effect by default(see the -N option), but many people display filenameswithout getting filtered by ls — and the problem returns.H. D. Moore’s “Terminal Emulator Security Issues” (2003)summarizes some of the security issues;modern terminal emulators try to disable the most dangerous ones, butthey can still cause trouble.A filename with embedded control characters can (when displayed) causefunction keys to be renamed, set X atoms, change displays in misleadingways, and so on.To counter this, some programs modify control characters(such as find and ls) — making it even harder to correctlyhandle files with such names.
In any case, filenames with control characters aren’t portable.POSIX.1-2008 doesn’t includecontrol characters in the “portable filename character set”,implying that such filenames aren’t portable per the POSIX standard.Wikipedia’s entry onFilenamesnotes that the Windows kernelforbids the use of characters in range 1-31 (i.e., 0x01-0x1F),so any such filenames can’t be shared with Windows users, and they’renot supposed to be stored on their filesystems.
A few people noted that they used the filesystem as a keystore,and found it handy to use filenames as arbitrary-value keys.That’s fine, but filesystemsalready impose naming limitations;you can’t use \0 in them, and you can’t use ‘/’ as a key value in thesame way, even on a traditional Unix filesystem.And as noted above, many filesystems impose more restrictions anyway.So even people who use the filesystem as a keystore, with arbitrary keyvalues, must dosome kind of encoding of filenames.Since you have to encode anyway, you can use an encoding thatis easier to work with and less likely to cause subtle problems... likeone that forbids control characters.Many programs, like git, use the filesystem as a keystore yet do notrequire control characters in filenames.
In contrast, if control characters are forbidden when created and/orescaped when returned, you can safely usecontrol characters like TAB and NEWLINE as filename separators, and thesecurity risks of displaying unfiltered control charactersin filenames goes away.As noted above, software developersmake these assumptions anyway; it’d be great if it was safe to do so.
The “leading dash” (aka leading hyphen)problem is an ancient problem in Unix/Linux/POSIX.This is another example of thegeneral problem that there’s interaction between overly-flexible filenameswith other system components (particularly option flags and shell scripts).
The Unix-haters handbookpage 27 (PDF page 67) notes problems these decisions cause:“By convention, programs accept their options as their first argument,usually preceded by a dash... Finally, Unix filenames can containmost characters, including nonprinting ones. This is flaw #3. Thesearchitectural choices interact badly. The shell lists files alphabeticallywhen expanding “*” [and]the dash (-) comes first in the lexicographic caste system.Therefore, filenames that begin with a dash (-)appear first when “*”is used. These filenames become options to the invoked program, yieldingunpredictable, surprising, and dangerous behavior...[e.g., “rm *” will expand filenames beginning with dash, and use those asoptions to rm]...We’ve known several people who have made a typo while renaming a filethat resulted in a filename that began with a dash: “% mv file1 -file2”Now just try to name it back...Doesn’t it seem a little crazy that a filename beginning with a hypen,especially when that dash is the result of a wildcard match, is treated as anoption list?”Indeed, people repeatedly ask how toignore leading dashes in filenames — yes, you canprepend “./”, but why do you need to know this at all?”
Similarly,in 1991 Larry Wall (of perl fame) stated:“Just don’t create a file called -rf. :-)”in a discussion about the difficulties in handling filenames well.
The list of problems that “leading dash filenames” createsis seemingly endless. You can’t safely run “cat *”, becausethere might be a file with a leading dash; if there’s a file named “-n”,then suddenly all the output is numbered if you use GNU cat.Not all programs support the “--” convention, so you can’t simply say“precede all command lists with --”, and in any case, people forget to dothis in real life.Even the POSIX folks, who are experts, make mistakes due to leading dashes;bug 192 identifiesa case where examples in POSIX failed to operate correctly whenfilenames begin with dash.
You could prefix the name or glob with “./”, e.g., “cat ./*”.Prefixing the filename isa good solution, but people often don’t know or forget to do this.The result: many programs break (orare vulnerable) when filenames have components beginning with dash.Users of “find” get this prefixing essentially for free, but then theyget troubled by newlines, tabs, and spaces in filenames(as discussed elsewhere).
POSIX.1-2008’s“base definitions” document section 4.7 (“Filename Portability”) specificallysays “Portable filenames shall not have the <hyphen> characteras the first character since this may cause problems when filenames arepassed as command line arguments”.So filenames with leading hyphens arealready specificallyidentified as non-portable in the POSIX standard.
There’s no reason that a filesystemmust permitfilenames to begin with a dash.If such filenames were forbidden, then writing safe shell scripts would bemuch simpler — if a parameterbegins with a “-”, then it’s an optionand there is no other possibility.
If the filesystemmust include filenames with leading dashes,one alternative would be to modify underlying tools and libraries so thatwhenever globbing or directory scanning is done,prepend “./” to any filename beginning with “-”.This would be done byglob(3), scandir(3), readdir(3), and shells that implementglobbing themselves.Then, “cat *” would become “cat ./-n” if “-n” was in the directory.This would be a silent change that would quietly cause bad code towork correctly.There are reasons to be wary of these kinds of hacks, but if thesekinds of filenames must exist, it would at least reduce their trouble.I will say more about solutions later in this paper.
With today’s march towards globalization, computers must supportthe sharing of information using many different languages.Given that, it’s crazy that there’s no standard encodingfor filenames across all Unix/Linux/POSIX systems.At the beginnings of Unix, everyone assumed that filenames could onlybe English text, but that hasn’t been true for a long time.Yet because you can’t know the character encoding of a given filename,in theory you can’t display filenames at all today.Why? Because then you don’t know how to translate the bytes of a filenameinto displayable characters (!).This is true for GUIs, and even for the command line.Yet youmust be able to display filenames, so you need to makesome determination... and it will be wrong.
The traditional POSIX approach is to use environment variables that declarethe filename character encoding(such asLC_ALL,LC_CTYPE, LC_CTYPE, LC_COLLATE, and LANG).But as soon as you start working with other people(say, by receiving a tarball or sharing a filesystem),the single environment variable approach fails.That’s because the single-environment-variable approach assumes that theentire filesystem uses the same encoding (as specified in the environmentvariable), but once there’s file sharing, different parts ofthe filesystem can use different encoding systems.Should you interpret the bytes in a filename asISO-8859-1? One of the other ISO-8859-* encodings?KOI8-* (for Cyrillic)? EUC-JP or Shift-JIS (both popular in Japan)?In short, this is too flexible!Since people routinely share information around the world, thisincompatibility is awful.The Austin Group even had a discussion about this in 2009.This failure to standardize the encoding leads to confusion, which can lead tomistakes and even vulnerabilities.
Yet this flexibility is actually not flexible enough,because the current filesystem requirements don’tpermit arbitrary encodings.If you want to store arbitrary international text, you need to useUnicode/ISO-10646.But the other common encodings of Unicode/ISO-10646 (UTF-16 and UTF-32)must be able to store byte 0; since youcan’t use byte 0 in a filename, they don’t work at all.The filesystem is alsonot flexible in another way: There’s no mechanism to find outwhat encoding is used on a given filesystem.If one person uses ISO-8859-1 for a given filename,there’s no obvious way to find out what encoding they used.In theory, you could store the encoding system with the filename, and thenuse multiple system calls to find out what encoding was used for each name..but really, whoneeds that kind of complexity?!?
If you want to store arbitrary language characters in filenamesusing today’s Unix/Linux/POSIX filesystem, theonly widely-usedanswer that “simply works” for all languages is UTF-8.Wikipedia’s UTF-8 entry andMarkus Kuhn’s UTF-8 and Unicode FAQ have more information about UTF-8.UTF-8 wasdeveloped by Unix luminaries Ken Thompson and Rob Pike,specifically to support arbitrary language characters on Unix-like systems,and it’s widely acknowledged to have a great design.
When filenames are sent to and from the kernel using UTF-8, then alllanguages are supported, and there are no encoding interoperability problems.Any other approach would require nonstandard additions like addingsort of “character encoding” value with the filesystem, which wouldthen require user programs to examine and use this encoding value.And they won’t.Users and software developers don’t need more complexity — theywant less.If people simply agreed that “all filenames will be sent in/out of thekernel in UTF-8 format”, then all programs would work correctly.In particular,programs could simply retrieve a filename and print it, knowing thatthe filename is in UTF-8.(Other encodings like UTF-7 and punycode do exist.But these are designed for cases where you can’t have byte values morethan 127, which is not true for Unix/Linux/POSIX filesystems.Which is why people do not use them for filesystems.)Plan 9already did this, and showed that you could do this on a POSIX-like system.The IETF specificallymandates that all protocol text must support UTF-8, while all otherencodings are optional.
Another advantage of UTF-8 filenames is that they arevery robust.The chance of a random 4-byte sequence of bytes being valid UTF-8,and not pure ASCII, is only 0.026% — and the chances drop even furtheras more bytes are added.Thus, systems that use UTF-8 filenames will almost certainlydetect when someone tries to import non-ASCII filenames that usethe “wrong” encoding — eliminating filenamemojibake.
UTF-8 is already supported by practically everything.Some filesystems store filenames in other formats, but at least on Linux,all of them have mount options to translate in/out of UTF-8 for userspace.In fact, some filesystems require a specific encoding on-disk forfilenames, but to do this correctly, the kernel has to know whichencoding is being used for the data sent in and out (e.g., withiocharset).But not all filesystems can do this conversion, and how do you find outwhich options are used where?!?Again, the simple answer is “use UTF-8 everywhere”.
There’s also another reason to use UTF-8 in filenames: Normalization.Some symbols have more than one Unicode representation(e.g., a character might be followed by accent 1 then accent 2, or byaccent 2 then accent 1).They’d look the same, but they would be considered different when comparedbyte-for-byte, and there’s more than one normalization system(Programs written for Linux normally use NFC, as recommended by the W3C,but Darwin and MacOS X normally use NFD).If you have a filename in a non-Unicode encoding,then it’s ambiguous how you “should” translate these to Unicode,making simple questions like “is this file already there” tricky.But if you store the name as UTF-8 encoded Unicode, then there’s no trouble;you can just use the filename using whatever normalization convention was usedwhen the file was created (presuming that the on-disk representationalso uses some Unicode encoding).
To be fair, what I’m proposing here doesn’t solve some other Unicode issues.Many characters in Unicode look identical to each other, and inmany cases there’s more than one way to represent a given character.But these problemsalready exist, and they don’t go away if thestatus quo continues.If we at least agreed that the userspace filename API was always in UTF-8,we’d at least solve half the battle.
Andrew Tridgell, Samba’s lead developer,has identified yet another reason to use UTF-8 — case handling.Efficiently implementing Windows’ filesystem semantics,where uppercase andlowercase are considered identical, requires that you be able toknow what is “uppercase” and what is “lowercase”.This is only practical if you know what the filename encoding isin the first place.(Granted, total upper and lower case handling is in theory locale-specific,but there are ways to address that sensibly that handle thecases people care about... and that’s outside the scope of this article.)Again, a single character encoding system for all filenames,from the application point of view, is almost required to make this efficient.
User “epa” on LWNnotes that Python 3 “got tripped up by filenames that are not valid UTF-8”.Python 3 moved to a very clean system where there are “string” types thathandle internationalized text and “bytes” that contain arbitrary data.You would think that filenames would be string types, but currentlyPOSIX filenames are really just binary blobs!Python 3’s “what’s new” discusses what they had to do in tryingto paper this over, but as epa says, this situationinterferes with implementing filenames“as Unicode strings [to] cleanly allow international characters”.Eventually, Python 3.1 implemented the more-complicatedPEP 383proposal, specifically to address the problem that some“character” interfaces (like filenames) don’t just provide charactersat all.In PEP 383, on POSIX systems,“Python currently applies the locale’s encoding to convert thebyte data to Unicode, failing for characters that cannot be decoded.With this PEP, non-decodable bytes >= 128 will be represented aslone surrogate codes U+DC80..U+DCFF.Bytes below 128 will produce exceptions...To convert non-decodable bytes,a new error handler “surrogateescape”is introduced, which produces these surrogates. On encoding,the error handler converts the surrogate back to the corresponding byte.This error handler will be used in any API that receives or producesfile names, command line arguments, or environment variables”.
The result is that many applications end up beingfar morecomplicated than necessary to deal with the lack of an encoding standard.Python PEP 383bluntly states that the Unix/Linux/POSIX lack of enforced encodingis a design error:“Microsoft Windows NT has corrected the original design limitation ofUnix, and made it explicit in its system interfaces that these data(file names, environment variables, command line arguments) are indeedcharacter data [and not arbitrary bytes]”.Zooko O’Whielacronx posted some comments on Python PEP 383 relating to the Tahoe project.He commented separately to me that“Tahoe could simplify its design and avoidcostly storage of ‘which encoding was allegedly used’ nextto *every*filename if we instead required utf-8b for all filenames on Linux.”(Sidebar: Tahoe is an interesting project;Here is Zooko smashing a laptop with an axe as part of his Tahoepresentation.)
Converting existing systems or filesystems to UTF-8 isn’t thatpainful either.The program “convmv” can do mass conversions of filenamesinto UTF-8.This program was designed to be“very handy when one wants to switch over from old8-bit locales to UTF-8 locales”.It’s taken years to get some programs converted to support UTF-8, butnowadays almost all modern POSIX systems support UTF-8.
Again, let’s look at thePOSIX.1-2008 spec.Its “Portable Filename Character Set” (defined in 3.276)is only A-Z, a-z, 0-9, <period>, <underscore>, and <hyphen>.Note that this is a very restrictive list; few international speakerswould accept this limited list, since it would mean they must only useEnglish filenames.That’s ridiculous; most computer users don’t evenknow English.So why is this standard so restrictive?That’sbecause there’s no standard encoding;since you don’t know if a filename is UTF-8 or something else, there’sno way to portably share filenames with non-English characters.If wedid agree that UTF-8 encoding is used, the set of portablecharacters could include all languages.In other words, the lack of a standardcreates arbitrary andunreasonable limitations.
Linux distributions are already moving towards storing filenames in UTF-8,for this very reason.Fedora’s packaging guidelines require that“filenames that contain non-ASCII characters must be encoded asUTF-8. Since there’s no way to note which encoding the filename is in,using the same encoding for all filenames is the best way to ensureusers can read the filenames properly.”OpenSuSE 9.1 has already switched to using UTF-8 as the defaultsystem character set (“lang_LANG.UTF-8”).Ubuntu recommendsusing UTF-8, saying “A good rule is to choose utf-8 locales”,and provides aUTF-8 migration tool as part of itsUTF-8 by defaultfeature.
Filename permissiveness is not just a command-line problem.It’s actually worse for the GUIs, because if filenames can truly beanything, then GUIs have no way to actually display filenames.The major POSIX GUI suites GNOME and KDE have already moved towards UTF-8as the required filename encoding format:
TheGUI toolkit Qt (the basis of KDE), since Qt 4, has“removed the hacks they had in QString to allow malformed Unicodedata in its QString constructor. What this means is that the old trickof just reading a filename from the OS and making a QString out of itis impossible in general since there are filenames which are not validASCII, Latin-1, or UTF-8.Qt does provide a way to convert from the ‘local 8-bit’ filename-encodingto and from QString, but this depends on there being one, and only one,defined filename-encoding (unless the application wishes to roll itsown conversion). This has effectively caused KDE to mandate users useUTF-8 for filenames if they want them to show up in the file manager,be able to be passed around on DBus interfaces, etc.”
NFSv4 requires that all filenames be exchanged using UTF-8 over the wire.TheNFSv4 specification,RFC 3530,says that filenames should be UTF-8 encoded in section 1.4.3:“In a slight departure, file and directory names are encodedwith UTF-8 to deal with the basics of internationalization.”The same text is also found in thenewer NFS 4.1 RFC (RFC 5661) section 1.7.3.The current Linux NFS client simply passes filenamesstraight through, without any conversion from the current localeto and from UTF-8.Using non-UTF-8 filenames could be a real problem on a system using aremote NFSv4 system; any NFS server that follows the NFS specificationis supposed to reject non-UTF-8 filenames.So if you want to ensure that your files can actually be storedfrom a Linux client to an NFS server,youmust currently use UTF-8 filenames.In other words, although some people think that Linux doesn’t force aparticular character encoding on filenames, in practice italreadyrequires UTF-8 encoding for filenames in certain cases.
UTF-8 is a longer-term approach.Systems have to support UTF-8 as well as the many older encodings, givingpeople time to switch to UTF-8.To use “UTF-8 everywhere”, all tools need to be updatedto support UTF-8.Years ago, this was a big problem, but as of 2011 this is essentially asolved problem, and I think the trajectory is very clear for thosefew trailing systems.
Not all byte sequences are legal UTF-8, and you don’t want to haveto figure out how to display them.If the kernel enforces these restrictions, ensuring that onlyUTF-8 filenames are allowed, then there’s no problem... all the filenameswill be legal UTF-8.Markus Kuhn’sutf8_check C function can quickly determine if a sequence is valid UTF-8.
The filesystem should be requiring that filenames meetsomestandard, not because of some evil need to control people, but simplyso that the names can always be displayed correctly at a later time.The lack of standards makes thingsharder for users, not easier.Yet the filesystem doesn’t force filenames to be UTF-8, so it can easilyhave garbage.
We have a good solution that is already in wide use: UTF-8.So let’s use it!
It’d be easier and cleaner to write fully-correct shell scriptsif filenames couldn’t include any kind of whitespace.There’s no reason anyone needs tab or newline in filenames,as noted above, so that leaves us with the space character.
There are alot of existing Unix/Linux shell scriptsthat presume there are no space characters in filenames.Many RPM spec files’ shell scripts make this assumption, for example(this can be enforcedin their constrained environment, but not in general).Spaces in filenames are particularly a problem becausethe default setting of the Bourne shell “IFS” variable(which determines how substitution results are split up)includes space as a delimiter.This means that, by default, invoking “find”via ‘...‘ or $(...) willfail to handle filenames with spaces(they will break single filenames into multiple filenames at the spaces).Any variable use with a space-containing filenamewill be split or corrupted if the programmer forgets to surround itwith double-quotes(unquoted variable uses can also cause trouble if the filename containsnewline, tab, “*”, “?”, or “]”,but these are less common than filenames with spaces).Reading filenames usingread will also fail (by default) ifa filename begins or ends with a space.Many programs, like xargs, also split on spaces by default.The result: Lots of Unix/Linux/POSIX programs don’t work correctly onfilenames with spaces.
In some dedicated-use systems, you could enforce a “no spaces” rule;this would make some common programming errors no longer an error, reducingslightly the risk of security vulnerabilities.From a functional viewpoint, other characters like“_” could be used instead of space.As noted above, some operating systems like Plan 9 expressly forbid spaces in filenames, so there is even some precedence forhaving an operating system forbid spaces in filenames.
Unfortunately, a lot of peopledo have filenames with embedded spaces(spaces that are not at the beginning or end of a filename),so a “no spaces” rule would be hard to enforce in general.In particular, you essentiallycannot handle typical Windows andMacOS filenames without handling filenames with an embedded space, becausemany filenames from those systems use the space character.So if you exchange files with them (via archives, shared storage, and soon), this is often impractical.Windows’ equivalent of “/usr/bin” is“\Program Files” —,and Windows’ historicalequivalent of “/home” is “\Documents and Settings”,so youmust deal with embedded spaces ifyou deal directly with Windows’ primary filesystem from a POSIX system.(Windows Vista and later use“\Users” insteadof the awful default “\Documents and Settings”,copying the more sensible Unix approach of using short names without spaces,but the problem still remains overall.)(To be fair, Windows has other problems too.Windowsinternally passes arguments as an unstructured string, makingescaping and its complications necessary.)
However, there are variations that might be more palatable to many:“no leading spaces” and/or “no trailing spaces”.Such filenames are a lot of trouble, especially filenames withtrailing spaces — these often confuseusers (especially GUI users).
If leading spaces, trailing spaces, newline, and tabcan’t be in filenames, thena Bourne shell construct already in common use actually becomes correct.A “while” loop usingread -r fileworks for filenames if spaces are always between other characters, butby default it subtly fails when filenames have leading or trailing spaces(because space is by default part of the IFS).But if leading spaces, trailing spaces, newline, and tabcannot occur in filenames, the following works all the timewith the default value of IFS:
# CORRECT IF filenames can't include leading/trailing space, newline, tab, # even though IFS is left as its default value find . -print | while read -r file ; do command "$file" ... done
There are a few arguments that leading spaces should be accepted.barryn informs me that“There is a use for leading spaces: They force filesto appear earlier than usual in a lexicographic sort.(For instance, a program might create a menu at run time inlexicographic order based on the contents of a directory,or you may want to force a file to appear near the beginning of alisting.) This is especially common in the Mac world....”.They are even used by some people with Mac OS X.
But it’s hard to argue that trailing spaces are useful.Trailing spaces are worse than leading ones; in many user interfaces,a leading space will at leastcause a visible indent, but there’s no indicationat all of trailing spaces... leading to rampant confusion.I understand that in Microsoft Windows (or at least some of its keycomponents), the space (and the period)are not allowed as the final character of a filename.So preventing a space as a final character improves portability,and is rather unlikely to be required for interoperability.
If trailing spaces are forbidden, thenfilenames withonly spaces in them become forbidden as well.And that’s a good thing; filenames withonly spacesin them arereally confusing to users.Years ago my co-workers set up a directory full of filenames with onlyspaces in them, briefly stumping our Sun representative.
So banning trailing spaces in a component might be aplausible broad rule.It’s not as important as getting rid of newlines in filenames, butit’s worth considering, because it would get rid of some confusion.Banning both leading and trailing spaces is also plausible; doing sowould makewhile read -r correct in Bourne shell scripts.
James K. Lowdenproposed an interesting alternative for spaces:“Spaces could be transparently handled (no pun intended) with U+00A0, anon-breaking space, which in fact it is. Really. If the system ispresented with a filename containing U+0020, it could just replace itunilaterally with the non-breaking space[Unicode U+00A0, represented in UTF-8 by the hex sequence 0xC2 0xA0].Permanently, no questions asked.”
This idea is interesting, because by default Bourne shells only break on U+0020,so they would consider the filename as one long unbreakable string.Filenames really aren’t intended to be broken up,so that’s actually a defensible representation.He claims “For most purposes, that will be just fine.GUIs won’t mind. Shells won’t mind;most scripts will be happier.”
He does note that constructs like
if [ "$name" = "my nice name" ]will fail, but he and I suspect that such code is rare.He says,“scripts won’t typically contain hard-coded comparisonsto filenames with spaces”.
I’m guessing that thefilesystem would internally always storespaces, but theAPI would always get unbreakable spaces.This could cause problems if other systems stored filenames on directorieswhich only differed between the use of unbreakable spaces and regular spaces,but users would generally think that’s pretty evil in the first place.
I’m not sure how I feel about this one idea, but it’s certainly aninteresting approach that’s worth thinking about.One reason I hesitate is that if other things are fixed, thedifficulties of handling spaces in filenames diminishes anyway,as I’ll explain next.
One reader of this essay suggested that GUIs shouldtransparently convert spaces to underscores when creating a file,reversing this when displaying a filename.It’s an interesting idea.However, I fear that some evil person will create multiple files inone directory which only differ because one uses spaces and the otheruses underscores. That might look okay, but would create opportunityfor confusion in the future.Thus, I haven’t recommended this approach.
Having spaces in filenames is no disaster, though, particularlyif other problems are fixed.
First,it’s worth noting thatmany “obvious” shell programs already work correctly, today,even if filenames have spaces and you make no special settings.For example, glob expansions like“cat ./*” work correctly,even if some filenames have spaces, because file glob expansionoccursafter splitting (more about this in a moment).The POSIX specification specifically requires this, and this isimplemented correctly by lots of shells(I’ve checked bash, dash, zsh, ksh, and even busybox’s shell).The find commands’s “-exec” optioncan work with arbitrary filenames(even ones with control characters), though I find that if theexec command gets long, the script starts to get very confusing:
# This is straightforward: find . -type f -exec somecommand {} \; # As these get long, I scream (example from "explodingferret"): find . -type f -exec sh -c 'if true; then somecommand "$1"; fi' -- {} \;Once newlines and tabscannot happen in filenames, programscan safely use newlines and tabs as delimiters between filenames.Having safe delimiters makes spaces in filenames much easier to handle.In particular, programs can thensafely do what manyalready do:they can use programs like‘find’ to create a list of filenames (one per line),and then process the filenames a line at a time.
However, if we stopped here, spaces in filenames still cause problems forBourne shell scripts.If you invoke programs like find via command substitution, such as“for file in `find .`”, then by defaultthe shell will break up filenames on the spaces — corrupting the results.This is one of the reasons that many shell scripts don’t handlespaces-in-files correctly.Yet the “obvious” way to process files isto create a loop through the results ofa command substitution withfind!We can make itmuch easier to write correct shell scriptsby using a poorly-documented trick.
Writers of (Bourne-like) shell scripts can use an additional trickto make spaces-in-filenames easier to handle, as long asnewlines and tabs can’t be in filenames.The trick: set the “IFS” variable to be just newline and tab.
IFS (the “input field separator”)is an ancient, very standard, but not well-known capability of Bourne shells.After almost all substitutions,including command substitution ‘...‘ andvariable substitution ${...}, the characters in IFS are used tosplit up any substitution results into multiple values(unless the results are inside double-quotes).Normally, IFS is set to space, tab, and newline —which means that by default,after almost all substitutions, spaces are interpreted as separating thesubstituted values into different values.This default IFS setting is very bad if file lists areproduced through substitutionslike command substitution and variable substitution,because filenames with spaces will get split into multiple filenamesat the spaces (oops!).And processing filenames isreally common.
Changing the IFS variable to include only newline and tab makeslists of filenamesmuch easier to deal with, becausethen filenames with spaces are trivially handled.Once you set IFS this way,instead of having to create a “while read...” loop,you can place a ‘...‘ file-listing command in the “usual place” of a file list,and filenames with spaces will then work correctly.And if filenames can’t include tabs and newlines, you can correctlyhandleall filenames.
A quick clarification, if you’re not familiar with IFS:Even when the space character is removed from IFS,you can still use space in shell scriptsas a separator in commands or the ‘in’ part of for loops.IFS only affects the splitting of unquoted values that aresubstitutedby the shell.So you can still do this, even when IFS doesn’t include space:
for name in one two three ; do echo "$name" done
I recommend using this portable constructnear the beginning of your (Bourne-like) shell scripts:
IFS="`printf '\n\t'`"
If you have a really old system that doesn’t include the POSIX-requiredprintf(1), you could use this instead(my thanks to Ralph Corderoy for pointing out this issue, though I’vetweaked his solution somewhat):
IFS="`echo nt | tr nt '\012\011'`"
It’s quite plausible to imagine that in the future,the standard “prologue” of a shell script would be:
#!/bin/sh set -eu IFS="`printf '\n\t'`"
An older version of this paper suggested setting IFS to tab followed by newline.Unfortunately, it can be slightly awkward to set IFS to just tab and newline,in that order, using only standard POSIX shell capabilities.The problem is that when you do command substitution in the shellwith ‘...‘ or $(...),trailing newline characters are removedbefore the result is used(see POSIX shell & utilities, section 2.6.3).Removing trailing newlines is almost always what you want, but notif the last character you wantedis newline.You can also include a newline in a variable by starting a quote andinserting a newline directly, but this is easy to screw up;any other white space could be silently inserted there, includingtext-transformation tools that might insert \r\n at the end, andpeople might “help” by indenting your code and quietly ruining it.There’s also the problem that the POSIXstandard’s “echo” is almostfeatureless, but you can just use “printf” instead.In an older version of this paper I suggested doingIFS="`printf '\t\nX'`" ; IFS="${IFS%X}"However,On LWN.net, Explodingferretpointed out a much better portable approach — justreverse their order.This doesn’t have theexactly the same result as my originalapproach (parameters are now joined by newline instead of tab whenthey are joined), but I think it’s actually slightly better,and it’s definitely simpler.I thought his actual code was harder to read, so I tweaked it(as shown above) to make it clearer.
A slightlymore pleasant approach in Bourne-like shells is to use the$'...' extension.This isn’t standard, but it’s widely supported, including bythe bash, ksh (korn shell), and zsh shells.In these shells you can just sayIFS=$'\n\t' and you’re done,which is slightly more pleasant.As thekorn shell documentationsays, the purpose of'...' is to‘solve the problem of entering special characters in scripts [using]ANSI-C rules to translate the string... It would have been cleanerto have all “...” strings handle ANSI-C escapes, but that wouldnot be backwards compatible.’It might even be more efficient; some shells might implement ‘printf ...‘by invoking a separate process, which would have nontrivial overhead(shells can optimize this away, too, since printf is typically a builtin).But this$'...'extension isn’t supported by some Bourne-like shells,including dash (the default /bin/sh in Ubuntu)and the busybox shell, and the portable version isn’ttoo bad.I’d like to see$'...' added to a future POSIX standardand these other shells,as it’s a widely implemented and useful extension.I think$'...' will in the next version of thePOSIX specification (you can blame me for proposing it).
If filenames can’t include newline or tab,and IFS is set to just newline and tab, you cansafely do this kind of thing to correctly handle all filenames:
for file in `find . -type f` ; do some_command "$file" ... done
Thisfor loop is a better constructfor file-at-a-time processingthan thewhile read -r fileconstruct listed earlier.Thisfor loop isn’t in a separate subprocess,so you can set variables inside the loop and have theirvalues persist outside the loop.The for loop has direct, easy access to standard input(the while loop uses standard input for the list of filenames).It’s shorter and easier to understand, andit’s less likely to go wrong (it’s easy to forget the “-r” option to read).
Some people like to build up a sequence of options and filenames ina variable, using the space character as the separator, and thencall a program later with all the options and filenames built up.That general approach still works, but if the space characteris not in IFS, then you can’t easily use it as a separator.Nor should you — if filenames can contain spaces, then youmustnot use the space as a separator.The solution is trivial; just use newlines or tabs as the separator instead.The usual shell tricks still apply (for example, if variable x leads withseparators, then $x without quotes will cause the variable to get splitusing IFS and the leading separators will be thrown away).This is easiest to show by example:
# DO NOT DO THIS when the space character is NOT part of IFS: x="-option1 -option2 filename1" x="$x filename2" # Build up $x run_command $x # Do this instead: t=`printf "\t"` # Newline is tricky to portably set; use tab as separator x="-option1${t}-option2${t}filename1" x="$x${t}filename2" # Build up $x. run_command $x # Or do this (do NOT give printf a leading dash, that's not portable): x=`printf "%s\n%s\n%s" "-option1" "-option2" "filename1"` x=`printf "%s\n%s" "$x" "filename2"` # Build up $x. run_command $xDo not use plain “read” in Bourne shells —use “read -r”.This is true regardless of the IFS setting.The problem is that “read”, when it sees a backslash,will merge the line with the next line, unlessyou undo that with “-r”.Notice that once you remove space from IFS, read stops corruptingfilenames with spaces, but you still need to use the -r optionwith read to correctly handle backslash.
Of course, there are times when it’s handy to have IFSset to a different value, including its traditional default value.One solution is straightforward: Set IFS to the value you need,when you need it... that’s what it’s there for.So feel free to do this when appropriate:
#!/bin/sh set -eu traditionalIFS="$IFS" IFS="`printf '\n\t'`" ... IFS="$traditionalIFS" # WARNING: You usually want "read -r", not plain "read": while read -r a b c do echo "a=$a, b=$b, c=$c" done IFS="`printf '\n\t'`"
Setting IFS to a value that ends in newline is a little tricky.If you just want to temporarily restore IFS to its default value,just save its original value for use it later (as shown above).If you need IFS set to some other value with newline at the end,this kind of sequence does the trick:
IFS="`printf '\t\nX'`" IFS="${IFS%X}"Setting IFS to newline and tab is best if programs use newline or tab(not space) as their default data separator.If the data format is under your control, you couldchange the format to use newline or tab as the separator.It turns out that many programs (like GNU seq) already use theseseparators anyway,and the POSIX definition of IFS makes this essentially automatic forbuilt-in shell commands (the first character of IFSis used as the separator for variables like $*).Once IFS is reset like this, filenames with spaces becomemuchsimpler to handle.
Characters that must be escaped in a shell before they can be used as anordinary character are termed “shell metacharacters”.If filenames cannot contain some or allshell metacharacters, then some security vulnerabilities due toprogramming errors would go away.
I doubt all POSIX systems would forbid shell metacharacters,but it’d be niceif administrators could configurespecific systems to preventsuch filenames on higher-value systems, as sort of abelt-and-suspenders approach to counter errors in important programs.Many systems are dedicated to specific tasks; on such systems,a filename with unusual characters can only occur as part of an attack.To make this possible, software on such systems mustnotrequire that filenames have metacharacters,but that’s almost never a problem: Filenames withshell metacharacters are very rare, and these charactersaren’t part of the POSIX portable filename character set anyway.
Here I’ll discuss a few options.One option is to just forbid the glob characters(*, ?, and [) — this can eliminate many errors due toforgetting to double-quote a variable reference in the Bourne shell.You could forbid the XML/HTML special characters“<”, “>”, “&”, and “"”, which would eliminatemany errors caused by incorrectly escaping filenames.You could forbid the backslash character —this would eliminate a less-common error (forgetting the -r optionof Bourne shell read).Finally, you could forbid all or nearly all shell meta-characters,which can eliminate errors due to failing to escape metacharacterswhere required in many circumstances.
All the Bourne shell programming books tell you that you’re supposed todouble-quote all references to variables with filenames, e.g.,cat "$file".Without special filesystem rules, you definitely need to!In fact, correctly-written shell programs must be absolutely infested withdouble-quotes, since they have to surround almost every variable use.But I find that real people (even smart ones!) make mistakes andsometimes fail to include those quotation marks... leading to nasty bugs.
Although shell programming books don’t note it, you can actuallyomit the double quotes around variable references containing filenames if(1) IFS contains only newline and tab (not a space, as discussed above), and(2) tab, newline, and the shell globbing metacharacters(namely “*”, “?”, and “[”) can’t be in the filename.(The other shell metacharacters don’t matter, due to thePOSIX-specified substitution order of Bourne shells.)This means thatcat $file would work correctly in such cases,even if$file contains a spaceand other shell metacharacters.From a shell programming point of view, it’d be neat if such controland globbing characters could never show up in filenames...then correct shell scripts could be much cleaner (they wouldn’trequire all that quoting).
I doubt there can be widespread agreement on forbidding all theglobbing metacharacters acrossall Unix-like systemsBut if local systems reject or rename such names, then when someoneaccidentally forgets to quote a variable reference with a filename(it happens all the time), thethe error cannot actually cause a problem.And that’s a great thing, especially forhigh-value servers (where you could impose more stringent naming rules).Older versions of this article mistakenly omitted the glob character issues;my thanks to explodingferretfor correcting that.Similarly, if you also forbid spaces in filenames, as well as theseother characters, then even without changing IFS,scripts which accidentally didn’t double-quote the variables would stillwork correctly.(Even if glob metacharacters can be in filenames, there arestill good reasons to remove the space character from IFS,as noted inthe section on spaces in filenames.)
So, by forbidding a few more characters — at least locally onhigh-value systems — you eliminate a whole class ofprogramming errors that sometimes become security vulnerabilities.You will still need to put double-quotes around variables thatcontain values other than filenames, so this doesn’t eliminate thegeneral need to surround variables with double-quotes in Bourne-like shells.But by forbidding certain characters in filenames,you decrease the likelihood that a common programmingerror can turn into an attack;in some cases that’s worth it.
You could forbid the XML/HTML special characters“<”, “>”, “&”, and “"”, which would eliminatemany errors caused by incorrectly escaping filenames for XML/HTML.
This would also get rid of some nasty side-effects for shell and Perl programs.The < and > symbols redirect file writes, for both shell and Perl.This can be especially nasty for Perl, where filenames that begin with< or > can cause side-effects when open()ed —see “man perlopentut” for more information.Indeed, if you use Perl, see “man perlopentut” for other gotchas when openingfiles in Perl.
You could forbid the backslash character.This would eliminate one error —forgetting the -r option of Bourne shellread.
Of course, you could go further forbid all (or nearly all)shell metacharacters.
Sometimes it’s useful to write out programs and run them later.For example, shell programs can be flattened into single long strings.Althoughfilenames aresupposed to be escaped if they have unusual characters,it’s not at all unusual for a program to fail to escape something correctly.If filenames never had characters that needed to be escaped, there’d beone less operation that could fail.
A useful starting-point list of shell metacharactersis “*?:[]"<>|(){}&'!\;$”(this is Glindra’s “safe” list with ampersand, single-quote, bang, backslash, semicolon, and dollar-sign added).The colon causes trouble with Windows and MacOS systems, and althoughopening such a filename isn’t a problem on most Unix/Linux systems,the colon causes problems because it’s a directory separator in manydirectory or file lists (including PATH, bash CDPATH,gcc COMPILER_PATH, and gcc LIBRARY_PATH),and ithas a special meaning in a URL/URI.Note that < and > and & and "are on the list; this eliminates many HTML/XML problems!I’d need to go through a complete analysis of all characters for a final list;for security, you want to identify everything that is permissible, anddisallow everything else, but its manifestation can be either way as longas you’ve considered all possible cases.
In fact, for portability’s sake, you already don’t want to createfilenames with weird characters either.MacOS and Windows XPalso forbid certain characters/names.Some MacOS filesystems and interfacesforbid “:” in a name (it’s the directory separator).Microsoft Windows’ Explorer interfacewon’t let you begin filenames with a space or dot, and Windowsalso restricts these characters:
: * ? " < > |Also, in Windows, \ and / are both interpreted as directory name separators,and according to that page there are someissues with “.”, “[”, “]”, “;”, “=”, and “,”.
In the end, you're safer if filenames are limited to the charactersthat are never misused.In a system where security is at a premium, I can see configuring itto only permit filenames with characters in the setA-Za-z0-9_-, with the additional rule that it must not begin with a dash.These display everywhere, are unambiguous, and this limitationcuts off many attack avenues.
For more info, seeWikipedia’s entry onFilenames.Windows’ NTFS rules are actually complicated, according to Wikipedia:
Windows kernel forbids the use of characters in range 1-31 (i.e.,0x01-0x1F) and characters" * : < > ? \ / |. Although NTFS allowseach path component (directory or filename) to be 255 characterslong and paths up to about 32767 characters long, the Windows kernelonly supports paths up to 259 characters long. Additionally, Windowsforbids the use of the MS-DOS device names AUX, CLOCK$, COM1, COM2,COM3, COM4, COM5, COM6, COM7, COM8, COM9, CON, LPT1, LPT2, LPT3, LPT4,LPT5, LPT6, LPT7, LPT8, LPT9, NUL and PRN, as well as these names withany extension (for example, AUX.txt), except when using Long UNC paths(ex. \\.\C:\nul.txt or \\?\D:\aux\con). (In fact, CLOCK$ may be used ifan extension is provided.) These restrictions only apply to Windows —Linux, for example, allows use of" * : < > ? \ | even in NTFS[The source also included “/” in the last list,but Wheeler believes that is incorrect and has removed it.]
Microsoft Windows also makes someterrible mistakes with its filesystem naming; thesection on Windows filename problemsbriefly discusses this.
Beware of other assumptions about filenames.In particular, filenames that appeardifferent may be considered the same by the operating system,particularly on Mac OS X, Windows, and remote filesystems (e.g., via NFS).
Thegit developers fixed a critical vulnerability in late 2014(CVE-2014-9390) due to filenames.GitHub has an interesting post about it.Mercurial had the same problem (they notified the git developers about it).In particular, filenames that appear different are considered the same:
Thus, filtering based on filenames is tricky and potentially dangerous.This is inaddition to the Windows-specific filenames(e.g., NUL) as discussed above.
Microsoft Windows has a whole host of other nasty tricks involving filenames.Normally periods and spaces at the end of a filename are silentlystripped, e.g., "hello .. " is the same filenameas "hello".You can also add various other selectors, e.g.,"file1::$DATA" is the same as "file1",but the stripping does not happen so"file1...::$DATA" is not the same as "file1".Short 8+3 filenames can refer to longer names.There are other issues too, butthis is not primarily an essay about Windows filenames;I just thought it important to note.
There are lots of tricks we can use in Bourne-like shells towork correctly, or at least not fail catastrophically, withnasty filenames.We’ve already noted a key approach:Set IFS early in a scriptto prevent breaking up filenames-with-spaces in the wrong place:
IFS="`printf '\n\t'`"
The problem has been around for a long time, and I can’t possiblycatalog all the techniques.Indeed, that’s the problem; we need too many techniques.
I guess I should mention a few other techniques for either handlingarbitrary filenames, or filtering out “bad” filenames.I think they’ll show why people often don’t do it “correctly”in the first place.In Bourne shell, you must double-quote variable references for manyother kinds of variables anyway, so let’s look beyond that.I will focus on using shell globbing and “find”,since those are where filenames often come from, and the ways for doingit aren’t always obvious.This BashFAQ answergives some suggestions, indeed, there’s a lot of stuff out there onhow to work around these misfeatures.
Shell globbing is great when you just want to look at a list of files ina specific directory and ignore its “hidden” files (files beginning with “.”),particularly if you just want ones with a specific extension.Globbing doesn’t let you easily recurse down a tree of files, though;for that, use “file” (below).Problem is, globs happily return filenames that begin with a dash.
| When globbing, make sure that your globs cannot return anythingbeginning with “-”,for example, prefix globs with “./”if they start in the current directory.This eliminates the “leading dash” problemin a simple and clean way. |
When globbing, make sure that your globs cannot return anythingbeginning with “-”,for example, prefix globs with “./”if they start in the current directory.This eliminates the “leading dash” problemin a simple and clean way.Of course, this only works on POSIX; if you can get Windowsfilenames of the form C:\Users,you’ll need to consider drive: as well.When you glob using this pattern, you will quietly hide any leading dashes,skip hidden files (as expected),and you can use any filename (even with controlcharacters and other junk):
for file in ./*.jpg ; do ... command "$file"
Making globbing safe for all filenames is actually not complicated —just prefix them with “./”.Problem is, nobody knows (or remembers) to prefix globs with “./”,leading to widespread problems with filenames starting with “-”.If we can’t even get people to dothat simple prefixing task,then expecting them to do complicated things with “find” is silly.
Bash has an extension that can limit filenames, GLOBIGNORE, thoughsetting it to completely deal with all these cases (while stillbeing usable) is avery tricky.Here’s a GLOBIGNORE pattern so that globs will ignore filenames withcontrol characters, leading dashes, or begin with a “.”, as well astraditional hidden files (names beginning with “.”), yet acceptreasonable patterns (including those beginning with “./” and“../” and even multiple “../”):
GLOBIGNORE=`printf '.[!/.]*:..[!/]*:*/.[!/.]*:*/..[!/]*:*[\001-\037\177]*:-*'`
By the way, a special thanks to Eric Wald for this complicatedGLOBIGNORE pattern, which resolves the GLOBIGNOREproblems I mentioned in earlier versions of this article.With this pattern, if youremember to always prefix globs with“./” or similar (as you should), then you’ll safely get filenames thatbegin with dash (because they will appear as “./-NAME”).But when you forget to correctly prefix globs (and you will),then leading-dash filenames will be skipped (which isn’t ideal, but it’s generally far safer than silently changingcommand options).Yes, this GLOBIGNORE pattern is hideously complicated, but that’s my point:Safely traversing filenames is difficult, and it should be easy.
Globbing can’t express UTF-8, so you can’t filter out non-UTF-8filenames with globbing.Again, you probably need a separate program to filter out those filenames.
How can we usefind correctly?Thankfully, “find” always prefixes filenames with its firstparameter, so as long as the first parameter doesn’t begin witha dash (it’s often “.”), we don’t have the“leading dash” problem. (If you’re starting from adirectory that begins with “-” inside your current directory,you can always prefix its name with “./”).
It’s worth noting that if you want to handle fully-arbitraryfilenames, use “find . ... -exec” when you can; that’s100% portable, and can handle arbitrarily-awkward filenames.The more-recent POSIX addition to find of -exec ... {} +can help too.So where you can, do this kind of thing:
# This is correct and portable; painful if "command" gets long: find . ... -exec command {} ; # This is correct and standard; some systems don't implement this: find . ... -exec command {} +When you can’t do that, using find ... -print0 | xargs -0is the common suggestion; that works, but those require non-standardextensions (though they are common),the resulting program can get really clumsy if what youwant to do if the file isn’t simple, and the results don’teasily feed into shell command substitutions if you plan to pass in\0-separated results.
If you don’t mind using bash extensions, here’s oneof the better ways to directly implement a shell loop that takes“find”-created filenames.In short, you use a while loop with ‘read’and have read delimit only on the \0(the IFS= setting is needed or filenames containingleading/trailing IFS characters will get corrupted;the-d '' option switches to \0 as the separator,and the-r option disables backslash processing).Here’s a way that at least works in simple cases:
# This handles all filenames, but uses bash-specific extensions: find . -print0 | while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. doneThis approach does handle all filenames, but becausewe use a pipe, each of the processes will be in a subshell.Thus, if any variables are set inside the “while” loop, theirvalues will disappear once we exit the loop (because the loop’ssubshell will disappear).To solvethat problem, we’ll need to useanothernonstandard bash extension, process substitution(which even doesn’t work on all systems with bash):
# This handles all filenames, but uses bash-specific extensions: while IFS="" read -r -d "" file ; do ... # Use "$file" not $file everywhere. # You can set variables, and they'll stay set. done < <(find . -print0)
We can now loop through all the filenames, and retain anyvariable values we set, but this construct is hideously ugly and non-portable.Also, this approach means we can’t read the originalstandard input, which in many programs would be a problem.You can work around that by using other file descriptors, butthat causes even more complications, leading to hideous results.Is there any wonder nobody actuallydoes this correctly?!?
Notice that you can’t portably use this constructin “for” loopsor as a command substitution, due to limitations in current shells(you can’t portably say “split input on \0”).
Oh, and while carefully using the find command canprocess filenames with embedded control characters(like newline and escape), what happens afterwords that can be“interesting”.In GNU find, if you use -print (directly or implicitly)to a teletype, it will silently change the filenames to preventsome attacks and problems.But once piped, there’s no way todistinguish between filenames-with-newlines and newlines-between-filenames(without additional options like the nonstandard -print0).And those later commands must be careful;merelyprinting a filename via those later commands is dangerous(since it may have terminal escape codes) and can go badly wrong(because the filename encodingneed not match the environment variable settings).
Can you use the ‘find’ command in a portable wayso it will filter out bad filenames, and have a simpler life from there on?Yes!If you have to write secure programs on systems with potentially bad filenames,this may be the way to go —by filtering out the bad filenames, you at least prevent your programfrom getting affected by them.Here’s the simplest portable (POSIX-compliant)approach I’ve found which filters outfilenames with embedded ASCII control characters(including newline and tab); that way, newlines can separate filenames,displaying filenames is less dangerous (though we still havecharacter encoding issues),and the results are easy to use in a command substitution(including a Bourne shell “for” loop) and with line-processing filters:
# This is correct and portable; it skips filenames with control chars: IFS="`printf '\n\t'`" # Remove spaces so spaces-in-filenames still work controlchars=`printf '*[\001-\037\177]*'` for file in `find . ! -name "$controlchars"'` ; do command "$file" ... done
Unfortunately,UTF-8 can’t really beexpressed with traditional globs, because globs can’t expressa repetition of particular patterns.The standard find only supports globs, so it can’tdo utf-8 matching by itself.In the long term, I hope “find” grows a simple optionto determine if a filename is UTF-8.Full regular expressionsare able to represent UTF-8, thankfully.So in the short term, if you want to only accept filenames that areUTF-8, you’ll need to filter the filename list through a regex(rejecting names that fail to meet UTF-8 requirements).(GNU find has “-regex” as an extension, which could do this, butobviously that wouldn’t port to other implementations of find.)Or you could write a small C program that filters them out(along with other bad patterns).
Of course, if filenames are clean(at least, can’t have control characters),this can become this far simpler, and that’s the point of this article:
IFS="`printf '\n\t'`" # Remove spaces so spaces-in-filenames will work ... # This is correct if filenames can't have control characters: for file in `find .` ; do ... done # This will fail if scaled to very large lists, but it is correct for # smaller lists if filenames can't have control characters: cat `find . -type f`
Why do I need to add odd coding mechanisms that say“don’t send me garbage”, and constantly work aroundthe garbage other programs copy to me?There are many conventions out there to try to deal with garbage, butit’s just too easy to write programs that fail to do so.Shouldn’t the system keep out the garbage in the first place?!?
Yes, I need to filter inputs provided by untrusted programs. Fine.But the operating system kernel shouldn’t be oneof the untrusted programs I must protect myself against (grin).
Using the techniques discussed above,you can count how many filenames include control characters 1-31 or 127in the entire system’s filesystem:
badfile=`printf '*[\\x01-\\x1f\\x7f]*'` find / -name "$badfile" -exec echo 1 \; | wc -l
For most systems, the answer is “0”. Which means thiscapability to store weird filenames isn’t really necessary.This “capability” costs a lot of development time,and causes many bugs;yet in return we get no real benefit.
So does limiting filenames, even in small ways, actually make things better?Yes!Let me focus on eliminating control characters(at least newline and tab), probably the worst offenders, and how things likea better IFS setting can improve things in a very publichistorical complaint about Unix.
The Unix-haters handbookpage 167 (PDF page 205) beginsJamie Zawinski’s multi-page description of his frustrated 1992 effortto simply “find all .el files in a directory tree thatdidn’t have acorresponding .elc file. That should be easy.”After much agony (described over multiple pages),he found that the “perversity of the taskhad pulled me in, preying on my morbid fascination”.He ended up writing this horror, which is both horribly complicated andstill doesn’t correctly handle all filenames:
find . -name '*.el' -print \ | sed 's/^/FOO=/' | \ sed 's/$/; if [ ! -f \ ${FOO}c ]; then \ echo \ $FOO ; fi/' | shZawinski’s script fails when filenames have spaces, tabs, or newlines.In fact, just aboutany shell metacharacter in a filenamewill causecatastrophic effects, because they will be executed(unescaped!) by another shell.
Paul Dunne’s review of the “Unix Hater’s Handbook”(here andhere)proposes a solution, but his solution is bothwrong andcomplicated.Dunne’s solution is wrong because it only examines the directories thatare the immediate children of the current directory; it fails to examinethe current directory and it fails to examine deeper directories.Whups!In addition, his solution is quite complicated;he uses a loop inside another loop to do it, and has to show itin steps (presumably because it’s too complicated to show at once).Dunne’s solutionalso fails to handle filenameswith spaces in them, and it even fails if there are empty directories.Dunne does note those last two weaknesses, to be fair.Dunne doesn’t even show the full, actual code; he only shows a codeoutline, and you have to fill in the pieces before it would actually run.(If it’s so complicated that you can only show an outline,it’s too complicated.)This is all part of the problem —if it’s too hard to write good examples of easy tasksthat do the job correctly, then thesystem is making it too hard to do the job correctly!
Here’s my alternative; this one is simple, clear, and actually correct:
# This is correct if filenames can't include control characters: IFS="`printf '\n\t'`" for file in `find . -name '*.el'` ; do if [ ! -f "${file}c" ] ; then echo "$file" fi doneThis approach (above) just sets IFS to the value it should normallyhave anyway, followed by a single bog-standard loop over theresult of “find”.This alternativeismuch simpler and clearer than either solutions,it actually handles the entire tree as Zawinski wanted (unlike Dunne’s),and it handles spaces-in-filenames correctly(as neither of the above do).It also handles empty directories,which Dunne’s doesn’t, and it handlesmetacharacters in filenames, which Zawinski’s doesn’t.It works onall filenames (including those with spaces),presuming that filenames can’t contain control characters.The find loop presumes that filenames cannot include newline or tab;the later “echo” that prints the filename presumes that the filenamecannot contain characters (since if it did, the echo of controlcharacters might cause a security vulnerability).If we also required that filenames be UTF-8, then we could be certain thatthe displayed characters would be sensible instead of mojibake.This particular program works even when file components beginwith “-”, because “find” will prefix thefilenames with “./”, but preventing suchfilenames is still a good idea for many other programs(the call toecho would fail and possibly bedangerous if the filename had been acquired via a glob like*).My approach also avoids piping its results to another shell to run,something that Zawinski’s approach does.A variation could use “set -f” but this one does not need it.There’s nothing wrong with having a shell run a program generated byanother program (it’s a powerful technique),but if you use this technique, small errors can have catastrophic effects(in Zawinski’s example, a filename with metacharacters could causedisaster).So it’s best to use the “run generated code” approachonly when necessary.This is a trivial problem; such powerful grenade-liketechniques shouldnot necessary!Most importantly, it’s easy to generalize this approachto arbitrary file processing.
| Adding small limits to filenamesmakes it much easier to create completely-correct programs. |
That’s my point:Adding small limits to filenamesmakes it much easier to create completely-correct programs.Especially since most software developers act as if theselimitations were already being enforced.
Peter Moulder sent me a shorter solution forthis particular problem (he accidentally omitted -print, which I added):
# Works on all filenames, but requires a non-standard extension, and there # are security problems with some versions of find when printing filenames: find . -name '*.el' \! -exec test -e '{}c' \; -printHowever, Moulder’s solution uses an implementation-defined (non-standard)extension;as noted by the Single UNIX specification version 3 section on find,“If a utility_name or argument string contains the two characters “{}”,but not just the two characters “{}”, it is implementation-definedwhether find replaces those two characters or uses the string withoutchange”.My thanks to Davide Brini who pointed out that this isimplementation-defined, and also suggested thisstandard-conforming solution instead:
# This is correct for all filenames, and portable, but hideously ugly; it can # cause security vulnerabilities b/c it prints filenames with control chars: find . -name "*.el" -exec sh -c '[ ! -f "$1"c ] && printf "%s\n" "$1"' sh {} \;This version (with find)can process files with newlines, but if files have embeddednewlines, the output is ambiguous.In addition, if the files can haveterminal escapes or a different character encoding,beware —this code is a security vulnerability waiting to happen.In any case, as file processing gets more complicated,stuffing logic into “find” gets very painful.I believe that the simple for-loop is easier to understand andmore easily scales to more complicated file processing.
Similarly, here is a little script calledmklowercase,which renames all filenames to lowercase recursively fromthe current directory (“.”) down.Again, this script is pretty simple to writeif we can assume that filenames don’t include newline or tab.This one can handle filenames with spaces and initial dash(again, because find can handle them):
#!/bin/sh # mklowercase - change all filenames to lowercase recursively from "." down. # Will prompt if there's an existing file of that name (mv -i) set -eu IFS="`printf '\n\t'`" for file in `find . -depth` ; do [ "." = "$file" ] && continue # Skip "." entry. dir=`dirname "$file"` base=`basename "$file"` oldname="$dir/$base" newbase=`printf "%s" "$base" | tr A-Z a-z` newname="$dir/$newbase" if [ "$oldname" != "$newname" ] ; then mv -i "$file" "$newname" fi done
Donot assume that filename issues are limited to Unix/POSIX/Linuxsystems; that is simply the focus of this particular paper.Windows also has serious filenaming issues, which in some ways aremore serious than Unix/POSIX/Linux.
Windows forbids control characters in filenames, so it doesn’t have thatproblem, and it forces an encoding, so they can be displayed unambiguously.But that isn’t the only problem.
However, Windows has very arbitrary interpretations offilenames, which can make it dangerous.In particular, it interprets certain filename sequences specially.For example, if there is a directory called “c:\temp”,and you run the following command from Windows’ “cmd”:
mkdir c:\temp echo hi > c:\temp\Com1.txtYou might think that this sequence creates a file named “c:\temp\Com1.txt”.You would be wrong; it doesn’t create a file at all.Instead, this writes the text to the serial port.In fact, there are a vast number of special filenames, and evenextensions don’t help.Since filenames are often generated from attacker data,this can be a real problem.I’ve confirmed this example with Windows XP, but I believe it’s truefor many versions of Windows.
One solution is to prefix filenames with “\\?\” and then the full pathname;few people will do that consistently, leading to disaster.Web applications can protect themselves by only using filenames basedon hashes, or forcing a prefix that makes the filename not a device name.(I have not been able to authoritatively confirm that only the usual listsof special names can be special, which makes this worrisome.)But this shows that Windows has its own serious filename issues.
The lesson here is not for POSIX to copy Windows; that would be a mistake.Instead, the goal is to have simple rules that make it easy toavoid common mistakes.Developers need systems that are neither“everything is permissible” nor“capricious, hard-to-follow rules that don’t help users”.
I’ve received some interesting commentary on this article, both via emailand viacomments about it at lwn.net.My thanks to all the commenters.Not everyone agrees with this essay (I expected that!), but many did.Below are some comments that I found particularly interesting.
Ed Avis said via email“Hi, I read your fixing filenames essay - great work!I hope this longstanding problem is finally sorted out.”He also suggested that“A patch to lkml would at least get discussion moving, even if it has no chance of being accepted”.
OnLWN, epa said:
I thoroughly agree. If using a single character for end-of-line was the best design decision in UNIX, then allowing any character sequence in filenames (while at the same time including a shell and scripting environment that’s easily tripped up by them) was the worst.
Look at the recent Python version that got tripped up by filenames that are not valid UTF-8. Currently on a Unix-like system you cannot assume anything more about filenames than that they’re a string of bytes. This frustrates efforts to treat them as Unicode strings and cleanly allow international characters.
Or look at the whole succession of security holes in shell scripts and even other languages caused by control characters in filenames. My particular favorite is the way many innocuous-looking perl programs (containing ‘while (<>)’) can be induced to overwrite random files by making filenames beginning ‘>’.
Richard Neill said via email:
I agree that spaces in filenames are evil.But I suspect that we won’t be able to stamp them out widely enough to matter,because there are too many systems that absolutely require them.The Windows XP equivalentof “/home” is “\Documents and Settings”(notice the spaces!), and Windows’ equivalent of “/usr/bin” is“\Program Files” — so if youever have to deal with Windowsfilesystems, trouble handling the space character is a real problem.(Vista and later useof “\Users” instead of “\Documents and Settings”,which is a more sensible Unix approach, but the problem still remains.)People do use both underscore and space in full pathnames, somaking them effectively the same probably won’t work.If we can stamp out other problems, spaces in filenamesbecomemuch easier to deal with, and I can live with that.That’s particularly true if we can get people to move to using an “IFS”setting without the space character — then they’re really easy to handle.
OnLWN, jreiser said:
Keep those filename rules out of my filesystems, please.Some of my programs use such “bad” filenames systematically on purpose,and achieve strictly greater utility and efficiency than would bepossible without them.
But while jreiser may get “greater utility and efficiency”, lotsof other people have programs that subtly failpossibly with security vulnerabilities, because of this leniency.I’d rather have “slower and working” than “faster but not working”.Such programs aren’t portable anyway; not all filesystemspermit such names, and the POSIX standard doesn’t guarantee them either.
Interestingly, epa replies with:
Can you give an example [where ‘bad’ filenames are needed]?There is a certain old-school appeal in just being able to use thefilesystem as a key-value store with no restrictions on what bytes canappear in the key. But it’s spoiled a bit by the prohibition of NUL and /characters, and trivially you can adapt such code to base64-encode thekey into a sanitized filename. It may look a bit uglier, but if onlyapplication-specific programs and the OS access the files anyway, thatdoes not matter.
OnLWN, nix was notso sure about this approach, and said:
I use the filename as a key-value store for a system (not yet released)which implements an object model of sorts in the shell... I pondereda \n-prepended filename because it’s even harder to trip over bymistake, but decided that it would look too odd in directory listingsof object directories when debugging... David’s proposed constraintson filenames are constraints which can never be imposed by default,at the very least...
The first part proves my point. Even for a key-value store, nixdecided to avoid \n filenames because they cause trouble.If they cause trouble, then let’s stop.I actually agree that some of these constraints cannot be imposed bydefault, but some can — so let’s deal with those.
But I agree that these rules all make sense forparts of the filesystem that users might manipulate with arbitraryprograms, as opposed to those that are using part of the FS as asingle-program-managed datastore. What I think we need is an analogueof pathconf() (setpathconf()?) combined with extra fs metadata, suchthat constraints of this sort can be imposed and relaxed for *specificdirectories* (inherited by subdirectories on mkdir(), presumably). *That*might stand a chance of not breaking the entire world in the name of fixing it.
This might the kernel (ha!) of a good idea.In fact, there’s already a mechanism in the Linux kernel that might dothis job already: getfattr(1)/setfattr(1).You could set extended attributes on directories, which would controlwhat kinds of filenames could be created inside them.If this approach were implemented this way, I’d suggest that bydefault, directories would “prevent bad filenames”(e.g., control chars and leading “-”).You could then use “setfattr” on directories to permit badness,or perhaps enforce additional requirements.I would make those “trusted extended attributes” —you’d have to be CAP_SYS_ADMIN (typically superuser) to be able tomake directories that permitted bad filenames.I’d like new directories to inherit the attributes of their parent directory,too; I’ll need to look into that.I’m sure there are many other variations; much would depend on theviewpoint of kernel writers.This might give people the flexibility they want: Those who want reasonablefilename limits can get them without rewriting the kernel, andthose who want weird names can get them too.
OnLWN, ajb said:
I think this is a sensible idea. It should be possible to make thetransition relatively painless:That way, most processes can run happily in the ignorance of any bad filenames. If you need to access one, you run the commands you need to access it with under the special shell.
- add a new inheritable process capability, ‘BADFILENAMES’, without which processes can’t see or create files with bad names.
- add a command ‘access_bad_filenames’ which creates a shell with the capability.
- /bin/ls also needs the capability, but should not display bad filenames unless an additional option is passed.
I’m not sure about using capabilities this way, but it’s certainlyan interesting approach.
The basic notion of making this inheritable to processes is interesting.In fact, you could do inheritable shrouding of bad filenamessolely from userspace, without the kernel.Simply define a special environment variable(e.g., HIDE_BAD_FILENAMES), and then modify programs so that they aren’tfound by programs that walk directories.You could probably just modify readdir(3)’s implementation, since Isuspect other C routines, shells, and find(1) simply call thatwhen they look for filenames.If not, I suspect the number of routines that need to be changed would beremarkably small.One trouble is that this might betoo good at hiding bad filenames;you might not realize they exist, even when you need to find them,and attackers might intentionally create “hidden” files(e.g., so they can hide malware).Also, invoking setuid programs would erase this environment variable,and privileged programs are sometimes the programs you most want toprotect from bad filenames.Which makes me worry; it’d be better to not have bad filenames in thefirst place.
You could also try to preventcreating such bad filenamesfrom userspace, but here it gets dodgy.I suspect many programs invoke the kernel open() interface directly, andthus aren’t quite as easy to intercept.And if we can’t keep them from existing, they’ll keep popping up asproblems.
OnLWN, mrshiny said:
You can pry my spaces from my filenames out of my cold dead fingers. Butfrankly spaces are no different than other shell meta-characters. If afilename is properly handled for spaces, doesn’t it automatically workfor all the other chars? If not, it should be easy enough to fix theSHELLS in this case.
Well, spaces are actually different than other meta-characters in shells.The problem is that the default IFS value includes space, as well as taband newline. As I discuss in the article, you CAN change IFS to removespace. If you do that, and ensure that filenames can’t include newline ortab, then a lot of common shell script patterns actually become correct.
I agree with you, it’s too late to forbid spaces-in-filenames on mostsystems. I thought I made that clear, sorry if I didn’t. My point wasthat since most of us are probably stuck with them, let’s get rid ofsome of the other junk like control chars in filenames; without them,spaces would be way easier to deal with.
Mr. Wheeler makes a mistake in the article as well. Windows has noproblem with files starting with a dot. It’s only Explorer and a handfulof other tools that have problems. Otherwise Cygwin would be prettyannoying to use.
You’re right, the Windows kernel has no trouble with filenames beginning with dot. I was quoting something else, and didn’t quite quote it correctly. Fixed. It’s worth noting that to a lot of users, if the file Explorer has trouble, they have trouble. I’m an avid fan of Cygwin, BTW.
Overall, however, I like the idea of restricting certain things, especially the character encoding. The sooner the other encodings can die, the sooner I can be happy.
Glad you liked the rest!
OnLWN, njs said:
I pretty much agree with all dwheeler’s points (not sure about banningshell metacharacters).
The section on Unicode-in-the-filesystem seemed quite incomplete. Weknow this can work, since the most widely used Unix *already* doesit. OS X basically extends POSIX to say “all those char * pathnames yougive me, those are UTF-8”. However, there are a lot of complexitiesnot mentioned here -- you need to worry about Unicode normalization(whether or not to allow different files to have names containing thesame characters but with different bytestring representations), if thereis any normalization then you need a new API to say “hey filesystem,what did you actually call that file I just opened?” (OS X has this,but it’s very well hidden), and so on.
But these problems all exist now, they’re just overshadowed by theterrible awful even worse problems caused by filenames all being inrandom unguessable charsets.
I did note that most people wouldn’t be able to ban metacharacters.
Yes, I know about the issues with normalization.But my point is what you just noted — these problems are“overshadowed by the terrible awful even worse problemscaused by filenames all being in random unguessable charsets”.If youknow that filenames will be handed to you in UTF-8and won’t have nasty stuff like control characters,many problems either go away or become more manageable.
OnLWN, kenjennings said:
If you had a petition I’d sign it. I agree with all six of your fixes at the end of your article.
Having been working with computers since 1979 and subject to the various limitations of dozens of file systems, I automatically exercise self-restraint and never put any of those characters into filenames.
People should not be using filenames as data storage.
On Apr 15, 2010, Derek Martin sent me a lengthy and interesting email;here are some highlights (it’s really long, so I don’t include all of it):
I came across your article regarding Unix filenames. I mostly agreewith a lot of your points, including that spaces in filenames are bad...As you point out, that’s a hard one to get around, becausespaces are allowed on a lot of other filesystems, and interoperationshould be a goal of any system (ideally).
I avoid using spaces in filenames, just as I avoid using control andmeta characters in them. But I want to point out a couple of otherissues, playing devil’s advocate for a bit. The basic premise is,there’s actually nothing wrong with the Unix filesystem allowingarbitrary character strings in filenames; the real problem is in theshell, and maybe a few of the standard command-line utilities.
I’ll start by pointing out that in part, Unix is where it is for thesame reason Windows still honors a lot of MS-DOS brain damage: itsimply was always that way...
In the bad-old-days, there was no such thing as Unicode...[people could]specify their own language/encoding via environment variables, and[the kernel allowed] any sequence of bytes in filenames. Thisway, the implementers of the kernel don’t need to be familiar withevery character set in use in every language and culture...
With Unicode, we don’t really need to continue this practice.However, interestingly, using UTF-8 is not a complete solution to thisproblem, either...there are a few rarely used (but still used!) Korean syllabiccharacters, a number of Japanese-only characters (mostlytypographical/graphical in nature), and a selection of uncommonChinese characters that are not available in UTF-8, which areavailable in one or more of those languages’ native encodings...UTF-8 contains enough of those languages’ characters that any nativeuser won’t have trouble communicating; but some well-educated peoplemay find their expressivity hampered.
Also, I must point out that most of the problems you’ve sited [sic] in yourarticle are specific to the Unix shell... They are not problemsinherent to Unix as a whole. Most other programming languages (C forexample) have no trouble handling file names with odd characters inthem. By and large, it just works (though displaying them ormanipulating them in certain ways may still be an issue, if you can’tidentify their encoding). And where the use of GUI shells is nowbecoming common, even on Unix and Linux, this fact reduces theseverity of some of the issues you outline. The GUI shells can handlethose files just fine, for the most part. But back to the(Bourne-like) Unix shell, since that’s what your article focuses on.
It should be (and is) possible to make a number of enhancements to theshell to allow better handling of such odd filenames. For example,something like the following could/should be possible:
$ var="\006\007xyz"$ echo $var\006\007xyz$ echo "$var"^F^GxyzOne improvement: If unquoted, the shell could treat a variablecontaining spaces and control characters just as a C program would:i.e. they’re not special....
My last point is filenames that start with a ‘-’ character. That oneis a little trickier, since a lot of tools don’t have a way to handleit. There are tricks to do it... like specifying ‘./-n’ instead ofjust ‘-n’ in your command. But, it must be pointed out that the magic‘--’ argument, while not implemented everywhere, IS defined in thePOSIX standard. This is probably the best solution; sadly noteveryone who writes programs is aware of and/or pays attention tostandards. You can’t blame that on Unix.
So, as a practical matter, since we don’t currently have any of thesethings, I do still agree with you, mostly. But from a technicalstandpoint, the problems you outline are, I think, much more caused bythe shell’s poor handling of these special cases, than by the factthat they’re allowed in the first place.
As far as Unicode/UTF-8 goes, Derek Martin is right, there is the problem thatsome very rarely-used characters aren’t encoded in Unicode (and thushave no UTF-8 value).But that is almost never a significant problem, andthis problem is slowly going away while these extremely rare charactersare added to Unicode.More importantly, the world is different now.Today, peopledo exchange data across many locales, and it issimply unreasonable to expect that people can stay isolated in theirlocal locales.Most people expect to be able to display filenames at any time, eventhough they receive data from around the world.We need a single standard for all characters, worldwide, and a standardencoding for them in filenames.There is really only one answer, so let’s start moving there.
Martin notes that handling filenames beginning with “-” is tricky.Martin points out that the “magic‘--’ argument, while not implemented everywhere, IS defined in thePOSIX standard. This is probably the best solution; sadly noteveryone who writes programs is aware of and/or pays attention tostandards. You can’t blame that on Unix”.Actually, yes, Ican blame the standard.If a standard is too hard difficult to follow, maybe the problemis the standard.More importantly, even if programs implemented “--” everywhere,users would typically fail to use it everywhere.This is just like putting barbed wire on a tool handle;if a tool is difficult to use safely and correctly, perhaps thetool needs to be fixed.Anyway, the formal POSIX standard specifically states that you donotneed to support filenames beginning with “-”;the problem is that many implementations permit them anyway.So we don’t need to fix the standard; we just need to fix implementationsin a way that complies with standards.
Martin says,“If unquoted, the shell could treat a variablecontaining spaces and control characters just as a C program would:i.e. they’re not special....”Setting the IFS variable in the shelldoes make it possibleto make the space, tab, and/or newline character nonspecial, so you don’teven need to rewrite shells.I specifically recommend removing the space character from IFS, and thathelps.That doesn’t deal with the other characters, though.
Martin concludes,“So, as a practical matter, since we don’t currently have any of thesethings, I do still agree with you, mostly.”So we may disagree a little on their causes, but he still mostly agreesthat something should be done.
Derek Martin claims that“most of the problems you’ve [cited] in yourarticle are specific to the Unix shell...”.I do talk about the problems specific to the shell, butthe biggest problems with filenames arenot specific to the shell or to command-line interfaces.The biggest problems arecontrol characters, leading dashes, and non-UTF-8 encoding.Control characters are a problem forall languages, because essentiallyall programming languages have constructs that process lines at a timeand handle tab-separated fields; control characters ruin that.Leading dashes interfere with invoking other programs, which is somethingthat programs inany language sometimes need to do.The lack of a standard filename encoding means you can’t reasonablydisplay filenames, regardless of programming language or user interface.Certainly a number of other problems are unique to the shell, but thatdoesn’t make them non-issues; the shell is so baked into the system, andused so widely (including via other programming languages),that they cause endless problems (including security problems).So let’s fix them.
| Few people really believe that filenames should have this junk,and you can prove that just by observing their actions...Their programs... are littered withassumptions that filenames are “reasonable”...By changing a few lines of kernel code, millions of lines of existing codewill work correctly in all cases, and many vulnerabilities willevaporate. |
In sum:It’d be far better if filenames were more limited so that they would besafer and easier to use.This would eliminate a whole class of errors and vulnerabilities in programsthat “look correct” but subtly fail when unusual filenames are created(possibly by attackers).The problems of filenames in Unix/Linux/POSIX are particularly jarringin part because there are so many otherthings in POSIX systems thatare well-designed.In contrast, Microsoft Windows has a legion of design problems,often caused by its legacy, that will probably be harder to fixover time. These include itsirregular filesystem rules that are also a problem yet will be harder to fix(so that “c:\stuff\com1.txt” refers to the COM1 serial port, not to a file),itsdistinction between binary and text files*,its monolithic design, andtheWindows registry.Any real-world system hassome problems, but the POSIX/Linuxfilename issuescanbe fixed without major costs.The main reason that things are the way they are is because“we’ve always done it that way”,and that is not a compelling argumentwhen there are so many easily-demonstrated problems.So let’s fix the problem!
In general,kernels should emphasize mechanism not policy.The problem is that currently there’s nomechanism forenforcing any policy.Yet it’s often easy for someone to create filenames that triggerfile-processing errors in others’ programs (including system programs),leading to foul-ups and exploits.Let administrators determine policies like which bytes must never occur infilenames, which bytes must not be prefixes, which bytes must not be suffixes,and whether or not to enforce UTF-8.All that’s needed in the kernel is a mechanism to enforce such a policy.After all, the problem is so bad that there are programs likedetox andGlindra to fix bad filenames.
So what steps could be taken to clean this up slowly, over time, withoutcausing undue burdens to anyone?Here are some ideas:
Merely forbidding their creation might be enough for a lot of purposes.On many systems, files are only created via the local operating system,and not by mounting local or remotely-controlled filesystems.On the other hand, if you also hide any such filenames thatdoexist, you have a complete solution — applications on that systemcan then trust that such “bad” filenames do not exist, and thushiding such files essentially treats bad filenames like data corruption.I think that if you hide files with “bad” filenames, then you shouldrejectall requests to open a bad filename... whether you’recreating it or not.(One risk of hiding is that this creates an opportunity for malicioususers to “hide” data in bad filenames, such as malware or data thatisn’t supposed to be there).Administrators could decide if they want to hide bad filenames or not,so there would be enforcement settings.Here is one possible scheme:One setting would determine whether or not topermit creationof files with bad filenames.Another would determine how they should be viewed if they are alreadythere (e.g., in directories):as-is, hidden (not viewed at all), or escaped (see the next point)?Another would determine if they can be opened if the bad filename is usedto open it (yes or no); obviously this would only have effect if badfilenames had been created in the first place.There would also be the issue of escaped filenames;if there is a fixed escaping mechanism, you configure which file winsif the the escaped name equals the name of another file.
If bad filenames cannot be viewed (because they are escaped or hidden),then you have a complete solution.That is, at that point, all application programs that assumed thatfilenames are reasonable will suddenly work correctly in all cases.At least on that system, bad filenames can no longer cause mysteriousproblems and bugs.
Let’s talk about how this could be implemented in Linux, specifically.It could be a small capability built into the kernel itself.Josh Stone shows how filesystem rules could be implementedusing SystemTap.However, the most obvious approach is to create a smallLinuxSecurity Module, now that LSM supports stacking multiple LSM modules.People typically already have a big LSM module installed,and there’s more than one used by different distros,but with stacking you can simply add a focused capability(e.g., to limit creation of filenames).Another option is by creating a special pass-through filesystem, but theadditional complexity such a filesystem would add doesn’t seem necessary.
For more information on my early Linux kernel module, seethe LWN.net article"Safename: restricting "dangerous" file names" by Jake Edge (LWN.net)
James K. Lowden informs me that“enforcement could be effected onNetBSD using a layered filesystem. [It] Wouldmake a nice [Summer of Code] SoC project, too, as a proof of concept.”
There are many possible designs for a renaming system; here’s a sample one:
Let’s examine various options; it turns out that there are many optionsto this, making it a little complicated.
A common approach would implement an escape character(or escape sequence) that is used when the underlying filename is bad.This would also be a complete solution —users and developers could then truly trust that “bad” filenames can’t happen (directory lists and so on would not produce them).The administrator could configure the specific policy of what filenamesare “bad” for their system, using the same approachesdescribed above(e.g., bytes forbidden everywhere, bytes forbidden as an initial character,bytes forbidden as a trailing character, bytes to be renamedeverywhere/initially/trailing,as well as whether or not to enforce UTF-8).
I presume that a file is stored in its “bad” form(if it’s bad), is escaped (renamed) before beingreturned to userspace, and that any filename from userspace withthe escape mechanism is automatically renamed backto the “bad” form when it is stored.The encoding character/sequence should itself be encoded, sothat you do not have to worry about having two different files with thesame user-visible name.This kind of “rename on create” isn’t what most POSIXsystems do, butMacOS already does this in some cases (it normalizes filenameswith non-ASCII characters), and most application programs don’tseem to care.
You’re probably better off minimizing the number of filenames thatwill be renamed into a different sequenceof bytes internally; this has implications on the encoding.For example,some encoding systems double the encoding character to encode theencoding character (soif “=” starts an encoding, then “==” can encode an “=”).I had earlier suggested using doubling to encode the encoding character,but this violates the rule of minimizing the renaming.In particular,if you use “=” in a filename at all, using “==”isn’t unlikely (e.g., filenames like “==Attention==”).Unix/Linux filenames tend to have mostly or all lower case letters, somandating that hexadecimal digits only be recognized if they areuppercase can help reduce unintentional renames too.To reduce the likelihood of unintentional encoding, I suggest having thekernel accept filenames and convertonly filename components which havethe encoding character followed by two hexadecimal digits, and wherethey are letters they must be upper case.Otherwise, any userspace “encoding” is not translated when brought tokernelspace, and conversely, an encoding from kernelspace is only usedwhen it is necessary.Thus, if “=” starts an encoding and is followed by two uppercase hex digits,it would be encoded as “=3D”, but filenames like“==Attention==” would not need to be renamed at all.
One of the problems with renaming systems is thatmany programs won’t be prepared for low-level encoding and/or might permitfilenames that can cause trouble later.At the very least, the kernel should not accept encodings forbyte 0x00 nor byte 0x2F (“/”).It might be a good idea to forbid encoding byte 0x2E (“.”).You only accept encodings that are currently forbidden, but that couldwould make it hard to change the rules later.Perhaps there should be a list of bytes which are translated from userspace,and all other “encodings” are ignored.
The same filename could appear different among differentsystems if it could be viewed at different times with and without encoding(e.g., perhaps it is stored on a memory stick, with the filename storedinside a separate file).This problem could be mostly alleviated by allowing programs toopen or create files using unencoded bad names (including names that haveencoding errors), while returning the encodednames when file lists are created later.
The administratorcould also decide if the system allowed‘bad’ filenames to be created (effectively renamingthem on creation, from the point of viewof user applications), or forbid their creation.This meant you could see existing data, but not create new problems.The rules could even be different (e.g., some “bad” filenamesare so badthat they may not be created... others are okay, but will be escapedwhen viewed in a directory).But that becomes rather complicated.If useful, two simple settings could be added:should “bad” filenames be acceptable when creating files, andshould “bad” filenames be acceptable when opening existing files.These settings might not be necessary, though;once renaming is automatic, bad filenames cannot cause that system any problem.
Finding a “good” escape character / escape sequenceand notation is tricky;there seem to be problems with all characters and notations.
The “=” character is a particularly reasonable escape character;it is relatively uncommon in filenames, most programs don’tconsider a leading “=” as starting options, and it doesn’thave special interactions with the shell.There’s a lot of experience with this kind of thing; thequoted-printable format uses “=” as theescape character too.Basically, any “=” is then followed by two hexadecimal digits(uppercase for letters) which indicate the replaced byte value.You could encode the “=” sign itself as “==” or “=3D” or both;I suggest using “==” (doubling the =) as the preferred way to escape it,since that would be easier to readwhen someonedid use an “=” in a filename.Then “foo\nbar” would become “foo=0Abar”.This could also be used to escape names that aren’t valid UTF-8 names.It could even be used to escape metacharacters and spaces, thoughI don’t think everyone would want that :-).
An alternative would be “%” as the escape character, againfollowed byby “%” if the original character was “%”, anda 2-digit hex value if it was any other forbidden character value.Then “foo\nbar” would become “foo%0Abar”.One problem: using % is also the convention for URLs, and since URLsare often mapped directly to filenames, there might be interference.I think I prefer “=” over “%”.
The “+” character is reasonable, though a few programs do use “+”as an option flag, and the built-in directory lost+found of filesystemswould be renamed (making it slightly less good).One advantage to using “+” is that you could then useUTF-7 to encode thecharacters that need to be escaped; UTF-7 is at least widely implemented.
Some escape characters are especially bad.I suggest not using “\” (this is anescape character for C, Python, and shell) or“&” (an escape character for HTML/XML), becausecombining them could be very confusing.Avoid the main glob characters (“*”, “?”, and “[”) — that way, accidentallyomitting shell quotes is less likely to be painful.
An alternative would be to use a rarely-used UTF-8 character as theescape character;the escape character would take more bytes, and on some systems it mightbe harder to type, but that wouldreduce the “it’s already being used” defense.Unfortunately, what is rarely-used for one person might be important to someone.
You could use an illegal UTF-8 prefix as the escape character,such as 0xFD, 0x81, or 0x90.This could be followed by two ASCII bytes that give the hexadecimal valueof the bad byte.After all, if we have a bad character in the filename,it would be sensible to not produce a legal UTF-8 sequence at all.Then we can handle all legal UTF-8 sequences as filenames, since ourescape mechanism can’t be confused with legal UTF-8 sequences.The byte 0xFD is reasonable, since it is not legalin UTF-8 (it begins a 6-byte UTF-8 sequence, but more recent rulings such asRFC 3629 forbid 6-byte sequences).But using 0xFD more-or-less assumes that you will use UTF-8 filenames,since in many other encodings this may step on an existing character.The bytes 0x81 and 0x90 havesome additional interesting properties: Not only are theyillegal as a UTF-8 first byte, but these bytesare also not included in many Windows code pages(such asWindows-1252)and are not in many ISO/IEC code pages(such asISO/IEC 8859-1).Thus,many people could use 0x81 or 0x90as the escape sequence prefix to escape bad bytes in filenames(like control characters and leading dashes),even if they didnot want to switch to UTF-8 filenames.Yet those using UTF-8 filenames could use exactly the same prefix.This means that we would not need to configure the prefix value,at least in many cases, and that makes many things easier.Between the two, I think I would pick 0x81, simply because it is thefirst value with the right properties.One negative of this is that this means that programs and programminglanguages will forever have to deal with illegal UTF-8 sequences in filenames,enshrining them instead of slowly getting rid of them.
You could also escape the bad byte as an overlong UTF-8 sequence,e.g., store the control characters 1-31 as two bytes instead of one.Then, if we receive a UTF-8 sequence that isoverlong,we encode it back before storing it(while not allowing \0 and slash in the stored filename).One nice property of this is that display systems are more likely todisplay these correctly, if there is a way to display them at all(e.g., they may display leading dash as leading dash).Again, this more-or-less assumes you are using UTF-8 filenames forexternal (user-level) representation.A big problem is that some programming language libraries may read theseoverlong sequences in and convert them to ordinary Unicode characters;then, when they are written out, they could be written as ordinary(non-overlong) characters, changing their meaning.So, while at first I thought this made sense, nowI think this is a bad idea.
A different approach would be to use an approach similar toPython PEP 383encoding (though encoding non-slash bad bytes 1-127 as well).In short, encode each bad byte (other than ASCII NUL \0 and slash)to U+DCxx (the low-surrogate code points), then encode that with UTF-8.This would include encoding bytes that are not valid UTF-8 in theunderlying filesystem.The advantage of this approach is thatPEP 383 encoding doesn’t interfere withgood filenames at all; it only renames bad filenames.Bad filenames would get a little longer (each bad byte becomes3 bytes), but there shouldn’t be many bad filenamesin the first place, and many bad filenames only have a few bad bytes(unless they are due to encoding mismatch).Thus, newline \n (0x0A, aka U+000A) would become Unicode“character” U+DC0A, encoding toUTF-8 0xED 0xB0 0x8A.Similarly, a leading dash is ASCII 0x2D becomes Unicode U+DC2D, encoding toUTF-8 0xED 0xB0 0xAD.The largest possible bad byte is 0xFF, becoming Unicode U+DCFF, encoding toUTF-8 0xED 0xB3 0xBF.When the kernel gets a filename from userspace that includesUTF-8-encoded U+DCxx characters, they would beencoded back (except for encodings of \0 and “/”,which would be ignored).If the filename stored on disk already has UTF-8 encoding of U+DCxx,it would be encoded again (so that when it is decoded later weend up with the original filename).Enabling this in some sense requires thatfilenames normally be UTF-8 (ASCII is a valid subset of UTF-8), sincemany other encodings would permit 0xED as a valid character, butit would work as an intermediate stage; if a filename uses a differentencoding, it can still be found and then renamed to UTF-8.These might display in an ugly way, but that is often true even withoutencoding, and display systemscould be taught to display thesewith “?” or some such.Such filenames are likely to be considered legal UTF-8, and thusprograms that expect UTF-8 will like these filenames.
The specific escape sequence could be an administrator setting.Unfortunately, if it can be set, that will tend to make thingsmorecomplicated, and we don’t needmore complications... itwould be nicer to have afixed escape sequence that we could count on.
One challenge is what to do about filenames that are so long thatthey “can’t” be expanded; at that point, it may be better to simplynot show the filename at all (and let specialized tools recover it).In practice, this is unlikely to be a problem.
My thanks to Adam Spragg, who convinced me to expand the descriptionon doing renaming in the kernel.
One problem with this approach is that programs that have extraprivileges areexactly the programs that youmost don’twant fooled by misleading filenames.A sneaky alternative, which again could be a configuration option, might bethat only privileged programs couldcreate bad filenames, but onlyunprivileged programs couldsee them later.I think that alternative is amusing but a bad idea;better to forbid or escape bad filenames outright.In general, I think this is overly complicated, but I mention itfor completeness.
Shells are especially important, because you want theircommand substitution, variable substitution,and “read” operations to be able toeasily do splitting on only \0.You’d need to modify all the Bourne-like shells so that they could use\0 as a field splitting character, either by default, new syntax,or via some easy setting (e.g., a glob setting).The zsh shell can include \0 inside variables but many shells cannot.It would probably be easier to get wider acceptance ifmechanisms thatall shells could easily support were created,even if the shell does not support \0 (null byte aka NUL) inside variables.This isespecially important for thefor loop, as this isthe easy way to loop over returned filenames.The idea would be to make something likeIFS="" for x in `find . -print0`or some such to split cleanly on \0 (it doesn’t today on most shells).No,IFS=$'\0' doesn’t work in bash 3.2.39(unsurprisingly; C programs often don’t like multicharacter stringsthat contain \0).It really should be the default; if it’s not,devising good syntax for this is tricky!One possibility is to specially interpret a 0-length IFS value in thefor loop as splitting on \0.Another possibility would be to devise another special setting or syntaxthat meant“when splitting, ignore IFS and split on \0 instead”.Then modify the “for” loop syntax to be"for name [ [using word [ in word ] ] ; do list ; done";the “using” stuff could set the \0 setting or an IFS value, whichwould applyonly to thein word part.Another variant would be to allownull in orzero in instead ofin,meaning to split on \0 instead of using IFS.Hopefully someone will come up with something better!The Bourne-like shells’ “read” command will also needto be able to easilyread \0-delimited values; you already can do this in bash 3.2.39 usingthe -d option, e.g.,IFS="" read -r -d ''(”don’t use IFS to split this up at all, don’t interpret backslash specially,and the delimiter is \0”).But not only is this nonstandard — don’tyou see how complicated that is?It has to bereally easy to use, like“read -0”, or people will forget to use it.
It’s not just shells; there are a lot of other tools that might need togenerate or accept filename lists, and they’d all need to be modifiedto handle \0 as a separator.The programs “find”, “xargs”, and “sort” are obvious, but almost anything thatdoes line-at-a-time processing might need to support \0 as a possibleseparator instead, and almost anything that might generate or usefilename lists will need to be modified so it can use \0 asthe separator instead of newline (or whatever it normally uses).And you can’t change just one implementation of a command like “sort”;you’d have to modify GNU’s sort and BSD’s sort and busybox’s sort and so on.Of course, after modifying all these infrastructure utilities, you’d have tomodify every program that processes filenames (!) to actuallyuse these new abilities.You could try to use an environment variable that means “default to using\0 separators”, but turning on such a variablewill probably mess up many programs thatdowork correctly, so I have little hope for that.And after all this, displaying filenames is still dangerous(due to terminal escapes) and inconsistent (due to a lack of standardencoding).
Few people really believe that filenames should have this junk,and you can prove that just by observing their actions.Their programs, when you read them, are littered with assumptions that filenames are “reasonable”.They assume that newlines and tabs aren’t in filenames, that filenamesdon’t start with “-”, that you canmeaningfully and safely print filenames, and so on.Actions speak louder than words —unless it is easy, people will not do it.Continuing to allow filenames to contain almost anything makesit very complicated to have correctly-working secure systems.I’m happy to help with the “make it easier” stuff,but in the long run, I don’t think they’re enough.By changing a few lines of kernel code, millions of lines of existing codewill work correctly in all cases, and many vulnerabilities willevaporate.
This won’t happen overnight; many programs will still haveto handle “bad” filenames as this transition occurs.But we can start making bad filenames impossible now, so that futuresoftware developers won’t have to deal with them.
What is “bad”, though? Even if they aren’t universal, it’d be usefulto have a common list so that software developers could avoid creating“non-portable” filenames.Some restrictions are easier to convince people of than others;administrators of a locked-down system might be interested in alonger list of rules.Here are possible rules, in order ofimportance (I’d do the first two right away, the third as consensuscan be achieved, and the later ones would probably only apply toindividual systems):
In particular, ensuring that filenames hadno control characters, no leading dashes, and used UTF-8 encoding wouldmake a lot of software development simpler.This is a long-term effort, but the journey of a thousand miles startswith the first step.
Feel free to see my home page athttps://dwheeler.com.You may also want to look at my paperWhy OSS/FS? Look atthe Numbers! and my book onhow to developsecure programs.
(C) Copyright 2009 David A. Wheeler.