UTF-8 and the problem of over-long characters
What is the problem?

Developers with an interest in IT security have been aware of concerns about "over-long"UTF-8 character encoding for at least ten years,but the problem seems to have resurfaced recently. This articledescribes what an over-long encoding is, and discusses whether developersshould be concerned it. However, to understand theproblem properly, we must first know something about UTF-8 encoding,and how it works.
How UTF-8 works
The Unicode standard assigns a numerical value (called a
code point) to each character in a wide range of alphabets. The first 127 code points correspond loosely to the traditionalASCII character set, so the symbol "A" is number 65 in both ASCIIand Unicode. There are some technical complications regardingthe meanings of code points with values less than 32 (theASCII "control characters") but they need not concern us here.
The majority of writtensymbols in common use have code points that will fit into a 16-bit number (1-65535) but,if we want to represent less common scripts such as Takri or Cuneiform,larger numbers are necessary. The largest code point in current useis about 125 000; the Unicode specification limit is 1 112 064.Even a number this large will certainly fit into a 32-bit number,but representing every symbol in a textas a 32-bit number is likely to be hopelessly inefficient where there are largeamounts of text.
Encoding is the process of writing the numeric value of a Unicode code point in some agreed format. UTF-32 encoding is the nearest thing we haveto simply writing out the code point in binary. UTF-32 is easy to read, write, and understand; but since the majority of all charactersin common use will fit into 16 bits, 50% or so ofthe storage or transmission capacity is wasted, with most alphabets.UTF-16 is an encodingof at least 16 bits but, again, for Western alphabets, most of whichare based on latin characters, it's still rather inefficient.
UTF-8 encoding uses a simple 8-bit representation for the symbols thatare the same as traditional ASCII (code points lower than 127 -- most characters in Western alphabets), and a variable-length encoding for all other code points.Broadly speaking, the length of a UTF-8 character encoding increaseswith the code point number; in practice few characters needmore than four bytes, and most need fewer than two. It is thus avery storage-efficient encoding, but one that is fiddly to encode and decode.
In a UTF-8-encoded character, if the first (most significant) bit of the first (most significant)byte is a zero, then this will be a one-byte sequence, interpretedbroadly the same as ASCII. Otherwise the first few bits of thefirst byte indicate the length of the sequence (that's
almosttrue -- see below). For example, if the first byte begins110, then this is a two byte sequence; 1110 is a three-byte sequence,and so on. This first byte is called the
leading byte.
All other bytes (
continuation bytes) start with binary 10.
It follows that the amount of bits available to encode the characterwill vary according to the leading byte, and will never (exceptin the simple ASCII case) amount to more than six bits in each byte. It is thisirregular distribution of data bits between framing bits that makesUTF-8 so fiddly to encode and decode.
Nevertheless, the storage and transmission efficiency has made UTF-8almost ubiquitous on the Web, and on Linux; so developers need to beaware of the problems that arise when it isn't handled carefully.
An example -- and why the problem exists
Consider the symbol "A", code point 65, or 0x41 in hexadecimal. Becausethis is a number less than 128, it can be represented as a one-bytesequence, just as in ASCII. This is the natural, and preferred representation.
Now consider an alternative way of encoding this symbol in UTF-8. Itturns out that it can be encoded as the sequence 0xC1, 0x81. Why?
In binary, this sequence is:
11 00 00 01 10 00 00 01
(I'm implicitly assuming a big-endian way of writing binary, here. Apologiesin advance to those who like to crack their boiled eggs at the other end).
The first binary byte starts with 110, so this is a two-byte sequence.The second byte begins with 10, marking this as a continuation byte.These bits are removed when the sequence is decoded, leaving
0 00 01 00 00 01
These bits are interpreted as four-bit groups, starting from theright (least significant bit) end of the bit pattern. Rewriting with thisgrouping gives us:
0 00 01 00 00 01
So we have 0100 (4) and 0001 (1), giving 0x41, or code point 65.So we have the symbol "A" legitimately encoded using two bytes, rather than one.
It gets worse -- the same symbol can be written as three bytes:0xE0,0x81,0x81. Again in binary we have:
(1110) 0000 (10) [00 00] [01 (10) 00] [0001]
I've put round brackets () around the framing bits, which are ignored,and grouped the actual data using square brackets. The number encodedhere is 0000 0100 0001, or 0x041 or, again, 65. There isn't any differencebetween 0x041 and 0x41, or even 0x0000000041 for that matter -- they areall 65.
The problem of over-long encoding
Problems arise when an application processes UTF-8 data from some untrusted source, and the developer assumes that the encoding will always be in theshortest form. Consider a situation where the data specifies a filename,and the application wants to ensure that this filename is constrained to bewithin some particular directory and its subdirectories. This problem commonly arises in web browsers, where we want to ensure that the user cannot specify a URL that maps to some arbitrary filename.
In a case like this, we will want to remove any characters that theoperating system will interpret as "higher-level directory." If we allow a UTF-8 string like this as input:
/etc/password
there could well be trouble. We need at least to remove the leading "/". This is easy enough if we can be sure that the characters will always be in shortest form -- it's just a matterof looking for the byte 0x2F. You might wonder whether this byte could legitimately appear in a UTF-8 multi-byte sequence; the answeris 'no' because only one-byte encodings have a zero in theirmost significant bit. So this is a safe thing to do.
Unfortunately, "/" can also be encoded as the over-long sequence0xC0,0xAF. This is harder to spot in a string of bytesthan 0x2F, particularly as this sequence might conceivablyoccur legitimately in some input. So the UTF-8 input might remainun-sanitized, and the directory separator "/" eventually passed to the operating system in a filename.
Worse, there are many other potential over-long encodings of the"/" character, and of other characters for which might need to check.The uncomfortable reality is that sanitizing UTF-8 is much more difficultthan it might first appear, if we don't somehow reject over-longencodings first.
The scope of the problem
So how likely is it, that over-long characters might be used to create a securityweakness?
In C/C++
The problem I described above arose because
we treated UTF-8 data as if it were ASCII. We are likely to do this -- if we do -- because, in many cases, UTF-8 can be manipulated like ASCII. Because the null (0) byte cannot occur in a UTF-8 string except when encoding the null character, many of the same C functionsthat work on ASCII work equally well on UTF-8. These functions all rely on a textstring being represented as a sequence of bytes terminated with a null. We can copy UTF-8 strings in memory using
strcmp(), compare them (badly) using
strcmp(), etc. We can even use
strlen() to give the length of thestring, in terms of the amount of storage it occupies. It won't tell us the number of characters in a UTF-8 string because some characters willtake more than one byte -- but it's a common (incorrect) programmingpractice that actually works a lot of the time. This kind of sloppyprogramming -- which is all it is -- doesn't arise when handlingencodings like UTF-16 because in practice the data is full of zeros. Thesezeros are treated as null terminators by the C library functions thathandle strings, and so the program just fails completely at an earlystage. With UTF-8, however, we can continue to treat UTF-8 like ASCIIa lot of the time without noticing a problem.
Manipulating UTF-8 as if it were ASCII is a very common (bad) thing todo in C/C++ programming, because C does not have any standard, low-levelrepresentation of a Unicode string. Various C++ libraries exist thatdo provide such representations, but they are only helpful if developersactually use them, and are not tempted to fall back on the familiarbuilt-in routines, that don't know the difference between UTF-8 andASCII.
In code that we write ourselves, it's easy to be careful about how weprocess UTF-8; it's less obvious what to do when using somebody else'scode, or a library. Suppose you are given some code that implements afunction that removes certain character sequences from thebeginning of a string:
char *removeAllAtStart (char *inputString, char *substring);
You want to call it like this, to remove any "/" at the startof your filename:
char *sanitizedFilename = removeAllAtStart (filename, "/");
Even if the function is fully UTF-8 aware(and how can you be sure of this?) it's still going to fail here.Why? Because you have specified the seaarch string as
"/", which the compiler will treat as null-terminated ASCII, not unicode.The function
removeAtStart would have to be very cleverindeed to search for all the possible over-long representations, notonly in the input to be modified, but also of the substring theprogrammer specified.
In short, in any situation where UTF-8 strings are processed usingfunctions designed for ASCII, there is a possibility that over-longcharacters can create a security weakness. The only solution is toremove all over-long characters from the input, using a processorthat is fully aware of the subtleties of UTF-8 encoding.
In Java
Problems like the above are far less likely to arise in Java, becauseUnicode strings and characters are first-class language elements. In Java, when I write
String remove = "/";
both the literal "/" and the
String object to which it isassigned are unicode. Consequently, even naive use of Java is unlikely toexpose a weakness.
It's not impossible, however. Suppose we are using an XML parser, that exposes the low-level byte stream for pre-processing. We want to filter out and replace certain characters or substrings ("/", "..", etc) as part of thispre-processing. Again, if the input is UTF-8, and if we assume wrongly thateach byte can be checked independently, the we have the same problemthat the C/C++ programmer has.
The solution is simple, at least in principle -- never, ever work on thebyte representations of text strings in Java unless you are really, 100%sure you know what you are doing. Once the text has been converted to Java'sinternal representation, it's perfectly safe to use Java's built in APIs toprocess it.
Other languages
As we have seen, over-long UTF-8 encoding can be a problem in any development scenario where we need to sanitize text input, and we use methods that are appropriate for ASCII strings, rather than unicode. Any language is potentially at risk, but languages that treat text strings as byte arrays are most vulnerable. This includesPhP and Perl -- although these languages do have support for Unicode,there is an implicit assumption that strings are one-byte-per-characterunless special care is taken. Python, like Java, mostly uses Unicodestrings natively.
Even in a programming language that has native Unicode support, we stillhave to be careful about the use of external libraries, particular thosewritten in C.
Conclusion
Some time back, over-long UTF-8 encodings were a real problem forweb browsers and web applications. The problem was solved simply byremoving them and replacing them with an error marker. This causedfurther processing to fail, but in a relatively safe way. There is not, and probably never has been, a
legitimate use ofover-long encodings, so stripping them is not usually a hostile thingto do.
Recently, over-long encodings have become a cause for concern again,perhaps because enough time has elapsed for developers to forget therisk they pose. The risk is smallest in languages like Java, that makeit quite difficult to manipulate the individual bytes in a text string, but all langauges are potentially vulnerable if care is not taken.
The only definitive solution, when working with UTF-8 input, is to convertit to a reliable unicode representation at the earliest possiblestage, and never work on the underlying bytes.
Have you posted something in response to this page?
Send a webmentionto notify me, giving the URL of your blog or page.