Movatterモバイル変換


[0]ホーム

URL:


ContentsMenuExpandLight modeDark modeAuto light/dark modeAuto light/dark, in light modeAuto light/dark, in dark modeSkip to content
Alyssa Coghlan's Python Notes
Alyssa Coghlan's Python Notes
Back to top

Python 3 and ASCII Compatible Binary Protocols

Last Updated: 6th January, 2014

If you pay any attention to the Twittersphere (and likely several otherenvironments), you may have noticed various web framework developers having afew choice words regarding the Unicode handling design in Python 3.

They actually have good reason to be upset with python-dev: we broke theirworld. Not only did we break it, but we did it on purpose.

What did we break?

What we broke is a very specific thing: many of the previously idiomatictechniques for transparently accepting both Unicode text and text in an ASCIIcompatible binary encoding no longer work in Python 3. Given that the web(along with network protocols in general) isbuilt on the concept of ASCIIcompatible binary encodings, this is causing web framework developers anunderstandable amount of grief as they start making their first serious effortsat supporting Python 3.

The key thing that changed is that it is no longer easy to write textmanipulation algorithms that can work transparently on either actual text(i.e. 2.xunicode objects) and on text encoded to binary using an ASCIIcompatible encoding (i.e. some instances of 2.xstr objects).

There are a few essential changes in Python 3 which make this no longerpractical:

  • In 2.x, whenunicode andstr meet, the latter is automaticallypromoted tounicode (usually assuming a defaultascii encoding). In3.x, this changes such that whenstr (now always Unicode text) meetsbytes (the new binary data type) you get an exception. Significantly,this means you can no longer share literal values between the twoalgorithm variants (in 2.x, you could just usestr literals and rely onthe automatic promotion to cover theunicode case).

  • Iterating over a string produces a series of length 1 strings. Iterating overa 3.x bytes object, on the other hand, produces a series of integers.Similarly, indexing a bytes object produces an integer - you need to useslicing syntax if you want a length 1 bytes object.

  • Theencode() anddecode() convenience methods no longer support thetext->text and binary->binary transforms, instead being limited to the actualtext->binary and binary->text encodings. Thecodecs.encode andcodecs.decode functions need to be used instead in order to handle thesetransforms in addition to the regular text encodings (these functions areavailable as far back as Python 2.4, so they’re usable in the common subsetof Python 2 and Python 3).

  • In 2.x, theunicode type supported the buffer API, allowing direct accessto the raw multi-byte characters (stored as UCS4 in wide builds, and aUCS2/UTF-16 hybrid in narrow builds). In 3.x, the only way to access thisdata directly is via the Python C API. At the Python level, you only haveaccess to the code point data, not the individual bytes.

The recommended approach to handling both binary and text inputs to an APIwithout duplicating code is to explicitly decode any binary data on inputand encode it again on output, using one of two options:

  1. ascii instrict mode (for true 7-bit ASCII data)

  2. ascii insurrogateescape mode (to allow any ASCII compatibleencoding)

However, it’s important to be very careful with the latter approach - whenapplied to an ASCII incompatible encoding, manipulations that assume ASCIIcompatibility may still cause data corruption, even with explicit decodingand encoding steps. It can be better to assume strict ASCII-only data forimplicit conversions, and require external conversion to Unicode for otherASCII compatible encodings (e.g. this is the approach now taken by theurllib.urlparse module).

Why did we break it?

That last paragraph in the previous section hints at the answer:assumingthat binary data uses an ASCII compatible encoding and manipulating itaccordingly can lead to silent data corruption if the assumption is incorrect.

In a world where there are multiple ASCIIincompatible text encodings inregular use (e.g. UTF-16, UTF-32, ShiftJIS, many of the CJK codecs), that’sa problem.

Another regular problem with code that supposedly supports both Unicode andencoded text is that it may not correctly handle multi-byte, variablewidth and other stateful encodings where the meaning of the current bytemay depend on the values of one or more previous bytes, even if the codedoes happen to correctly handle ASCII-incompatible stateless single-byteencodings.

All of these problem can be dealt with if you appropriately vet the encodingof any binary data that is passed in. However, this is not only often easiersaid than done, but Python 2 doesn’t really offer you any good tools forfinding out when you’ve stuffed it up. They’re data driven bugs, but theerrors may never turn into exceptions, instead just causing flaws in theresulting text output.

This was a gross violation of “The Zen of Python”, specifically the part about“Errors should never pass silently. Unless explicitly silenced”.

As a concrete example of the kind of obscure errors this can cause, I recentlytracked down an obscure problem that was leading to my web server receivinga request that consisted solely of the letter “G”. From what I have been ableto determine, that error was the result of:

  1. M2Crypto emitting a Unicode value for a HTTP header value

  2. The SSL connection combining this with other values, creating an entireUnicode string instead of the expected byte sequence

  3. The SSL connection interpreting that string via the buffer API

  4. The SSL connection seeing the additional NULs due to the UCS4 internalencoding and truncating the string accordingly

This has now been worked around by explicitly encoding the Unicode valueerroneously emitted, but it was a long hunt to find the problem when theinitial symptom was just a 404 error from the web server.

Since Python 3 is a lot fussier when it comes to the ways it willallow binary and text data to implicitly interact, this would have beenpicked up client side as soon as any attempt was made to combine theUnicode text value with the already encoded binary data.

The other key reason for changing the text model of the language is thatthe Python 2 model only works properly on POSIX systems. Unlike POSIX,Unicode capable interfaces on Windows, the JVM and the CLR (whether .NETor mono), use Unicode natively rather than using encoded bytestrings.

The Python 3 model, by contrast, aims to handle Unicode correctly on allplatforms, with the surrogateescape error handler introduced to handle thecase of data in operating system interfaces that doesn’t match the declaredencoding on POSIX systems.

Why are the web framework developers irritated?

We knew when we released Python 3 that it was going to take quite a while forthe binary/text split to be fully resolved. Most of the burden of thatresolution falls on the shoulders of those dealing with the boundariesbetween text data and binary protocols. Web frameworks have to deal withthese issues both on the network sideand on the data storage side.

Those developers also have good reason to want to avoid decoding to Unicode -until Python 3.3 was released, Unicode strings consumed up to four timesthe memory consumed by 8 bit strings (depending on build options).

That means framework developers face an awkward choice in their near termPython 3 porting efforts:

  • do it “right” (i.e. converting to the text format for text manipulations),and keep track of the need to convert the result back to bytes

  • split their code into parallel binary and text APIs (potentially duplicatinga lot of code and making it much harder to maintain)

  • including multiple “binary or text” checks within the algorithmimplementation (this can get very untidy very quickly)

  • develop a custom extension type for implementing a str-style API on top ofencoded binary data (this is hard to do without reintroducing all theproblems with ASCII incompatible encodings noted above, but a customtype provides more scope to make it clear it is only appropriate incontexts where ASCII compatible encodings can be safely assummed, suchas many web protocols)

I have a personal preference for the first choice as the current pathof least resistance, as reflected in the way I implemented the binaryinput support for theurllib.parse APIs in Python 3.2. However, thelast option (or something along those lines) will likely be needed inorder to make ASCII compatible binary protocol handling as convenientin Python 3 as it is in Python 2.

The last option is still one of the options for possible future Python 3improvements listed underIs Python 3 more convenient than Python 2 in every respect?.

Couldn’t the implicit decoding just be disabled in Python 2?

While Python 2does provide a mechanism that allows the implicitdecoding mechanism to be disabled, actually trying to use it breaks theworld:

>>>importurlparse>>>urlparse.urlsplit("http://example.com")SplitResult(scheme='http', netloc='example.com', path='', query='', fragment='')>>>urlparse.urlsplit(u"http://example.com")SplitResult(scheme=u'http', netloc=u'example.com', path=u'', query='', fragment='')>>>importsys>>>reload(sys).setdefaultencoding("undefined")>>>urlparse.clear_cache()>>>urlparse.urlsplit("http://example.com")SplitResult(scheme='http', netloc='example.com', path='', query='', fragment='')>>>urlparse.urlsplit(u"http://example.com")Traceback (most recent call last):  File"<stdin>", line1, in<module>  File"/usr/lib64/python2.7/urlparse.py", line181, inurlspliti=url.find(':')  File"/usr/lib64/python2.7/encodings/undefined.py", line22, indecoderaiseUnicodeError("undefined encoding")UnicodeError:undefined encoding

(If you don’t clear the parsing cache after disabling the default encodingand retest with the same URLs, that second call may appear to be work,but that’s only because it gets a hit in the cache from the earliersuccessful call. Using a different URL or clearing the caches as shownwill reveal the error).

This is why turning off the implicit decoding is such a big deal that itrequired a major version bump for the language definition: there isalot of Python 2 code that only handles Unicode because 8-bit strings(including literals) are implicitly promoted to Unicode as needed. SincePython 3 removes all the implicit conversions, code that previously reliedon it in order to accept both binary and text inputs (like the Python 2 URLparsing code shown above) instead needs to be updated to explicitly handleboth binary and text inputs.

So, in contrast to Python 2 code above, the Python 3 version not onlychanges the types of the components in the result, but also changes thetype of the result itself:

>>>fromurllibimportparse>>>parse.urlsplit("http://example.com")SplitResult(scheme='http', netloc='example.com', path='', query='', fragment='')>>>parse.urlsplit(b"http://example.com")SplitResultBytes(scheme=b'http', netloc=b'example.com', path=b'', query=b'', fragment=b'')

However, it’s also no longer dependent on a global configuration setting thatcontrols how 8-bit string literals are converted to Unicode text - instead,the decision on how to convert from bytes to text is handled entirelywithin the function call.

Where to from here?

The revised text handling design in Python 3 is definitely a case of thepursuit of correctness triumphing over convenience. “Usually handy, butoccasionally completely and totally wrong” is not a good way to design alanguage (If you question this, compare and contrast the experience ofprogramming in C++ and Python. Both are languages with a strong C influence,but the former makes a habit of indulging in premature optimisations that cango seriously wrong if their assumptions are violated. Guess which of the twois almost universally seen as being more developer hostile?).

The challenge for Python 3.3 and beyond is to start bringing back some ofthe past convenience that resulted from being able to blur the lines betweenbinary and text data without unduly compromising on the gains in correctness.

The efficient Unicode representation in Python 3.3 (which uses thesmallest per-character size out of 1, 2 and 4 that can handle all charactersin the string) was a solid start down that road, as was the restoration ofUnicode string literal support inPEP 414 (as that was a change libraryand framework developers couldn’t address on behalf of their users).

Python 3.4 restored full support for the binary transform codecs throughthe existing type neutral codecs module API (along with improved handlingof codec errors in general).

Some other possible steps towards making Python 3 as convenient a langaugeas Python 2 for wire protocol handling are discussed inIs Python 3 more convenient than Python 2 in every respect?

But for most Python programmers, this issue simply doesn’t arise. Binarydata is binary data, text characters are text characters, and the two onlymeet at well-defined boundaries. It’s only people that are writing thelibraries and frameworks thatimplement those boundaries that really needto grapple with the details of these concepts.

On this page

[8]ページ先頭

©2009-2025 Movatter.jp