Python Enhancement Proposals

Python »
PEP Index »
PEP 383

PEP 383 – Non-decodable Bytes in System Character Interfaces

Author:: Martin von Löwis <martin at v.loewis.de>
Status:

Abstract

File names, environment variables, and command line arguments aredefined as being character data in POSIX; the C APIs however allowpassing arbitrary bytes - whether these conform to a certain encodingor not. This PEP proposes a means of dealing with such irregularitiesby embedding the bytes in character strings in such a way that allowsrecreation of the original byte string.

Rationale

The C char type is a data type that is commonly used to represent bothcharacter data and bytes. Certain POSIX interfaces are specified andwidely understood as operating on character data, however, the systemcall interfaces make no assumption on the encoding of these data, andpass them on as-is. With Python 3, character strings use aUnicode-based internal representation, making it difficult to ignorethe encoding of byte strings in the same way that the C interfaces canignore the encoding.

On the other hand, Microsoft Windows NT has corrected the originaldesign limitation of Unix, and made it explicit in its systeminterfaces that these data (file names, environment variables, commandline arguments) are indeed character data, by providing aUnicode-based API (keeping a C-char-based one for backwardscompatibility).

For Python 3, one proposed solution is to provide two sets of APIs: abyte-oriented one, and a character-oriented one, where thecharacter-oriented one would be limited to not being able to representall data accurately. Unfortunately, for Windows, the situation wouldbe exactly the opposite: the byte-oriented interface cannot representall data; only the character-oriented API can. As a consequence,libraries and applications that want to support all user data in across-platform manner have to accept mish-mash of bytes and charactersexactly in the way that caused endless troubles for Python 2.x.

With this PEP, a uniform treatment of these data as characters becomespossible. The uniformity is achieved by using specific encodingalgorithms, meaning that the data can be converted back to bytes onPOSIX systems only if the same encoding is used.

Being able to treat such strings uniformly will allow applicationwriters to abstract from details specific to the operating system, andreduces the risk of one API failing when the other API would haveworked.

Specification

On Windows, Python uses the wide character APIs to accesscharacter-oriented APIs, allowing direct conversion of theenvironmental data to Python str objects (PEP 277).

On POSIX systems, Python currently applies the locale’s encoding toconvert the byte data to Unicode, failing for characters that cannotbe decoded. With this PEP, non-decodable bytes >= 128 will berepresented as lone surrogate codes U+DC80..U+DCFF. Bytes below128 will produce exceptions; see the discussion below.

To convert non-decodable bytes, a new error handler (PEP 293)“surrogateescape” is introduced, which produces these surrogates. Onencoding, the error handler converts the surrogate back to thecorresponding byte. This error handler will be used in any API thatreceives or produces file names, command line arguments, orenvironment variables.

The error handler interface is extended to allow the encode errorhandler to return byte strings immediately, in addition to returningUnicode strings which then get encoded again (also see the discussionbelow).

Byte-oriented interfaces that already exist in Python 3.0 are notaffected by this specification. They are neither enhanced nordeprecated.

External libraries that operate on file names (such as GUI filechoosers) should also encode them according to the PEP.

Discussion

This surrogateescape encoding is based on Markus Kuhn’s idea thathe called UTF-8b[3].

While providing a uniform API to non-decodable bytes, this interfacehas the limitation that chosen representation only “works” if the dataget converted back to bytes with the surrogateescape error handleralso. Encoding the data with the locale’s encoding and the (default)strict error handler will raise an exception, encoding them with UTF-8will produce nonsensical data.

Data obtained from other sources may conflict with data producedby this PEP. Dealing with such conflicts is out of scope of the PEP.

This PEP allows the possibility of “smuggling” bytes in characterstrings. This would be a security risk if the bytes aresecurity-critical when interpreted as characters on a target system,such as path name separators. For this reason, the PEP rejectssmuggling bytes below 128. If the target system uses EBCDIC, suchsmuggled bytes may still be a security risk, allowing smuggling ofe.g. square brackets or the backslash. Python currently does notsupport EBCDIC, so this should not be a problem in practice. Anybodyporting Python to an EBCDIC system might want to adjust the errorhandlers, or come up with other approaches to address the securityrisks.

Encodings that are not compatible with ASCII are not supported bythis specification; bytes in the ASCII range that fail to decodewill cause an exception. It is widely agreed that such encodingsshould not be used as locale charsets.

For most applications, we assume that they eventually pass datareceived from a system interface back into the same systeminterfaces. For example, an application invoking os.listdir() willlikely pass the result strings back into APIs like os.stat() oropen(), which then encodes them back into their original byterepresentation. Applications that need to process the original bytestrings can obtain them by encoding the character strings with thefile system encoding, passing “surrogateescape” as the error handlername. For example, a function that works like os.listdir, except foraccepting and returning bytes, would be written as:

deflistdir_b(dirname):fse=sys.getfilesystemencoding()dirname=dirname.decode(fse,"surrogateescape")forfninos.listdir(dirname):# fn is now a str objectyieldfn.encode(fse,"surrogateescape")

The extension to the encode error handler interface proposed by thisPEP is necessary to implement the ‘surrogateescape’ error handler,because there are required byte sequences which cannot be generatedfrom replacement Unicode. However, the encode error handler interfacepresently requires replacement Unicode to be provided in lieu of thenon-encodable Unicode from the source string. Then it promptlyencodes that replacement Unicode. In some error handlers, such as the‘surrogateescape’ proposed here, it is also simpler and more efficientfor the error handler to provide a pre-encoded replacement bytestring, rather than forcing it to calculating Unicode from which theencoder would create the desired bytes.

A few alternative approaches have been proposed:

create a new string subclass that supports embedded bytes
use different escape schemes, such as escaping with a NULcharacter, or mapping to infrequent characters.

Of these proposals, the approach of escaping each byte XXwith the sequence U+0000 U+00XX has the disadvantage thatencoding to UTF-8 will introduce a NUL byte in the UTF-8sequence. As a consequence, C libraries may interpret thisas a string termination, even though the string continues.In particular, the gtk libraries will truncate text in thiscase; other libraries may show similar problems.

References

[3]

UTF-8bhttps://web.archive.org/web/20090830064219/http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0383.rst

Last modified:2025-02-01 08:55:40 GMT