Python Enhancement Proposals

Python »
PEP Index »
PEP 414

PEP 414 – Explicit Unicode Literal for Python 3.3

Author:: Armin Ronacher <armin.ronacher at active-4.com>,Alyssa Coghlan <ncoghlan at gmail.com>
Status:

Table of Contents

Abstract

This document proposes the reintegration of an explicit unicode literalfrom Python 2.x to the Python 3.x language specification, in order toreduce the volume of changes needed when porting Unicode-awarePython 2 applications to Python 3.

BDFL Pronouncement

This PEP has been formally accepted for Python 3.3:

I’m accepting the PEP. It’s about as harmless as they come. Make it so.

Proposal

This PEP proposes that Python 3.3 restore support for Python 2’s Unicodeliteral syntax, substantially increasing the number of lines of existingPython 2 code in Unicode aware applications that will run without modificationon Python 3.

Specifically, the Python 3 definition for string literal prefixes will beexpanded to allow:

"u"|"U"

in addition to the currently supported:

"r"|"R"

The following will all denote ordinary Python 3 strings:

'text'"text"'''text'''"""text"""u'text'u"text"u'''text'''u"""text"""U'text'U"text"U'''text'''U"""text"""

No changes are proposed to Python 3’s actual Unicode handling, only to theacceptable forms for string literals.

Exclusion of “Raw” Unicode Literals

Python 2 supports a concept of “raw” Unicode literals that don’t meet theconventional definition of a raw string:\uXXXX and\UXXXXXXXX escapesequences are still processed by the compiler and converted to theappropriate Unicode code points when creating the associated Unicode objects.

Python 3 has no corresponding concept - the compiler performsnopreprocessing of the contents of raw string literals. This matches thebehaviour of 8-bit raw string literals in Python 2.

Since such strings are rarely used and would be interpreted differently inPython 3 if permitted, it was decided that leaving them out entirely wasa better choice. Code which uses them will thus still fail immediately onPython 3 (with a Syntax Error), rather than potentially producing differentoutput.

To get equivalent behaviour that will run on both Python 2 and Python 3,either an ordinary Unicode literal can be used (with appropriate additionalescaping within the string), or else string concatenation or stringformatting can be combine the raw portions of the string with those thatrequire the use of Unicode escape sequences.

Note that when usingfrom__future__importunicode_literals in Python 2,the nominally “raw” Unicode string literals will process\uXXXX and\UXXXXXXXX escape sequences, just like Python 2 strings explicitly markedwith the “raw Unicode” prefix.

Author’s Note

This PEP was originally written by Armin Ronacher, and Guido’s approval wasgiven based on that version.

The currently published version has been rewritten by Alyssa Coghlan toinclude additional historical details and rationale that were taken intoaccount when Guido made his decision, but were not explicitly documented inArmin’s version of the PEP.

Readers should be aware that many of the arguments in this PEP arenottechnical ones. Instead, they relate heavily to thesocial andpersonalaspects of software development.

Rationale

With the release of a Python 3 compatible version of the Web Services GatewayInterface (WSGI) specification (PEP 3333) for Python 3.2, many parts of thePython web ecosystem have been making a concerted effort to support Python 3without adversely affecting their existing developer and user communities.

One major item of feedback from key developers in those communities, includingChris McDonough (WebOb, Pyramid), Armin Ronacher (Flask, Werkzeug), JacobKaplan-Moss (Django) and Kenneth Reitz (requests) is that the requirementto change the spelling ofevery Unicode literal in an application(regardless of how that is accomplished) is a key stumbling block for portingefforts.

In particular, unlike many of the other Python 3 changes, it isn’t one thatframework and library authors can easily handle on behalf of their users. Mostof those users couldn’t care less about the “purity” of the Python languagespecification, they just want their websites and applications to work as wellas possible.

While it is the Python web community that has been most vocal in highlightingthis concern, it is expected that other highly Unicode aware domains (such asGUI development) may run into similar issues as they (and their communities)start making concerted efforts to support Python 3.

Common Objections

Complaint: This PEP may harm adoption of Python 3.2

This complaint is interesting, as it carries within it a tacit admission thatthis PEPwill make it easier to port Unicode aware Python 2 applications toPython 3.

There are many existing Python communities that are prepared to put up withthe constraints imposed by the existing suite of porting tools, or to updatetheir Python 2 code bases sufficiently that the problems are minimised.

This PEP is not for those communities. Instead, it is designed specifically tohelp people thatdon’t want to put up with those difficulties.

However, since the proposal is for a comparatively small tweak to the languagesyntax with no semantic changes, it is feasible to support it as a thirdparty import hook. While such an import hook imposes some import timeoverhead, and requires additional steps from each application that needs itto get the hook in place, it allows applications that target Python 3.2to use libraries and frameworks that would otherwise only run on Python 3.3+due to their use of unicode literal prefixes.

One such import hook project is Vinay Sajip’suprefix[4].

For those that prefer to translate their code in advance rather thanconverting on the fly at import time, Armin Ronacher is working on a hookthat runs at install time rather than during import[5].

Combining the two approaches is of course also possible. For example, theimport hook could be used for rapid edit-test cycles during localdevelopment, but the install hook for continuous integration tasks anddeployment on Python 3.2.

The approaches described in this section may prove useful, for example, forapplications that wish to target Python 3 on the Ubuntu 12.04 LTS release,which will ship with Python 2.7 and 3.2 as officially supported Pythonversions.

Complaint: Python 3 shouldn’t be made worse just to support porting from Python 2

This is indeed one of the key design principles of Python 3. However, one ofthe key design principles of Python as a whole is that “practicality beatspurity”. If we’re going to impose a significant burden on third partydevelopers, we should have a solid rationale for doing so.

In most cases, the rationale for backwards incompatible Python 3 changes areeither to improve code correctness (for example, stricter default separationof binary and text data and integer division upgrading to floats whennecessary), reduce typical memory usage (for example, increased usage ofiterators and views over concrete lists), or to remove distracting nuisancesthat make Python code harder to read without increasing its expressiveness(for example, the comma based syntax for naming caught exceptions). Changesbacked by such reasoning arenot going to be reverted, regardless ofobjections from Python 2 developers attempting to make the transition toPython 3.

In many cases, Python 2 offered two ways of doing things for historical reasons.For example, inequality could be tested with both!= and<> and integerliterals could be specified with an optionalL suffix. Such redundancieshave been eliminated in Python 3, which reduces the overall size of thelanguage and improves consistency across developers.

In the original Python 3 design (up to and including Python 3.2), the explicitprefix syntax for unicode literals was deemed to fall into this category, as itis completely unnecessary in Python 3. However, the difference between thoseother cases and unicode literals is that the unicode literal prefix isnotredundant in Python 2 code: it is a programmatically significant distinctionthat needs to be preserved in some fashion to avoid losing information.

While porting tools were created to help with the transition (see next section)it still creates an additional burden on heavy users of unicode strings inPython 2, solely so that future developers learning Python 3 don’t need to betold “For historical reasons, string literals may have an optionalu orU prefix. Never use this yourselves, it’s just there to help with portingfrom an earlier version of the language.”

Plenty of students learning Python 2 received similar warnings regarding stringexceptions without being confused or irreparably stunted in their growth asPython developers. It will be the same with this feature.

This point is further reinforced by the fact that Python 3still allows theuppercase variants of theB andR prefixes for bytes literals and rawbytes and string literals. If the potential for confusion due to string prefixvariants is that significant, where was the outcry asking that theseredundant prefixes be removed along with all the other redundancies that wereeliminated in Python 3?

Just as support for string exceptions was eliminated from Python 2 using thenormal deprecation process, support for redundant string prefix characters(specifically,B,R,u,U) may eventually be eliminatedfrom Python 3, regardless of the current acceptance of this PEP. However,such a change will likely only occur once third party libraries supportingPython 2.7 is about as common as libraries supporting Python 2.2 or 2.3 istoday.

Complaint: The WSGI “native strings” concept is an ugly hack

One reason the removal of unicode literals has provoked such concern amongstthe web development community is that the updated WSGI specification had tomake a few compromises to minimise the disruption for existing web serversthat provide a WSGI-compatible interface (this was deemed necessary in orderto make the updated standard a viable target for web application authors andweb framework developers).

One of those compromises is the concept of a “native string”. WSGI definesthree different kinds of string:

text strings: handled asunicode in Python 2 andstr in Python 3
native strings: handled asstr in both Python 2 and Python 3
binary data: handled asstr in Python 2 andbytes in Python 3

Some developers consider WSGI’s “native strings” to be an ugly hack, as theyareexplicitly documented as being used solely forlatin-1 decoded“text”, regardless of the actual encoding of the underlying data. Using thisapproach bypasses many of the updates to Python 3’s data model that aredesigned to encourage correct handling of text encodings. However, itgenerally works due to the specific details of the problem domain - web serverand web framework developers are some of the individualsmost aware of howblurry the line can get between binary data and text when working with HTTPand related protocols, and how important it is to understand the implicationsof the encodings in use when manipulating encoded text data. At theapplication level most of these details are hidden from the developer bythe web frameworks and support libraries (both in Python 2and in Python 3).

In practice, native strings are a useful concept because there are some APIs(both in the standard library and in third party frameworks and packages) andsome internal interpreter details that are designed primarily to work withstr. These components often don’t supportunicode in Python 2orbytes in Python 3, or, if they do, require additional encoding detailsand/or impose constraints that don’t apply to thestr variants.

Some example of interfaces that are best handled by using actualstrinstances are:

Python identifiers (as attributes, dict keys, class names, module names,import references, etc)
URLs for the most part as well as HTTP headers in urllib/http servers
WSGI environment keys and CGI-inherited values
Python source code for dynamic compilation and AST hacks
Exception messages
__repr__ return value
preferred filesystem paths
preferred OS environment

In Python 2.6 and 2.7, these distinctions are most naturally expressed asfollows:

u"": text string (unicode)
"": native string (str)
b"": binary data (str, also aliased asbytes)

In Python 3, thelatin-1 decoded native strings are not distinguishedfrom any other text strings:

"": text string (str)
"": native string (str)
b"": binary data (bytes)

Iffrom__future__importunicode_literals is used to modify the behaviourof Python 2, then, along with an appropriate definition ofn(), thedistinction can be expressed as:

"": text string
n(""): native string
b"": binary data

(Whilen=str works for simple cases, it can sometimes have problemsdue to non-ASCII source encodings)

In the common subset of Python 2 and Python 3 (with appropriatespecification of a source encoding and definitions of theu() andb()helper functions), they can be expressed as:

u(""): text string
"": native string
b(""): binary data

That last approach is the only variant that supports Python 2.5 and earlier.

Of all the alternatives, the format currently supported in Python 2.6 and 2.7is by far the cleanest approach that clearly distinguishes the three desiredkinds of behaviour. With this PEP, that format will also be supported inPython 3.3+. It will also be supported in Python 3.1 and 3.2 through the useof import and install hooks. While it is significantly less likely, it isalso conceivable that the hooks could be adapted to allow the use of theb prefix on Python 2.5.

Complaint: The existing tools should be good enough for everyone

A commonly expressed sentiment from developers that have already successfullyported applications to Python 3 is along the lines of “if you think it’s hard,you’re doing it wrong” or “it’s not that hard, just try it!”. While it is nodoubt unintentional, these responses all have the effect of telling thepeople that are pointing out inadequacies in the current porting toolset“there’s nothing wrong with the porting tools, you just suck and don’t knowhow to use them properly”.

These responses are a case of completely missing the point of what people arecomplaining about. The feedback that resulted in this PEP isn’t due to peoplecomplaining that ports aren’t possible. Instead, the feedback is coming frompeople that have successfullycompleted ports and are objecting that theyfound the experience thoroughlyunpleasant for the class of application thatthey needed to port (specifically, Unicode aware web frameworks and supportlibraries).

This is a subjective appraisal, and it’s the reason why the Python 3porting tools ecosystem is a case where the “one obvious way to do it”philosophy emphatically doesnot apply. While it was originally intended that“develop in Python 2, convert with2to3, test both” would be the standardway to develop for both versions in parallel, in practice, the needs ofdifferent projects and developer communities have proven to be sufficientlydiverse that a variety of approaches have been devised, allowing each groupto select an approach that best fits their needs.

Lennart Regebro has produced an excellent overview of the available migrationstrategies[2], and a similar review is provided in the official portingguide[3]. (Note that the official guidance has softened to “it depends onyour specific situation” since Lennart wrote his overview).

However, both of those guides are written from the founding assumption thatall of the developers involved arealready committed to the idea ofsupporting Python 3. They make no allowance for thesocial aspects of such achange when you’re interacting with a user base that may not be especiallytolerant of disruptions without a clear benefit, or are trying to persuadePython 2 focused upstream developers to accept patches that are solely aboutimproving Python 3 forward compatibility.

With the current porting toolset,every migration strategy will result inchanges toevery Unicode literal in a project. No exceptions. They willbe converted to either an unprefixed string literal (if the project decides toadopt theunicode_literals import) or else to a converter call likeu("text").

If theunicode_literals import approach is employed, but is not adoptedacross the entire project at the same time, then the meaning of a bare stringliteral may become annoyingly ambiguous. This problem can be particularlypernicious foraggregated software, like a Django site - in such a situation,some files may end up using theunicode_literals import and others may not,creating definite potential for confusion.

While these problems are clearly solvable at a technical level, they’re acompletely unnecessary distraction at the social level. Developer energy shouldbe reserved for addressingreal technical difficulties associated with thePython 3 transition (like distinguishing their 8-bit text strings from theirbinary data). They shouldn’t be punished with additional code changes (evenautomated ones) solely due to the fact that they havealready explicitlyidentified their Unicode strings in Python 2.

Armin Ronacher has created an experimental extension to 2to3 which onlymodernizes Python code to the extent that it runs on Python 2.7 or later withsupport from the cross-version compatibilitysix library. This tool isavailable aspython-modernize[1]. Currently, the deltas generated bythis tool will affect every Unicode literal in the converted source. Thiswill create legitimate concerns amongst upstream developers asked to acceptsuch changes, and amongst frameworkusers being asked to change theirapplications.

However, by eliminating the noise from changes to the Unicode literal syntax,many projects could be cleanly and (comparatively) non-controversially madeforward compatible with Python 3.3+ just by runningpython-modernize andapplying the recommended changes.

References

[1]

Python-Modernize(http://github.com/mitsuhiko/python-modernize)

[2]

Porting to Python 3: Migration Strategies(http://python3porting.com/strategies.html)

[3]

Porting Python 2 Code to Python 3(http://docs.python.org/howto/pyporting.html)

[4]

uprefix import hook project(https://bitbucket.org/vinay.sajip/uprefix)

[5]

install hook to remove unicode string prefix characters(https://github.com/mitsuhiko/unicode-literals-pep/tree/master/install-hook)

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0414.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換

PEP 414 – Explicit Unicode Literal for Python 3.3