Python Enhancement Proposals

Python »
PEP Index »
PEP 261

PEP 261 – Support for “wide” Unicode characters

Author:: Paul Prescod <paul at prescod.net>
Status:

Abstract

Python 2.1 unicode characters can have ordinals only up to2**16-1.This range corresponds to a range in Unicode known as the BasicMultilingual Plane. There are now characters in Unicode that liveon other “planes”. The largest addressable character in Unicodehas the ordinal17*2**16-1 (0x10ffff). For readability, wewill call this TOPCHAR and call characters in this range “widecharacters”.

Glossary

Character: Used by itself, means the addressable units of a PythonUnicode string.
Code point: A code point is an integer between 0 and TOPCHAR.If you imagine Unicode as a mapping from integers tocharacters, each integer is a code point. But theintegers between 0 and TOPCHAR that do not map tocharacters are also code points. Some will somedaybe used for characters. Some are guaranteed neverto be used for characters.
Codec: A set of functions for translating between physicalencodings (e.g. on disk or coming in from a network)into logical Python objects.
Encoding: Mechanism for representing abstract characters in terms ofphysical bits and bytes. Encodings allow us to storeUnicode characters on disk and transmit them over networksin a manner that is compatible with other Unicode software.
Surrogate pair: Two physical characters that represent a single logicalcharacter. Part of a convention for representing 32-bitcode points in terms of two 16-bit code points.
Unicode string: A Python type representing a sequence of code points with“string semantics” (e.g. case conversions, regularexpression compatibility, etc.) Constructed with theunicode() function.

Proposed Solution

One solution would be to merely increase the maximum ordinalto a larger value. Unfortunately the only straightforwardimplementation of this idea is to use 4 bytes per character.This has the effect of doubling the size of most Unicodestrings. In order to avoid imposing this cost on everyuser, Python 2.2 will allow the 4-byte implementation as abuild-time option. Users can choose whether they care aboutwide characters or prefer to preserve memory.

The 4-byte option is called “widePy_UNICODE”. The 2-byte optionis called “narrowPy_UNICODE”.

Most things will behave identically in the wide and narrow worlds.

unichr(i) for 0 <= i <2**16 (0x10000) always returns alength-one string.
unichr(i) for2**16 <= i <= TOPCHAR will return alength-one string on wide Python builds. On narrow builds it willraiseValueError.
ISSUE
Python currently allows\U literals that cannot berepresented as a single Python character. It generates twoPython characters known as a “surrogate pair”. Should thisbe disallowed on future narrow Python builds?
Pro:
Python already the construction of a surrogate pairfor a large unicode literal character escape sequence.This is basically designed as a simple way to construct“wide characters” even in a narrow Python build. It is alsosomewhat logical considering that the Unicode-literal syntaxis basically a short-form way of invoking the unicode-escapecodec.
Con:
Surrogates could be easily created this way but the userstill needs to be careful about slicing, indexing, printingetc. Therefore, some have suggested that Unicodeliterals should not support surrogates.
ISSUE
Should Python allow the construction of characters that donot correspond to Unicode code points? Unassigned Unicodecode points should obviously be legal (because they couldbe assigned at any time). But code points above TOPCHAR areguaranteed never to be used by Unicode. Should we allow accessto them anyhow?
Pro:
If a Python user thinks they know what they’re doing whyshould we try to prevent them from violating the Unicodespec? After all, we don’t stop 8-bit strings fromcontaining non-ASCII characters.
Con:
Codecs and other Unicode-consuming code will have to becareful of these characters which are disallowed by theUnicode specification.
ord() is always the inverse ofunichr()
There is an integer value in the sys module that describes thelargest ordinal for a character in a Unicode string on the currentinterpreter.sys.maxunicode is2**16-1 (0xffff) on narrow buildsof Python and TOPCHAR on wide builds.
ISSUE:
Should there be distinct constants for accessingTOPCHAR and the real upper bound for the domain ofunichr (if they differ)? There has also been asuggestion ofsys.unicodewidth which can take thevalues'wide' and'narrow'.
every Python Unicode character represents exactly one Unicode codepoint (i.e. Python Unicode Character = Abstract Unicode character).
codecs will be upgraded to support “wide characters”(represented directly in UCS-4, and as variable-length sequencesin UTF-8 and UTF-16). This is the main part of the implementationleft to be done.
There is a convention in the Unicode world for encoding a 32-bitcode point in terms of two 16-bit code points. These are knownas “surrogate pairs”. Python’s codecs will adopt this conventionand encode 32-bit code points as surrogate pairs on narrow Pythonbuilds.
ISSUE
Should there be a way to tell codecs not to generatesurrogates and instead treat wide characters aserrors?
Pro:
I might want to write code that works only withfixed-width characters and does not have to worry aboutsurrogates.
Con:
No clear proposal of how to communicate this to codecs.
there are no restrictions on constructing strings that usecode points “reserved for surrogates” improperly. These arecalled “isolated surrogates”. The codecs should disallow readingthese from files, but you could construct them using stringliterals orunichr().

Implementation

There is a new define:

#define Py_UNICODE_SIZE 2

To test whether UCS2 or UCS4 is in use, the derived macroPy_UNICODE_WIDE should be used, which is defined when UCS-4 is inuse.

There is a new configure option:

–enable-unicode=ucs2	configures a narrow`Py_UNICODE`, and useswchar_t if it fits
–enable-unicode=ucs4	configures a wide`Py_UNICODE`, and useswchar_t if it fits
–enable-unicode	same as “=ucs2”
–disable-unicode	entirely remove the Unicode functionality.

It is also proposed that one day--enable-unicode will justdefault to the width of your platformswchar_t.

Windows builds will be narrow for a while based on the fact thatthere have been few requests for wide characters, those requestsare mostly from hard-core programmers with the ability to buytheir own Python and Windows itself is strongly biased towards16-bit characters.

Notes

This PEP does NOT imply that people using Unicode need to use a4-byte encoding for their files on disk or sent over the network.It only allows them to do so. For example, ASCII is still alegitimate (7-bit) Unicode-encoding.

It has been proposed that there should be a module that handlessurrogates in narrow Python builds for programmers. If someonewants to implement that, it will be another PEP. It might also becombined with features that allow other kinds of character-,word- and line- based indexing.

Rejected Suggestions

More or less the status-quo

We could officially say that Python characters are 16-bit andrequire programmers to implement wide characters in theirapplication logic by combining surrogate pairs. This is a heavyburden because emulating 32-bit characters is likely to bevery inefficient if it is coded entirely in Python. Plus theseabstracted pseudo-strings would not be legal as input to theregular expression engine.

“Space-efficient Unicode” type

Another class of solution is to use some efficient storageinternally but present an abstraction of wide characters tothe programmer. Any of these would require a much more compleximplementation than the accepted solution. For instance considerthe impact on the regular expression engine. In theory, we couldmove to this implementation in the future without breaking Pythoncode. A future Python could “emulate” wide Python semantics onnarrow Python. Guido is not willing to undertake theimplementation right now.

Two types

We could introduce a 32-bit Unicode type alongside the 16-bittype. There is a lot of code that expects there to be only asingle Unicode type.

This PEP represents the least-effort solution. Over the nextseveral years, 32-bit Unicode characters will become more commonand that may either convince us that we need a more sophisticatedsolution or (on the other hand) convince us that simplymandating wide Unicode characters is an appropriate solution.Right now the two options on the table are do nothing or dothis.

References

Unicode Glossary:http://www.unicode.org/glossary/

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0261.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換

PEP 261 – Support for “wide” Unicode characters