Python 2.1 unicode characters can have ordinals only up to2**16-1.This range corresponds to a range in Unicode known as the BasicMultilingual Plane. There are now characters in Unicode that liveon other “planes”. The largest addressable character in Unicodehas the ordinal17*2**16-1 (0x10ffff). For readability, wewill call this TOPCHAR and call characters in this range “widecharacters”.
unicode() function.One solution would be to merely increase the maximum ordinalto a larger value. Unfortunately the only straightforwardimplementation of this idea is to use 4 bytes per character.This has the effect of doubling the size of most Unicodestrings. In order to avoid imposing this cost on everyuser, Python 2.2 will allow the 4-byte implementation as abuild-time option. Users can choose whether they care aboutwide characters or prefer to preserve memory.
The 4-byte option is called “widePy_UNICODE”. The 2-byte optionis called “narrowPy_UNICODE”.
Most things will behave identically in the wide and narrow worlds.
unichr(i) for 0 <= i <2**16 (0x10000) always returns alength-one string.unichr(i) for2**16 <= i <= TOPCHAR will return alength-one string on wide Python builds. On narrow builds it willraiseValueError.ISSUE
Python currently allows\Uliterals that cannot berepresented as a single Python character. It generates twoPython characters known as a “surrogate pair”. Should thisbe disallowed on future narrow Python builds?
Pro:
Python already the construction of a surrogate pairfor a large unicode literal character escape sequence.This is basically designed as a simple way to construct“wide characters” even in a narrow Python build. It is alsosomewhat logical considering that the Unicode-literal syntaxis basically a short-form way of invoking the unicode-escapecodec.
Con:
Surrogates could be easily created this way but the userstill needs to be careful about slicing, indexing, printingetc. Therefore, some have suggested that Unicodeliterals should not support surrogates.
ISSUE
Should Python allow the construction of characters that donot correspond to Unicode code points? Unassigned Unicodecode points should obviously be legal (because they couldbe assigned at any time). But code points above TOPCHAR areguaranteed never to be used by Unicode. Should we allow accessto them anyhow?
Pro:
If a Python user thinks they know what they’re doing whyshould we try to prevent them from violating the Unicodespec? After all, we don’t stop 8-bit strings fromcontaining non-ASCII characters.
Con:
Codecs and other Unicode-consuming code will have to becareful of these characters which are disallowed by theUnicode specification.
ord() is always the inverse ofunichr()sys.maxunicode is2**16-1 (0xffff) on narrow buildsof Python and TOPCHAR on wide builds.ISSUE:
Should there be distinct constants for accessingTOPCHAR and the real upper bound for the domain ofunichr(if they differ)? There has also been asuggestion ofsys.unicodewidthwhich can take thevalues'wide'and'narrow'.
ISSUE
Should there be a way to tell codecs not to generatesurrogates and instead treat wide characters aserrors?
Pro:
I might want to write code that works only withfixed-width characters and does not have to worry aboutsurrogates.
Con:
No clear proposal of how to communicate this to codecs.
unichr().There is a new define:
#define Py_UNICODE_SIZE 2To test whether UCS2 or UCS4 is in use, the derived macroPy_UNICODE_WIDE should be used, which is defined when UCS-4 is inuse.
There is a new configure option:
| –enable-unicode=ucs2 | configures a narrowPy_UNICODE, and useswchar_t if it fits |
| –enable-unicode=ucs4 | configures a widePy_UNICODE, and useswchar_t if it fits |
| –enable-unicode | same as “=ucs2” |
| –disable-unicode | entirely remove the Unicode functionality. |
It is also proposed that one day--enable-unicode will justdefault to the width of your platformswchar_t.
Windows builds will be narrow for a while based on the fact thatthere have been few requests for wide characters, those requestsare mostly from hard-core programmers with the ability to buytheir own Python and Windows itself is strongly biased towards16-bit characters.
This PEP does NOT imply that people using Unicode need to use a4-byte encoding for their files on disk or sent over the network.It only allows them to do so. For example, ASCII is still alegitimate (7-bit) Unicode-encoding.
It has been proposed that there should be a module that handlessurrogates in narrow Python builds for programmers. If someonewants to implement that, it will be another PEP. It might also becombined with features that allow other kinds of character-,word- and line- based indexing.
More or less the status-quo
We could officially say that Python characters are 16-bit andrequire programmers to implement wide characters in theirapplication logic by combining surrogate pairs. This is a heavyburden because emulating 32-bit characters is likely to bevery inefficient if it is coded entirely in Python. Plus theseabstracted pseudo-strings would not be legal as input to theregular expression engine.
“Space-efficient Unicode” type
Another class of solution is to use some efficient storageinternally but present an abstraction of wide characters tothe programmer. Any of these would require a much more compleximplementation than the accepted solution. For instance considerthe impact on the regular expression engine. In theory, we couldmove to this implementation in the future without breaking Pythoncode. A future Python could “emulate” wide Python semantics onnarrow Python. Guido is not willing to undertake theimplementation right now.
Two types
We could introduce a 32-bit Unicode type alongside the 16-bittype. There is a lot of code that expects there to be only asingle Unicode type.
This PEP represents the least-effort solution. Over the nextseveral years, 32-bit Unicode characters will become more commonand that may either convince us that we need a more sophisticatedsolution or (on the other hand) convince us that simplymandating wide Unicode characters is an appropriate solution.Right now the two options on the table are do nothing or dothis.
Unicode Glossary:http://www.unicode.org/glossary/
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-0261.rst
Last modified:2025-02-01 08:59:27 GMT