Python Enhancement Proposals

Python »
PEP Index »
PEP 393

PEP 393 – Flexible String Representation

Author:: Martin von Löwis <martin at v.loewis.de>
Status:

Abstract

The Unicode string type is changed to support multiple internalrepresentations, depending on the character with the largest Unicodeordinal (1, 2, or 4 bytes). This will allow a space-efficientrepresentation in common cases, but give access to full UCS-4 on allsystems. For compatibility with existing APIs, several representationsmay exist in parallel; over time, this compatibility should be phasedout. The distinction between narrow and wide Unicode builds isdropped. An implementation of this PEP is available at[1].

Rationale

There are two classes of complaints about the current implementationof the unicode type: on systems only supporting UTF-16, users complainthat non-BMP characters are not properly supported. On systems usingUCS-4 internally (and also sometimes on systems using UCS-2), there isa complaint that Unicode strings take up too much memory - especiallycompared to Python 2.x, where the same code would often use ASCIIstrings (i.e. ASCII-encoded byte strings). With the proposed approach,ASCII-only Unicode strings will again use only one byte per character;while still allowing efficient indexing of strings containing non-BMPcharacters (as strings containing them will use 4 bytes percharacter).

One problem with the approach is support for existing applications(e.g. extension modules). For compatibility, redundant representationsmay be computed. Applications are encouraged to phase out reliance ona specific internal representation if possible. As interaction withother libraries will often require some sort of internalrepresentation, the specification chooses UTF-8 as the recommended wayof exposing strings to C code.

For many strings (e.g. ASCII), multiple representations may actuallyshare memory (e.g. the shortest form may be shared with the UTF-8 formif all characters are ASCII). With such sharing, the overhead ofcompatibility representations is reduced. If representations do sharedata, it is also possible to omit structure fields, reducing the basesize of string objects.

Specification

Unicode structures are now defined as a hierarchy of structures,namely:

typedefstruct{PyObject_HEADPy_ssize_tlength;Py_hash_thash;struct{unsignedintinterned:2;unsignedintkind:2;unsignedintcompact:1;unsignedintascii:1;unsignedintready:1;}state;wchar_t*wstr;}PyASCIIObject;typedefstruct{PyASCIIObject_base;Py_ssize_tutf8_length;char*utf8;Py_ssize_twstr_length;}PyCompactUnicodeObject;typedefstruct{PyCompactUnicodeObject_base;union{void*any;Py_UCS1*latin1;Py_UCS2*ucs2;Py_UCS4*ucs4;}data;}PyUnicodeObject;

Objects for which both size and maximum character are known atcreation time are called “compact” unicode objects; character dataimmediately follow the base structure. If the maximum character isless than 128, they use the PyASCIIObject structure, and the UTF-8data, the UTF-8 length and the wstr length are the same as the lengthof the ASCII data. For non-ASCII strings, the PyCompactObjectstructure is used. Resizing compact objects is not supported.

Objects for which the maximum character is not given at creation timeare called “legacy” objects, created throughPyUnicode_FromStringAndSize(NULL, length). They use thePyUnicodeObject structure. Initially, their data is only in the wstrpointer; when PyUnicode_READY is called, the data pointer (union) isallocated. Resizing is possible as long PyUnicode_READY has not beencalled.

The fields have the following interpretations:

length: number of code points in the string (result of sq_length)
interned: interned-state (SSTATE_*) as in 3.2
kind: form of string
- 00 => str is not initialized (data are in wstr)
- 01 => 1 byte (Latin-1)
- 10 => 2 byte (UCS-2)
- 11 => 4 byte (UCS-4);
compact: the object uses one of the compact representations(implies ready)
ascii: the object uses the PyASCIIObject representation(implies compact and ready)
ready: the canonical representation is ready to be accessed throughPyUnicode_DATA and PyUnicode_GET_LENGTH. This is set either if theobject is compact, or the data pointer and length have beeninitialized.
wstr_length, wstr: representation in platform’s wchar_t(null-terminated). If wchar_t is 16-bit, this form may use surrogatepairs (in which cast wstr_length differs form length).wstr_length differs from length only if there are surrogate pairsin the representation.
utf8_length, utf8: UTF-8 representation (null-terminated).
data: shortest-form representation of the unicode string.The string is null-terminated (in its respective representation).

All three representations are optional, although the data form isconsidered the canonical representation which can be absent onlywhile the string is being created. If the representation is absent,the pointer is NULL, and the corresponding length field may containarbitrary data.

The Py_UNICODE type is still supported but deprecated. It is alwaysdefined as a typedef for wchar_t, so the wstr representation can doubleas Py_UNICODE representation.

The data and utf8 pointers point to the same memory if the string usesonly ASCII characters (using only Latin-1 is not sufficient). The dataand wstr pointers point to the same memory if the string happens tofit exactly to the wchar_t type of the platform (i.e. uses someBMP-not-Latin-1 characters if sizeof(wchar_t) is 2, and uses somenon-BMP characters if sizeof(wchar_t) is 4).

String Creation

The recommended way to create a Unicode object is to use the functionPyUnicode_New:

PyObject*PyUnicode_New(Py_ssize_tsize,Py_UCS4maxchar);

Both parameters must denote the eventual size/range of the strings.In particular, codecs using this API must compute both the number ofcharacters and the maximum character in advance. A string isallocated according to the specified size and character range and isnull-terminated; the actual characters in it may be uninitialized.

PyUnicode_FromString and PyUnicode_FromStringAndSize remain supportedfor processing UTF-8 input; the input is decoded, and the UTF-8representation is not yet set for the string.

PyUnicode_FromUnicode remains supported but is deprecated. If thePy_UNICODE pointer is non-null, the data representation is set. If thepointer is NULL, a properly-sized wstr representation is allocated,which can be modified until PyUnicode_READY() is called (explicitlyor implicitly). Resizing a Unicode string remains possible until itis finalized.

PyUnicode_READY() converts a string containing only a wstrrepresentation into the canonical representation. Unless wstr and datacan share the memory, the wstr representation is discarded after theconversion. The macro returns 0 on success and -1 on failure, whichhappens in particular if the memory allocation fails.

String Access

The canonical representation can be accessed using two macrosPyUnicode_Kind and PyUnicode_Data. PyUnicode_Kind gives one of thevalues PyUnicode_WCHAR_KIND (0), PyUnicode_1BYTE_KIND (1),PyUnicode_2BYTE_KIND (2), or PyUnicode_4BYTE_KIND (3). PyUnicode_DATAgives the void pointer to the data. Access to individual charactersshould use PyUnicode_{READ|WRITE}[_CHAR]:

PyUnicode_READ(kind, data, index)
PyUnicode_WRITE(kind, data, index, value)
PyUnicode_READ_CHAR(unicode, index)

All these macros assume that the string is in canonical form;callers need to ensure this by calling PyUnicode_READY.

A new function PyUnicode_AsUTF8 is provided to access the UTF-8representation. It is thus identical to the existing_PyUnicode_AsString, which is removed. The function will compute theutf8 representation when first called. Since this representation willconsume memory until the string object is released, applicationsshould use the existing PyUnicode_AsUTF8String where possible(which generates a new string object every time). APIs that implicitlyconverts a string to a char* (such as the ParseTuple functions) willuse PyUnicode_AsUTF8 to compute a conversion.

New API

This section summarizes the API additions.

Macros to access the internal representation of a Unicode object(read-only):

PyUnicode_IS_COMPACT_ASCII(o), PyUnicode_IS_COMPACT(o),PyUnicode_IS_READY(o)
PyUnicode_GET_LENGTH(o)
PyUnicode_KIND(o), PyUnicode_CHARACTER_SIZE(o),PyUnicode_MAX_CHAR_VALUE(o)
PyUnicode_DATA(o), PyUnicode_1BYTE_DATA(o), PyUnicode_2BYTE_DATA(o),PyUnicode_4BYTE_DATA(o)

Character access macros:

PyUnicode_READ(kind, data, index), PyUnicode_READ_CHAR(o, index)
PyUnicode_WRITE(kind, data, index, value)

Other macros:

PyUnicode_READY(o)
PyUnicode_CONVERT_BYTES(from_type, to_type, begin, end, to)

String creation functions:

PyUnicode_New(size, maxchar)
PyUnicode_FromKindAndData(kind, data, size)
PyUnicode_Substring(o, start, end)

Character access utility functions:

PyUnicode_GetLength(o), PyUnicode_ReadChar(o, index),PyUnicode_WriteChar(o, index, character)
PyUnicode_CopyCharacters(to, to_start, from, from_start, how_many)
PyUnicode_FindChar(str, ch, start, end, direction)

Representation conversion:

PyUnicode_AsUCS4(o, buffer, buflen)
PyUnicode_AsUCS4Copy(o)
PyUnicode_AsUnicodeAndSize(o, size_out)
PyUnicode_AsUTF8(o)
PyUnicode_AsUTF8AndSize(o, size_out)

UCS4 utility functions:

Py_UCS4_{strlen, strcpy, strcat, strncpy, strcmp, strncpy, strcmp,strncmp, strchr, strrchr}

Stable ABI

The following functions are added to the stable ABI (PEP 384), as theyare independent of the actual representation of Unicode objects:PyUnicode_New, PyUnicode_Substring, PyUnicode_GetLength,PyUnicode_ReadChar, PyUnicode_WriteChar, PyUnicode_Find,PyUnicode_FindChar.

GDB Debugging Hooks

Tools/gdb/libpython.py contains debugging hooks that embed knowledgeabout the internals of CPython’s data types, include PyUnicodeObjectinstances. It has been updated to track the change.

Deprecations, Removals, and Incompatibilities

While the Py_UNICODE representation and APIs are deprecated with thisPEP, no removal of the respective APIs is scheduled. The APIs shouldremain available at least five years after the PEP is accepted; beforethey are removed, existing extension modules should be studied to findout whether a sufficient majority of the open-source code on PyPI hasbeen ported to the new API. A reasonable motivation for using thedeprecated API even in new code is for code that shall work both onPython 2 and Python 3.

The following macros and functions are deprecated:

PyUnicode_FromUnicode
PyUnicode_GET_SIZE, PyUnicode_GetSize, PyUnicode_GET_DATA_SIZE,
PyUnicode_AS_UNICODE, PyUnicode_AsUnicode, PyUnicode_AsUnicodeAndSize
PyUnicode_COPY, PyUnicode_FILL, PyUnicode_MATCH
PyUnicode_Encode, PyUnicode_EncodeUTF7, PyUnicode_EncodeUTF8,PyUnicode_EncodeUTF16, PyUnicode_EncodeUTF32,PyUnicode_EncodeUnicodeEscape, PyUnicode_EncodeRawUnicodeEscape,PyUnicode_EncodeLatin1, PyUnicode_EncodeASCII,PyUnicode_EncodeCharmap, PyUnicode_TranslateCharmap,PyUnicode_EncodeMBCS, PyUnicode_EncodeDecimal,PyUnicode_TransformDecimalToASCII
Py_UNICODE_{strlen, strcat, strcpy, strcmp, strchr, strrchr}
PyUnicode_AsUnicodeCopy
PyUnicode_GetMax

_PyUnicode_AsDefaultEncodedString is removed. It previously returned aborrowed reference to an UTF-8-encoded bytes object. Since the unicodeobject cannot anymore cache such a reference, implementing it withoutleaking memory is not possible. No deprecation phase is provided,since it was an API for internal use only.

Extension modules using the legacy API may inadvertently callPyUnicode_READY, by calling some API that requires that the object isready, and then continue accessing the (now invalid) Py_UNICODEpointer. Such code will break with this PEP. The code was alreadyflawed in 3.2, as there is was no explicit guarantee that thePyUnicode_AS_UNICODE result would stay valid after an API call (due tothe possibility of string resizing). Modules that face this issueneed to re-fetch the Py_UNICODE pointer after API calls; doingso will continue to work correctly in earlier Python versions.

Discussion

Several concerns have been raised about the approach presented here:

It makes the implementation more complex. That’s true, but consideredworth it given the benefits.

The Py_UNICODE representation is not instantaneously available,slowing down applications that request it. While this is also true,applications that care about this problem can be rewritten to use thedata representation.

Performance

Performance of this patch must be considered for both memoryconsumption and runtime efficiency. For memory consumption, theexpectation is that applications that have many large strings will seea reduction in memory usage. For small strings, the effects depend onthe pointer size of the system, and the size of the Py_UNICODE/wchar_ttype. The following table demonstrates this for various small ASCIIand Latin-1 string sizes and platforms.

stringsize	Python 3.2				This PEP
	16-bit wchar_t		32-bit wchar_t		ASCII		Latin-1
	32-bit	64-bit	32-bit	64-bit	32-bit	64-bit	32-bit	64-bit
1	32	64	40	64	32	56	40	80
2	40	64	40	72	32	56	40	80
3	40	64	48	72	32	56	40	80
4	40	72	48	80	32	56	48	80
5	40	72	56	80	32	56	48	80
6	48	72	56	88	32	56	48	80
7	48	72	64	88	32	56	48	80
8	48	80	64	96	40	64	48	88

The runtime effect is significantly affected by the API beingused. After porting the relevant pieces of code to the new API,the iobench, stringbench, and json benchmarks see typicallyslowdowns of 1% to 30%; for specific benchmarks, speedups mayhappen as may happen significantly larger slowdowns.

In actual measurements of a Django application ([2]), significantreductions of memory usage could be found. For example, the storagefor Unicode objects reduced to 2216807 bytes, down from 6378540 bytesfor a wide Unicode build, and down from 3694694 bytes for a narrowUnicode build (all on a 32-bit system). This reduction came from theprevalence of ASCII strings in this application; out of 36,000 strings(with 1,310,000 chars), 35713 where ASCII strings (with 1,300,000chars). The sources for these strings where not further analysed;many of them likely originate from identifiers in the library, andstring constants in Django’s source code.

In comparison to Python 2, both Unicode and byte strings need to beaccounted. In the test application, Unicode and byte strings combinedhad a length of 2,046,000 units (bytes/chars) in 2.x, and 2,200,000units in 3.x. On a 32-bit system, where the 2.x build used 32-bitwchar_t/Py_UNICODE, the 2.x test used 3,620,000 bytes, and the 3.xbuild 3,340,000 bytes. This reduction in 3.x using the PEP comparedto 2.x only occurs when comparing with a wide unicode build.

Porting Guidelines

Only a small fraction of C code is affected by this PEP, namely codethat needs to look “inside” unicode strings. That code doesn’tnecessarily need to be ported to this API, as the existing API willcontinue to work correctly. In particular, modules that need tosupport both Python 2 and Python 3 might get too complicated whensimultaneously supporting this new API and the old Unicode API.

In order to port modules to the new API, try to eliminatethe use of these API elements:

the Py_UNICODE type,
PyUnicode_AS_UNICODE and PyUnicode_AsUnicode,
PyUnicode_GET_SIZE and PyUnicode_GetSize, and
PyUnicode_FromUnicode.

When iterating over an existing string, or looking at specificcharacters, use indexing operations rather than pointer arithmetic;indexing works well for PyUnicode_READ(_CHAR) and PyUnicode_WRITE. Usevoid* as the buffer type for characters to let the compiler detectinvalid dereferencing operations. If you do want to use pointerarithmetics (e.g. when converting existing code), use (unsigned)char* as the buffer type, and keep the element size (1, 2, or 4) in avariable. Notice that (1<<(kind-1)) will produce the element sizegiven a buffer kind.

When creating new strings, it was common in Python to start of with aheuristical buffer size, and then grow or shrink if the heuristicsfailed. With this PEP, this is now less practical, as you need notonly a heuristics for the length of the string, but also for themaximum character.

In order to avoid heuristics, you need to make two passes over theinput: once to determine the output length, and the maximum character;then allocate the target string with PyUnicode_New and iterate overthe input a second time to produce the final output. While this maysound expensive, it could actually be cheaper than having to copy theresult again as in the following approach.

If you take the heuristical route, avoid allocating a string meant tobe resized, as resizing strings won’t work for their canonicalrepresentation. Instead, allocate a separate buffer to collect thecharacters, and then construct a unicode object from that usingPyUnicode_FromKindAndData. One option is to use Py_UCS4 as the bufferelement, assuming for the worst case in character ordinals. This willallow for pointer arithmetics, but may require a lot of memory.Alternatively, start with a 1-byte buffer, and increase the elementsize as you encounter larger characters. In any case,PyUnicode_FromKindAndData will scan over the buffer to verify themaximum character.

For common tasks, direct access to the string representation may notbe necessary: PyUnicode_Find, PyUnicode_FindChar, PyUnicode_Ord, andPyUnicode_CopyCharacters help in analyzing and creating stringobjects, operating on indexes instead of data pointers.

References

[1]

PEP 393 branchhttps://bitbucket.org/t0rsten/pep-393

[2]

Django measurement resultshttps://web.archive.org/web/20160911215951/http://www.dcl.hpi.uni-potsdam.de/home/loewis/djmemprof/

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0393.rst

Last modified:2025-02-01 08:55:40 GMT

Movatterモバイル変換

PEP 393 – Flexible String Representation