Python Enhancement Proposals

Python »
PEP Index »
PEP 624

PEP 624 – Remove Py_UNICODE encoder APIs

Author:: Inada Naoki <songofacandy at gmail.com>
Status:

Abstract

This PEP proposes to remove deprecatedPy_UNICODE encoder APIs in Python 3.11:

PyUnicode_Encode()
PyUnicode_EncodeASCII()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()

Note

PEP 623 propose to removeUnicode object APIs relating toPy_UNICODE. On the other hand, this PEPis not relating to Unicode object. These PEPs are split because they havedifferent motivations and need different discussions.

Motivation

In general, reducing the number of APIs that have been deprecated fora long time and have few users is a good idea for not only itimproves the maintainability of CPython, but it also helps API usersand other Python implementations.

Rationale

Deprecated since Python 3.3

Py_UNICODE and APIs using it has been deprecated since Python 3.3.

Inefficient

All of these APIs are implemented usingPyUnicode_FromWideChar.So these APIs are inefficient when user want to encode Unicodeobject.

Not used widely

When searching from the top 4000 PyPI packages[1], only pyodbc usethese APIs.

PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()

pyodbc uses these APIs to encode Unicode object into bytes object.So it is easy to fix it.[2]

Alternative APIs

There are alternative APIs to acceptPyObject*unicode instead ofPy_UNICODE*. Users can migrate to them.

Deprecated API	Alternative APIs
`PyUnicode_Encode()`	`PyUnicode_AsEncodedString()`
`PyUnicode_EncodeASCII()`	`PyUnicode_AsASCIIString()` (1)
`PyUnicode_EncodeLatin1()`	`PyUnicode_AsLatin1String()` (1)
`PyUnicode_EncodeUTF7()`	(2)
`PyUnicode_EncodeUTF8()`	`PyUnicode_AsUTF8String()` (1)
`PyUnicode_EncodeUTF16()`	`PyUnicode_AsUTF16String()` (3)
`PyUnicode_EncodeUTF32()`	`PyUnicode_AsUTF32String()` (3)
`PyUnicode_EncodeUnicodeEscape()`	`PyUnicode_AsUnicodeEscapeString()`
`PyUnicode_EncodeRawUnicodeEscape()`	`PyUnicode_AsRawUnicodeEscapeString()`
`PyUnicode_EncodeCharmap()`	`PyUnicode_AsCharmapString()` (1)
`PyUnicode_TranslateCharmap()`	`PyUnicode_Translate()`
`PyUnicode_EncodeDecimal()`	(4)
`PyUnicode_TransformDecimalToASCII()`	(4)

Notes:

constchar*errors parameter is missing.
There is no public alternative API. But user can use genericPyUnicode_AsEncodedString() instead.
constchar*errors,intbyteorder parameters are missing.
There is no direct replacement. ButPy_UNICODE_TODECIMALcan be used instead. CPython uses_PyUnicode_TransformDecimalAndSpaceToASCII for convertingfrom Unicode to numbers instead.

Plan

Remove these APIs in Python 3.11. They have been deprecated already.

PyUnicode_Encode()
PyUnicode_EncodeASCII()
PyUnicode_EncodeLatin1()
PyUnicode_EncodeUTF7()
PyUnicode_EncodeUTF8()
PyUnicode_EncodeUTF16()
PyUnicode_EncodeUTF32()
PyUnicode_EncodeUnicodeEscape()
PyUnicode_EncodeRawUnicodeEscape()
PyUnicode_EncodeCharmap()
PyUnicode_TranslateCharmap()
PyUnicode_EncodeDecimal()
PyUnicode_TransformDecimalToASCII()

Alternative Ideas

Replace`Py_UNICODE` with`PyObject`

As described in the “Alternative APIs” section, some APIs don’t havepublic alternative APIs acceptingPyObject*unicode input.And some public alternative APIs have restrictions like missingerrors andbyteorder parameters.

Instead of removing deprecated APIs, we can reuse their names foralternative public APIs.

Since we have private alternative APIs already, it is just renamingfrom private name to public and deprecated names.

Rename to	Rename from
`PyUnicode_EncodeASCII()`	`_PyUnicode_AsASCIIString()`
`PyUnicode_EncodeLatin1()`	`_PyUnicode_AsLatin1String()`
`PyUnicode_EncodeUTF7()`	`_PyUnicode_EncodeUTF7()`
`PyUnicode_EncodeUTF8()`	`_PyUnicode_AsUTF8String()`
`PyUnicode_EncodeUTF16()`	`_PyUnicode_EncodeUTF16()`
`PyUnicode_EncodeUTF32()`	`_PyUnicode_EncodeUTF32()`

Pros:

We have a more consistent API set.

Cons:

Backward incompatible.
We have more public APIs to maintain for rare use cases.
Existing public APIs are enough for most use cases, andPyUnicode_AsEncodedString() can be used in other cases.

Replace`Py_UNICODE` with`Py_UCS4`

We can replacePy_UNICODE withPy_UCS4 and undeprecatethese APIs.

UTF-8, UTF-16, UTF-32 encoders supportPy_UCS4 internally.SoPyUnicode_EncodeUTF8(),PyUnicode_EncodeUTF16(), andPyUnicode_EncodeUTF32() can avoid to create a temporary Unicodeobject.

Pros:

We can avoid creating temporary Unicode object when encoding fromPy_UCS4* into bytes object with UTF-8, UTF-16, UTF-32 codecs.

Cons:

Backward incompatible.
We have more public APIs to maintain for rare use cases.
Other Python implementations that want to support Python/C API needto support these APIs too.
If we change the Unicode internal representation to UTF-8 in thefuture, we need to keep UCS-4 support only for these APIs.

Replace`Py_UNICODE` with`wchar_t`

We can replacePy_UNICODE withwchar_t. SincePy_UNICODEis typedef ofwchar_t already, this is status quo.

On platforms wheresizeof(wchar_t)==4, we can avoid to create atemporary Unicode object when encoding fromwchar_t* to bytesobjects using UTF-8, UTF-16, and UTF-32 codec, like the “ReplacePy_UNICODE* withPy_UCS4*” idea.

Pros:

Backward compatible.
We can avoid creating temporary Unicode object when encode fromPy_UCS4* into bytes object with UTF-8, UTF-16, UTF-32 codecson platform wheresizeof(wchar_t)==4.

Cons:

Although Windows is the most major platform that useswchar_theavily, these APIs need to create a temporary Unicode objectalways becausesizeof(wchar_t)==2 on Windows.
We have more public APIs to maintain for rare use cases.
Other Python implementations that want to support Python/C API needto support these APIs too.
If we change the Unicode internal representation to UTF-8 in thefuture, we need to keep UCS-4 support only for these APIs.

Rejected Ideas

Emit runtime warning

In addition to existing compiler warning, emitting runtimeDeprecationWarning is suggested.

But these APIs doesn’t release GIL for now. Emitting a warning fromsuch APIs is not safe. See this example.

PyObject*u=PyList_GET_ITEM(list,i);//uisborrowedreference.PyObject*b=PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(u),PyUnicode_GET_SIZE(u),NULL);//Assumesuisstilllivingreference.PyObject*t=PyTuple_Pack(2,u,b);Py_DECREF(b);returnt;

If we emit Python warning fromPyUnicode_EncodeUTF8(), warningfilters and other threads may change thelist andu can bea dangling reference afterPyUnicode_EncodeUTF8() returned.

Discussions

Objections

Removing these APIs removes ability to use codec without temporaryUnicode.
- Codecs can not encode Unicode buffer directly without temporaryUnicode object since Python 3.3. All these APIs creates temporaryUnicode object for now. So removing them doesn’t reduce anyabilities.
Why not remove decoder APIs too?
- They are part of stable ABI.
- PyUnicode_DecodeASCII() andPyUnicode_DecodeUTF8() areused very widely. Deprecating them is not worth enough.
- Decoder APIs can decode from byte buffer directly, withoutcreating temporary bytes object. On the other hand, encoder APIscan not avoid temporary Unicode object.

References

[1]

Source package list chosen from top 4000 PyPI packages.(https://github.com/methane/notes/blob/master/2020/wchar-cache/package_list.txt)

[2]

pyodbc – Don’t use PyUnicode_Encode API #792(https://github.com/mkleehammer/pyodbc/pull/792)

Copyright

This document is placed in the public domain or under theCC0-1.0-Universal license, whichever is more permissive.

Source:https://github.com/python/peps/blob/main/peps/pep-0624.rst

Last modified:2025-02-01 08:55:40 GMT

Movatterモバイル変換

PEP 624 – Remove Py_UNICODE encoder APIs