Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork32.4k
Description
Feature or enhancement
Creating a Python string object in an efficient way is complicated. Python hasprivate_PyUnicodeWriter
API. It's being used by these projects:
Affected projects (5):
- Cython (3.0.9)
- asyncpg (0.29.0)
- catboost (1.2.3)
- frozendict (2.4.0)
- immutables (0.20)
I propose making the API public to promote it and help C extensions maintainers to write more efficient code to create Python string objects.
API:
typedefstructPyUnicodeWriterPyUnicodeWriter;PyAPI_FUNC(PyUnicodeWriter*)PyUnicodeWriter_Create(void);PyAPI_FUNC(void)PyUnicodeWriter_Discard(PyUnicodeWriter*writer);PyAPI_FUNC(PyObject*)PyUnicodeWriter_Finish(PyUnicodeWriter*writer);PyAPI_FUNC(void)PyUnicodeWriter_SetOverallocate(PyUnicodeWriter*writer,intoverallocate);PyAPI_FUNC(int)PyUnicodeWriter_WriteChar(PyUnicodeWriter*writer,Py_UCS4ch);PyAPI_FUNC(int)PyUnicodeWriter_WriteUTF8(PyUnicodeWriter*writer,constchar*str,// decoded from UTF-8Py_ssize_tlen);// use strlen() if len < 0PyAPI_FUNC(int)PyUnicodeWriter_Format(PyUnicodeWriter*writer,constchar*format, ...);// Write str(obj)PyAPI_FUNC(int)PyUnicodeWriter_WriteStr(PyUnicodeWriter*writer,PyObject*obj);// Write repr(obj)PyAPI_FUNC(int)PyUnicodeWriter_WriteRepr(PyUnicodeWriter*writer,PyObject*obj);// Write str[start:end]PyAPI_FUNC(int)PyUnicodeWriter_WriteSubstring(PyUnicodeWriter*writer,PyObject*str,Py_ssize_tstart,Py_ssize_tend);
The internal writer buffer isoverallocated by default.PyUnicodeWriter_Finish()
truncates the buffer to the exact size if the buffer was overallocated.
Overallocation reduces the cost of exponential complexity when adding short strings in a loop. UsePyUnicodeWriter_SetOverallocate(writer, 0)
to disable overallocation just before the last write.
The writer takes care of the internal buffer kind: Py_UCS1 (latin1), Py_UCS2 (BMP) or Py_UCS4 (full Unicode Character Set). It also implements an optimization if a single write is made usingPyUnicodeWriter_WriteStr()
: it returns the string unchanged without any copy.
Example of usage (simplified code from Python/unionobject.c):
staticPyObject*union_repr(PyObject*self){unionobject*alias= (unionobject*)self;Py_ssize_tlen=PyTuple_GET_SIZE(alias->args);PyUnicodeWriter*writer=PyUnicodeWriter_Create();if (writer==NULL) {returnNULL; }for (Py_ssize_ti=0;i<len;i++) {if (i>0&&PyUnicodeWriter_WriteUTF8(writer," | ",3)<0) { gotoerror; }PyObject*p=PyTuple_GET_ITEM(alias->args,i);if (PyUnicodeWriter_WriteRepr(writer,p)<0) { gotoerror; } }returnPyUnicodeWriter_Finish(writer);error:PyUnicodeWriter_Discard(writer);returnNULL;}
Linked PRs
- gh-119182: Add PyUnicodeWriter C API #119184
- gh-119396: Optimize PyUnicode_FromFormat() UTF-8 decoder #119398
- gh-119182: Decode PyUnicode_FromFormat() format string from UTF-8 #120248
- gh-119182: Use strict error handler in PyUnicode_FromFormat() #120307
- gh-119182: Add PyUnicodeWriter_DecodeUTF8Stateful() #120639
- gh-119182: Optimize PyUnicode_FromFormat() #120796
- gh-119182: Use public PyUnicodeWriter API in union_repr() #120797
- gh-119182: Use public PyUnicodeWriter API in ga_repr() #120799
- gh-119182: Use public PyUnicodeWriter in contextvar_tp_repr() #120809
- gh-119182: Rewrite PyUnicodeWriter tests in Python #120845
- gh-119182: Add PyUnicodeWriter_WriteUCS4() function #120849
- gh-119182: Use PyUnicodeWriter_WriteWideChar() #120851
- gh-119182: Add checks to PyUnicodeWriter APIs #120870
- gh-119182: Complete PyUnicodeWriter documentation #127607
- gh-119182: Use public PyUnicodeWriter in wrap_strftime() #129206
- gh-119182: Use public PyUnicodeWriter in time_strftime() #129207
- gh-119182: Use public PyUnicodeWriter in ast_unparse.c #129208
- gh-119182: Use public PyUnicodeWriter in Python-ast.c #129209
- gh-119182: Use public PyUnicodeWriter in stringio.c #129243
- gh-119182: Use public PyUnicodeWriter in _json.c #129249