Unicode and passing strings¶
Note
This page uses two different syntax variants:
Cython specific
cdef
syntax, which was designed to make type declarationsconcise and easily readable from a C/C++ perspective.Pure Python syntax which allows static Cython type declarations inpure Python code,followingPEP-484 type hintsandPEP 526 variable annotations.
To make use of C data types in Python syntax, you need to import the special
cython
module in the Python module that you want to compile, e.g.importcython
If you use the pure Python syntax we strongly recommend you use a recentCython 3 release, since significant improvements have been made herecompared to the 0.29.x releases.
Similar to the string semantics in Python 3, Cython strictly separatesbyte strings and unicode strings. Above all, this means that by defaultthere is no automatic conversion between byte strings and unicode strings(except for what Python 2 does in string operations). All encoding anddecoding must pass through an explicit encoding/decoding step. To easeconversion between Python and C strings in simple cases, the module-levelc_string_type
andc_string_encoding
directives can be used toimplicitly insert these encoding/decoding steps.
Python string types in Cython code¶
Cython supports four Python string types:bytes
,str
,unicode
andbasestring
. Thebytes
andunicode
typesare the specific types known from normal Python 2.x (namedbytes
andstr
in Python 3). Additionally, Cython also supports thebytearray
type which behaves like thebytes
type, exceptthat it is mutable.
Thestr
type is special in that it is the byte string in Python 2and the Unicode string in Python 3 (for Cython code compiled withlanguage level 2, i.e. the default). Meaning, it always correspondsexactly with the type that the Python runtime itself callsstr
.Thus, in Python 2, bothbytes
andstr
represent the byte stringtype, whereas in Python 3, bothstr
andunicode
represent thePython Unicode string type. The switch is made at C compile time, thePython version that is used to run Cython is not relevant.
When compiling Cython code with language level 3, thestr
type isidentified with exactly the Unicode string type at Cython compile time,i.e. it does not identify withbytes
when running in Python 2.
Note that thestr
type is not compatible with theunicode
type in Python 2, i.e. you cannot assign a Unicode string to a variableor argument that is typedstr
. The attempt will result in eithera compile time error (if detectable) or aTypeError
exception atruntime. You should therefore be careful when you statically type astring variable in code that must be compatible with Python 2, as thisPython version allows a mix of byte strings and unicode strings for dataand users normally expect code to be able to work with both. Code thatonly targets Python 3 can safely type variables and arguments as eitherbytes
orunicode
.
Thebasestring
type represents both the typesstr
andunicode
,i.e. all Python text string types in Python 2 and Python 3. This can beused for typing text variables that normally contain Unicode text (atleast in Python 3) but must additionally accept thestr
type inPython 2 for backwards compatibility reasons. It is not compatible withthebytes
type. Its usage should be rare in normal Cython code asthe genericobject
type (i.e. untyped code) will normally be goodenough and has the additional advantage of supporting the assignment ofstring subtypes. Support for thebasestring
type was added in Cython0.20.
String literals¶
Cython understands all Python string type prefixes:
b'bytes'
for byte stringsu'text'
for Unicode stringsf'formatted{value}'
for formatted Unicode string literals as defined byPEP 498 (added in Cython 0.24)
Unprefixed string literals becomestr
objects when compilingwith language level 2 andstr
objects(i.e.unicode
) with language level 3.
General notes about C strings¶
In many use cases, C strings (a.k.a. character pointers) are slowand cumbersome. For one, they usually require manual memorymanagement in one way or another, which makes it more likely tointroduce bugs into your code.
Then, Python string objects cache their length, so requesting it(e.g. to validate the bounds of index access or when concatenatingtwo strings into one) is an efficient constant time operation.In contrast, callingstrlen()
to get this informationfrom a C string takes linear time, which makes many operations onC strings rather costly.
Regarding text processing, Python has built-in support for Unicode,which C lacks completely. If you are dealing with Unicode text,you are usually better off using Python Unicode string objects thantrying to work with encoded data in C strings. Cython makes thisquite easy and efficient.
Generally speaking: unless you know what you are doing, avoidusing C strings where possible and use Python string objects instead.The obvious exception to this is when passing them back and forthfrom and to external C code. Also, C++ strings remember their lengthas well, so they can provide a suitable alternative to Python bytesobjects in some cases, e.g. when reference counting is not neededwithin a well defined context.
Passing byte strings¶
We have dummy C functions declared that we are going to reuse throughout this tutorial:
fromcython.cimports.libc.stdlibimportmallocfromcython.cimports.libc.stringimportstrcpy,strlenhello_world=cython.declare(cython.p_char,'hello world')n=cython.declare(cython.Py_ssize_t,strlen(hello_world))@cython.cfuncdefc_call_returning_a_c_string()->cython.p_char:c_string:cython.p_char=cython.cast(cython.p_char,malloc((n+1)*cython.sizeof(cython.char)))ifnotc_string:returncython.NULL# malloc failedstrcpy(c_string,hello_world)returnc_string@cython.cfuncdefget_a_c_string(c_string_ptr:cython.pp_char,length:cython.p_Py_ssize_t)->cython.int:c_string_ptr[0]=cython.cast(cython.p_char,malloc((n+1)*cython.sizeof(cython.char)))ifnotc_string_ptr[0]:return-1# malloc failedstrcpy(c_string_ptr[0],hello_world)length[0]=nreturn0
Warning
The code provided above / on this page uses an externalnative (non-Python) library through acimport
(cython.cimports
).Cython compilation enables this, but there is no support for this fromplain Python. Trying to run this code from Python (without compilation)will fail when accessing the external library.This is described in more detail inCalling C functions.
fromlibc.stdlibcimportmallocfromlibc.stringcimportstrcpy,strlencdefchar*hello_world='hello world'cdefsize_tn=strlen(hello_world)cdefchar*c_call_returning_a_c_string():cdefchar*c_string=<char*>malloc((n+1)*sizeof(char))ifnotc_string:returnNULL# malloc failedstrcpy(c_string,hello_world)returnc_stringcdefintget_a_c_string(char**c_string_ptr,Py_ssize_t*length):c_string_ptr[0]=<char*>malloc((n+1)*sizeof(char))ifnotc_string_ptr[0]:return-1# malloc failedstrcpy(c_string_ptr[0],hello_world)length[0]=nreturn0
We make a correspondingc_func.pxd
to be able to cimport those functions:
cdefchar*c_call_returning_a_c_string()cdefintget_a_c_string(char**c_string,Py_ssize_t*length)
It is very easy to pass byte strings between C code and Python.When receiving a byte string from a C library, you can let Cythonconvert it into a Python byte string by simply assigning it to aPython variable:
fromcython.cimports.c_funcimportc_call_returning_a_c_stringc_string=cython.declare(cython.p_char,c_call_returning_a_c_string())ifc_stringiscython.NULL:...# handle errorpy_string=cython.declare(bytes,c_string)
fromc_funccimportc_call_returning_a_c_stringcdefchar*c_string=c_call_returning_a_c_string()ifc_stringisNULL:...# handle errorcdefbytespy_string=c_string
A type cast toobject
orbytes
will do the same thing:
py_string=cython.cast(bytes,c_string)
py_string=<bytes>c_string
This creates a Python byte string object that holds a copy of theoriginal C string. It can be safely passed around in Python code, andwill be garbage collected when the last reference to it goes out ofscope. It is important to remember that null bytes in the string actas terminator character, as generally known from C. The above willtherefore only work correctly for C strings that do not contain nullbytes.
Besides not working for null bytes, the above is also very inefficientfor long strings, since Cython has to callstrlen()
on theC string first to find out the length by counting the bytes up to theterminating null byte. In many cases, the user code will know thelength already, e.g. because a C function returned it. In this case,it is much more efficient to tell Cython the exact number of bytes byslicing the C string. Here is an example:
fromcython.cimports.libc.stdlibimportfreefromcython.cimports.c_funcimportget_a_c_stringdefmain():c_string:cython.p_char=cython.NULLlength:cython.Py_ssize_t=0# get pointer and length from a C functionget_a_c_string(cython.address(c_string),cython.address(length))try:py_bytes_string=c_string[:length]# Performs a copy of the datafinally:free(c_string)
fromlibc.stdlibcimportfreefromc_funccimportget_a_c_stringdefmain():cdefchar*c_string=NULLcdefPy_ssize_tlength=0# get pointer and length from a C functionget_a_c_string(&c_string,&length)try:py_bytes_string=c_string[:length]# Performs a copy of the datafinally:free(c_string)
Here, no additional byte counting is required andlength
bytes fromthec_string
will be copied into the Python bytes object, includingany null bytes. Keep in mind that the slice indices are assumed to beaccurate in this case and no bounds checking is done, so incorrectslice indices will lead to data corruption and crashes.
Note that the creation of the Python bytes string can fail with anexception, e.g. due to insufficient memory. If you need tofree()
the string after the conversion, you should wrapthe assignment in a try-finally construct:
fromcython.cimports.libc.stdlibimportfreefromcython.cimports.c_funcimportc_call_returning_a_c_stringpy_string=cython.declare(bytes)c_string=cython.declare(cython.p_char,c_call_returning_a_c_string())try:py_string=c_stringfinally:free(c_string)
fromlibc.stdlibcimportfreefromc_funccimportc_call_returning_a_c_stringcdefbytespy_stringcdefchar*c_string=c_call_returning_a_c_string()try:py_string=c_stringfinally:free(c_string)
To convert the byte string back into a Cchar*, use theopposite assignment:
other_c_string=cython.declare(cython.p_char,py_string)# other_c_string is a 0-terminated string.
cdefchar*other_c_string=py_string# other_c_string is a 0-terminated string.
This is a very fast operation after whichother_c_string
points tothe byte string buffer of the Python string itself. It is tied to thelife time of the Python string. When the Python string is garbagecollected, the pointer becomes invalid. It is therefore important tokeep a reference to the Python string as long as thechar*is in use. Often enough, this only spans the call to a C function thatreceives the pointer as parameter. Special care must be taken,however, when the C function stores the pointer for later use. Apartfrom keeping a Python reference to the string object, no manual memorymanagement is required.
Starting with Cython 0.20, thebytearray
type is supported andcoerces in the same way as thebytes
type. However, when using itin a C context, special care must be taken not to grow or shrink theobject buffer after converting it to a C string pointer. Thesemodifications can change the internal buffer address, which will makethe pointer invalid.
Accepting strings from Python code¶
The other side, receiving input from Python code, may appear simpleat first sight, as it only deals with objects. However, getting thisright without making the API too narrow or too unsafe may not beentirely obvious.
In the case that the API only deals with byte strings, i.e. binarydata or encoded text, it is best not to type the input argument assomething likebytes
, because that would restrict the allowedinput to exactly that type and exclude both subtypes and other kindsof byte containers, e.g.bytearray
objects or memory views.
Depending on how (and where) the data is being processed, it may be agood idea to instead receive a 1-dimensional memory view, e.g.
defprocess_byte_data(data:cython.uchar[:]):length=data.shape[0]first_byte=data[0]slice_view=data[1:-1]# ...
defprocess_byte_data(unsignedchar[:]data):length=data.shape[0]first_byte=data[0]slice_view=data[1:-1]# ...
Cython’s memory views are described in more detail inTyped Memoryviews, but the above example already showsmost of the relevant functionality for 1-dimensional byte views. Theyallow for efficient processing of arrays and accept anything that canunpack itself into a byte buffer, without intermediate copying. Theprocessed content can finally be returned in the memory view itself(or a slice of it), but it is often better to copy the data back intoa flat and simplebytes
orbytearray
object, especiallywhen only a small slice is returned. Since memoryviews do not copy thedata, they would otherwise keep the entire original buffer alive. Thegeneral idea here is to be liberal with input by accepting any kind ofbyte buffer, but strict with output by returning a simple, well adaptedobject. This can simply be done as follows:
defprocess_byte_data(data:cython.uchar[:]):# ... process the data, here, dummy processing.return_all:cython.bint=(data[0]==108)ifreturn_all:returnbytes(data)else:# example for returning a slicereturnbytes(data[5:7])
defprocess_byte_data(unsignedchar[:]data):# ... process the data, here, dummy processing.cdefbintreturn_all=(data[0]==108)ifreturn_all:returnbytes(data)else:# example for returning a slicereturnbytes(data[5:7])
For read-only buffers, likebytes
, the memoryview item type shouldbe declared asconst
(seeRead-only views). If the byte input isactually encoded text, and the further processing should happen at theUnicode level, then the right thing to do is to decode the input straightaway. This is almost only a problem in Python 2.x, where Python codeexpects that it can pass a byte string (str
) with encoded text intoa text API. Since this usually happens in more than one place in themodule’s API, a helper function is almost always the way to go, since itallows for easy adaptation of the input normalisation process later.
This kind of input normalisation function will commonly look similar tothe following:
fromcython.cimports.cpython.versionimportPY_MAJOR_VERSION@cython.cfuncdef_text(s)->str:iftype(s)isstr:# Fast path for most common case(s).returncython.cast(str,s)elifPY_MAJOR_VERSION<3andisinstance(s,bytes):# Only accept byte strings as text input in Python 2.x, not in Py3.returncython.cast(bytes,s).decode('ascii')elifisinstance(s,str):# We know from the fast path above that 's' can only be a subtype here.# An evil cast to <str> might still work in some(!) cases,# depending on what the further processing does. To be safe,# we can always create a copy instead.returnstr(s)else:raiseTypeError("Could not convert to str.")
fromcpython.versioncimportPY_MAJOR_VERSIONcdefstr_text(s):iftype(s)isstr:# Fast path for most common case(s).return<str>selifPY_MAJOR_VERSION<3andisinstance(s,bytes):# Only accept byte strings as text input in Python 2.x, not in Py3.return(<bytes>s).decode('ascii')elifisinstance(s,str):# We know from the fast path above that 's' can only be a subtype here.# An evil cast to <str> might still work in some(!) cases,# depending on what the further processing does. To be safe,# we can always create a copy instead.returnstr(s)else:raiseTypeError("Could not convert to str.")
cdefstr_text(s)
And should then be used like this:
fromcython.cimports.to_unicodeimport_textdefapi_func(s):text_input=_text(s)# ...
fromto_unicodecimport_textdefapi_func(s):text_input=_text(s)# ...
Similarly, if the further processing happens at the byte level, but Unicodestring input should be accepted, then the following might work, if you areusing memory views:
# define a global name for whatever char type is used in the modulechar_type=cython.typedef(cython.uchar)@cython.cfuncdef_chars(s)->char_type[:]:ifisinstance(s,str):# encode to the specific encoding used inside of the modules=cython.cast(str,s).encode('utf8')returns
# define a global name for whatever char type is used in the modulectypedefunsignedcharchar_typecdefchar_type[:]_chars(s):ifisinstance(s,str):# encode to the specific encoding used inside of the modules=(<str>s).encode('utf8')returns
In this case, you might want to additionally ensure that byte stringinput really uses the correct encoding, e.g. if you require pure ASCIIinput data, you can run over the buffer in a loop and check the highestbit of each byte. This should then also be done in the input normalisationfunction.
Dealing with “const”¶
Many C libraries use theconst
modifier in their API to declarethat they will not modify a string, or to require that users mustnot modify a string they return, for example:
typedefconstcharspecialChar;intprocess_string(constchar*s);constunsignedchar*look_up_cached_string(constunsignedchar*key);
Cython has support for theconst
modifier inthe language, so you can declare the above functions straight away asfollows:
cdefexternfrom"someheader.h":ctypedefconstcharspecialCharintprocess_string(constchar*s)constunsignedchar*look_up_cached_string(constunsignedchar*key)
Decoding bytes to text¶
The initially presented way of passing and receiving C strings issufficient if your code only deals with binary data in the strings.When we deal with encoded text, however, it is best practice to decodethe C byte strings to Python Unicode strings on reception, and toencode Python Unicode strings to C byte strings on the way out.
With a Python byte string object, you would normally just call thebytes.decode()
method to decode it into a Unicode string:
ustring=byte_string.decode('UTF-8')
Cython allows you to do the same for a C string, as long as itcontains no null bytes:
fromcython.cimports.c_funcimportc_call_returning_a_c_stringsome_c_string=cython.declare(cython.p_char,c_call_returning_a_c_string())ustring=some_c_string.decode('UTF-8')
fromc_funccimportc_call_returning_a_c_stringcdefchar*some_c_string=c_call_returning_a_c_string()ustring=some_c_string.decode('UTF-8')
And, more efficiently, for strings where the length is known:
fromcython.cimports.c_funcimportget_a_c_stringc_string=cython.declare(cython.p_char,cython.NULL)length=cython.declare(cython.Py_ssize_t,0)# get pointer and length from a C functionget_a_c_string(cython.address(c_string),cython.address(length))ustring=c_string[:length].decode('UTF-8')
fromc_funccimportget_a_c_stringcdefchar*c_string=NULLcdefPy_ssize_tlength=0# get pointer and length from a C functionget_a_c_string(&c_string,&length)ustring=c_string[:length].decode('UTF-8')
The same should be used when the string contains null bytes, e.g. whenit uses an encoding like UCS-4, where each character is encoded in fourbytes most of which tend to be 0.
Again, no bounds checking is done if slice indices are provided, soincorrect indices lead to data corruption and crashes. However, usingnegative indices is possible and will inject a calltostrlen()
in order to determine the string length.Obviously, this only works for 0-terminated strings without internalnull bytes. Text encoded in UTF-8 or one of the ISO-8859 encodings isusually a good candidate. If in doubt, it’s better to pass indicesthat are ‘obviously’ correct than to rely on the data to be as expected.
It is common practice to wrap string conversions (and non-trivial typeconversions in general) in dedicated functions, as this needs to bedone in exactly the same way whenever receiving text from C. Thiscould look as follows:
fromcython.cimports.libc.stdlibimportfree@cython.cfuncdeftounicode(s:cython.p_char)->str:returns.decode('UTF-8','strict')@cython.cfuncdeftounicode_with_length(s:cython.p_char,length:cython.size_t)->str:returns[:length].decode('UTF-8','strict')@cython.cfuncdeftounicode_with_length_and_free(s:cython.p_char,length:cython.size_t)->str:try:returns[:length].decode('UTF-8','strict')finally:free(s)
fromlibc.stdlibcimportfreecdefstrtounicode(char*s):returns.decode('UTF-8','strict')cdefstrtounicode_with_length(char*s,size_tlength):returns[:length].decode('UTF-8','strict')cdefstrtounicode_with_length_and_free(char*s,size_tlength):try:returns[:length].decode('UTF-8','strict')finally:free(s)
Most likely, you will prefer shorter function names in your code basedon the kind of string being handled. Different types of content oftenimply different ways of handling them on reception. To make the codemore readable and to anticipate future changes, it is good practice touse separate conversion functions for different types of strings.
Encoding text to bytes¶
The reverse way, converting a Python unicode string to a Cchar*, is pretty efficient by itself, assuming that whatyou actually want is a memory managed byte string:
py_byte_string=py_unicode_string.encode('UTF-8')c_string=cython.declare(cython.p_char,py_byte_string)
py_byte_string=py_unicode_string.encode('UTF-8')cdefchar*c_string=py_byte_string
As noted before, this takes the pointer to the byte buffer of thePython byte string. Trying to do the same without keeping a referenceto the Python byte string will fail with a compile error:
# this will not compile !c_string=cython.declare(cython.p_char,py_unicode_string.encode('UTF-8'))
# this will not compile !cdefchar*c_string=py_unicode_string.encode('UTF-8')
Here, the Cython compiler notices that the code takes a pointer to atemporary string result that will be garbage collected after theassignment. Later access to the invalidated pointer will read invalidmemory and likely result in a segfault. Cython will therefore refuseto compile this code.
C++ strings¶
When wrapping a C++ library, strings will usually come in the form ofthestd::string class. As with C strings, Python byte stringsautomatically coerce from and to C++ strings:
# distutils: language = c++fromcython.cimports.libcpp.stringimportstringdefget_bytes():py_bytes_object=b'hello world's:string=py_bytes_objects.append(b'abc')py_bytes_object=sreturnpy_bytes_object
# distutils: language = c++fromlibcpp.stringcimportstringdefget_bytes():py_bytes_object=b'hello world'cdefstrings=py_bytes_objects.append(b'abc')py_bytes_object=sreturnpy_bytes_object
The memory management situation is different than in C because thecreation of a C++ string makes an independent copy of the stringbuffer which the string object then owns. It is therefore possibleto convert temporarily created Python objects directly into C++strings. A common way to make use of this is when encoding a Pythonunicode string into a C++ string:
cpp_string=cython.declare(string,py_unicode_string.encode('UTF-8'))
cdefstringcpp_string=py_unicode_string.encode('UTF-8')
Note that this involves a bit of overhead because it first encodesthe Unicode string into a temporarily created Python bytes objectand then copies its buffer into a new C++ string.
For the other direction, efficient decoding support is availablein Cython 0.17 and later:
# distutils: language = c++fromcython.cimports.libcpp.stringimportstringdefget_ustrings():s:string=string(b'abcdefg')ustring1=s.decode('UTF-8')ustring2=s[2:-2].decode('UTF-8')returnustring1,ustring2
# distutils: language = c++fromlibcpp.stringcimportstringdefget_ustrings():cdefstrings=string(b'abcdefg')ustring1=s.decode('UTF-8')ustring2=s[2:-2].decode('UTF-8')returnustring1,ustring2
For C++ strings, decoding slices will always take the proper lengthof the string into account and apply Python slicing semantics (e.g.return empty strings for out-of-bounds indices).
Auto encoding and decoding¶
Automatic conversions are controlled by the directivesc_string_type
andc_string_encoding
. They can be used to change the Python stringtypes that C/C++ strings coerce from and to. By default, they onlycoerce from and to the bytes type, and encoding or decoding mustbe done explicitly, as described above.
This can be inconvenient if allC strings that are being processed (or the large majority) containtext, and automatic encoding and decoding from and to Python unicodeobjects can reduce the code overhead a little. In this case, youcan set thec_string_type
directive in your module tounicode
and thec_string_encoding
to the encoding that your C code uses,for example:
# cython: c_string_type=unicode, c_string_encoding=utf8cdefchar*c_string='abcdefg'# implicit decoding:cdefobjectpy_unicode_object=c_string# explicit conversion to Python bytes:py_bytes_object=<bytes>c_string
The other direction, i.e. automatic encoding to C strings, is onlysupported for ASCII/UTF-8. CPython handles the memorymanagement in this case by keeping an encoded copy of the string alivetogether with the original unicode string. Otherwise, there would be noway to limit the lifetime of the encoded string in any sensible way,thus rendering any attempt to extract a C string pointer from it adangerous endeavour. The following safely converts a Unicode string toUTF-8 (changec_string_encoding
toASCII
to limit it to that):
# cython: c_string_type=unicode, c_string_encoding=UTF8deffunc():ustring:str='abc'cdefconstchar*s=ustringreturns[0]# returns 'a' as a Unicode text string
(This example uses a function context in order to safely control thelifetime of the Unicode string. Global Python variables can bemodified from the outside, which makes it dangerous to rely on thelifetime of their values.)
Source code encoding¶
When string literals appear in the code, the source code encoding isimportant. It determines the byte sequence that Cython will store inthe C code for bytes literals, and the Unicode code points that Cythonbuilds for unicode literals when parsing the byte encoded source file.FollowingPEP 263, Cython supports the explicit declaration ofsource file encodings. For example, putting the following comment atthe top of anISO-8859-15
(Latin-9) encoded source file (into thefirst or second line) is required to enableISO-8859-15
decodingin the parser:
# -*- coding: ISO-8859-15 -*-
When no explicit encoding declaration is provided, the source code isparsed as UTF-8 encoded text, as specified byPEP 3120.UTF-8is a very common encoding that can represent the entire Unicode set ofcharacters and is compatible with plain ASCII encoded text that itencodes efficiently. This makes it a very good choice for source codefiles which usually consist mostly of ASCII characters.
As an example, putting the following line into a UTF-8 encoded sourcefile will print5
, as UTF-8 encodes the letter'ö'
in the twobyte sequence'\xc3\xb6'
:
print(len(b'abcö'))
whereas the followingISO-8859-15
encoded source file will print4
, as the encoding uses only 1 byte for this letter:
# -*- coding: ISO-8859-15 -*-print(len(b'abcö'))
Note that the unicode literalu'abcö'
is a correctly decoded fourcharacter Unicode string in both cases, whereas the unprefixed Pythonstr
literal'abcö'
will become a byte string in Python 2 (thushaving length 4 or 5 in the examples above), and a 4 character Unicodestring in Python 3. If you are not familiar with encodings, this maynot appear obvious at first read. SeeCEP 108 for details.
As a rule of thumb, it is best to avoid unprefixed non-ASCIIstr
literals and to use unicode string literals for all text. Cython alsosupports the__future__
importunicode_literals
that instructsthe parser to read all unprefixedstr
literals in a source file asunicode string literals, just like Python 3.
Single bytes and characters¶
The Python C-API uses the normal Cchar type to representa byte value, but it has two special integer types for a Unicode codepoint value, i.e. a single Unicode character:Py_UNICODE
andPy_UCS4
. Cython supports thefirst natively, support forPy_UCS4
is new in Cython 0.15.Py_UNICODE
is either defined as an unsigned 2-byte or4-byte integer, or aswchar_t, depending on the platform.The exact type is a compile time option in the build of the CPythoninterpreter and extension modules inherit this definition at Ccompile time. The advantage ofPy_UCS4
is that it isguaranteed to be large enough for any Unicode code point value,regardless of the platform. It is defined as a 32bit unsigned intor long.
In Cython, thechar type behaves differently from thePy_UNICODE
andPy_UCS4
types when coercingto Python objects. Similar to the behaviour of the bytes type inPython 3, thechar type coerces to a Python integervalue by default, so that the following prints 65 and notA
:
# -*- coding: ASCII -*-char_val=declare(cython.char,'A')assertchar_val==65# ASCII encoded byte value of 'A'print(char_val)
# -*- coding: ASCII -*-cdefcharchar_val='A'assertchar_val==65# ASCII encoded byte value of 'A'print(char_val)
If you want a Python bytes string instead, you have to request itexplicitly, and the following will printA
(orb'A'
in Python3):
print(cython.cast(bytes,char_val))
print(<bytes>char_val)
The explicit coercion works for any C integer type. Values outside ofthe range of achar orunsignedchar will raise anOverflowError
at runtime. Coercion will also happen automaticallywhen assigning to a typed variable, e.g.:
py_byte_string=cython.declare(bytes)py_byte_string=char_val
cdefbytespy_byte_stringpy_byte_string=char_val
On the other hand, thePy_UNICODE
andPy_UCS4
types are rarely used outside of the context of a Python unicode string,so their default behaviour is to coerce to a Python unicode object. Thefollowing will therefore print the characterA
, as would the samecode with thePy_UNICODE
type:
uchar_val=cython.declare(cython.Py_UCS4,u'A')assertuchar_val==65# character point value of u'A'print(uchar_val)
cdefPy_UCS4uchar_val=u'A'assertuchar_val==65# character point value of u'A'print(uchar_val)
Again, explicit casting will allow users to override this behaviour.The following will print 65:
uchar_val=cython.declare(cython.Py_UCS4,u'A')print(cython.cast(long,uchar_val))
cdefPy_UCS4uchar_val=u'A'print(<long>uchar_val)
Note that casting to a Clong (orunsignedlong) will workjust fine, as the maximum code point value that a Unicode charactercan have is 1114111 (0x10FFFF
). On platforms with 32bit or more,int is just as good.
Narrow Unicode builds¶
In narrow Unicode builds of CPython before version 3.3, i.e. buildswheresys.maxunicode
is 65535 (such as all Windows builds, asopposed to 1114111 in wide builds), it is still possible to useUnicode character code points that do not fit into the 16 bit widePy_UNICODE
type. For example, such a CPython build willaccept the unicode literalu'\U00012345'
. However, theunderlying system level encoding leaks into Python space in thiscase, so that the length of this literal becomes 2 instead of 1.This also shows when iterating over it or when indexing into it.The visible substrings areu'\uD808'
andu'\uDF45'
in thisexample. They form a so-called surrogate pair that represents theabove character.
For more information on this topic, it is worth reading theWikipediaarticle about the UTF-16 encoding.
The same properties apply to Cython code that gets compiled for anarrow CPython runtime environment. In most cases, e.g. whensearching for a substring, this difference can be ignored as both thetext and the substring will contain the surrogates. So most Unicodeprocessing code will work correctly also on narrow builds. Encoding,decoding and printing will work as expected, so that the above literalturns into exactly the same byte sequence on both narrow and wideUnicode platforms.
However, programmers should be aware that a singlePy_UNICODE
value (or single ‘character’ unicode string in CPython) may not beenough to represent a complete Unicode character on narrow platforms.For example, if an independent search foru'\uD808'
andu'\uDF45'
in a unicode string succeeds, this does not necessarilymean that the characteru'\U00012345
is part of that string. Itmay well be that two different characters are in the string that justhappen to share a code unit with the surrogate pair of the characterin question. Looking for substrings works correctly because the twocode units in the surrogate pair use distinct value ranges, so thepair is always identifiable in a sequence of code points.
As of version 0.15, Cython has extended support for surrogate pairs sothat you can safely use anin
test to search character values fromthe fullPy_UCS4
range even on narrow platforms:
uchar=cython.declare(cython.Py_UCS4,0x12345)print(ucharinsome_unicode_string)
cdefPy_UCS4uchar=0x12345print(ucharinsome_unicode_string)
Similarly, it can coerce a one character string with a high Unicodecode point value to a Py_UCS4 value on both narrow and wide Unicodeplatforms:
uchar=cython.declare(cython.Py_UCS4,u'\U00012345')assertuchar==0x12345
cdefPy_UCS4uchar=u'\U00012345'assertuchar==0x12345
In CPython 3.3 and later, thePy_UNICODE
type is an aliasfor the system specificwchar_t type and is no longer tiedto the internal representation of the Unicode string. Instead, anyUnicode character can be represented on all platforms withoutresorting to surrogate pairs. This implies that narrow builds nolonger exist from that version on, regardless of the size ofPy_UNICODE
. SeePEP 393 for details.
Cython 0.16 and later handles this change internally and does the rightthing also for single character values as long as either type inferenceis applied to untyped variables or the portablePy_UCS4
typeis explicitly used in the source code instead of the platform specificPy_UNICODE
type. Optimisations that Cython applies to thePython unicode type will automatically adapt toPEP 393 at C compiletime, as usual.
Iteration¶
Cython 0.13 supports efficient iteration overchar*,bytes and unicode strings, as long as the loop variable isappropriately typed. So the following will generate the expectedC code:
defiterate_char():c_string:cython.p_char="Hello to A C-string's world"c:cython.charforcinc_string[:11]:ifc==b'A':print("Found the letter A")
defiterate_char():cdefchar*c_string="Hello to A C-string's world"cdefcharcforcinc_string[:11]:ifc==b'A':print("Found the letter A")
The same applies to bytes objects:
defiterate_bytes():bytes_string:bytes=b"hello to A bytes' world"c:cython.charforcinbytes_string:ifc==b'A':print("Found the letter A")
defiterate_bytes():cdefbytesbytes_string=b"hello to A bytes' world"cdefcharcforcinbytes_string:ifc==b'A':print("Found the letter A")
For unicode objects, Cython will automatically infer the type of theloop variable asPy_UCS4
:
defiterate_string():ustring:unicode=u'Hello world'# NOTE: no typing required for 'uchar' !forucharinustring:ifuchar==u'A':print("Found the letter A")
defiterate_string():cdefunicodeustring=u'Hello world'# NOTE: no typing required for 'uchar' !forucharinustring:ifuchar==u'A':print("Found the letter A")
The automatic type inference usually leads to much more efficient codehere. However, note that some unicode operations still require thevalue to be a Python object, so Cython may end up generating redundantconversion code for the loop variable value inside of the loop. Ifthis leads to a performance degradation for a specific piece of code,you can either type the loop variable as a Python object explicitly,or assign its value to a Python typed variable somewhere inside of theloop to enforce one-time coercion before running Python operations onit.
There are also optimisations forin
tests, so that the followingcode will run in plain C code, (actually using a switch statement):
@cython.ccalldefis_in(uchar_val:cython.Py_UCS4)->cython.void:ifuchar_valinu'abcABCxY':print("The character is in the string.")else:print("The character is not in the string")
cpdefvoidis_in(Py_UCS4uchar_val):ifuchar_valinu'abcABCxY':print("The character is in the string.")else:print("The character is not in the string")
Combined with the looping optimisation above, this can result in veryefficient character switching code, e.g. in unicode parsers.
Windows and wide character APIs¶
Warning
The use ofPy_UNICODE*
strings outside of Windows isstrongly discouraged.Py_UNICODE
is inherently notportable between different platforms and Python versions.
Support for thePy_UNICODE
C-API has been removed in CPython 3.12.Code that uses it will no longer compile in recent CPython releases.Since version 3.3, CPython provides a flexible internal representation ofunicode strings (PEP 393), that makes allPy_UNICODE
relatedAPIs deprecated and inefficient.
Windows system APIs natively support Unicode in the form ofzero-terminated UTF-16 encodedwchar_t* strings, so called“wide strings”.
By default, Windows builds of CPython definePy_UNICODE
asa synonym forwchar_t. This makes internalunicode
representation compatible with UTF-16 and allows for efficient zero-copyconversions. This also means that Windows builds are alwaysNarrow Unicode builds with all the caveats.
To aid interoperation with Windows APIs, Cython 0.19 supports widestrings (in the form ofPy_UNICODE*) and implicitly convertsthem to and fromunicode
string objects. These conversions behave thesame way as they do forchar* andbytes
as described inPassing byte strings.
In addition to automatic conversion, unicode literals that appearin C context become C-level wide string literals andlen()
built-in function is specialized to compute the length of zero-terminatedPy_UNICODE* string or array.
Here is an example of how one would call a Unicode API on Windows:
cdefexternfrom"Windows.h":ctypedefPy_UNICODEWCHARctypedefconstWCHAR*LPCWSTRctypedefvoid*HWNDintMessageBoxW(HWNDhWnd,LPCWSTRlpText,LPCWSTRlpCaption,intuType)title=u"Windows Interop Demo - Python%d.%d.%d"%sys.version_info[:3]MessageBoxW(NULL,u"Hello Cython\u263a",title,0)
One consequence of CPython 3.3 changes is thatlen()
ofunicode
strings is always measured incode points (“characters”),while Windows API expect the number of UTF-16code units(where each surrogate is counted individually). To always get the numberof code units, callPyUnicode_GetSize()
directly.