Python Enhancement Proposals

Python »
PEP Index »
PEP 296

PEP 296 – Adding a bytes Object Type

Author:: Scott Gilbert <xscottg at yahoo.com>
Status:

Notice

This PEP is withdrawn by the author (in favor ofPEP 358).

This PEP proposes the creation of a new standard type and builtinconstructor called ‘bytes’. The bytes object is an efficientlystored array of bytes with some additional characteristics thatset it apart from several implementations that are similar.

Rationale

Python currently has many objects that implement something akin tothe bytes object of this proposal. For instance the standardstring, buffer, array, and mmap objects are all very similar insome regards to the bytes object. Additionally, severalsignificant third party extensions have created similar objects totry and fill similar needs. Frustratingly, each of these objectsis too narrow in scope and is missing critical features to make itapplicable to a wider category of problems.

Specification

The bytes object has the following important characteristics:

Efficient underlying array storage via the standard C type “unsignedchar”. This allows fine grain control over how much memory isallocated. With the alignment restrictions designated in the nextitem, it is trivial for low level extensions to cast the pointerto a different type as needed.
Also, since the object is implemented as an array of bytes, it ispossible to pass the bytes object to the extensive library ofroutines already in the standard library that presently work withstrings. For instance, the bytes object in conjunction with thestruct module could be used to provide a complete replacement forthe array module using only Python script.
If an unusual platform comes to light, one where there isn’t anative unsigned 8 bit type, the object will do its best torepresent itself at the Python script level as though it were anarray of 8 bit unsigned values. It is doubtful whether manyextensions would handle this correctly, but Python script could beportable in these cases.
Alignment of the allocated byte array is whatever is promised by theplatform implementation of malloc. A bytes object created from anextension can be supplied that provides any arbitrary alignment asthe extension author sees fit.
This alignment restriction should allow the bytes object to beused as storage for all standard C types - includingPyComplexobjects or other structs of standard C type types. Furtheralignment restrictions can be provided by extensions as necessary.
The bytes object implements a subset of the sequence operationsprovided by string/array objects, but with slightly differentsemantics in some cases. In particular, a slice always returns anew bytes object, but the underlying memory is shared between thetwo objects. This type of slice behavior has been called creatinga “view”. Additionally, repetition and concatenation areundefined for bytes objects and will raise an exception.
As these objects are likely to find use in high performanceapplications, one motivation for the decision to use view slicingis that copying between bytes objects should be very efficient andnot require the creation of temporary objects. The following codeillustrates this:
```
# create two 10 Meg bytes objectsb1=bytes(10000000)b2=bytes(10000000)# copy from part of one to another with out creating a 1 Meg temporaryb1[2000000:3000000]=b2[4000000:5000000]
```
Slice assignment where the rvalue is not the same length as thelvalue will raise an exception. However, slice assignment willwork correctly with overlapping slices (typically implemented withmemmove).
The bytes object will be recognized as a native type by thepickle andcPickle modules for efficient serialization. (In truth, this isthe only requirement that can’t be implemented via a third partyextension.)
Partial solutions to address the need to serialize the data storedin a bytes-like object without creating a temporary copy of thedata into a string have been implemented in the past. The tofileand fromfile methods of the array object are good examples ofthis. The bytes object will support these methods too. However,pickling is useful in other situations - such as in the shelvemodule, or implementing RPC of Python objects, and requiring theend user to use two different serialization mechanisms to get anefficient transfer of data is undesirable.
XXX: Will try to implement pickling of the new bytes object insuch a way that previous versions of Python will unpickle it as astring object.
When unpickling, the bytes object will be created from memoryallocated from Python (viamalloc). As such, it will lose anyadditional properties that an extension supplied pointer mighthave provided (special alignment, or special types of memory).
XXX: Will try to make it so that C subclasses of bytes type cansupply the memory that will be unpickled into. For instance, aderived class called PageAlignedBytes would unpickle to memorythat is also page aligned.
On any platform where an int is 32 bits (most of them), it iscurrently impossible to create a string with a length larger thancan be represented in 31 bits. As such, pickling to a string willraise an exception when the operation is not possible.
At least on platforms supporting large files (many of them),pickling large bytes objects to files should be possible viarepeated calls to thefile.write() method.
The bytes type supports thePyBufferProcs interface, but a bytes objectprovides the additional guarantee that the pointer will not bedeallocated or reallocated as long as a reference to the bytesobject is held. This implies that a bytes object is not resizableonce it is created, but allows the global interpreter lock (GIL)to be released while a separate thread manipulates the memorypointed to if thePyBytes_Check(...) test passes.
This characteristic of the bytes object allows it to be used insituations such as asynchronous file I/O or on multiprocessormachines where the pointer obtained byPyBufferProcs will be usedindependently of the global interpreter lock.
Knowing that the pointer can not be reallocated or freed after theGIL is released gives extension authors the capability to get trueconcurrency and make use of additional processors for long runningcomputations on the pointer.
In C/C++ extensions, the bytes object can be created from a suppliedpointer and destructor function to free the memory when thereference count goes to zero.
The special implementation of slicing for the bytes object allowsmultiple bytes objects to refer to the same pointer/destructor.As such, a refcount will be kept on the actualpointer/destructor. This refcount is separate from the refcounttypically associated with Python objects.
XXX: It may be desirable to expose the inner refcounted object as anactual Python object. If a good use case arises, it should be possiblefor this to be implemented later with no loss to backwards compatibility.
It is also possible to signify the bytes object as readonly, in thiscase it isn’t actually mutable, but does provide the other features of abytes object.
The bytes object keeps track of the length of its data with a PythonLONG_LONG type. Even though the current definition forPyBufferProcsrestricts the length to be the size of an int, this PEP does not proposeto make any changes there. Instead, extensions can work around this limitby making an explicitPyBytes_Check(...) call, and if that succeeds theycan make aPyBytes_GetReadBuffer(...) orPyBytes_GetWriteBuffercall to get the pointer and full length of the object as aLONG_LONG.
The bytes object will raise an exception if the standardPyBufferProcsmechanism is used and the size of the bytes object is greater than can berepresented by an integer.
From Python scripting, the bytes object will be subscriptable with longsso the 32 bit int limit can be avoided.
There is still a problem with thelen() function as it isPyObject_Size() and this returns an int as well. As a workaround,the bytes object will provide a.length() method that will return a long.
The bytes object can be constructed at the Python scripting level bypassing an int/long to the bytes constructor with the number of bytes toallocate. For example:
```
b=bytes(100000)# alloc 100K bytes
```
The constructor can also take another bytes object. This will be usefulfor the implementation of unpickling, and in converting a read-write bytesobject into a read-only one. An optional second argument will be used todesignate creation of a readonly bytes object.
From the C API, the bytes object can be allocated using any of thefollowing signatures:
```
PyObject*PyBytes_FromLength(LONG_LONGlen,intreadonly);PyObject*PyBytes_FromPointer(void*ptr,LONG_LONGlen,intreadonlyvoid(*dest)(void*ptr,void*user),void*user);
```
In thePyBytes_FromPointer(...) function, if the dest function pointeris passed in asNULL, it will not be called. This should only be usedfor creating bytes objects from statically allocated space.
The user pointer has been called a closure in other places. It is apointer that the user can use for whatever purposes. It will be passed tothe destructor function on cleanup and can be useful for a number ofthings. If the user pointer is not needed,NULL should be passedinstead.
The bytes type will be a new style class as that seems to be where allstandard Python types are headed.

Contrast to existing types

The most common way to work around the lack of a bytes object has been tosimply use a string object in its place. Binary files, the struct/arraymodules, and several other examples exist of this. Putting aside thestyle issue that these uses typically have nothing to do with textstrings, there is the real problem that strings are not mutable, so directmanipulation of the data returned in these cases is not possible. Also,numerous optimizations in the string module (such as caching the hashvalue or interning the pointers) mean that extension authors are on verythin ice if they try to break the rules with the string object.

The buffer object seems like it was intended to address the purpose thatthe bytes object is trying fulfill, but several shortcomings in itsimplementation[1] have made it less useful in many common cases. Thebuffer object made a different choice for its slicing behavior (it returnsnew strings instead of buffers for slicing and other operations), and itdoesn’t make many of the promises on alignment or being able to releasethe GIL that the bytes object does.

Also in regards to the buffer object, it is not possible to simply replacethe buffer object with the bytes object and maintain backwardscompatibility. The buffer object provides a mechanism to take thePyBufferProcs supplied pointer of another object and present it as itsown. Since the behavior of the other object can not be guaranteed tofollow the same set of strict rules that a bytes object does, it can’t beused in places that a bytes object could.

The array module supports the creation of an array of bytes, but it doesnot provide a C API for supplying pointers and destructors to extensionsupplied memory. This makes it unusable for constructing objects out ofshared memory, or memory that has special alignment or locking for thingslike DMA transfers. Also, the array object does not currently pickle.Finally since the array object allows its contents to grow, via the extendmethod, the pointer can be changed if the GIL is not held while using it.

Creating a buffer object from an array object has the same problem ofleaving an invalid pointer when the array object is resized.

The mmap object caters to its particular niche, but does not attempt tosolve a wider class of problems.

Finally, any third party extension can not implement pickling withoutcreating a temporary object of a standard Python type. For example, in theNumeric community, it is unpleasant that a large array can’t picklewithout creating a large binary string to duplicate the array data.

Backward Compatibility

The only possibility for backwards compatibility problems that the authoris aware of are in previous versions of Python that try to unpickle datacontaining the new bytes type.

Reference Implementation

XXX: Actual implementation is in progress, but changes are still possibleas this PEP gets further review.

The following new files will be added to the Python baseline:

Include/bytesobject.h# C interfaceObjects/bytesobject.c# C implementationLib/test/test_bytes.py# unit testingDoc/lib/libbytes.tex# documentation

The following files will also be modified:

Include/Python.h# adding bytesmodule.h include filePython/bltinmodule.c# adding the bytes type objectModules/cPickle.c# adding bytes to the standard typesLib/pickle.py# adding bytes to the standard types

It is possible that several other modules could be cleaned up andimplemented in terms of the bytes object. The mmap module comes to mindfirst, but as noted above it would be possible to reimplement the arraymodule as a pure Python module. While it is attractive that this PEPcould actually reduce the amount of source code by some amount, the authorfeels that this could cause unnecessary risk for breaking existingapplications and should be avoided at this time.

Additional Notes/Comments

Guido van Rossum wondered whether it would make sense to be ableto create a bytes object from a mmap object. The mmap objectappears to support the requirements necessary to provide memoryfor a bytes object. (It doesn’t resize, and the pointer is validfor the lifetime of the object.) As such, a method could be addedto the mmap module such that a bytes object could be createddirectly from a mmap object. An initial stab at how this would beimplemented would be to use thePyBytes_FromPointer() functiondescribed above and pass themmap_object as the user pointer. Thedestructor function would decref themmap_object for cleanup.
Todd Miller notes that it may be useful to have two new functions:PyObject_AsLargeReadBuffer() andPyObject_AsLargeWriteBuffer that aresimilar toPyObject_AsReadBuffer() andPyObject_AsWriteBuffer(), butsupport getting aLONG_LONG length in addition to thevoid* pointer.These functions would allow extension authors to work transparently withbytes object (that supportLONG_LONG lengths) and most other buffer likeobjects (which only support int lengths). These functions could be inlieu of, or in addition to, creating a specificPyByte_GetReadBuffer() andPyBytes_GetWriteBuffer() functions.
XXX: The author thinks this is very a good idea as it paves the way forother objects to eventually support large (64 bit) pointers, and it shouldonly affect abstract.c and abstract.h. Should this be added above?
It was generally agreed that abusing the segment count of thePyBufferProcs interface is not a good hack to work around the 31 bitlimitation of the length. If you don’t know what this means, then you’rein good company. Most code in the Python baseline, and presumably in manythird party extensions, punt when the segment count is not 1.