Python Enhancement Proposals

Python »
PEP Index »
PEP 358

Author:: Neil Schemenauer <nas at arctrix.com>, Guido van Rossum <guido at python.org>
Status:

Update

This PEP has partially been superseded byPEP 3137.

Abstract

This PEP outlines the introduction of a raw bytes sequence type.Adding the bytes type is one step in the transition toUnicode-based str objects which will be introduced in Python 3.0.

The PEP describes how the bytes type should work in Python 2.6, aswell as how it should work in Python 3.0. (Occasionally there aredifferences because in Python 2.6, we have two string types, strand unicode, while in Python 3.0 we will only have one stringtype, whose name will be str but whose semantics will be like the2.6 unicode type.)

Motivation

Python’s current string objects are overloaded. They serve to holdboth sequences of characters and sequences of bytes. Thisoverloading of purpose leads to confusion and bugs. In futureversions of Python, string objects will be used for holdingcharacter data. The bytes object will fulfil the role of a bytecontainer. Eventually the unicode type will be renamed to strand the old str type will be removed.

Specification

A bytes object stores a mutable sequence of integers that are inthe range 0 to 255. Unlike string objects, indexing a bytesobject returns an integer. Assigning or comparing an object thatis not an integer to an element causes aTypeError exception.Assigning an element to a value outside the range 0 to 255 causesaValueError exception. The.__len__() method of bytes returnsthe number of integers stored in the sequence (i.e. the number ofbytes).

The constructor of the bytes object has the following signature:

bytes([initializer[,encoding]])

If no arguments are provided then a bytes object containing zeroelements is created and returned. The initializer argument can bea string (in 2.6, either str or unicode), an iterable of integers,or a single integer. The pseudo-code for the constructor(optimized for clear semantics, not for speed) is:

defbytes(initializer=0,encoding=None):ifisinstance(initializer,int):# In 2.6, int -> (int, long)initializer=[0]*initializerelifisinstance(initializer,basestring):ifisinstance(initializer,unicode):# In 3.0, "if True"ifencodingisNone:# In 3.0, raise TypeError("explicit encoding required")encoding=sys.getdefaultencoding()initializer=initializer.encode(encoding)initializer=[ord(c)forcininitializer]else:ifencodingisnotNone:raiseTypeError("no encoding allowed for this initializer")tmp=[]forcininitializer:ifnotisinstance(c,int):raiseTypeError("initializer must be iterable of ints")ifnot0<=c<256:raiseValueError("initializer element out of range")tmp.append(c)initializer=tmpnew=<newbytesobjectoflengthlen(initializer)>fori,cinenumerate(initializer):new[i]=creturnnew

The.__repr__() method returns a string that can be evaluated togenerate a new bytes object containing a bytes literal:

>>>bytes([10,20,30])b'\n\x14\x1e'

The object has a.decode() method equivalent to the.decode()method of the str object. The object has a classmethod.fromhex()that takes a string of characters from the set[0-9a-fA-F] andreturns a bytes object (similar tobinascii.unhexlify). Forexample:

>>>bytes.fromhex('5c5350ff')b'\\SP\xff'>>>bytes.fromhex('5c 53 50 ff')b'\\SP\xff'

The object has a.hex() method that does the reverse conversion(similar tobinascii.hexlify):

>>bytes([92,83,80,255]).hex()'5c5350ff'

The bytes object has some methods similar to list methods, andothers similar to str methods. Here is a complete list ofmethods, with their approximate signatures:

.__add__(bytes)->bytes.__contains__(int|bytes)->bool.__delitem__(int|slice)->None.__delslice__(int,int)->None.__eq__(bytes)->bool.__ge__(bytes)->bool.__getitem__(int|slice)->int|bytes.__getslice__(int,int)->bytes.__gt__(bytes)->bool.__iadd__(bytes)->bytes.__imul__(int)->bytes.__iter__()->iterator.__le__(bytes)->bool.__len__()->int.__lt__(bytes)->bool.__mul__(int)->bytes.__ne__(bytes)->bool.__reduce__(...)->....__reduce_ex__(...)->....__repr__()->str.__reversed__()->bytes.__rmul__(int)->bytes.__setitem__(int|slice,int|iterable[int])->None.__setslice__(int,int,iterable[int])->Bote.append(int)->None.count(int)->int.decode(str)->str|unicode# in 3.0, only str.endswith(bytes)->bool.extend(iterable[int])->None.find(bytes)->int.index(bytes|int)->int.insert(int,int)->None.join(iterable[bytes])->bytes.partition(bytes)->(bytes,bytes,bytes).pop([int])->int.remove(int)->None.replace(bytes,bytes)->bytes.rindex(bytes|int)->int.rpartition(bytes)->(bytes,bytes,bytes).split(bytes)->list[bytes].startswith(bytes)->bool.reverse()->None.rfind(bytes)->int.rindex(bytes|int)->int.rsplit(bytes)->list[bytes].translate(bytes,[bytes])->bytes

Note the conspicuous absence of.isupper(),.upper(), and friends.(But see “Open Issues” below.) There is no.__hash__() becausethe object is mutable. There is no use case for a.sort() method.

The bytes type also supports the buffer interface, supportingreading and writing binary (but not character) data.

Out of Scope Issues

Python 3k will have a much different I/O subsystem. Decidinghow that I/O subsystem will work and interact with the bytesobject is out of the scope of this PEP. The expectation howeveris that binary I/O will read and write bytes, while text I/Owill read strings. Since the bytes type supports the bufferinterface, the existing binary I/O operations in Python 2.6 willsupport bytes objects.
It has been suggested that a special method named.__bytes__()be added to the language to allow objects to be converted intobyte arrays. This decision is out of scope.
A bytes literal of the formb"..." is also proposed. This isthe subject ofPEP 3112.

Open Issues

The.decode() method is redundant since a bytes objectb canalso be decoded by callingunicode(b,<encoding>) (in 2.6) orstr(b,<encoding>) (in 3.0). Do we need encode/decode methodsat all? In a sense the spelling using a constructor is cleaner.
Need to specify the methods still more carefully.
Pickling and marshalling support need to be specified.
Should all those list methods really be implemented?
A case could be made for supporting.ljust(),.rjust(),.center() with a mandatory second argument.
A case could be made for supporting.split() with a mandatoryargument.
A case could even be made for supporting.islower(),.isupper(),.isspace(),.isalpha(),.isalnum(),.isdigit() and thecorresponding conversions (.lower() etc.), using the ASCIIdefinitions for letters, digits and whitespace. If this isaccepted, the cases for.ljust(),.rjust(),.center() and.split() become much stronger, and they should have defaultarguments as well, using an ASCII space or all ASCII whitespace(for.split()).

Frequently Asked Questions

Q: Why have the optional encoding argument when the encode method ofUnicode objects does the same thing?

A: In the current version of Python, the encode method returns a strobject and we cannot change that without breaking code. Theconstructbytes(s.encode(...)) is expensive because it has tocopy the byte sequence multiple times. Also, Python generallyprovides two ways of converting an object of type A into anobject of type B: ask an A instance to convert itself to a B, orask the type B to create a new instance from an A. Depending onwhat A and B are, both APIs make sense; sometimes reasons ofdecoupling require that A can’t know about B, in which case youhave to use the latter approach; sometimes B can’t know about A,in which case you have to use the former.

Q: Why does bytes ignore the encoding argument if the initializer isa str? (This only applies to 2.6.)

A: There is no sane meaning that the encoding can have in that case.str objectsare byte arrays and they know nothing about theencoding of character data they contain. We need to assume thatthe programmer has provided a str object that already uses thedesired encoding. If you need something other than a pure copy ofthe bytes then you need to first decode the string. For example:

bytes(s.decode(encoding1),encoding2)

Q: Why not have the encoding argument default to Latin-1 (or someother encoding that covers the entire byte range) rather thanASCII?

A: The system default encoding for Python is ASCII. It seems leastconfusing to use that default. Also, in Py3k, using Latin-1 asthe default might not be what users expect. For example, theymight prefer a Unicode encoding. Any default will not alwayswork as expected. At least ASCII will complain loudly if you tryto encode non-ASCII data.

Copyright

This document has been placed in the public domain.

Source:https://github.com/python/peps/blob/main/peps/pep-0358.rst

Last modified:2025-02-01 08:59:27 GMT

Movatterモバイル変換