Python Enhancement Proposals

Python »
PEP Index »
PEP 461

PEP 461 – Adding % formatting to bytes and bytearray

Author:: Ethan Furman <ethan at stoneleaf.us>
Status:

Table of Contents

Abstract

This PEP proposes adding % formatting operations similar to Python 2’sstrtype tobytes andbytearray[1][2].

While interpolation is usually thought of as a string operation, there arecases where interpolation onbytes orbytearrays make sense, and thework needed to make up for this missing functionality detracts from the overallreadability of the code.

Motivation

With Python 3 and the split betweenstr andbytes, one small butimportant area of programming became slightly more difficult, and much morepainful – wire format protocols[3].

This area of programming is characterized by a mixture of binary data andASCII compatible segments of text (aka ASCII-encoded text). Bringing back arestricted %-interpolation forbytes andbytearray will aid both inwriting new wire format code, and in porting Python 2 wire format code.

Common use-cases includedbf andpdf file formats,emailformats, andFTP andHTTP communications, among many others.

Proposed semantics for`bytes` and`bytearray` formatting

%-interpolation

All the numeric formatting codes (d,i,o,u,x,X,e,E,f,F,g,G, and any that are subsequently addedto Python 3) will be supported, and will work as they do for str, includingthe padding, justification and other related modifiers (currently#,0,-, space, and+ (plus any added to Python 3)). The onlynon-numeric codes allowed arec,b,a, ands (which is asynonym for b).

For the numeric codes, the only difference betweenstr andbytes (orbytearray) interpolation is that the results from these codes will beASCII-encoded text, not unicode. In other words, for any numeric formattingcode%x:

b"%x"%val

is equivalent to:

("%x"%val).encode("ascii")

Examples:

>>>b'%4x'%10b'   a'>>>b'%#4x'%10' 0xa'>>>b'%04X'%10'000A'

%c will insert a single byte, either from anint in range(256), or fromabytes argument of length 1, not from astr.

Examples:

>>>b'%c'%48b'0'>>>b'%c'%b'a'b'a'

%b will insert a series of bytes. These bytes are collected in one of twoways:

input type supportsPy_buffer[4]?use it to collect the necessary bytes
input type is something else?use its__bytes__ method[5] ; if there isn’t one, raise aTypeError

In particular,%b will not accept numbers norstr.str is rejectedas the string to bytes conversion requires an encoding, and we are refusing toguess; numbers are rejected because:

what makes a number is fuzzy (float? Decimal? Fraction? some user type?)
allowing numbers would lead to ambiguity between numbers and textualrepresentations of numbers (3.14 vs ‘3.14’)
given the nature of wire formats, explicit is definitely better than implicit

%s is included as a synonym for%b for the sole purpose of making 2/3 codebases easier to maintain. Python 3 only code should use%b.

Examples:

>>>b'%b'%b'abc'b'abc'>>>b'%b'%'some string'.encode('utf8')b'some string'>>>b'%b'%3.14Traceback (most recent call last):...TypeError:b'%b' does not accept 'float'>>>b'%b'%'hello world!'Traceback (most recent call last):...TypeError:b'%b' does not accept 'str'

%a will give the equivalent ofrepr(some_obj).encode('ascii','backslashreplace') on the interpolatedvalue. Use cases include developing a new protocol and writing landmarksinto the stream; debugging data going into an existing protocol to see ifthe problem is the protocol itself or bad data; a fall-back for a serializationformat; or any situation where defining__bytes__ would not be appropriatebut a readable/informative representation is needed[6].

%r is included as a synonym for%a for the sole purpose of making 2/3code bases easier to maintain. Python 3 only code use%a[7].

Examples:

>>>b'%a'%3.14b'3.14'>>>b'%a'%b'abc'b"b'abc'">>>b'%a'%'def'b"'def'"

Compatibility with Python 2

As noted above,%s and%r are being included solely to help easemigration from, and/or have a single code base with, Python 2. This isimportant as there are modules both in the wild and behind closed doors thatcurrently use the Python 2str type as abytes container, and henceare using%s as a bytes interpolator.

However,%b and%a should be used in new, Python 3 only code, so%sand%r will immediately be deprecated, but not removed from the 3.x series[7].

Proposed variations

It has been proposed to automatically use.encode('ascii','strict') forstr arguments to%b.

Rejected as this would lead to intermittent failures. Better to have theoperation always fail so the trouble-spot can be correctly fixed.

It has been proposed to have%b return the ascii-encoded repr when thevalue is astr (b’%b’ % ‘abc’ –> b“‘abc’”).

Rejected as this would lead to hard to debug failures far from the problemsite. Better to have the operation always fail so the trouble-spot can beeasily fixed.

Originally this PEP also proposed adding format-style formatting, but it wasdecided that format and its related machinery were all strictly text (akastr) based, and it was dropped.

Various new special methods were proposed, such as__ascii__,__format_bytes__, etc.; such methods are not needed at this time, but canbe visited again later if real-world use shows deficiencies with this solution.

A competing PEP,PEP 460 Add binary interpolation and formatting,also exists.

Objections

The objections raised against this PEP were mainly variations on two themes:

thebytes andbytearray types are for pure binary data, with noassumptions about encodings
offering %-interpolation that assumes an ASCII encoding will be anattractive nuisance and lead us back to the problems of the Python 2str/unicode text model

As was seen during the discussion,bytes andbytearray are also usedfor mixed binary data and ASCII-compatible segments: file formats such asdbf andpdf, network protocols such asftp andemail, etc.

bytes andbytearray already have several methods which assume an ASCIIcompatible encoding.upper(),isalpha(), andexpandtabs() to namejust a few. %-interpolation, with its very restricted mini-language, will notbe any more of a nuisance than the already existing methods.

Some have objected to allowing the full range of numeric formatting codes withthe claim that decimal alone would be sufficient. However, at least twoformats (dbf and pdf) make use of non-decimal numbers.