Movatterモバイル変換


[0]ホーム

URL:


— FREE Email Series —

🐍 Python Tricks 💌

Python Tricks Dictionary Merge

🔒 No spam. Unsubscribe any time.

Browse TopicsGuided Learning Paths
Basics Intermediate Advanced
aialgorithmsapibest-practicescareercommunitydatabasesdata-sciencedata-structuresdata-vizdevopsdjangodockereditorsflaskfront-endgamedevguimachine-learningnewsnumpyprojectspythonstdlibtestingtoolsweb-devweb-scraping

Table of Contents

Bytes Objects: Handling Binary Data in Python

Bytes Objects: Handling Binary Data in Python

byBartosz ZaczyńskiReading time estimate 1h 9mintermediatepython

Table of Contents

Remove ads

Thebytes data type is an immutable sequence of unsigned bytes used for handling binary data in Python. You can create abytes object using the literal syntax, thebytes() function, or thebytes.fromhex() method. Sincebytes are closely related to strings, you often convert between the two data types, applying the correct character encoding.

By the end of this tutorial, you’ll understand that:

  • Pythonbytes objects are immutable sequences of unsigned bytes used for handling binary data.
  • Thedifference betweenbytes andbytearray is thatbytes objects are read-only, whilebytearray objects are mutable.
  • You convert aPython string tobytes using thestr.encode() method, thebytes() function, or thecodecs module.
  • Endianness refers to the byte order used to represent binary data in memory, which can be either little-endian or big-endian.

This tutorial starts with a brief overview of binary data fundamentals, setting the scene for the remaining part, which delves into creating and manipulatingbytes objects in Python. Along the way, it touches on related topics, such asbytearray, bytes-like objects, and the buffer protocol. To top it off, you’ll find several real-life examples and exercises at the end, which demonstrate the concepts discussed.

To get the most out of this tutorial, you should be familiar withPython basics, particularlybuilt-in data types.

Get Your Code:Click here to download the free sample code that you’ll use to learn about bytes objects and handling binary data in Python.

Take the Quiz: Test your knowledge with our interactive “Python Bytes” quiz. You’ll receive a score upon completion to help you track your learning progress:


Bytes Objects: Handling Binary Data in Python

Interactive Quiz

Python Bytes

In this quiz, you'll test your understanding of Python bytes objects. By working through this quiz, you'll revisit the key concepts related to this low-level data type.

Brushing Up on Binary Fundamentals

If you’re new to binary data or need a quick refresher, then consider sticking around. This section will provide a brief overview of binary representations, emphasizing a Python programmer’s perspective. On the other hand, if you’re already comfortable with the basics, then feel free to dive right intocreatingbytes objects in Python.

Bits, Bytes, and Binary Data

Virtually every piece of information, from books and music to movies, can be stored asbinary data in a computer’s memory. The wordbinary implies that the information is stored as a sequence ofbinary digits, orbits for short. Each bit can hold a value of either one or zero, which is particularly well-suited for storage in electronic devices since they often use distinct voltage levels to represent these binary states.

For example, the binary sequence below may represent the color of a pixel in an image:

 
1 1 0 0 0 0 1 0 0 0 1 1 0 0 1 1 0 0 1 0 0 1

To make the interpretation of such binary sequences more systematic, you often arrange the individual bits into uniform groups. The standard unit of information in modern computing consists of exactly eight bits, which is why it’s sometimes known as anoctet, although most people call it abyte. A single 8-bit byte allows for 256 possible bit combinations (28).

With this in mind, you can break up the bit sequence above into these three bytes:

   
0 0 1 1 0 0 0 01 0 0 0 1 1 0 01 1 0 0 1 0 0 1

Notice that the leftmost byte has been padded with two leading zeros to ensure a consistent number of bits across all bytes. Together, they form a24-bit color depth, letting you choose from more than 16 million (224) unique colors per pixel.

In this case, each byte corresponds to one of threeprimary colors (red, green, and blue) within theRGB color model, effectively serving as coordinates in theRGB color space. Changing their proportions can be loosely compared to mixing paints to achieve a desired hue.

Note: Strictly speaking, the RGB color model is anadditive one, meaning it combines specific wavelengths of light to synthesize complex colors. In contrast, paint mixing follows asubtractive model, where pigments absorb certain wavelengths of light from the visible spectrum.

To reveal the pixel’s primary colors as decimal numbers, you can open thePython REPL and definebinary literals by prefixing the corresponding bit sequences with0b:

Python
>>>0b00110000,0b10001100,0b11001001(48, 140, 201)

Binary literals are an alternative way of defining integers in Python. Other types of numeric literals include hexadecimal and octal. For example, you can represent the integer48 as0x30 in hexadecimal or0o60 in octal, allowing you to write the same number differently.

Having such flexibility comes in handy since it’s customary to express byte values using thehexadecimal numeral system. By rewriting each byte as atwo-digit hex number, you can represent your pixel color much more compactly compared to the equivalent binary sequence:

Python
>>>hex(48),hex(140),hex(201)('0x30', '0x8c', '0xc9')>>>int("308cc9",base=16)3181769>>>int("001100001000110011001001",base=2)3181769

Calling the built-inhex() function on an integer returns the corresponding hexadecimal literal as a string. When you combine the resulting hex numbers, you’re able to describe a 24-bit color with just six digits (308cc9). Go ahead and open anonline color picker to see what that encoded value looks like:

Google's Color Picker Tool

You can play around with this tool to explore different primary color combinations and see how they affect the resulting hexadecimal value.

The RGB model deliberately fits each color channel onto a byte, making this representation simple and memory-efficient. However, if you wanted to use a greater color depth to achieve higher fidelity, then you’d need to allocate more than one byte per channel. Furthermore, most computers read and write data in chunks of certain sizes for better performance, often requiring your data to be properlyaligned.

Bits and bytes are fundamental building blocks of digital data, giving structure and meaning to your binary sequences. In the next section, you’ll learn how to manage binary numbers longer than a byte and how to cope with the associated challenges.

Binary Words and Endianness

When single bytes don’t cut it, you can group them into larger units of information calledwords. The size of abinary word in bytes is usually a multiple of a power of two, as reflected by most data types native to popular programming languages. For example, here are a few basic types supported byJava with their byte sizes:

TypeSizeMinimum ValueMaximum Value
byte20 = 1 byte-128127
short21 = 2 bytes-32,76832,767
int22 = 4 bytes-2,147,483,6482,147,483,647
long23 = 8 bytes-9,223,372,036,854,775,8089,223,372,036,854,775,807

Neither Java nor Python provides a built-in numeric type capable of representing a three-byte value, effectively forcing you to use more memory than necessary for each pixel. Alternatively, you could create a custom data structure to store pixels as byte triplets, but you would inevitably introduce additional complexity and performance overhead in your code.

Note that as soon as you start using more than one byte to represent a value, it raises an interesting conundrum known as theendianness orbyte order. Depending on how you decide to store your multi-byte numbers in memory, their interpretation can vary significantly.

The following example demonstrates this using the integer’s.to_bytes() method, which you’ll explore in more detail later:

Python
>>>color=0x308cc9>>>list(color.to_bytes(length=4,byteorder="little"))[201, 140, 48, 0]>>>list(color.to_bytes(length=4,byteorder="big"))[0, 48, 140, 201]

You specify an RGB color as a numeric value using the hexadecimal notation. By calling.to_bytes() with thelength=4 parameter, you convert the number into a sequence of four bytes, which is why you see an extra byte with the value of zero in the output. Thebyteorder parameter allows you to control the order in which these bytes are arranged.

Endianness is primarily expressed as big-endian or little-endian. Inlittle-endian, you arrange the subsequent bytes starting with theleast significant byte—in this case, hexadecimal0xc9 or decimal201—at the lowest address. Conversely, inbig-endian, you first start with the most significant byte, which is zero. This is very much like choosing whether to read a sentence forward or backward, word by word.

When you apply the wrong endianness to a piece of binary data, you’ll likely get an incorrect value. Here’s another code snippet that illustrates this with Python’sbytes data type, which you’ll learn about soon:

Python
>>>data=bytes([0,48,140,201])>>>int.from_bytes(data,byteorder="little")3381407744>>>int.from_bytes(data,byteorder="big")3181769

Notice how different the resulting numbers are, even though you derived them from the same sequence of bytes! This goes to show the importance of choosing the right byte order when working with longer byte sequences.

Note: Contrary to popular belief, the completebit sequences produced by little-endian and big-endian orders aren’t mirror reflections of each other:

Python
>>>little_endian=format(3381407744,"032b")>>>big_endian=format(3181769,"032b")>>>print(little_endian,big_endian,sep="\n")1100100110001100001100000000000000000000001100001000110011001001>>>little_endian==big_endian[::-1]False

If you look closely at the two bit patterns above, then you’ll quickly notice that they don’t match when one is reversed. Remember that it’s the wholebytes that are stored in reverse order, but the internalbit numbering within each byte remains consistent and unambiguous:

Python
>>>importtextwrap>>>print(..." ".join(textwrap.wrap(little_endian,8)),..." ".join(textwrap.wrap(big_endian,8)),...sep="\n"...)11001001 10001100 00110000 0000000000000000 00110000 10001100 11001001

For example, the bit sequence in the leftmost byte of the first highlighted line is identical to that of the rightmost byte of the following line. Similarly, the next byte from the left on the first line matches the second byte from the right on the line below, and so on. In fact, both binary values consist of the same set of four bytes, just placed in reverse order.

How do you know which byte order to use? Usually, you don’t care because Python works hard to remain neutral about the byte order as much as possible, making your code portable across platforms that may have different endianness. To find out your hardware’s native byte order, use thesys module:

Python
>>>importsys>>>sys.byteorder'little'

This means that Python internally stores values in little-endian byte order, which is what most computers use these days. That said, if you’re still on an older Apple computer or work with some enterprise systems, then you might stumble on big-endian or evenbi-endian.

As long as you don’t interface with the outside world, you can ignore these differences in byte order. Unfortunately, that’s only one side of the coin. If you plan to communicate with other systems over the network or exchange data through binary files—especially using Python’s low-level constructs—then you might need to ensure appropriate conversion yourself. Otherwise, you’re gambling with potential data corruption at some point.

Note: When you decide to process binary data at a low level, always check the file format or transmission protocol specification to ensure the correct byte order. For example, major network protocols have adopted big-endian, sometimes callednetwork byte order in this context, as a standard. Also, beware of mixed byte orders in certain file formats, like multimedia containers.

You can represent arbitrary data in digital form by combining bytes into longer sequences. It’s entirely up to you how you want to interpret these byte sequences, though. Depending on the context, the same piece of binary data can mean different things.

So far, you’ve learned about bits, bytes, binary words, and endianness. But, to fully understand how to handle binary data, you’ll need to delve into another key concept: signedness.

Signedness and the Sign Bit

As mentioned earlier, an 8-bit byte consists of 256 unique bit combinations (28). It can represent either a smallunsigned integer ranging from 0 to 255, or asigned integer in the range of -128 to 127. Python only understandsunsigned bytes, but there are ways to emulate signed bytes should you need to—more on that later.

Note: You could theoretically represent afloating-point number on a single byte by encoding thesign, exponent, and mantissa using fewer bits. However, such a data type would suffer from extremely low precision and a limited range of values, rendering it impractical.

In some applications, however, you might encounter ahalf-precision data format, which strikes a balance between accuracy and storage. It encodes floating-point values on just two bytes instead of the usual four (single precision) or eight (double precision).

Therefore, it’s more common to think about individual bytes as small integers for pragmatic reasons.

Thesignedness of a number may play a role in certain data formats, such as those representing digital audio samples, which encode amplitude levels. When dealing with low-level binary data, you must know beforehand whether a given bit sequence is signed or unsigned. This applies to both single bytes and binary words comprised of more than one byte. Otherwise, you may misinterpret the data, akin to choosing the wrong endianness:

Python
>>>data=bytes([0b10101010])>>>int.from_bytes(data,signed=True)-86>>>int.from_bytes(data,signed=False)170

Here, for example, you tell Python to interpret the specified bit sequence as a signed integer and then as an unsigned one. Consequently, you get two completely different values. Note that this won’t always be the case:

Python
>>>data=bytes([0b01010101])>>>int.from_bytes(data,signed=True)85>>>int.from_bytes(data,signed=False)85

After having changed the bit sequence ever so slightly, you now get two identical values regardless of the mode. Read on to find out what’s different about this particular bit pattern!

If you haven’t dipped your toes into other programming languages, thenunsigned numbers might be a foreign concept to you. After all, Python’snumeric types let you store positive and negative values—as well as zero—at will. Such a design choice makes it straightforward to think about numbers when you solve programming problems, but it doesn’t always translate well to binary data formats. Why does this distinction at the binary level exist?

While unsigned types can’t store negative values, they double the range of positive values compared to signed types with the same number of bits. If you know for sure that your data will never include negative values, then choosing an unsigned type will maximize your storage efficiency. Think of tracking a video view count, for example.

This doubling of positive values stems from an extra bit that you gain by not having to keep the sign of your number. The bulk ofsigned number representations repurpose the most significant bit—located on the far left of the sequence—as asign bit, which indicates whether the number is positive or negative. However, that only leaves you with the remaining bits to represent the magnitude of the number.

When thesign bit is switched on (1), it means that the stored number is negative. On the other hand, when the sign bit is switched off (0), then the number is either positive or equal to zero. Tying this information back to the two previous examples, you can conclude the following:

  • The leftmost bit in the first bit pattern was equal to one, serving as the sign bit when you specifiedsigned=True. Conversely, in the unsigned mode, the same bit acted as a regular bit, contributing to the number’s value, which is why you saw two different results.
  • In the second bit pattern, however, the leftmost bit was zero, making no contribution to the number’s magnitude in either of the modes. Moreover, it didn’t affect the sign of the number, allowing it to remain positive or zero in both signed and unsigned interpretations.

Beneath the surface, modern computers represent signed integers as bit sequences usingtwo’s complement, which is by far the most efficient method for binary arithmetic and hardware design, making it a widely accepted standard. Things get more tricky in Python, which takes an adaptive approach to seamlessly store integers of arbitrary precision. It accomplishes this by using a custom representationwithout an explicit sign bit.

The technical details ofPython’s integer representations are beyond the scope of this tutorial. Still, you might want to know about various ways toemulate the sign bit when working with fixed-length binary sequences in Python. You’ll learn about them now.

Two’s Complement vs Python

Despite not having an explicit sign bit in their internal representations, the numeric types in Python are indeed signed. At the same time—as if that weren’t confusing enough—Python can only representunsigned bytes with itsbytes data type. So, if you try to define a byte sequence containing at least one negative value, then you’ll get aValueErrorexception:

Python
>>>bytes([-42])Traceback (most recent call last):...bytes([-42])~~~~~^^^^^^^ValueError:bytes must be in range(0, 256)

The error message tells you that each of the bytes in your sequence must fit between 0 and 255, inclusive. The 256 in the message is there because arange() object corresponds to aright-open interval, meaning it includes the start but not the end value.

To work around this limitation, you can choose from many options. Arguably, the most straightforward way to usetwo’s complement when reading or writing binary data in Python is to call an integer’s.to_bytes() and.from_bytes() methods with thesigned parameter set toTrue:

Python
>>>fornumberin[42,2,1,-1,-2,-42]:...binary_word=number.to_bytes(length=2,byteorder="big",signed=True)...bytes_binary=" ".join(format(byte,"08b")forbyteinbinary_word)...bytes_decimal=", ".join(format(byte,"3d")forbyteinbinary_word)...print(f"{number:3d}: [{bytes_binary}] [{bytes_decimal}]")... 42: [00000000 00101010] [  0,  42]  2: [00000000 00000010] [  0,   2]  1: [00000000 00000001] [  0,   1] -1: [11111111 11111111] [255, 255] -2: [11111111 11111110] [255, 254]-42: [11111111 11010110] [255, 214]>>>int.from_bytes(bytes([255,214]),byteorder="big",signed=True)-42

In this case, youloop over a few sample integers and convert them into binary words, each consisting of two unsigned bytes. Because you specify big-endian byte order, the most significant byte appears first. Setting thesigned parameter toTrue enables two’s complement, preventing anoverflow error during the conversion.

In two’s complement, you can represent negative numbers by inverting all the bits of their positive counterparts and adding one to the obtained value. For example, you derive -42 by first taking the binary representation of 42, which is00000000 00101010, inverting the bits, and then adding one to get11111111 11010110. The resulting bit pattern corresponds to two bytes, 255 and 214 in decimal, which encode the value -42 in two’s complement.

Later, you restore one of the binary words to its original form by usingint.from_bytes() and specifying matching argument values. Try changing those parameters to see their effect on the output.

Other techniques for dealing with signed bytes and words in Python include usingbitmasks and bit shifting, themodulo operator (%), as well as various modules that ship with Python:

TechniqueExample
Bitmaskbytes([-42 & 0x00ff, (-42 & 0xff00) >> 8])
Modulobytes([-42 % 2**8, -42 % 2**16 // 2**8])
Arrayarray.array("h", [-42]).tobytes()
C Structstruct.pack("<h", -42)
C Typesctypes.string_at(ctypes.byref(ctypes.c_short(-42)), 2)

These are only examples, so you don’t need to fully understand them right now. If you’d like to learn more about these methods, then check out the tutorial onbitwise operators in Python. Keep in mind thatarray andctypes rely on your system’s native byte order, so you’ll need to handle them with care.

Now that you have a solid understanding of binary data fundamentals, you can move on to explore how Python handles bytes and words. Next up, you’ll learn about thebytes data type in more depth, including bytes-like objects and the buffer protocol.

Getting to Know thebytes Object in Python

Python is a high-level programming language that comes withbatteries included, meaning it provides a rich set of abstractions for all sorts of tasks out of the box. But, even if you can’t find the desired tool for your specific problem in thestandard library, there’s a good chance athird-party package exists to fill the gap. In other words, writing Python code often feels like assembling a puzzle from prefabricated pieces.

Because of this, you rarely need a deep understanding of how Python works under the hood when crunching or pushing data around. But it doesn’t completely shield you from these aspects if you choose to engage with them.

One such scenario is manipulatingbinary data directly, which enables fine-grained control and can be valuable for educational purposes. Occasionally, it may be your only option when there’s no built-in alternative for your specific use case.

Note: Unless you have a compelling reason, it’s generally preferable to stick to the available high-level abstractions rather than reinvent the wheel. Not only will you save time, but you’ll also ensure better performance in most cases and increase reliability and convenience. Only go down the low-level path when absolutely necessary!

Without further ado, it’s time to introduce a few lesser-known data types that allow you to handle raw binary data in Python.

Bytes-Like Objects andbytes

Binary data can be interpreted and manipulated differently depending on its structure. At a fundamental level, you can view a chunk of binary data in two primary ways:

  1. As Individual Bytes: An ordered series of byte values, each carrying some meaning
  2. As a Complex Structure: Multiple fields packed together, with fixed or varying sizes

The image below illustrates these concepts by showing the binary representation of ASCII characters, uniformly sized pixel values, and various fields within image metadata:

Single-Byte vs Multi-Byte Oriented Data

A text file is typically a sequence of simple bytes representing characters. Unless you want to encode non-ASCII letters, which may require more than one byte per character, each byte corresponds directly to a character’s ordinal value in the givencode page. For example, the hexadecimal byte value0x52 (82 in decimal) represents the letterR in ASCII.

Many image file formats store pixels as a sequence of three-byte words consisting of red, green, and blue primary colors. However, in the example above, each pixel has an extra byte for thealpha channel, representing the transparency level. This makes each pixel four bytes long. For example, the first pixel comprises the hexadecimal values0x30 for red,0x8c for green,0xc9 for blue, and0xff for the alpha channel, which indicates full opacity.

Meanwhile, an imagefile header contains structured metadata to help graphics-editing software correctly interpret the pixels. Such metadata may include details like image dimensions—for example, 1920 pixels wide by 1080 pixels high—a 32-bit color depth, andGPS coordinates captured by the camera when the photo was taken. These fields are packed together one after another as distinct binary words whose sizes you must know upfront.

If you plan to work withsingle bytes in Python, then you’ll get the greatest level of control by using one of its built-inbinary sequence types:

  • bytes: A read-only sequence of unsigned bytes
  • bytearray: Amutable counterpart ofbytes
  • memoryview: A lightweight view into the memory of abytes-like object

Conversely, when working withmulti-byte fields, you’ll likely prefer higher-level abstractions from the standard library, such as:

  • array.array: A homogeneous sequence of numeric values
  • struct: A module for working with C-style data structures
  • ctypes: A foreign function library for manipulating C data types
  • io.BytesIO: Afile-like object for handling binary data streams

The three built-in types listed earlier, along with thearray data structure above, are examples of Python’sbytes-like objects designed for efficient sharing of large amounts of binary data.

Note: You can use amemoryview to work with both byte-oriented and multi-byte-oriented data, as it automatically computes memory offsets for given index positions based on the underlying item size.

To read the data depicted in the diagram above, you can use Python’s bytes-like objects, the modules mentioned, or you can hand-craft a list of byte values using hexadecimal literals as follows:

Python
>>>data=[0x52,0x65,0x61,0x6c,0x20,0x50,0x79,0x74,0x68,0x6f,0x6e]>>>"".join(map(chr,data))'Real Python'

Aftermapping each byte value into the corresponding character with Python’schr() function, you.join() them into the string'Real Python'. Keep in mind that this approach only works for byte values within theASCII range. To decipher foreignUnicode characters, which may be encoded using more than one byte, you’ll have to specify the respectivecharacter encoding:

Python
>>>data=[0x63,0x61,0x66,0xc3,0xa9]>>>bytes(data).decode("utf-8")'café'>>>"".join(map(chr,data))'café'

Theaccented letteré uses two bytes when represented in the UTF-8 encoding. Therefore, you wrap your list of hexadecimal values in abytes() object and call its.decode() method with an appropriate encoding. Otherwise, when you try to interpret each byte individually as before, you end up with the garbled text'café'.

For image pixels, which occupy a fixed number of bytes each, Python’s numericarray is a more suitable choice:

Python
>>>fromarrayimportarray>>>fromPILimportImage>>>pixels=array("I",bytes([...0x30,0x8c,0xc9,0xff,...0xfc,0xba,0x03,0xff,...0x25,0xba,0x34,0xff,......0xfc,0xba,0x03,0xff,...]))>>>image=Image.new(mode="RGBA",size=(1920,1080))>>>image.putdata(pixels)>>>image.save("image.png")

You can pass your array straight to aPillow image instance, making sure to match theimage mode and dimensions correctly. Additionally, you may sometimes need toswap the bytes in the array elements, depending on your system’s endianness.

Note: In practice, most users who work with Pillow choose NumPy arrays over Python arrays due to their efficiency and versatility. You’ll see an example of this in the next section.

Unlike the individual ASCII characters or image pixels, a file header typically consists of multiple data fields with varying sizes, which are placed next to each other. While you could read them one at a time, thestruct module offers a more concise and convenient way to untangle such a group of fields all at once:

Python
>>>importstruct>>>struct.unpack("<HHBff",bytes([...0x80,0x07,...0x38,0x04,...0x20,...0x9c,0x37,0x48,0x42,...0x08,0x7c,0x9f,0x41...]))(1920, 1080, 32, 50.05430603027344, 19.935562133789062)

The first argument tostruct.unpack() is aformat string that specifies how the sequence of bytes passed as the second argument should be interpreted. In this case, the less-than symbol (<) determines little-endian byte order, while the following characters correspond to the expected types of the subsequent data fields. Here they are:

  • HH: Two consecutive unsigned shorts (16-bit integers)
  • B: One unsigned byte (8-bit integer)
  • ff: Two single-precision floats (32-bit float)

As a result, you get a tuple of five elements representing the decoded values from the header according to the specified format.

Okay. You’ve now seen an overview of the tools available in Python for handling different types of binary data, most of which are considered bytes-like objects. One of them is thebytes data type, which you’ll explore in more detail later in this tutorial. But first, you’ll learn what makes bytes-like objects special.

Python’s Buffer Protocol

Bytes-like objects conform to thebuffer protocol, which describes a standard interface for objects to directly access their internaldata buffers. This mechanism eliminates the need for expensive copying when you slice or pass large amounts of data around, improving both performance and memory use.

So far, you’ve only seen a few bytes-like objects built into Python, which can be useful for working with binary data at a low level. However, the real power of the underlying buffer protocol lies elsewhere: it can facilitate efficient interoperability between third-party libraries. For example, you may feed a Pillow image with aNumPy array instead of a Pythonarray of pixels:

Pythongenerate_image.py
importnumpyasnpfromPILimportImagepixels=np.random.default_rng().random((1080,1920,3))*255image=Image.fromarray(pixels.astype("uint8"))image.save("image.png")

This code snippet creates a Pillow image from a NumPy array filled with random pixel values and saves it to a file in your current working directory using thePortable Network Graphics (PNG) format.

Both libraries leverage compiledC extension modules for the heavy lifting to maximize performance. Since the Python interpreter loads them into the same system process, they share a common address space. NumPy exposes an array-like region of memory to Python through the buffer protocol. This allows Pillow to read and manipulate data managed by NumPy without unnecessary copying, even for arrays containing millions of pixels.

Although you can exchange data in a similar fashion using alternative methods, the buffer protocol offers several advantages:

  • Interoperability: It’s widely supported by popular Python libraries, especially those in the scientific ecosystem, ensuring their seamless integration.
  • Efficiency: It allows raw memory to be accessed directly without the overhead of copying or type conversions.
  • Safety: It relies on Python’s automatic memory management, reducing the risk ofmemory leaks andsegmentation faults.
  • Convenience: It supports complex memory layouts by handlingstrides and offsets for you.

It’s worth noting that the buffer protocol is a low-level C API, which hasn’t traditionally been accessible in pure Python. Previously, implementing a custom bytes-like object required writing and building a C extension module. However, this changed with the introduction ofPEP 688 inPython 3.12, which brought two newspecial methods:

  1. .__buffer__(): Returns the buffer wrapped in amemoryview object
  2. .__release_buffer__(): Optionally frees any buffer-related resources

These methods let you implement the buffer protocol in your ownPython classes, making them usable as buffers in compiled code. The buffer itself might be allocated by a C extension module or a built-in data type likebytes orarray.

You can quickly check whether a given object supports the buffer protocol by callingmemoryview() on it:

Python
>>>text_bytes=bytes([0x63,0x61,0x66,0xc3,0xa9])>>>text=text_bytes.decode("utf-8")>>>memoryview(text_bytes)<memory at 0x764ca33de5c0>>>>memoryview(text)Traceback (most recent call last):...memoryview(text)~~~~~~~~~~^^^^^^TypeError:memoryview: a bytes-like object is required, not 'str'

Upon success, you’ll get a new instance of thememoryview class. Otherwise, Python will raise aTypeError exception with a message explaining that the object in question isn’t a bytes-like one. NumPy arrays also expose their underlying data as a buffer, which you can access in Python usingmemoryview.

At this point, you know thatbytes is one of Python’s built-in binary sequence types, representing a read-only series of unsigned bytes. It’s also a bytes-like object that implements the buffer protocol for low-level binary data manipulation. With that covered, it’s time to explore how to create and work withbytes in Python.

Creating Pythonbytes by Hand

In this section, you’ll explore three different ways to createbytes objects in Python, including abytes literal, thebytes() function, and thebytes.fromhex() class method. Each one is suited to specific use cases, which you’ll discover as you delve deeper into their distinct features.

The Bytes Literal Format

The quickest way to produce abytes object in Python resembles defining astring literal, only prefixed with the letterb orB:

Python
>>>MAGIC_NUMBER=b"GIF89a">>>type(MAGIC_NUMBER)<class 'bytes'>

This method works best when your binary sequence consists of reasonably few bytes that you know upfront and you want to embed them in your Python source code in literal form. Such a literal could serve as aconstant meant to represent abinary file signature, also known as a “magic number.”

In this case, the byte values stored in yourbytes object happen to have a human-readable representation because they correspond to well-known character codes:

       
CharacterGIF89a
Code Point717370565797

To embed non-printable values, you’ll have to rewrite them using a special notation. That’s becausebytes literals are limited toASCII characters, making them slightly more restrictive than string literals, which can contain any Unicode character, including emojis.

Thankfully,string andbytes literals have a lot in common. They both supportescape character sequences, letting you encode certain symbols like quotation marks or special characters like newlines with plain ASCII.

Anescape sequence always begins with a backslash (\) followed by one or more ASCII characters. For example, you can use the hexadecimal (\xhh) or octal (\ooo) notation, whereh denotes a hexadecimal digit ando an octal one, to represent any numeric byte value:

Python
>>>b"caf\xc3\251"b'caf\xc3\xa9'

The hexadecimal sequence\xc3 corresponds to the byte value 195 in the decimal system, while the octal sequence\251 represents the byte value 169. Together, these two bytes encode the accented letteré in UTF-8. While neither of the two bytes can fit into the ASCII range between 0 and 127, the individual characters of their escape sequences do. As you can see, Python prefers the hexadecimal notation when displayingbytes in the REPL.

Note: Pythonbytes literals support nearly all escape sequences available in string literals, except for a handful that are specific to encoding Unicode characters.

Another similarity between string andbytes literals is that they have their raw counterparts. Araw string literal disables the interpretation of escape character sequences, which could conflict with the syntax ofregular expressions, for example. Likewise,rawbytes literals are often used to define regex patterns when searching, matching, or substituting parts of binary content:

Pythonfind_timestamps.py
importrefrompathlibimportPathbinary_data=Path("picture.jpg").read_bytes()pattern=rb"\d{4}:\d{2}:\d{2} \d{2}:\d{2}:\d{2}"formatchinre.finditer(pattern,binary_data):print(match)

To define a rawbytes literal in Python, you can use any combination of the lowercase or uppercase lettersr andb as a prefix. In this example, you look for timestamps in a JPEG file, which usually keeps the camera’s original date, file creation date, and last modified date within itsExif metadata.

Declaring your literal as raw instructs Python to treat the sequence\d literally as a pair of characters instead of failing due to an invalid escape sequence. Alternatively, you could escape each backslash in a plainbytes literal to achieve a similar effect:

Python
pattern=b"\\d{4}:\\d{2}:\\d{2}\\d{2}:\\d{2}:\\d{2}"

However, this makes the pattern less readable and doesn’t scale well since the regular expression syntax involves other conflicting symbols.

Note: Unlike Python strings,bytes don’t supportformatted string literals, commonly referred to asf-strings. When you want to format abytes object, consider using an f-string literal first, and then convert it to a sequence of bytes.

Finally, you may occasionally need to declare anemptybytes literal, for instance, when reading a binary file in chunks:

Pythonread_chunks.py
defread_chunks(filename,max_bytes=1024):withopen(filename,mode="rb")asfile:whileTrue:chunk=file.read(max_bytes)ifchunk==b"":breakyieldchunkforchunkinread_chunks("picture.jpg"):print(len(chunk))

Python reports reaching theend of the file (EOF) through an emptybytes object returned from thefile.read() method. By comparing the returned value to an emptybytes literal (b""), you’ll know when to stop reading from the file.

Literals are great forstatic byte sequences of relatively short length that you know ahead of time. In contrast, if you want to create abytes object dynamically based on some data or conditions that can change at runtime, then use thebytes() function.

The Built-inbytes() Function

To create a newbytes instance, you can invoke itsclass constructor just like a regularPython function. For simplicity, you’ll refer to it as abuilt-in function in this section. Depending on how many arguments and what types you pass tobytes(), you can createbytes objects in several different ways:

Python Syntax
# Argumentless:bytes()# Single argument:bytes(length:int)bytes(data:Buffer)bytes(instance:CustomClass)bytes(values:Iterable[int])# Two or three arguments:bytes(text:str,encoding:str[,errors:str="strict"])

When called without arguments, thebytes() function creates anemptybytes object:

Python
>>>bytes()b''

The resulting object is a byte sequence of length zero, which you may use as asentinel value to indicate an absence of data in a binary stream.

Callingbytes() with a positive integer as an argument allocates a sequence of the specified length, and then fills it with zeros ornull bytes. Note that callingbytes(0) has the same effect as declaring an emptybytes literal (b""):

Python
>>>bytes(5)b'\x00\x00\x00\x00\x00'>>>bytes(0)b''

A zero-filledbytes object can be useful as padding since it’s inherently read-only. For example, you might use one to pad the rows of pixels in abitmap image.

The next type of argument that you can pass intobytes() is anotherbytes-like object, which implements the buffer protocol. Callingbytes() on such an object allows you to break down a sequence of binary words into a sequence of their individual bytes:

Python
>>>fromarrayimportarray>>>bytes(array("f",[50.05430603027344,19.935562133789062]))b'\x9c7HB\x08|\x9fA'

Here, you start with a Pythonarray containing two single-precision floats (32-bit), which you then convert into raw bytes. This technique opens up the possibility to reinterpret the underlying binary data in completely new ways:

Python
>>>coordinates=array("I")>>>coordinates.frombytes(b"\x9c7HB\x08|\x9fA")>>>latitude,longitude=coordinates>>>latitude,longitude(1112029084, 1100971016)

For example, you can load the resultingbytes object into an array of unsigned integers. As long as you take responsibility for potential conversion failures, it’ll let you treat the original floating-point numbers as if they were integers all along. The best part is that you can reverse this process without losing a single bit of information!

Another use case for callingbytes() on a bytes-like object is when you want to obtain its read-only copy:

Python
>>>mutable=bytearray([0x30,0x8c,0xc9,0xff])>>>immutable=bytes(mutable)>>>mutable==immutableTrue>>>mutable.clear()>>>mutable==immutableFalse

Abytearray is writable, allowing you to modify its contents, whereas abytes object is fixed, providing a read-only version of the data. The comparison between these two data types is similar to that betweenlists and tuples, since one is mutable and the other immutable. Notice that modifying the original bytes-like object has no effect on the copy you created withbytes().

Note: You can also use thebytes() function on instances of a custom class, provided that it implements the.__bytes__() special method:

Python
>>>classUserStatus:...def__init__(self,user_id,message):...self.user_id=user_id...self.message=message......def__bytes__(self):...returnb"".join([...self.user_id.to_bytes(4,"little",signed=False),...self.message.encode("utf-8")...])...>>>user_status=UserStatus(42,"Away from keyboard\N{PALM TREE}")>>>bytes(user_status)b'*\x00\x00\x00Away from keyboard \xf0\x9f\x8c\xb4'

This method allows you to represent your object as a series of bytes, which are ready for transmission over a network or storage in a binary file. Keep in mind that.__bytes__() takes precedence over the buffer protocol’s.__buffer__() method, so avoid using them together.

The last way to call the single-argument flavor ofbytes() involves using aniterable of small integers as input:

Python
>>>bytes([0x30,0x8c,0xc9,0xff])b'0\x8c\xc9\xff'>>>bytes(range(65,91))b'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

You’ve seen this method of specifying byte sequences many times over in this tutorial. The firstiterable in the example above is a list of integers defined as hexadecimal literals, whereas the second iterable is arange() object. While your iterable can be eagerly orlazily evaluated, it must have a fixed size to avoid crashing the computer due to excessive memory use. So, please don’t use an infinite iterator from theitertools module!

Remember that each element of the input iterable represents a byte value, which must be greater than or equal to 0 and less than 256. Otherwise, you’ll face a familiar error:

Python
>>>bytes([-128,0,127,256])Traceback (most recent call last):...bytes([-128,0,127,256])~~~~~^^^^^^^^^^^^^^^^^^^^^ValueError:bytes must be in range(0, 256)

Both -128 and 256 aren’t valid unsigned byte values, as they fall outside the permissible range.

Finally, you can call thebytes() function withtwo arguments, both of which must be Python strings. The first argument is an ordinary piece of text, while the second argument is the desired character encoding:

Python
>>>bytes("café","utf-8")b'caf\xc3\xa9'

Callingbytes() like that turns a string into abytes object. The resulting binary sequence might be longer than the original number of Unicode characters, depending on theircode points and the chosen encoding.

Note: A morePythonic way to encode a string into bytes entails calling itsstr.encode() method instead of thebytes() function.

Sometimes, a string character might not have a corresponding binary representation in the given encoding. For instance, you can’t translate the accented letteré, which has anordinal value of 233, to one of the 128 ASCII values:

Python
>>>bytes("café","ascii")Traceback (most recent call last):...bytes("café","ascii")~~~~~^^^^^^^^^^^^^^^^^UnicodeEncodeError:'ascii' codec can't encode character '\xe9'⮑ in position 3: ordinal not in range(128)>>>ord("é")233

To prevent Python from raising an error in cases like this, you can relax the default strategy forhandling encoding errors with an optionalthird argument tobytes():

Python
>>>bytes("café","ascii",errors="ignore")b'caf'>>>bytes("café","ascii",errors="replace")b'caf?'>>>bytes("café","ascii",errors="namereplace")b'caf\\N{LATIN SMALL LETTER E WITH ACUTE}'>>>bytes("café","ascii",errors="backslashreplace")b'caf\\xe9'>>>bytes("café","ascii",errors="xmlcharrefreplace")b'caf&#233;'

In fact, callingbytes() with two arguments is the same as calling it with an implicit third argument,errors="strict" by default. The offending characters won’t be considered at all when you choose to ignore the encoding errors. Alternatively, you can replace those characters with a placeholder or an ASCII-safe replacement sequence.

Python provides yet another way to create a newbytes object directly from a string of hexadecimal digits, which you’ll learn about next.

Thebytes.fromhex() Class Method

If you’ve been following along, then you know that using thehexadecimal system is a popular way of representing byte values concisely. In this notation, you need between one and two hexadecimal digits for every byte. Once you exceed the maximum value of 255, you’ll require three digits or more:

Python
>>>hex(9)'0x9'>>>hex(10)'0xa'>>>hex(42)'0x2a'>>>hex(256)'0x100'

When you run out of the decimal digits (0-9), you start using the first six letters of the alphabet (A-F) to represent the values between 10 and 15. But even when you technically only need one hexadecimal digit, it’s common to use two by putting a zero in front of it:

Python
>>>0x099

This practice ensures that every byte is displayed with the same number of hexadecimal digits, which helps maintain consistent formatting. It can be especially useful when you want to print and align multiple bytes in columns or rows.

Because hex is such a widespread notation, Python provides a convenientclass method in itsbytes data type, which converts a string of hexadecimal digits into a newbytes instance:

Python
>>>bytes.fromhex("308cc9")b'0\x8c\xc9'

You’re free to use a mix of lowercase and uppercase letters in the string. Additionally, any whitespace characters are ignored for better readability:

Python
>>>bytes.fromhex("30 8C C9")b'0\x8c\xc9'

There’s a companion method in the resultingbytes object that converts the byte sequence back into a hexadecimal string:

Python
>>>b"0\x8c\xc9".hex()'308cc9'

But that’s just one of many features ofbytes available to you in Python. Now that you know how to create abytes object, you’ll explore what you can do with it.

Manipulating Bytes Objects in Python

Python’sbytes objects share many similarities with strings but are specifically designed for handling binary data. In this section, you’ll explore how they compare to strings and how you can convert between the two data types. Finally, you’ll discover alternative ways to interpret the underlying byte values.

Usebytes Like a Python String

As already hinted at by their literal forms, Python’sbytes andstr data types have many similarities. That’s no coincidence since bytes were often synonymous with characters in the early days of computing. For example, theC programming language still lacks a separate type to represent bytes, instead relying on thechar data type to handle both characters and bytes. Likewise, explicitbytes didn’t appear in Python until the 3.0 release.

Note: You can check the officialrelease notes to find out the differences between Python 2.x and Python 3.x concerning binary data and Unicode handling.

First and foremost, bytes and strings fall under the broad category ofPython sequences. Like othersequence types in Python, thebytes data type supports standard sequence operations, such as:

OperationBytes ExampleString Example
Indexingdata[-1]text[-1]
Slicingdata[3:7:2]text[3:7:2]
Iteratingiter(data)iter(text)
Reversingreversed(data)reversed(text)
Measuring the Lengthlen(data)len(text)
Finding the Indexdata.index(b"\xff\xe1")text.index("Python")
Counting Occurrencesdata.count(b"\xff\xe1")text.count("Python")
Checking Membershipb"\xff\xe1" in data"Python" in text

Just like strings,bytes objects support two additional operators. Thestar operator (*) allows you to repeat the same byte sequence multiple times, while theplus operator (+) lets youconcatenate two or morebytes instances into one:

Python
>>>(b"\xAA"*7)+b"\xAB"b'\xaa\xaa\xaa\xaa\xaa\xaa\xaa\xab'

The code above uses both operators to construct thepreamble andstart frame delimiter (SFD) for anEthernet packet, consisting of seven repeated bytes of0xAA followed by a single byte of0xAB. The underlying bit pattern alternates between ones and zeros until it’s broken by the delimiter, helping network devices synchronize their timing with the incoming data.

You can use Pythonbytes in any context where aniterable object is expected, for example, to count the frequency of unique byte values:

Python
>>>fromcollectionsimportCounter>>>Counter(bytes.fromhex("beef babe"))Counter({190: 2, 239: 1, 186: 1})

In this example, you take advantage of theCounter class from thecollections module to create amapping of numeric byte values to their counts. Along the way, you demonstrate a whimsical use ofhexspeak by using the hexadecimal values of 0xBE 0xEF and 0xBA 0xBE.

Other than that, Pythonbytes andstr share over eighty percent of methods in theirpublic interfaces. After all, operations likefinding and replacing substrings orsplitting andjoining strings can be just as useful for byte sequences:

Python
>>>frompathlibimportPath>>>defhas_exif(path):...returnpath.read_bytes().find(b"\xff\xe1")!=-1...>>>has_exif(Path("camera_photo.jpg"))True>>>has_exif(Path("exif_stripped.jpg"))False

In this example, you call.find() to determine the offset of the first occurrence of an Exif segment marker (0xFF 0xE1) in a JPEG file. If the given byte sequence isn’t found, then the method returns -1. Note that the above implementation isn’t the most efficient since it loads the entire file into memory before searching.

The last major similarity between Pythonbytes and strings is that they’re bothimmutable, making their instances read-only:

Python
>>>byte_sequence=b"Real Python">>>byte_sequence[:4]=b"Monty"Traceback (most recent call last):...byte_sequence[:4]=b"Monty"~~~~~~~~~~~~~^^^^TypeError:'bytes' object does not support item assignment>>>delbyte_sequence[:5]Traceback (most recent call last):...delbyte_sequence[:5]~~~~~~~~~~~~~^^^^TypeError:'bytes' object does not support item deletion

You can’t modifybytes and strings once they’re created because they don’t supportassignments ordeletions. The best you can do is to wrap them in a mutablebytearray, modify the contents as needed, and then convert back to the respective immutable type, effectively creating a copy.

Whilebytes and strings look mostly similar at first glance, they have some notable differences. For instance, one quirk ofbytes is thatindexing returns an integer while slicing returns abytes object, whereas performing the same operations on a string object always results in another string:

Python
>>>byte_sequence=b"Real Python">>>byte_sequence[5]80>>>byte_sequence[5:]b'Python'>>>text="Real Python">>>text[5]'P'>>>text[5:]'Python'

Because Python doesn’t have a distinct type for Unicode characters, it must represent individual characters as single-element strings. In contrast, abytes object is a sequence of numeric byte values, which are stored as integers.

Consequently, when you apply themembership test operatorsin andnot in against a byte sequence, you can check for individual elements using integers and for subsequences usingbytes:

Python
>>>80inbyte_sequenceTrue>>>b"P"inbyte_sequenceTrue>>>b"Python"inbyte_sequenceTrue

The decimal value 80 is a code point for the letterP in ASCII, which is present in the sequence. Similarly, both the byte stringb"Python" and the single-byte sequenceb"P" are found within the full sequence.

Thelength reported bybytes and strings representing precisely the same piece of text may differ:

Python
>>>len("café")4>>>len(bytes("café","raw_unicode_escape"))4>>>len(bytes("café","utf-8"))5>>>len(bytes("café","utf-16"))10

A Python string is a sequence of Unicode characters, each having a unique ordinal value or code point. These code points can range from zero toover a million, which is far beyond the standard byte range, necessitating a clever binary representation. Depending on your chosen encoding, the same character might use fewer or more bytes per code point.

Unlike in legacy Python 2, you can’t mix numeric bytes with strings in one expression unless you do an explicit conversion one way or the other:

Python
>>>b"Real Python"=="Real Python"False>>>b"Real"+"Python"Traceback (most recent call last):...b"Real"+"Python"~~~~~~~~^~~~~~~~~~TypeError:can't concat str to bytes>>>b"Real".decode()+"Python"'RealPython'>>>b"Real"+"Python".encode()b'RealPython'

Python enforces a clear distinction between text (str) and binary data (bytes), reducing the likelihood of errors and making your code more predictable. This applies equally to operators, such as== and+, and methods like.find().

Since you’ll often need to encode a string into bytes ordecode bytes back into a string, you’ll revisit how to convert between these two closely related types in the next section.

Convert Between Bytes and Strings

Strings and byte sequences in Python exhibit symmetry by providing two complementary methods, which allow you to convert between them seamlessly:

  1. str.encode()
  2. bytes.decode()

Encoding a string turns its characters into a sequence of bytes, while decoding binary data reverses this process:

Python
>>>"café"=="café".encode("utf-8").decode("utf-8")True

As you can see, chaining.encode() and.decode() with the same encoding as arguments preserves the original text.

In addition to these two methods, you can convert betweenbytes and strings by calling thebytes() andstr() functions, as well as using thecodecs module from the standard library:

Python
>>>bytes("café","utf-8")b'caf\xc3\xa9'>>>str(b"caf\xc3\xa9","utf-8")'café'>>>importcodecs>>>codecs.encode("café","utf-8")b'caf\xc3\xa9'>>>codecs.decode(b"caf\xc3\xa9","utf-8")'café'

While most of these tools assume some default encoding, it’s highly recommended to always specify one explicitly to avoid surprises. That’s because different operating systems may use incompatible encodings by default. Your best bet is usually to use theUTF-8 encoding, which is widely supported and can handle a wide range of characters from various spoken languages.

Note: Python will start usingUTF-8 as the default encoding on all supported platforms in the near future.

The three conversion methods are largely interchangeable, so you can choose whichever you like most. However, callingstr.encode() andbytes.decode() directly on your Python strings andbytes is the most readable and common way to convert between them.

On the other hand, thecodecs module supports advanced encodings, including a few special ones that only apply tobytes rather than strings:

Python
>>>codecs.encode(b"0\x8c\xc9","base64")b'MIzJ\n'>>>codecs.encode("secret_password","rot-13")'frperg_cnffjbeq'

TheBase64 encoding allows you to represent binary data as text using only ASCII characters, whileROT-13 is a basic substitution cipher that rotates letters in your string by thirteen places. You can also leveragecodecs to apply custom encodings previously registered with Python.

Keep in mind that the conversion between strings andbytes will only work as long you use the correct character encoding:

Python
>>>"żółć".encode("ascii")Traceback (most recent call last):..."żółć".encode("ascii")UnicodeEncodeError:'ascii' codec can't encode characters⮑ in position 0-3: ordinal not in range(128)>>>"żółć".encode("iso-8859-2").decode("iso-8859-1")'¿ó³æ'

When you attempt to decode bytes with the wrong encoding—or the other way around—you’ll likely get an error. In the worst-case scenario, the conversion will fail silently without errors, producing incorrect output. So, always ensure that you use the right encoding when working with text and bytes.

By now, you’re familiar with creatingbytes instances from scratch and from strings, as well as manipulating them. However, the default string representation of byte sequences in the REPL isn’t always the most readable. Next up, you’ll discover ways to customize the display ofbytes, making them easier to understand.

Represent Bytes in Python Differently

When your byte sequence contains only ASCII characters, its default representation in the Python REPL looks similar to that of a string:

Python
>>>b"Real Python"b'Real Python'

This is perfectly readable and straightforward. However, as soon as you add non-printable ASCII characters, such as thecarriage return control character or non-ASCII byte values to the mix, then the output becomes less intuitive:

Python
>>>bytes.fromhex("89 50 4e 47 0d 0a 1a 0a")b'\x89PNG\r\n\x1a\n'

This piece of binary data includes a few values that can’t be expressed with pure ASCII—they must be properly escaped, either using a predefined escape sequence like\n or hexadecimal notation (\x1a). Fortunately, Python consistently formats such hexadecimal numbers with two digits so their boundaries remain clearly visible.

To reveal thenumeric values behind those bytes, you can convert yourbytes object to another sequence type, such as a Python list:

Python
>>>list(b"\x89PNG\r\n\x1a\n")[137, 80, 78, 71, 13, 10, 26, 10]

This allows you to see the decimal value of each byte in your sequence, which can be useful fordebugging or understanding the structure of binary data. Remember that these values represent unsigned integers, meaning they range from 0 to 255.

If you’d like to uncover the underlyingbit patterns of each individual byte, then consider the following code snippet:

Python
>>>" ".join(format(byte,"08b")forbyteinb"\x89PNG\r\n\x1a\n")'10001001 01010000 01001110 01000111 00001101 00001010 00011010 00001010'

You use a niftygenerator expression that iterates over each byte and callsformat() to convert it to an 8-bit binary string. The format specifier ("08b") ensures appropriate padding to maintain a consistent length for each byte. Lastly, you combine the resulting bit patterns into a single string with spaces separating them.

Alternatively, you might want to convert your byte sequence into ahexadecimal string by calling thebytes.hex() method:

Python
>>>color=b"0\x8c\xc9\xff">>>color.hex()'308cc9ff'>>>print(f"#{color.hex()}")#308cc9ff

Here, you also take advantage of Python’s f-string literal to format the resulting string according to a popular color code format.

The.hex() method accepts two optional parameters: aseparator character and thenumber of bytes per group between these separators. Both are particularly useful for formattingIPv6 addresses as strings:

Python
>>>ipv6=b"\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4">>>ipv6.hex()'20010db885a3000000008a2e03707334'>>>ipv6.hex(":")'20:01:0d:b8:85:a3:00:00:00:00:8a:2e:03:70:73:34'>>>ipv6.hex(":",2)'2001:0db8:85a3:0000:0000:8a2e:0370:7334'

The separator can be either a Python string or abytes object containing a single ASCII character. The second argument determines how many bytes to group together before the separator gets inserted. This number can be either positive or negative, although you’ll only see the difference with an uneven number of bytes in the sequence, as shown below:

Python
>>>color=b"0\x8c\xc9">>>color.hex(" ",2)'30 8cc9'>>>color.hex(" ",-2)'308c c9'

When the second parameter is a positive number, Python calculates the separator placements from the right, so the two bytes0x8c and0xc9 in the example above end up being grouped together, while the byte0x30 remains separate. Using a negative value reverses the direction, which aligns with your natural left-to-right reading order. Note that splitting an RGB color like this isn’t practical and only serves as an illustration.

Last but not least, a sequence of raw bytes might represent structured data or a series of more complex values. Eventually, you’ll want to transform these low-levelbytes into acollection of data types understood by Python to work more comfortably. You can do so by leveraging one of the few modules you saw earlier when you first learned about thebytes-like objects andbytes. Here they are as a quick reminder:

  • array
  • struct
  • ctypes
  • io

Another noteworthy module in the Python standard library ispickle, which you typically use toserialize data:

Python
>>>importpickle>>>instance=pickle.loads(bytes([...0x80,0x04,0x95,0x2f,0x00,0x00,0x00,0x00,...0x00,0x00,0x00,0x8c,0x0b,0x63,0x6f,0x6c,...0x6c,0x65,0x63,0x74,0x69,0x6f,0x6e,0x73,...0x94,0x8c,0x07,0x43,0x6f,0x75,0x6e,0x74,...0x65,0x72,0x94,0x93,0x94,0x7d,0x94,0x28,...0x4b,0xbe,0x4b,0x02,0x4b,0xef,0x4b,0x01,...0x4b,0xba,0x4b,0x01,0x75,0x85,0x94,0x52,...0x94,0x2e,...]))>>>instanceCounter({190: 2, 239: 1, 186: 1})>>>type(instance)<class 'collections.Counter'>

When you callpickle.loads() with abytes object as an argument, Python attempts to deserialize an object from the underlying byte stream, which must conform to acertain data format. In this case, the operation succeeds, and you get a newcollections.Counter instance reconstructed from the byte stream.

That was a ton of information to take in! Hopefully, you now have a solid understanding of how to handle low-level binary data in Python and can begin applying your new knowledge to real-world problems. In the next section, you’ll put it into action by exploring a few practical examples.

Working With Python Bytes in Practice

It can’t be stressed enough that reaching for low-level Python constructs, such asbytes, to manipulate raw byte sequences should be your last resort. In real life, higher-level abstractions and libraries are often a better choice. That said, studying and reverse-engineering established binary file formats and network protocols can still be valuable, even when more efficient tools and solutions already exist.

In this section, you’ll implement bits of Python code to handle various binary data formats, including:

  • Base64 Encoding
  • Bitmap File Format
  • Custom File Format
  • Executable File Format
  • OpenStreetMap File Format
  • Pickle Serialization Protocol
  • Python Bytecode
  • Redis Serialization Protocol
  • Waveform Audio File Format

These exercises will help you understand the intricacies of binary data manipulation, not just in Python but in general.

Read and Write Binary Files of Varying Sizes

Two of the most fundamental tasks associated with handling binary data are storing and retrieving byte sequences from a file on disk. If your file is relatively small, then you can load its contents into memory as a whole, process it somehow, and save the modified data. The most concise way to achieve this is with the help of thepathlib module, whosePath object exposes convenient methods for reading and writing raw bytes in one go:

Pythonchecksum.py
importhashlibimportsysfrompathlibimportPathpython_path=Path(sys.executable)checksum_path=python_path.with_suffix(".md5")machine_code=python_path.read_bytes()checksum_path.write_bytes(hashlib.md5(machine_code).digest())print("Saved",checksum_path)

The script above reads theentire binary content of the Python interpreter as raw bytes, effectively loading its machine code into a variable. Next, it calculates the file’sMD5 digest and saves the resulting binarychecksum to a new file with the same name as your Python interpreter, but with an.md5 extension.

If you’re dealing with a slightly larger file, then loading it at once might be too slow for all practical purposes. In such cases, it’s usually better to read your file incrementally insmall chunks, which may optionally overlap to ensure data continuity:

Pythonchecksum_incremental.py
importhashlibimportsysfrompathlibimportPathdefcalculate_checksum(path,chunk_size=4096):checksum=hashlib.md5()withpath.open(mode="rb")asfile:whilechunk:=file.read(chunk_size):checksum.update(chunk)returnchecksum.digest()print(calculate_checksum(Path(sys.executable)))

In this version of your script, you read the specified file in chunks of four kilobytes by default, updating the checksum with each chunk until the entire file is processed. This method can be more efficient for larger files, especially when you’re only interested in processing a small portion of the file. Take note of thewalrus operator (:=), which enables you to read a chunk of data and simultaneously check if the chunk is non-empty.

To handlevery large files—on the order of tens of gigabytes or more—which wouldn’t possibly fit into your computer’s physical memory, you can use memory-mapped files, ormmap for short. This technique gives you the illusion of unlimited memory by projecting parts of the file onto a virtual address space. For example, here’s how you can leverage memory mapping to processOpenStreetMap (OSM) data stored in a massiveXML file:

Pythonosm_mmap.py
frommmapimportACCESS_READ,mmapfrompathlibimportPathclassOpenStreetMap:def__init__(self,path):self.file=path.open(mode="rb")self.stream=mmap(self.file.fileno(),0,access=ACCESS_READ)def__enter__(self):returnselfdef__exit__(self,exc_type,exc_val,exc_tb):self.stream.close()self.file.close()def__iter__(self):end=0while(begin:=self.stream.find(b"<way",end))!=-1:end=self.stream.find(b"</way>",begin)yieldself.stream[begin:end+len(b"</way>")]if__name__=="__main__":withOpenStreetMap(Path("map.osm"))asosm:forway_taginosm:print(way_tag)

The code snippet above reads the file in astreaming fashion, discarding fragments that are of no interest to you and letting you process only relevant XML tags as soon as they’re found.

While XML is a text-based data format, themmap module gives you direct access to the underlying bytes on disk. Depending on the access mode, you may quicklyseek a specific byte offset or use the subscript notation to modify the file contents in place, as if it were areally long Python byte sequence!

If you’re looking for a hands-on example demonstrating how to work with a well-known binary data file format, then check out our tutorial onreading and writing WAV files in Python. On the other hand, if you’re interested in designing a custom binary file format from scratch and using it in Python, then follow the steps outlined in themaze solver project.

Communicate Over a Binary Network Protocol

Clients and servers communicate through the network by exchanging messages that adhere to a well-defined set of rules, collectively known as aprotocol. Somecommunication protocols are text-based, while others are binary. For example,HTTP uses a plain text format, while its secure counterpartHTTPS turns the same payload into encrypted binary data.

Unless you’re working with a highly specialized protocol developed for your own needs, you’ll often find a plethora of third-party libraries andsoftware development kits (SDKs) that can handle the mundane details for you. However, to learn the mechanics of a binary protocol, you’ll usesocket programming to build a minimal Python client that connects to aRedis database.

Note: This exercise isn’t meant as a replacement for mature libraries likeredis-py. For production applications, you should choose battle-tested solutions offering optimal performance, security, and reliability.

Under the hood, Redis uses a protocol known asRESP, which was intentionally designed to be human-readable. Although it’s abinary protocol, the transmitted messages contain predominantly ASCII characters. The first byte of every message determines its type, which can be simple, bulk, or aggregate. The table below shows a few sample RESP messages:

Redis TypeExample
Simple String+PONG\r\n
Error Message-ERR Unknown command 'FOO'\r\n
Integer:1738769911\r\n
Bulk String$5\r\ncaf\xc3\xa9\r\n
NIL$-1\r\n
Empty Array*0\r\n
Array*2\r\n$5\r\nviews\r\n$8\r\nsessions\r\n

You’ll notice that each message is terminated with a Windows-style newline (\r\n) or CRLF, which stands forcarriage return (CR) andline feed (LF). This control character can appear more than once, acting as a delimiter separating different parts of the more complex messages. For instance, the dollar sign ($) indicates abulk string consisting of the total number ofbytes and the UTF-8 encoded string.

The communication with Redis is based on therequest-response model, where the client sends a request to the server and waits for a response. ARESP request is anarray of bulk strings, with the first—and sometimes the second—element representing aRedis command, followed by any optional arguments. Here are some examples:

CommandRequestResponse
PING*1\r\n$4\r\nPING\r\n+PONG\r\n
KEYS*2\r\n$4\r\nKEYS\r\n$1\r\n*\r\n*0\r\n
SET*3\r\n$3\r\nSET\r\n$3\r\nkey\r\n$5\r\nvalue\r\n+OK\r\n
GET*2\r\n$3\r\nGET\r\n$3\r\nkey\r\n$-1\r\n
DEL*2\r\n$3\r\nDEL\r\n$3\r\nkey\r\n:1\r\n

Knowing these details allows you to implement the following helper functions for constructing such messages in Python:

Pythonredis_client.py
defarray(*items):binary=bytearray(f"*{len(items)}\r\n".encode("ascii"))foriteminitems:binary.extend(item)returnbytes(binary)defbulk_string(value):binary=value.encode("utf-8")returnf"${len(binary)}\r\n".encode("utf-8")+binary+b"\r\n"

The first function takes avariable number of arguments, which must be bytes-like objects, and concatenates them into a singlebytes object representing an array. The other function takes a Python string argument and encodes it as a Redis bulk string. Both functions are sufficient to encode most RESP messages.

You can now use these functions to send raw bytes to the Redis server. Go ahead and define the followingRedisClient class at the top of your script:

Pythonredis_client.py
importsocketclassRedisClient:def__init__(self,address="localhost",port=6379):self._socket=socket.create_connection((address,port))def__enter__(self):returnselfdef__exit__(self,exc_type,exc_val,exc_tb):self._socket.close()defping(self):self._socket.sendall(array(bulk_string("PING")))# ...

The class constructor opens a newTCP connection to the host and port number specified as arguments. Your class implements thecontext manager interface, which manages the life cycle of the underlying client socket. Finally, the.ping() method sends a binary message to the server, testing whether the connection is still alive.

To receive the server’s response, you’ll initially want to read the first line of the message, which determines the interpretation of the remaining data. The best approach is to load the individual bytes into a mutablebytearray using a loop until you find the first CRLF occurrence:

Pythonredis_client.py
importsocketclassRedisClient:# ...defget_response(self):line=bytearray()whilenotline.endswith(b"\r\n"):line.extend(self._socket.recv(1))# ...

Please note that this simplified implementation ignores error handling to keep things clear.

Now that you know the first line, you can usepattern matching to decide how to convert raw bytes into Python values:

Pythonredis_client.py
 1importsocket 2 3classRedisClient: 4# ... 5 6defget_response(self): 7line=bytearray() 8whilenotline.endswith(b"\r\n"): 9line.extend(self._socket.recv(1))10matchprefix:=line[:1]:11caseb"+"|b"-":12returnline[1:-2].decode("ascii")13caseb":":14returnint(line[1:-2])15caseb"$":16if(length:=int(line[1:]))==-1:17returnNone18else:19data=self._socket.recv(length+2)20returndata[:-2].decode("utf-8")21caseb"*":22return[self.get_response()for_inrange(int(line[1:]))]23case_:24raiseValueError(f"Unsupported type:{prefix}")2526# ...

You keep processing the response from Redis based on the first byte, which indicates a specific RESP data type. Here’s a line-by-line breakdown of how this process works:

  • Line 10: You begin by extracting the first byte from the first line of the response, and you assign it to a variable calledprefix.
  • Lines 11 and 12: If the prefix is a plus (+) or minus sign (-), then you decode the line as a simple ASCII string without the trailing CRLF control character.
  • Lines 13 and 14: Otherwise, if the prefix is a colon (:), then you interpret the line as an integer.
  • Line 15: The dollar sign ($) indicates either anil value or a bulk string, so you extract the length field from the line and proceed as follows:
    • Lines 16 and 17: If the length equals -1, then you returnNone.
    • Lines 18-20: Otherwise, you read the specified number of bytes from the socket, taking into account the extra two bytes for the CRLF control character, and then you decode the resulting data as a UTF-8 string.
  • Lines 21 and 22: If the prefix is a star (*), you interpret the response as an array whose length is determined by the rest of the line. Next, you read the subsequent array elements byrecursively calling.get_response().
  • Lines 23 and 24: Finally, when none of the above cases match, then you raise aValueError indicating an unsupported RESP type.

Before moving any further, make sure that you have a running server instance. If you don’t want to install Redis, then you can spin up aDocker container with Redis inside by using the following command:

Shell
$dockerrun-p6379:6379redis

The-p option maps the container’s port 6379 to your host’s port 6379, allowing you to access Redis on its default port from a local machine.

Putting it together, you can test drive your custom Redis client using the Python REPL:

Python
>>>fromredis_clientimportRedisClient>>>withRedisClient()asredis:...redis.ping()...redis.get_response()...'PONG'

Great! As long as the server is running and is accessible from your local machine, you should get a response.

Go ahead and implement the rest of the Redis commands mentioned earlier using the RESP protocol. The supporting materials include a sample Redis client implementation, which covers a few more commands.

Serialize Python Objects Using a Binary Format

To dump your objects to a binary file or transfer them over the wire and later reconstruct on a remote machine, you can use thepickle module again. It supports a subset of Python built-in types, collections, and abstract data types from the standard library:

Python
>>>importpickle>>>pickle.dumps(42)b'\x80\x04K*.'>>>fromcollectionsimportdeque>>>pickle.dumps(deque([42]))b'\x80\x04\x95\x1f...\x05deque\x94\x93\x94)R\x94K*a.'>>>fromdatetimeimportdate>>>pickle.dumps(date.today())b'\x80\x04\x95...\x04\x07\xe9\x02\x06\x94\x85\x94R\x94.'

In each case, you get abytes object, which is ready to be written to a file or sent over a network connection. You can then deserialize these raw byte sequences back into their original form:

Python
>>>pickle.loads(b"\x80\x04K*.")42

For more advanced uses cases, which might include dealing with stateful objects or instances of custom classes, check out the tutorial ondata serialization in Python. It also covers other binary serialization protocols.

Manipulate Pixel Data to Conceal a Hidden Message

If you’re looking for an advanced yet fun example of using low-level binary processing in Python, then enterdigital steganography:

Original Bitmap vs Altered Bitmap With Secret Data
Hiding a Secret Message in the Pixel Data of an Image

You can find a detailed discussion on this topic in thelast section of the bitwise operators tutorial. In a nutshell, this technique allows you to store a secret file within the pixel data by flipping specific bits while maintaining the overall appearance of the original image.

Embed an Image in Markdown Using Base64

Markdown is a popular markup language used to format all sorts of text documents, ranging fromREADME files to entire books. It’s lightweight, human-readable, and widely supported. While it contains only plain text, you can embed binary images in it by using the Base64 encoding. Here’s how to do this using Python:

Pythonembed.py
frombase64importb64encodefrompathlibimportPathdefembed_image(path:Path,label:str="")->str:ascii_string=b64encode(path.read_bytes()).decode("ascii")returnf"![{label}](data:image/jpeg;base64,{ascii_string})"if__name__=="__main__":output_path=Path("picture.md")output_path.write_text(embed_image(Path("picture.jpg")))print("Saved",output_path)

In this example, you read the entire binary content of a JPEG file into Python and encode it using thebase64 module. Next, you embed the image directly into your document using the corresponding Markdown syntax. As a result, the size of your original binary data increases by one third, so this method only makes sense for relatively small images!

Compile and Execute Python Bytecode

As the name implies, Pythonbytecode is a low-level representation of Python code, consisting of bytes that correspond to specific instructions known asopcodes. These opcodes serve as commands for the Python interpreter, which executes them at runtime. Unlike human-readable Python source code that you write by hand or generate withAI, bytecode is more compact and efficient for execution.

If you’re curious about how to read, manipulate, and even execute such raw byte sequences using Python, then check out the tutorial on the__pycache__ folder.

Conclusion

By now, you should feel comfortable working withbytes objects in Python, whether you’re processing binary data, handling fileinput/output (I/O), or dealing with networking protocols. Understanding these low-level constructs allows you to work efficiently with performance-sensitive applications, custom binary formats, and data interoperability between different systems.

In this tutorial, you’ve learned how to:

  • Differentiate betweenbytes,bytearray, and otherbytes-like objects
  • Create instances of thePythonbytes data type in various ways
  • Convert between Python strings andbytes usingcharacter encoding
  • Interpretbinary words comprised of multiple bytes correctly
  • Manipulatebytes objects using standardsequence operations
  • Usebytes in file I/O, network communication, and data serialization

Now, it’s your turn! Try out some of the concepts you’ve learned by working with real-world binary data, and don’t hesitate to experiment. If you have questions or want to share your experiences, join the discussion in the comments!

Get Your Code:Click here to download the free sample code that you’ll use to learn about bytes objects and handling binary data in Python.

Frequently Asked Questions

Now that you have some experience withbytes objects in Python, you can use the questions and answers below to check your understanding and recap what you’ve learned.

These FAQs are related to the most important concepts you’ve covered in this tutorial. Click theShow/Hide toggle beside each question to reveal the answer.

bytes objects in Python are immutable sequences of unsigned 8-bit integers, used to handle binary data.

You can create abytes object by using a bytes literal, thebytes() function, or thebytes.fromhex() method.

While both represent sequences of bytes,bytes is immutable, whereasbytearray is mutable, allowing modifications.

You convert a string tobytes by using thestr.encode() method and specifying the desired character encoding.

Endianness refers to the order of bytes in multi-byte data types, either starting with the least significant byte (little-endian) or the most significant byte (big-endian).

Take the Quiz: Test your knowledge with our interactive “Python Bytes” quiz. You’ll receive a score upon completion to help you track your learning progress:


Bytes Objects: Handling Binary Data in Python

Interactive Quiz

Python Bytes

In this quiz, you'll test your understanding of Python bytes objects. By working through this quiz, you'll revisit the key concepts related to this low-level data type.

🐍 Python Tricks 💌

Get a short & sweetPython Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

AboutBartosz Zaczyński

Bartosz is an experienced software engineer and Python educator with an M.Sc. in Applied Computer Science.

» More about Bartosz

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

MasterReal-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

MasterReal-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students.Get tips for asking good questions andget answers to common questions in our support portal.


Looking for a real-time conversation? Visit theReal Python Community Chat or join the next“Office Hours” Live Q&A Session. Happy Pythoning!

Keep Learning

Related Topics:intermediatepython

Related Learning Paths:

Related Tutorials:

Keep reading Real Python by creating a free account or signing in:

Already have an account?Sign-In

Almost there! Complete this form and click the button below to gain instant access:

Bytes Objects: Handling Binary Data in Python

Bytes Objects: Handling Binary Data in Python (Sample Code)

🔒 No spam. We take your privacy seriously.


[8]ページ先頭

©2009-2026 Movatter.jp