Memory and IO Interfaces#

This section will introduce you to the major concepts in PyArrow’s memorymanagement and IO systems:

  • Buffers

  • Memory pools

  • File-like and stream-like objects

Referencing and Allocating Memory#

pyarrow.Buffer#

TheBuffer object wraps the C++arrow::Buffer typewhich is the primary tool for memory management in Apache Arrow in C++. It permitshigher-level array classes to safely interact with memory which they may or maynot own.arrow::Buffer can be zero-copy sliced to permit Buffers to cheaplyreference other Buffers, while preserving memory lifetime and cleanparent-child relationships.

There are many implementations ofarrow::Buffer, but they all provide astandard interface: a data pointer and length. This is similar to Python’sbuilt-inbufferprotocol andmemoryview objects.

ABuffer can be created from any Python object implementingthe buffer protocol by calling thepy_buffer() function. Let’s considera bytes object:

In [1]:importpyarrowaspaIn [2]:data=b'abcdefghijklmnopqrstuvwxyz'In [3]:buf=pa.py_buffer(data)In [4]:bufOut[4]:<pyarrow.Buffer address=0x7fdfff036750 size=26 is_cpu=True is_mutable=False>In [5]:buf.sizeOut[5]:26

Creating a Buffer in this way does not allocate any memory; it is a zero-copyview on the memory exported from thedata bytes object.

External memory, under the form of a raw pointer and size, can also bereferenced using theforeign_buffer() function.

Buffers can be used in circumstances where a Python buffer or memoryview isrequired, and such conversions are zero-copy:

In [6]:memoryview(buf)Out[6]:<memory at 0x7fdfff078100>

The Buffer’sto_pybytes() method converts the Buffer’s data to aPython bytestring (thus making a copy of the data):

In [7]:buf.to_pybytes()Out[7]:b'abcdefghijklmnopqrstuvwxyz'

Memory Pools#

All memory allocations and deallocations (likemalloc andfree in C)are tracked in an instance ofMemoryPool. This means that we canthen precisely track amount of memory that has been allocated:

In [8]:pa.total_allocated_bytes()Out[8]:56640

Let’s allocate a resizableBuffer from the default pool:

In [9]:buf=pa.allocate_buffer(1024,resizable=True)In [10]:pa.total_allocated_bytes()Out[10]:57664In [11]:buf.resize(2048)In [12]:pa.total_allocated_bytes()Out[12]:58688

The default allocator requests memory in a minimum increment of 64 bytes. Ifthe buffer is garbage-collected, all of the memory is freed:

In [13]:buf=NoneIn [14]:pa.total_allocated_bytes()Out[14]:56640

Besides the default built-in memory pool, there may be additional memory poolsto choose from (such asjemalloc)depending on how Arrow was built. One can get the backend name for a memorypool:

>>>pa.default_memory_pool().backend_name'mimalloc'

See also

On-GPU buffers using Arrow’s optionalCUDA integration.

Input and Output#

The Arrow C++ libraries have several abstract interfaces for different kinds ofIO objects:

  • Read-only streams

  • Read-only files supporting random access

  • Write-only streams

  • Write-only files supporting random access

  • File supporting reads, writes, and random access

In the interest of making these objects behave more like Python’s built-infile objects, we have defined aNativeFile base classwhich implements the same API as regular Python file objects.

NativeFile has some important features which make itpreferable to using Python files with PyArrow where possible:

  • Other Arrow classes can access the internal C++ IO objects natively, and donot need to acquire the Python GIL

  • Native C++ IO may be able to do zero-copy IO, such as with memory maps

There are several kinds ofNativeFile options available:

There are also high-level APIs to make instantiating common kinds of streamseasier.

High-Level API#

Input Streams#

Theinput_stream() function allows creating a readableNativeFile from various kinds of sources.

  • If passed aBuffer or amemoryview object, aBufferReader will be returned:

    In [15]:buf=memoryview(b"some data")In [16]:stream=pa.input_stream(buf)In [17]:stream.read(4)Out[17]:b'some'
  • If passed a string or file path, it will open the given file on diskfor reading, creating aOSFile. Optionally, the filecan be compressed: if its filename ends with a recognized extensionsuch as.gz, its contents will automatically be decompressed onreading.

    In [18]:importgzipIn [19]:withgzip.open('example.gz','wb')asf:   ....:f.write(b'some data\n'*3)   ....:In [20]:stream=pa.input_stream('example.gz')In [21]:stream.read()Out[21]:b'some data\nsome data\nsome data\n'
  • If passed a Python file object, it will wrapped in aPythonFilesuch that the Arrow C++ libraries can read data from it (at the expenseof a slight overhead).

Output Streams#

output_stream() is the equivalent function for output streamsand allows creating a writableNativeFile. It has the samefeatures as explained above forinput_stream(), such as beingable to write to buffers or do on-the-fly compression.

In [22]:withpa.output_stream('example1.dat')asstream:   ....:stream.write(b'some data')   ....:In [23]:f=open('example1.dat','rb')In [24]:f.read()Out[24]:b'some data'

On-Disk and Memory Mapped Files#

PyArrow includes two ways to interact with data on disk: standard operatingsystem-level file APIs, and memory-mapped files. In regular Python we canwrite:

In [25]:withopen('example2.dat','wb')asf:   ....:f.write(b'some example data')   ....:

Using pyarrow’sOSFile class, you can write:

In [26]:withpa.OSFile('example3.dat','wb')asf:   ....:f.write(b'some example data')   ....:

For reading files, you can useOSFile orMemoryMappedFile. The difference between these is thatOSFile allocates new memory on each read, like Python fileobjects. In reads from memory maps, the library constructs a buffer referencingthe mapped memory without any memory allocation or copying:

In [27]:file_obj=pa.OSFile('example2.dat')In [28]:mmap=pa.memory_map('example3.dat')In [29]:file_obj.read(4)Out[29]:b'some'In [30]:mmap.read(4)Out[30]:b'some'

Theread method implements the standard Python fileread API. To readinto Arrow Buffer objects, useread_buffer:

In [31]:mmap.seek(0)Out[31]:0In [32]:buf=mmap.read_buffer(4)In [33]:print(buf)<pyarrow.Buffer address=0x7fe0dbe86000 size=4 is_cpu=True is_mutable=False>In [34]:buf.to_pybytes()Out[34]:b'some'

Many tools in PyArrow, particular the Apache Parquet interface and the file andstream messaging tools, are more efficient when used with theseNativeFiletypes than with normal Python file objects.

In-Memory Reading and Writing#

To assist with serialization and deserialization of in-memory data, we havefile interfaces that can read and write to Arrow Buffers.

In [35]:writer=pa.BufferOutputStream()In [36]:writer.write(b'hello, friends')Out[36]:14In [37]:buf=writer.getvalue()In [38]:bufOut[38]:<pyarrow.Buffer address=0x7fe071a900c0 size=14 is_cpu=True is_mutable=True>In [39]:buf.sizeOut[39]:14In [40]:reader=pa.BufferReader(buf)In [41]:reader.seek(7)Out[41]:7In [42]:reader.read(7)Out[42]:b'friends'

These have similar semantics to Python’s built-inio.BytesIO.