High-Level Overview#

The Arrow C++ library is comprised of different parts, each of which servesa specific purpose.

The physical layer#

Memory management abstractions provide a uniform API over memory thatmay be allocated through various means, such as heap allocation, the memorymapping of a file or a static memory area. In particular, thebufferabstraction represents a contiguous area of physical data.

The one-dimensional layer#

Data types govern thelogical interpretation ofphysical data.Many operations in Arrow are parameterized, at compile-time or at runtime,by a data type.

Arrays assemble one or several buffers with a data type, allowing toview them as a logical contiguous sequence of values (possibly nested).

Chunked arrays are a generalization of arrays, comprising several same-typearrays into a longer logical sequence of values.

The two-dimensional layer#

Schemas describe a logical collection of several pieces of data,each with a distinct name and type, and optional metadata.

Tables are collections of chunked array in accordance to a schema. Theyare the most capable dataset-providing abstraction in Arrow.

Record batches are collections of contiguous arrays, describedby a schema. They allow incremental construction or serialization of tables.

The compute layer#

Datums are flexible dataset references, able to hold for example an array or tablereference.

Kernels are specialized computation functions running in a loop over agiven set of datums representing input and output parameters to the functions.

Acero (pronounced [aˈsɜɹo] / ah-SERR-oh) is a streaming execution engine that allowscomputation to be expressed as a graph of operators which can transform streams of data.

The IO layer#

Streams allow untyped sequential or seekable access over external dataof various kinds (for example compressed or memory-mapped).

The Inter-Process Communication (IPC) layer#

Amessaging format allows interchange of Arrow data between processes, usingas few copies as possible.

The file formats layer#

Reading and writing Arrow data from/to various file formats is possible, forexampleParquet,CSV,Orc or the Arrow-specificFeather format.

The devices layer#

BasicCUDA integration is provided, allowing to describe Arrow data backedby GPU-allocated memory.

The filesystem layer#

A filesystem abstraction allows reading and writing data from different storagebackends, such as the local filesystem or a S3 bucket.