Introduction #

Apache Arrow was born from the need for a set of standards aroundtabular data representation and interchange between systems.The adoption of these standards reduces computing costs of dataserialization/deserialization and implementation costs acrosssystems implemented in different programming languages.

The Apache Arrow specification can be implemented in any programminglanguage but official implementations for many languages are available.An implementation consists of format definitions using the constructsoffered by the language and common in-memory data processing algorithms(e.g. slicing and concatenating). Users can extend and use the utilitiesprovided by the Apache Arrow implementation in their programminglanguage of choice. Some implementations are further ahead and feature avast set of algorithms for in-memory analytical data processing. More detailabout how implementations differ can be found on theImplementation Status page.

Apart from this initial vision, Arrow has grown to also develop amulti-language collection of libraries for solving problems related toin-memory analytical data processing. This covers topics like:

Zero-copy shared memory and RPC-based data movement
Reading and writing file formats (like CSV,Apache ORC, andApache Parquet)
In-memory analytics and query processing

Arrow Columnar Format#

Apache Arrow focuses on tabular data. For an example, let’s considerwe have data that can be organized into a table:

Diagram with tabular data of 4 rows and columns. — Diagram of a tabular data structure.#

Tabular data can be represented in memory using a row-based format or acolumn-based format. The row-based format stores data row-by-row, meaning the rowsare adjacent in the computer memory:

Tabular data being structured row by row in computer memory. — Tabular data being saved in memory row by row.#

In a columnar format, the data is organized column-by-column instead.This organization makes analytical operations like filtering, grouping,aggregations and others, more efficient thanks to memory locality.When processing the data, the memory locations accessed by the CPU tendto be near one another. By keeping the data contiguous in memory, it alsoenables vectorization of the computations. Most modern CPUs haveSIMD instructions (a single instruction that operates on multiple values atonce) enabling parallel processing and execution of operations on vector datausing a single CPU instruction.

Apache Arrow is solving this exact problem. It is the specification thatuses the columnar layout.

Tabular data being structured column by column in computer memory. — The same tabular data being saved in memory column by column.#

Each column is called anArray in Arrow terminology. Arrays can be ofdifferent data types and the way their values are stored in memory varies amongthe data types. The specification of how these values are arranged in memory iswhat we call aphysical memory layout. One contiguous region of memory thatstores data for arrays is called aBuffer. An array consists of one or morebuffers.

Next sections give an introduction to Arrow Columnar Format explaining thedifferent physical layouts. The full specification of the format can be foundatArrow Columnar Format.

Support for Null Values#

Arrow supports missing values or “nulls” for all data types: any valuein an array may be semantically null, whether primitive or nested data type.

In Arrow, a dedicated buffer, known as the validity (or “null”) bitmap,is used alongside the data indicating whether each value in the array isnull or not: a value of 1 means that the value is not-null (“valid”), whereasa value of 0 indicates that the value is null.

This validity bitmap is optional: if there are no missing values inthe array the buffer does not need to be allocated (as in the examplecolumn 1 in the diagram below).

Note

We read validity bitmaps right-to-left within a group of 8 bits due toleast-significant bit numberingbeing used.

This is also how we have represented the validity bitmaps in thediagrams included in this document.

Primitive Layouts#

Fixed Size Primitive Layout#

A primitive column represents an array of values where each valuehas the same physical size measured in bytes. Data types that use thefixed size primitive layout are, for example, signed and unsignedinteger data types, floating point numbers, boolean, decimal and temporaldata types.

Diagram is showing the difference between the primitive data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for primitive data types.#

Note

The boolean data type is represented with a primitive layout where thevalues are encoded in bits instead of bytes. That means the physicallayout includes a values bitmap buffer and possibly a validity bitmapbuffer.

Diagram is showing the difference between the boolean data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for boolean data type.#

Note

Arrow also has a concept of Null data type where all values are null. Inthis case no buffers are allocated.

Variable length binary and string#

In contrast to the fixed size primitive layout, the variable length layoutallows representing an array where each element can have a variable sizein bytes. This layout is used for binary and string data.

The bytes of all elements in a binary or string column are stored togetherconsecutively in a single buffer or region of memory. To know where each elementof the column starts and ends, the physical layout also includes integer offsets.The offsets buffer is always one element longer than the array.The last two offsets define the start and the end of the lastbinary/string element.

Binary and string data types share the same physical layout. The onlydifference between them is that a string-typed array is assumed to containvalid UTF-8 string data.

The difference between binary/string and large binary/string is in the offsetdata type. In the first case that is int32 and in the second it is int64.

The limitation of data types using 32 bit offsets is that they have a maximum size of2GB per array. One can still use the non-large variants for bigger data, butthen multiple chunks are needed.

Diagram is showing the difference between the variable length string data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for variable length string data types.#

Variable length binary and string view#

This layout is an alternative for the variable length binary layout and is adaptedfrom TU Munich’sUmbraDB and is similar to the string layout used inDuckDB andVelox (and sometimes also called “German strings”).

The main difference to the classical binary and string layout is the views buffer.It includes the length of the string, and then either its characters appearinginline (for small strings) or only the first 4 bytes of the string and an offset intoone of the potentially several data buffers. Because it uses an offset and length to referto the data buffer, the bytes of all elements do not need to be storedconsecutively in a single buffer. This enables out of order writing ofvariable length elements into the array.

These properties are important for efficient string processing. The prefixenables a profitable fast path for string comparisons, which are frequentlydetermined within the first four bytes. Selecting elements is a simple “gather”operation on the fixed-width views buffer and does not need to rewrite thevalues buffers.

Diagram is showing the difference between the variable length string view data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for variable length string view data type.#

Nested Layouts#

Nested data types introduce the concept of parent and child arrays. They expressrelationships between physical value arrays in a nested data type structure.

Nested data types depend on one or more other child data types. For instance, Listis a nested data type (parent) that has one child (the data type of the values inthe list).

List#

The list data type enables representing an array where each element is a sequenceof elements of the same data type. The layout is similar to variable-size binaryor string layout as it has an offsets buffer to define where the sequence of valuesfor each element starts and ends, with all the values being stored consecutivelyin a values child array.

The offsets in the list data type are int32 while in the large list the offsetsare int64.

Diagram is showing the difference between the variable size list data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for variable size list data type.#

Fixed Size List#

Fixed-size list is a special case of variable-size list where each column slotcontains a fixed size sequence meaning all lists are the same size and so theoffset buffer is no longer needed.

List View#

In contrast to the list type, list view type also has a size buffer togetherwith an offset buffer. The offsets continue to indicate the start of eachelement but size is now saved in a separate size buffer. This allowsout-of-order offsets as the sizes aren’t derived from the consecutiveoffsets anymore.

Struct#

A struct is a nested data type parameterized by an ordered sequence of fields(a data type and a name).

There is one child array for each field
Child arrays are independent and need not be adjacent to each other inmemory. They only need to have the same length.

One can think of an individual struct field as a key-value pair where thekey is the field name and the child array its values. The field (key) issaved in the schema and the values of a specific field (key) are saved inthe child array.

Diagram is showing the difference between the struct data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for struct data type.#

Map#

The Map data type represents nested data where each value is a variable number ofkey-value pairs. Its physical representation is the same as a list of{key,value}structs.

The difference between the struct and map data types is that a struct holds the keyin the schema, requiring keys to be strings, and the values are stored in thechild arrays,one for each field. There can be multiple keys and therefore multiple child arrays.The map, on the other hand, has one child array holding all the different keys (thatthus all need to be of the same data type, but not necessarily strings) and a secondchild array holding all the values. The values need to be of the same data type; however,the data type doesn’t have to match that of the keys.

Also, the map stores the struct in a list and needs an offset as the list isvariable shape.

Diagram is showing the difference between the map data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for map data type.#

Union#

The union is a nested data type where each slot in the union has a value with a data typechosen from a subset of possible Arrow data types. That means that a union array representsa mixed-type array. Unlike other data types, unions do not have their own validity bitmapand the nullness is determined by the child arrays.

Arrow defines two distinct union data types, “dense” and “sparse”.

Dense Union#

A Dense Union has one child array for each data type present in the mixed-type array andtwo buffers of its own:

Types buffer: holds data type id for each slot of the array. Data type id isfrequently the index of the child array; however, the relationship between data typeID and the child index is a parameter of the data type.
Offsets buffer: holds relative offset into the respective child array for eacharray slot.

Diagram is showing the difference between the dense union data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for dense union data type.#

Sparse union#

A sparse union has the same structure as a dense union, with the omission of the offsetsbuffer. In this case, the child arrays are each equal in length to the length of the union.

Dictionary Encoded Layout#

Dictionary encoding can be effective when one has data with many repeated values.The values are represented by integers referencing a dictionary usually consisting ofunique values.

Diagram is showing the difference between the dictionary data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for dictionary data type.#

Run-End Encoded Layout#

Run-end encoding is well-suited for representing data containing sequences of thesame value. These sequences are called runs. A run-end encoded array has no buffersof its own, but has two child arrays:

Run ends array: holds the index in the array where each run ends. The numberof run ends is the same as the length of its parent array.
Values array: the actual values without repetitions (together with null values).

Note that nulls of the parent array are strictly represented in the values array.

Diagram is showing the difference between the run-end encoded data type presented in a Table and the data actually stored in computer memory. — Physical layout diagram for run-end encoded data type.#

See also

Table of all ArrowData Types.

Overview of Arrow Terminology#

Physical layoutA specification for how to represent values of an array in memory.

BufferA contiguous region of memory with a given length in bytes. Buffers are used to store datafor arrays. Sometimes we use the notion of number of elements in a buffer which can only beused if we know the data type of the array that wraps this specific buffer.

ArrayA contiguous, one-dimensional sequence of values with known length where all values have thesame data type. An array consists of zero or more buffers.

Chunked ArrayA discontiguous, one-dimensional sequence of values with known length where all values havethe same data type. Consists of zero or more arrays, the “chunks”.

Note

Chunked Array is a concept specific to certain implementations such as Arrow C++ and PyArrow.

RecordBatchA contiguous, two-dimensional data structure which consists of an ordered collection of arraysof the same length.

SchemaAn ordered collection of fields that communicates all the data types of an objectlike a RecordBatch or Table. Schemas can contain optional key/value metadata.

FieldA Field includes a field name, a data type, a nullability flag andoptional key-value metadata for a specific column in a RecordBatch.

TableA discontiguous, two-dimensional chunk of data consisting of an ordered collection of ChunkedArrays. All Chunked Arrays have the same length, but may have different types. Different columnsmay be chunked differently.

Note

Table is a concept specific to certain implementations such as Arrow C++ and PyArrow. In Javaimplementation, for example, a Table is not a collection of Chunked Arrays but a collection ofRecordBatches.

A graphical representation of an Arrow Table and a Record Batch, with structure as described in text above.

See also

TheGlossary for more terms.

Extension Types#

In case the system or application needs to extend standard Arrow data types withcustom semantics, this is enabled by defining extension types.

Examples of an extension type areUUID orFixed shape tensor extension type.

Extension types can be defined by annotating any of the built-in Arrow data types(the “storage type”) with a custom type name and optional serialized representation('ARROW:extension:name' and'ARROW:extension:metadata' keys in the Fieldmetadata structure).

See also

TheExtension Types documentation.

Canonical Extension Types#

It is beneficial to share the definitions of well-known extension types so as toimprove interoperability between different systems integrating Arrow columnar data.For this reason canonical extension types are defined in Arrow itself.

See also

TheCanonical Extension Types documentation.

Community Extension Types#

These are Arrow extension types that have been established as standards within specificdomain areas.

Example:

GeoArrow: A collection of Arrow extension types for representing vector geometries

Sharing Arrow data#

Arrow memory layout is meant to be a universal standard for representing tabular data in memory,not tied to a specific implementation. The Arrow standard defines two protocols forwell-defined and unambiguous communication of Arrow data between applications:

Protocol to share Arrow data between processes or over the network is calledSerialization and Interprocess Communication (IPC).The specification for sharing data is called IPC message format which defines how Arrowarray or record batch buffers are stacked together to be serialized and deserialized.
To share Arrow data in the same processThe Arrow C data interface is used, meant for sharingthe same buffer zero-copy in memory between different libraries within the same process.

On this page

Edit on GitHub

Movatterモバイル変換

Introduction#

Arrow Columnar Format#

Support for Null Values#

Primitive Layouts#

Fixed Size Primitive Layout#

Variable length binary and string#

Variable length binary and string view#

Nested Layouts#

List#

Fixed Size List#

List View#

Struct#

Map#

Union#

Dense Union#

Sparse union#

Dictionary Encoded Layout#

Run-End Encoded Layout#

Overview of Arrow Terminology#

Extension Types#

Canonical Extension Types#

Community Extension Types#

Sharing Arrow data#

Introduction #