Glossary#
- array#
- vector#
Acontiguous,one-dimensional sequence of values with knownlength where all values have the same type. An array consistsof zero or morebuffers, a non-negativelength, and adata type. The buffers of an array arelaid out according to the data type as defined by the columnarformat.
Arrays are contiguous in the sense that iterating the values ofan array will iterate through a single set of buffers, eventhough an array may consist of multiple disjoint buffers, ormay consist of child arrays that themselves span multiplebuffers.
Arrays are one-dimensional in that they are a sequence ofslots or singular values, even though for somedata types (like structs or unions), a slot may representmultiple values.
Defined by theArrow Columnar Format.
- buffer#
Acontiguous region of memory with a given length. Buffersare used to store data for arrays.
Buffers may be in CPU memory, memory-mapped from a file, indevice (e.g. GPU) memory, etc., though not all Arrowimplementations support all of these possibilities.
- canonical extension type#
Anextension type that has been standardized by theArrow community so as to improve interoperability betweenimplementations.
See also
- child array#
- parent array#
In an array of anested type, the parent arraycorresponds to theparent type and the child array(s)correspond to thechild type(s). Forexample, a
List[Int32]-type parent array has anInt32-type child array.- child type#
- parent type#
In anested type, the nested type is the parent type,and the child type(s) are its parameters. For example, in
List[Int32],Listis the parent type andInt32isthe child type.- chunked array#
Adiscontiguous,one-dimensional sequence of values withknown length where all values have the same type. Consists ofzero or morearrays, the “chunks”.
Chunked arrays are discontiguous in the sense that iteratingthe values of a chunked array may require iterating throughdifferent buffers for different indices.
Not part of the columnar format; this term is specific tocertain language implementations of Arrow (primarily C++ andits bindings).
See also
- complex type#
- nested type#
Adata type whose structure depends on one or moreotherchild data types. For instance,
Listis a nested type that has one child.Two nested types are equal if and only if their child types arealso equal.
- data type#
- type#
A type that a value can take, such as
Int8orList[Utf8]. The type of an array determines how its valuesare laid out in memory according toArrow Columnar Format.See also
- dictionary#
An array of values that accompany adictionary-encoded array.
- dictionary-encoding#
An array that stores its values as indices into adictionary array instead of storing the valuesdirectly.
See also
- extension type#
- storage type#
An extension type is an user-defineddata type that addsadditional semantics to an existing data type. This allowsimplementations that do not support a particular extension type tostill handle the underlying data type (the “storage type”).
For example, a UUID can be represented as a 16-byte fixed-sizebinary type.
See also
- field#
A column in aschema. Consists of a field name, adata type, a flag indicating whether the field isnullable or not, and optional key-value metadata.
- IPC file format#
- file format#
- random-access format#
An extension of theIPC streaming format that can beused to serialize Arrow data to disk, then read it back withrandom access to individual record batches.
- IPC format#
A specification for how to serialize Arrow data, so it can besent between processes/machines, or persisted on disk.
See also
- IPC message#
- message#
The IPC representation of a particular in-memory structure, like arecordbatch orschema. Will always be one of the members of
MessageHeaderin theFlatbuffers protocol file.- IPC streaming format#
- streaming format#
A protocol for streaming Arrow data or for serializing data toa file, consisting of a stream ofIPC messages.
- physical layout#
A specification for how to arrange values in memory.
See also
- primitive type#
A data type that does not have any child types.
See also
- record batch#
In theIPC format: the primitive unitof data. A record batch consists of an ordered list ofbuffers corresponding to aschema.
In some implementations (primarily C++ and its bindings): acontiguous,two-dimensional chunk of data. A record batchconsists of an ordered collection ofarrays ofthe same length.
Like arrays, record batches are contiguous in the sense thatiterating the rows of a record batch will iterate through asingle set of buffers.
- schema#
A collection offields with optional metadatathat determines all thedata types of anobject like arecord batch ortable.
- slot#
A single logical value within an array, i.e. a “row”.
- table#
Adiscontiguous,two-dimensional chunk of data consistingof an ordered collection ofchunked arrays. All chunked arrays have the same length, but may havedifferent types. Different columns may be chunkeddifferently.
Like chunked arrays, tables are discontiguous in the sense thatiterating the rows of a table may require iterating throughdifferent buffers for different indices.
Not part of the columnar format; this term is specific tocertain language implementations of Arrow (for example C++ andits bindings, and Go).
See also

