Arrow Columnar Format#
Version: 1.5
TheArrow columnar format includes a language-agnostic in-memorydata structure specification, metadata serialization, and a protocolfor serialization and generic data transport.
This document is intended to provide adequate detail to create a newimplementation of the columnar format without the aid of an existingimplementation. We utilize Google’sFlatbuffers project formetadata serialization, so it will be necessary to refer to theproject’sFlatbuffers protocol definition fileswhile reading this document.
The columnar format has some key features:
Data adjacency for sequential access (scans)
O(1) (constant-time) random access[1]
SIMD and vectorization-friendly
Relocatable without “pointer swizzling”, allowing for true zero-copyaccess in shared memory
The Arrow columnar format provides analytical performance and datalocality guarantees in exchange for comparatively more expensivemutation operations. This document is concerned only with in-memorydata representation and serialization details; issues such ascoordinating mutation of data structures are left to be handled byimplementations.
[1]Except for theRun-End Encoded Layout where random access isO(log n).
Terminology#
Since different projects have used different words to describe variousconcepts, here is a small glossary to help disambiguate.
Array orVector: a sequence of values with known length allhaving the same type. These terms are used interchangeably indifferent Arrow implementations, but we use “array” in thisdocument.
Slot: a single logical value in an array of some particular data type
Buffer orContiguous memory region: a sequential virtualaddress space with a given length. Any byte can be reached via asingle pointer offset less than the region’s length.
Physical Layout: The underlying memory layout for an arraywithout taking into account any value semantics. For example, a32-bit signed integer array and 32-bit floating point array have thesame layout.
Data type: An application-facing semantic value type that isimplemented using some physical layout. For example, Decimal128values are stored as 16 bytes in a fixed-size binarylayout. A timestamp may be stored as 64-bit fixed-size layout.
Primitive type: a data type having no child types. This includessuch types as fixed bit-width, variable-size binary, and null types.
Nested type: a data type whose full structure depends on one ormore other child types. Two fully-specified nested types are equalif and only if their child types are equal. For example,
List<U>is distinct fromList<V>iff U and V are different types.Parent andchild arrays: names to express relationshipsbetween physical value arrays in a nested type structure. Forexample, a
List<T>-type parent array has a T-type array as itschild (see more on lists below).Parametric type: a type which requires additional parametersfor full determination of its semantics. For example, all nested typesare parametric by construction. A timestamp is also parametric as it needsa unit (such as microseconds) and a timezone.
Data Types#
The fileSchema.fbs defines built-in data types supported by theArrow columnar format. Each data type uses a well-defined physical layout.
Schema.fbs is the authoritative source for the description of thestandard Arrow data types. However, we also provide the below table forconvenience:
Type | Type Parameters(1) | Physical Memory Layout |
|---|---|---|
Null | Null | |
Boolean | Fixed-size Primitive | |
Int |
| “ (same as above) |
Floating Point |
| “ |
Decimal |
| “ |
Date |
| “ |
Time |
| “ |
Timestamp |
| “ |
Interval |
| “ |
Duration |
| “ |
Fixed-Size Binary |
| Fixed-size Binary |
Binary | Variable-size Binary with 32-bit offsets | |
Utf8 | “ | |
Large Binary | Variable-size Binary with 64-bit offsets | |
Large Utf8 | “ | |
Binary View | Variable-size Binary View | |
Utf8 View | “ | |
Fixed-Size List |
| Fixed-size List |
List |
| Variable-size List with 32-bit offsets |
Large List |
| Variable-size List with 64-bit offsets |
List View |
| Variable-size List View with 32-bit offsets and sizes |
Large List View |
| Variable-size List View with 64-bit offsets and sizes |
Struct |
| Struct |
Map |
| Variable-size List of Structs |
Union |
| Dense or Sparse Union(3) |
Dictionary |
| Dictionary Encoded |
Run-End Encoded |
| Run-End Encoded |
(1) Type parameters listed initalics denote a data type’s child types.
(2) Thebit width parameter of a Time type is technically redundant aseachunit mandates a single bit width.
(3) Whether a Union type uses the Sparse or Dense layout is denoted by itsmode parameter.
(4) Theindex type of a Dictionary type can only be an integer type,preferably signed, with width 8 to 64 bits.
(5) Therun end type of a Run-End Encoded type can only be a signed integer typewith width 16 to 64 bits.
Note
Sometimes the term “logical type” is used to denote the Arrow data typesand distinguish them from their respective physical layouts. However,unlike other type systems such asApache Parquet’s,the Arrow type system doesn’t have separate notions of physical types andlogical types.
The Arrow type system separately providesextension types, which allowannotating standard Arrow data types with richer application-facing semantics(for example defining a “JSON” type laid upon the standard String data type).
Physical Memory Layout#
Arrays are defined by a few pieces of metadata and data:
A data type.
A sequence of buffers.
A length as a 64-bit signed integer. Implementations are permittedto be limited to 32-bit lengths, see more on this below.
A null count as a 64-bit signed integer.
An optionaldictionary, for dictionary-encoded arrays.
Nested arrays additionally have a sequence of one or more sets ofthese items, called thechild arrays.
Each data type has a well-defined physical layout. Here are the differentphysical layouts defined by Arrow:
Primitive (fixed-size): a sequence of values each having thesame byte or bit width
Variable-size Binary: a sequence of values each having a variablebyte length. Two variants of this layout are supported using 32-bitand 64-bit length encoding.
View of Variable-size Binary: a sequence of values each having avariable byte length. In contrast to Variable-size Binary, the valuesof this layout are distributed across potentially multiple buffersinstead of densely and sequentially packed in a single buffer.
Fixed-size List: a nested layout where each value has the samenumber of elements taken from a child data type.
Variable-size List: a nested layout where each value is avariable-length sequence of values taken from a child data type. Twovariants of this layout are supported using 32-bit and 64-bit lengthencoding.
View of Variable-size List: a nested layout where each value is avariable-length sequence of values taken from a child data type. Thislayout differs fromVariable-size List by having an additionalbuffer containing the sizes of each list value. This removes a constrainton the offsets buffer — it does not need to be in order.
Struct: a nested layout consisting of a collection of namedchildfields each having the same length but possibly differenttypes.
Sparse andDense Union: a nested layout representing asequence of values, each of which can have type chosen from acollection of child array types.
Dictionary-Encoded: a layout consisting of a sequence ofintegers (any bit-width) which represent indexes into a dictionarywhich could be of any type.
Run-End Encoded (REE): a nested layout consisting of two child arrays,one representing values, and one representing the logical index wherethe run of a corresponding value ends.
Null: a sequence of all null values.
The Arrow columnar memory layout only applies todata and notmetadata. Implementations are free to represent metadata in-memoryin whichever form is convenient for them. We handle metadataserialization in an implementation-independent way usingFlatbuffers, detailed below.
Buffer Alignment and Padding#
Implementations are recommended to allocate memory on alignedaddresses (multiple of 8- or 64-bytes) and pad (overallocate) to alength that is a multiple of 8 or 64 bytes. When serializing Arrowdata for interprocess communication, these alignment and paddingrequirements are enforced. If possible, we suggest that you preferusing 64-byte alignment and padding. Unless otherwise noted, paddedbytes do not need to have a specific value.
The alignment requirement follows best practices for optimized memoryaccess:
Elements in numeric arrays will be guaranteed to be retrieved via aligned access.
On some architectures alignment can help limit partially used cache lines.
The recommendation for 64 byte alignment comes from theIntelperformance guide that recommends alignment of memory to match SIMDregister width. The specific padding length was chosen because itmatches the largest SIMD instruction registers available on widelydeployed x86 architecture (Intel AVX-512).
The recommended padding of 64 bytes allows for usingSIMDinstructions consistently in loops without additional conditionalchecks. This should allow for simpler, efficient and CPUcache-friendly code. In other words, we can load the entire 64-bytebuffer into a 512-bit wide SIMD register and get data-levelparallelism on all the columnar values packed into the 64-bytebuffer. Guaranteed padding can also allow certain compilers togenerate more optimized code directly (e.g. One can safely use Intel’s-qopt-assume-safe-padding).
Array lengths#
Array lengths are represented in the Arrow metadata as a 64-bit signedinteger. An implementation of Arrow is considered valid even if it onlysupports lengths up to the maximum 32-bit signed integer, though. If usingArrow in a multi-language environment, we recommend limiting lengths to231 - 1 elements or less. Larger data sets can be represented usingmultiple array chunks.
Null count#
The number of null value slots is a property of the physical array andconsidered part of the data structure. The null count is representedin the Arrow metadata as a 64-bit signed integer, as it may be aslarge as the array length.
Validity bitmaps#
Any value in an array may be semantically null, whether primitive or nestedtype.
All array types, with the exception of union types (more on these later),utilize a dedicated memory buffer, known as the validity (or “null”) bitmap, toencode the nullness or non-nullness of each value slot. The validity bitmapmust be large enough to have at least 1 bit for each array slot.
Whether any array slot is valid (non-null) is encoded in the respective bits ofthis bitmap. A 1 (set bit) for indexj indicates that the value is not null,while a 0 (bit not set) indicates that it is null. Bitmaps are to beinitialized to be all unset at allocation time (this includes padding):
is_valid[j]->bitmap[j/8]&(1<<(j%8))
We useleast-significant bit (LSB) numbering (also known asbit-endianness). This means that within a group of 8 bits, we readright-to-left:
values=[0,1,null,2,null,3]bitmapjmod87654321000101011
Arrays having a 0 null count may choose to not allocate the validitybitmap; how this is represented depends on the implementation (forexample, a C++ implementation may represent such an “absent” validitybitmap using a NULL pointer). Implementations may choose to always allocatea validity bitmap anyway as a matter of convenience. Consumers of Arrowarrays should be ready to handle those two possibilities.
Nested type arrays (except for union types as noted above) have their owntop-level validity bitmap and null count, regardless of the null count andvalid bits of their child arrays.
Array slots which are null are not required to have a particular value;any “masked” memory can have any value and need not be zeroed, thoughimplementations frequently choose to zero memory for null values.
Fixed-size Primitive Layout#
A primitive value array represents an array of values each having thesame physical slot width typically measured in bytes, though the specalso provides for bit-packed types (e.g. boolean values encoded inbits).
Internally, the array contains a contiguous memory buffer whose totalsize is at least as large as the slot width multiplied by the arraylength. For bit-packed types, the size is rounded up to the nearestbyte.
The associated validity bitmap is contiguously allocated (as describedabove) but does not need to be adjacent in memory to the valuesbuffer.
Example Layout: Int32 Array
For example a primitive array of int32s:
[1,null,2,4,8]
Would look like:
*Length:5,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00011101|0(padding)|*ValueBuffer:|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-19|Bytes20-63||-------------|-------------|-------------|-------------|-------------|-----------------------||1|unspecified|2|4|8|unspecified(padding)|
Example Layout: Non-null int32 Array
[1,2,3,4,8] has two possible layouts:
*Length:5,Nullcount:0*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00011111|0(padding)|*ValueBuffer:|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-19|Bytes20-63||-------------|-------------|-------------|-------------|-------------|-----------------------||1|2|3|4|8|unspecified(padding)|
or with the bitmap elided:
*Length5,Nullcount:0*Validitybitmapbuffer:Notrequired*ValueBuffer:|Bytes0-3|Bytes4-7|Bytes8-11|bytes12-15|bytes16-19|Bytes20-63||-------------|-------------|-------------|-------------|-------------|-----------------------||1|2|3|4|8|unspecified(padding)|
Variable-size Binary Layout#
Each value in this layout consists of 0 or more bytes. While primitivearrays have a single values buffer, variable-size binary have anoffsets buffer anddata buffer.
The offsets buffer containslength+1 signed integers (either32-bit or 64-bit, depending on the data type), which encode thestart position of each slot in the data buffer. The length of thevalue in each slot is computed using the difference between the offsetat that slot’s index and the subsequent offset. For example, theposition and length of slot j is computed as:
slot_position=offsets[j]slot_length=offsets[j+1]-offsets[j]//(for0<=j<length)
It should be noted that a null value may have a positive slot length.That is, a null value may occupy anon-empty memory space in the databuffer. When this is true, the content of the corresponding memory spaceis undefined.
Offsets must be monotonically increasing, that isoffsets[j+1]>=offsets[j]for0<=j<length, even for null slots. This property ensures thelocation for all values is valid and well defined.
Generally the first slot in the offsets array is 0, and the last slotis the length of the values array. When serializing this layout, werecommend normalizing the offsets to start at 0.
Example Layout: ``VarBinary``
['joe',null,null,'mark']
will be represented as follows:
*Length:4,Nullcount:2*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00001001|0(padding)|*Offsetsbuffer:|Bytes0-19|Bytes20-63||----------------|-----------------------||0,3,3,3,7|unspecified(padding)|*Valuebuffer:|Bytes0-6|Bytes7-63||----------------|-----------------------||joemark|unspecified(padding)|
Variable-size Binary View Layout#
Note
New in Arrow Columnar Format 1.4
Each value in this layout consists of 0 or more bytes. These bytes’locations are indicated using aviews buffer, which may point to oneof potentially severaldata buffers or may contain the charactersinline.
The views buffer containslength view structures with the following layout:
*Shortstrings,length<=12|Bytes0-3|Bytes4-15||------------|---------------------------------------||length|data(paddedwith0)|*Longstrings,length>12|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15||------------|------------|------------|-------------||length|prefix|buf.index|offset|
In both the long and short string cases, the first four bytes encode thelength of the string and can be used to determine how the rest of the viewshould be interpreted.
In the short string case the string’s bytes are inlined — stored inside theview itself, in the twelve bytes which follow the length. Any remaining bytesafter the string itself are padded with0.
In the long string case, a buffer index indicates which data bufferstores the data bytes and an offset indicates where in that buffer thedata bytes begin. Buffer index 0 refers to the first data buffer, IEthe first bufferafter the validity buffer and the views buffer.The half-open range[offset,offset+length) must be entirely containedwithin the indicated buffer. A copy of the first four bytes of the string isstored inline in the prefix, after the length. This prefix enables aprofitable fast path for string comparisons, which are frequently determinedwithin the first four bytes.
All integers (length, buffer index, and offset) are signed.
This layout is adapted from TU Munich’sUmbraDB.
Note that this layout uses one additional buffer to store the variadic bufferlengths in theArrow C data interface.
Variable-size List Layout#
List is a nested type which is semantically similar to variable-sizebinary. There are two list layout variations — “list” and “list-view” —and each variation can be delimited by either 32-bit or 64-bit offsetsintegers.
List Layout#
The List layout is defined by two buffers, a validity bitmap and an offsetsbuffer, and a child array. The offsets are the same as in thevariable-size binary case, and both 32-bit and 64-bit signed integeroffsets are supported options for the offsets. Rather than referencingan additional data buffer, instead these offsets reference the childarray.
Similar to the layout of variable-size binary, a null value maycorrespond to anon-empty segment in the child array. When this istrue, the content of the corresponding segment can be arbitrary.
A list type is specified likeList<T>, whereT is any type(primitive or nested). In these examples we use 32-bit offsets wherethe 64-bit offset version would be denoted byLargeList<T>.
Example Layout: ``List<Int8>`` Array
We illustrate an example ofList<Int8> with length 4 having values:
[[12,-7,25],null,[0,-127,127,50],[]]
will have the following representation:
*Length:4,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00001101|0(padding)|*Offsetsbuffer(int32)|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-19|Bytes20-63||------------|-------------|-------------|-------------|-------------|-----------------------||0|3|3|7|7|unspecified(padding)|*Valuesarray(Int8Array):*Length:7,Nullcount:0*Validitybitmapbuffer:Notrequired*Valuesbuffer(int8)|Bytes0-6|Bytes7-63||------------------------------|-----------------------||12,-7,25,0,-127,127,50|unspecified(padding)|
Example Layout: ``List<List<Int8>>``
[[[1,2],[3,4]],[[5,6,7],null,[8]],[[9,10]]]
will be represented as follows:
* Length 3* Nulls count: 0* Validity bitmap buffer: Not required* Offsets buffer (int32) | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | |------------|------------|------------|-------------|-----------------------| | 0 | 2 | 5 | 6 | unspecified (padding) |* Values array (`List<Int8>`) * Length: 6, Null count: 1 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-------------| | 00110111 | 0 (padding) | * Offsets buffer (int32) | Bytes 0-27 | Bytes 28-63 | |----------------------|-----------------------| | 0, 2, 4, 7, 7, 8, 10 | unspecified (padding) | * Values array (Int8): * Length: 10, Null count: 0 * Validity bitmap buffer: Not required | Bytes 0-9 | Bytes 10-63 | |-------------------------------|-----------------------| | 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | unspecified (padding) |
ListView Layout#
Note
New in Arrow Columnar Format 1.4
The ListView layout is defined by three buffers: a validity bitmap, an offsetsbuffer, and an additional sizes buffer. Sizes and offsets have the identical bitwidth and both 32-bit and 64-bit signed integer options are supported.
As in the List layout, the offsets encode the start position of each slot in thechild array. In contrast to the List layout, list lengths are stored explicitlyin the sizes buffer instead of inferred. This allows offsets to be out of order.Elements of the child array do not have to be stored in the same order theylogically appear in the list elements of the parent array.
Every list-view value, including null values, has to guarantee the followinginvariants:
0<=offsets[i]<=lengthofthechildarray0<=offsets[i]+size[i]<=lengthofthechildarray
A list-view type is specified likeListView<T>, whereT is any type(primitive or nested). In these examples we use 32-bit offsets and sizes wherethe 64-bit version would be denoted byLargeListView<T>.
Example Layout: ``ListView<Int8>`` Array
We illustrate an example ofListView<Int8> with length 4 having values:
[[12,-7,25],null,[0,-127,127,50],[]]
It may have the following representation:
*Length:4,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00001101|0(padding)|*Offsetsbuffer(int32)|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-63||------------|-------------|-------------|-------------|-----------------------||0|7|3|0|unspecified(padding)|*Sizesbuffer(int32)|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-63||------------|-------------|-------------|-------------|-----------------------||3|0|4|0|unspecified(padding)|*Valuesarray(Int8Array):*Length:7,Nullcount:0*Validitybitmapbuffer:Notrequired*Valuesbuffer(int8)|Bytes0-6|Bytes7-63||------------------------------|-----------------------||12,-7,25,0,-127,127,50|unspecified(padding)|
Example Layout: ``ListView<Int8>`` Array
We continue with theListView<Int8> type, but this instance illustrates outof order offsets and sharing of child array values. It is an array with length 5having logical values:
[[12,-7,25],null,[0,-127,127,50],[],[50,12]]
It may have the following representation:
*Length:5,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00011101|0(padding)|*Offsetsbuffer(int32)|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-19|Bytes20-63||------------|-------------|-------------|-------------|-------------|-----------------------||4|7|0|0|3|unspecified(padding)|*Sizesbuffer(int32)|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-19|Bytes20-63||------------|-------------|-------------|-------------|-------------|-----------------------||3|0|4|0|2|unspecified(padding)|*Valuesarray(Int8Array):*Length:7,Nullcount:0*Validitybitmapbuffer:Notrequired*Valuesbuffer(int8)|Bytes0-6|Bytes7-63||------------------------------|-----------------------||0,-127,127,50,12,-7,25|unspecified(padding)|
Fixed-Size List Layout#
Fixed-Size List is a nested type in which each array slot contains afixed-size sequence of values all having the same type.
A fixed size list type is specified likeFixedSizeList<T>[N],whereT is any type (primitive or nested) andN is a 32-bitsigned integer representing the length of the lists.
A fixed size list array is represented by a values array, which is achild array of type T. T may also be a nested type. The value in slotj of a fixed size list array is stored in anN-long slice ofthe values array, starting at an offset ofj*N.
Example Layout: ``FixedSizeList<byte>[4]`` Array
Here we illustrateFixedSizeList<byte>[4].
For an array of length 4 with respective values:
[[192,168,0,12],null,[192,168,0,25],[192,168,0,1]]
will have the following representation:
*Length:4,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00001101|0(padding)|*Valuesarray(bytearray):*Length:16,Nullcount:0*validitybitmapbuffer:Notrequired|Bytes0-3|Bytes4-7|Bytes8-15||-----------------|-------------|---------------------------------||192,168,0,12|unspecified|192,168,0,25,192,168,0,1|
Struct Layout#
A struct is a nested type parameterized by an ordered sequence oftypes (which can all be distinct), called its fields. Each field musthave a UTF8-encoded name, and these field names are part of the typemetadata.
Physically, a struct array has one child array for each field. Thechild arrays are independent and need not be adjacent to each other inmemory. A struct array also has a validity bitmap to encode top-levelvalidity information.
For example, the struct (field names shown here as strings for illustrationpurposes):
Struct<name:VarBinaryage:Int32>
has two child arrays, oneVarBinary array (using variable-size binarylayout) and one 4-byte primitive value array havingInt32 logicaltype.
Example Layout: ``Struct<VarBinary, Int32>``
The layout for[{'joe',1},{null,2},null,{'mark',4}], havingchild arrays['joe',null,'alice','mark'] and[1,2,null,4]would be:
* Length: 4, Null count: 1* Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001011 | 0 (padding) |* Children arrays: * field-0 array (`VarBinary`): * Length: 4, Null count: 1 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001101 | 0 (padding) | * Offsets buffer: | Bytes 0-19 | Bytes 20-63 | |----------------|-----------------------| | 0, 3, 3, 8, 12 | unspecified (padding) | * Value buffer: | Bytes 0-11 | Bytes 12-63 | |----------------|-----------------------| | joealicemark | unspecified (padding) | * field-1 array (int32 array): * Length: 4, Null count: 1 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001011 | 0 (padding) | * Value Buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 | |-------------|-------------|-------------|-------------|-----------------------| | 1 | 2 | unspecified | 4 | unspecified (padding) |
Struct Validity#
A struct array has its own validity bitmap that is independent of itschild arrays’ validity bitmaps. The validity bitmap for the structarray might indicate a null when one or more of its child arrays hasa non-null value in its corresponding slot; or conversely, a childarray might indicate a null in its validity bitmap while the struct array’svalidity bitmap shows a non-null value.
Therefore, to know whether a particular child entry is valid, one musttake the logical AND of the corresponding bits in the two validity bitmaps(the struct array’s and the child array’s).
This is illustrated in the example above, one of the child arrays has avalid entry'alice' for the null struct but it is “hidden” by thestruct array’s validity bitmap. However, when treated independently,corresponding entries of the children array will be non-null.
Union Layout#
A union is defined by an ordered sequence of types; each slot in theunion can have a value chosen from these types. The types are namedlike a struct’s fields, and the names are part of the type metadata.
Unlike other data types, unions do not have their own validity bitmap. Instead,the nullness of each slot is determined exclusively by the child arrays whichare composed to create the union.
We define two distinct union types, “dense” and “sparse”, that areoptimized for different use cases.
Dense Union#
Dense union represents a mixed-type array with 5 bytes of overhead foreach value. Its physical layout is as follows:
One child array for each type
Types buffer: A buffer of 8-bit signed integers. Each type in theunion has a corresponding type id whose values are found in thisbuffer. A union with more than 127 possible types can be modeled asa union of unions.
Offsets buffer: A buffer of signed Int32 values indicating therelative offset into the respective child array for the type in agiven slot. The respective offsets for each child value array mustbe in order / increasing.
Example Layout: ``DenseUnion<f: Float32, i: Int32>``
For the union array:
[{f=1.2},null,{f=3.4},{i=5}]
will have the following layout:
*Length:4,Nullcount:0*Typesbuffer:|Byte0|Byte1|Byte2|Byte3|Bytes4-63||----------|-------------|----------|----------|-----------------------||0|0|0|1|unspecified(padding)|*Offsetbuffer:|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-15|Bytes16-63||-----------|-------------|------------|-------------|-----------------------||0|1|2|0|unspecified(padding)|*Childrenarrays:*Field-0array(f:Float32):*Length:3,Nullcount:1*Validitybitmapbuffer:00000101*ValueBuffer:|Bytes0-11|Bytes12-63||----------------|-----------------------||1.2,null,3.4|unspecified(padding)|*Field-1array(i:Int32):*Length:1,Nullcount:0*Validitybitmapbuffer:Notrequired*ValueBuffer:|Bytes0-3|Bytes4-63||-----------|-----------------------||5|unspecified(padding)|
Sparse Union#
A sparse union has the same structure as a dense union, with the omission ofthe offsets array. In this case, the child arrays are each equal in length tothe length of the union.
While a sparse union may use significantly more space compared with adense union, it has some advantages that may be desirable in certainuse cases:
A sparse union is more amenable to vectorized expression evaluation in some use cases.
Equal-length arrays can be interpreted as a union by only defining the types array.
Example layout: ``SparseUnion<i: Int32, f: Float32, s: VarBinary>``
For the union array:
[{i=5},{f=1.2},{s='joe'},{f=3.4},{i=4},{s='mark'}]
will have the following layout:
* Length: 6, Null count: 0* Types buffer: | Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Byte 5 | Bytes 6-63 | |------------|-------------|-------------|-------------|-------------|--------------|-----------------------| | 0 | 1 | 2 | 1 | 0 | 2 | unspecified (padding) |* Children arrays: * i (Int32): * Length: 6, Null count: 4 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00010001 | 0 (padding) | * Value buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | |-------------|-------------|-------------|-------------|-------------|--------------|-----------------------| | 5 | unspecified | unspecified | unspecified | 4 | unspecified | unspecified (padding) | * f (Float32): * Length: 6, Null count: 4 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00001010 | 0 (padding) | * Value buffer: | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-63 | |--------------|-------------|-------------|-------------|-------------|-------------|-----------------------| | unspecified | 1.2 | unspecified | 3.4 | unspecified | unspecified | unspecified (padding) | * s (`VarBinary`) * Length: 6, Null count: 4 * Validity bitmap buffer: | Byte 0 (validity bitmap) | Bytes 1-63 | |--------------------------|-----------------------| | 00100100 | 0 (padding) | * Offsets buffer (Int32) | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | Bytes 20-23 | Bytes 24-27 | Bytes 28-63 | |------------|-------------|-------------|-------------|-------------|-------------|-------------|------------------------| | 0 | 0 | 0 | 3 | 3 | 3 | 7 | unspecified (padding) | * Values buffer: | Bytes 0-6 | Bytes 7-63 | |------------|-----------------------| | joemark | unspecified (padding) |
Only the slot in the array corresponding to the type index is considered. All“unselected” values are ignored and could be any semantically correct arrayvalue.
Null Layout#
We provide a simplified memory-efficient layout for the Null data typewhere all values are null. In this case no memory buffers areallocated.
Dictionary-encoded Layout#
Dictionary encoding is a data representation technique to representvalues by integers referencing adictionary usually consisting ofunique values. It can be effective when you have data with manyrepeated values.
Any array can be dictionary-encoded. The dictionary is stored as an optionalproperty of an array. When a field is dictionary encoded, the values arerepresented by an array of non-negative integers representing the index of thevalue in the dictionary. The memory layout for a dictionary-encoded array isthe same as that of a primitive integer layout. The dictionary is handled as aseparate columnar array with its own respective layout.
As an example, you could have the following data:
type:VarBinary['foo','bar','foo','bar',null,'baz']
In dictionary-encoded form, this could appear as:
dataVarBinary(dictionary-encoded)index_type:Int32values:[0,1,0,1,null,2]dictionarytype:VarBinaryvalues:['foo','bar','baz']
Note that a dictionary is permitted to contain duplicate values ornulls:
dataVarBinary(dictionary-encoded)index_type:Int32values:[0,1,3,1,4,2]dictionarytype:VarBinaryvalues:['foo','bar','baz','foo',null]
The null count of such arrays is dictated only by the validity bitmapof its indices, irrespective of any null values in the dictionary.
Since unsigned integers can be more difficult to work with in some cases(e.g. in the JVM), we recommend preferring signed integers over unsignedintegers for representing dictionary indices. Additionally, we recommendavoiding using 64-bit unsigned integer indices unless they are required by anapplication.
We discuss dictionary encoding as it relates to serialization furtherbelow.
Run-End Encoded Layout#
Note
New in Arrow Columnar Format 1.3
Run-end encoding (REE) is a variation of run-length encoding (RLE). Theseencodings are well-suited for representing data containing sequences of thesame value, called runs. In run-end encoding, each run is represented as avalue and an integer giving the index in the array where the run ends.
Any array can be run-end encoded. A run-end encoded array has no buffersby itself, but has two child arrays. The first child array, called the run ends array,holds either 16, 32, or 64-bit signed integers. The actual values of each runare held in the second child array.For the purposes of determining field names and schemas, these child arraysare prescribed the standard names ofrun_ends andvalues respectively.
The values in the first child array represent the accumulated length of all runsfrom the first to the current one, i.e. the logical index where thecurrent run ends. This allows relatively efficient random access from a logicalindex using binary search. The length of an individual run can be determined bysubtracting two adjacent values. (Contrast this with run-length encoding, inwhich the lengths of the runs are represented directly, and in which randomaccess is less efficient.)
Note
Because therun_ends child array cannot have nulls, it’s reasonableto consider why therun_ends are a child array instead of just abuffer, like the offsets for aVariable-size List Layout. Thislayout was considered, but it was decided to use the child arrays.
Child arrays allow us to keep the “logical length” (the decoded length)associated with the parent array and the “physical length” (the numberof run ends) associated with the child arrays. Ifrun_ends was abuffer in the parent array then the size of the buffer would be unrelatedto the length of the array and this would be confusing.
A run must have a length of at least 1. This means the values in therun ends array all are positive and in strictly ascending order. A run end cannot benull.
The REE parent has no validity bitmap, and it’s null count field should always be 0.Null values are encoded as runs with the value null.
As an example, you could have the following data:
type:Float32[1.0,1.0,1.0,1.0,null,null,2.0]
In Run-end-encoded form, this could appear as:
*Length:7,Nullcount:0*ChildArrays:*run_ends(Int32):*Length:3,Nullcount:0(RunEndscannotbenull)*Validitybitmapbuffer:Notrequired(ifitexists,itshouldbeall1s)*Valuesbuffer|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-63||-------------|-------------|-------------|-----------------------||4|6|7|unspecified(padding)|*values(Float32):*Length:3,Nullcount:1*Validitybitmapbuffer:|Byte0(validitybitmap)|Bytes1-63||--------------------------|-----------------------||00000101|0(padding)|*Valuesbuffer|Bytes0-3|Bytes4-7|Bytes8-11|Bytes12-63||-------------|-------------|-------------|-----------------------||1.0|unspecified|2.0|unspecified(padding)|
Buffer Listing for Each Layout#
For the avoidance of ambiguity, we provide listing the order and typeof memory buffers for each layout.
Layout Type | Buffer 0 | Buffer 1 | Buffer 2 | Variadic Buffers |
|---|---|---|---|---|
Primitive | validity | data | ||
Variable Binary | validity | offsets | data | |
Variable Binary View | validity | views | data | |
List | validity | offsets | ||
List View | validity | offsets | sizes | |
Fixed-size List | validity | |||
Struct | validity | |||
Sparse Union | type ids | |||
Dense Union | type ids | offsets | ||
Null | ||||
Dictionary-encoded | validity | data (indices) | ||
Run-end encoded |
Serialization and Interprocess Communication (IPC)#
The primitive unit of serialized data in the columnar format is the“record batch”. Semantically, a record batch is an ordered collectionof arrays, known as itsfields, each having the same length as oneanother but potentially different data types. A record batch’s fieldnames and types collectively form the batch’sschema.
In this section we define a protocol for serializing record batchesinto a stream of binary payloads and reconstructing record batchesfrom these payloads without need for memory copying.
The columnar IPC protocol utilizes a one-way stream of binary messagesof these types:
Schema
RecordBatch
DictionaryBatch
We specify a so-calledencapsulated IPC message format whichincludes a serialized Flatbuffer type along with an optional messagebody. We define this message format before describing how to serializeeach constituent IPC message type.
Encapsulated message format#
For simple streaming and file-based serialization, we define a“encapsulated” message format for interprocess communication. Suchmessages can be “deserialized” into in-memory Arrow array objects byexamining only the message metadata without any need to copy or moveany of the actual data.
The encapsulated binary message format is as follows:
A 32-bit continuation indicator. The value
0xFFFFFFFFindicatesa valid message. This component was introduced in version 0.15.0 inpart to address the 8-byte alignment requirement of FlatbuffersA 32-bit little-endian length prefix indicating the metadata size
The message metadata as using the
Messagetype defined inMessage.fbsPadding bytes to an 8-byte boundary
The message body, whose length must be a multiple of 8 bytes
Schematically, we have:
<continuation:0xFFFFFFFF><metadata_size:int32><metadata_flatbuffer:bytes><padding><messagebody>
The complete serialized message must be a multiple of 8 bytes so that messagescan be relocated between streams. Otherwise the amount of padding between themetadata and the message body could be non-deterministic.
Themetadata_size includes the size of theMessage pluspadding. Themetadata_flatbuffer contains a serializedMessageFlatbuffer value, which internally includes:
A version number
A particular message value (one of
Schema,RecordBatch, orDictionaryBatch)The size of the message body
A
custom_metadatafield for any application-supplied metadata
When read from an input stream, generally theMessage metadata isinitially parsed and validated to obtain the body size. Then the bodycan be read.
Schema message#
The Flatbuffers filesSchema.fbs contains the definitions for allbuilt-in data types and theSchema metadata type which representsthe schema of a given record batch. A schema consists of an orderedsequence of fields, each having a name and type. A serializedSchemadoes not contain any data buffers, only type metadata.
TheField Flatbuffers type contains the metadata for a singlearray. This includes:
The field’s name
The field’s data type
Whether the field is semantically nullable. While this has nobearing on the array’s physical layout, many systems distinguishnullable and non-nullable fields and we want to allow them topreserve this metadata to enable faithful schema round trips.
A collection of child
Fieldvalues, for nested typesA
dictionaryproperty indicating whether the field isdictionary-encoded or not. If it is, a dictionary “id” is assignedto allow matching a subsequent dictionary IPC message with theappropriate field.
We additionally provide both schema-level and field-levelcustom_metadata attributes allowing for systems to insert theirown application defined metadata to customize behavior.
RecordBatch message#
A RecordBatch message contains the actual data buffers correspondingto the physical memory layout determined by a schema. The metadata forthis message provides the location and size of each buffer, permittingArray data structures to be reconstructed using pointer arithmetic andthus no memory copying.
The serialized form of the record batch is the following:
The
dataheader, defined as theRecordBatchtype inMessage.fbs.The
body, a flat sequence of memory buffers written end-to-endwith appropriate padding to ensure a minimum of 8-byte alignment
The data header contains the following:
The length and null count for each flattened field in the recordbatch
The memory offset and length of each constituent
Bufferin therecord batch’s body
Fields and buffers are flattened by a pre-order depth-first traversalof the fields in the record batch. For example, let’s consider theschema
col1:Struct<a:Int32,b:List<item:Int64>,c:Float64>col2:Utf8
The flattened version of this is:
FieldNode0:Structname='col1'FieldNode1:Int32name='a'FieldNode2:Listname='b'FieldNode3:Int64name='item'FieldNode4:Float64name='c'FieldNode5:Utf8name='col2'
For the buffers produced, we would have the following (refer to thetable above):
buffer0:field0validitybuffer1:field1validitybuffer2:field1valuesbuffer3:field2validitybuffer4:field2offsetsbuffer5:field3validitybuffer6:field3valuesbuffer7:field4validitybuffer8:field4valuesbuffer9:field5validitybuffer10:field5offsetsbuffer11:field5data
TheBuffer Flatbuffers value describes the location and size of apiece of memory. Generally these are interpreted relative to theencapsulated message format defined below.
Thesize field ofBuffer is not required to account for paddingbytes. Since this metadata can be used to communicate in-memory pointeraddresses between libraries, it is recommended to setsize to the actualmemory size rather than the padded size.
Variadic buffers#
Note
New in Arrow Columnar Format 1.4
Some types such as Utf8View are represented using a variable number of buffers.For each such Field in the pre-ordered flattened logical schema, there will bean entry invariadicBufferCounts to indicate the number of variadic bufferswhich belong to that Field in the current RecordBatch.
For example, consider the schema
col1:Struct<a:Int32,b:BinaryView,c:Float64>col2:Utf8View
This has two fields with variadic buffers, sovariadicBufferCounts willhave two entries in each RecordBatch. For a RecordBatch of this schema withvariadicBufferCounts=[3,2], the flattened buffers would be:
buffer0:col1validitybuffer1:col1.avaliditybuffer2:col1.avaluesbuffer3:col1.bvaliditybuffer4:col1.bviewsbuffer5:col1.bdatabuffer6:col1.bdatabuffer7:col1.bdatabuffer8:col1.cvaliditybuffer9:col1.cvaluesbuffer10:col2validitybuffer11:col2viewsbuffer12:col2databuffer13:col2data
Compression#
There are three different options for compression of record batchbody buffers: Buffers can be uncompressed, buffers can becompressed with thelz4 compression codec, or buffers can becompressed with thezstd compression codec. Buffers in theflat sequence of a message body must be compressed separately usingthe same codec. Specific buffers in the sequence of compressedbuffers may be left uncompressed (for example if compressing thosespecific buffers would not appreciably reduce their size).
The compression type used is defined in thedataheaderof theRecordBatch message in the optionalcompressionfield with the default being uncompressed.
Note
lz4 compression codec means theLZ4 frame formatand should not to be confused with“raw” (also called “block”) format.
The difference between compressed and uncompressed buffers in theserialized form is as follows:
If the buffers in theRecordBatch message arecompressed
the
dataheaderincludes the length and memory offsetof eachcompressed buffer in the record batch’s body togetherwith the compression typethe
bodyincludes a flat sequence ofcompressed bufferstogether with thelength of the uncompressed buffer as a 64-bitlittle-endian signed integer stored in the first 8 bytes of eachbuffer in the sequence. This uncompressed length can be set to-1to indicatethat that specific buffer is left uncompressed.
If the buffers in theRecordBatch message areuncompressed
the
dataheaderincludes the length and memory offsetof eachuncompressed buffer in the record batch’s bodythe
bodyincludes a flat sequence ofuncompressed buffers.
Note
Some Arrow implementations lack support for producing and consumingIPC data with compressed buffers using one or either of the codecslisted above. SeeImplementation Status for details.
Some applications might apply compression in the protocol they useto store or transport Arrow IPC data. (For example, an HTTP servermight serve gzip-compressed Arrow IPC streams.) Applications thatalready use compression in their storage or transport protocolsshould avoid using buffer compression. Double compression typicallyworsens performance and does not substantially improve compressionratios.
Byte Order (Endianness)#
The Arrow format is little endian by default.
Serialized Schema metadata has an endianness field indicatingendianness of RecordBatches. Typically this is the endianness of thesystem where the RecordBatch was generated. The main use case isexchanging RecordBatches between systems with the same Endianness. Atfirst we will return an error when trying to read a Schema with anendianness that does not match the underlying system. The referenceimplementation is focused on Little Endian and provides tests forit. Eventually we may provide automatic conversion via byte swapping.
IPC Streaming Format#
We provide a streaming protocol or “format” for record batches. It ispresented as a sequence of encapsulated messages, each of whichfollows the format above. The schema comes first in the stream, and itis the same for all of the record batches that follow. If any fieldsin the schema are dictionary-encoded, one or moreDictionaryBatchmessages will be included.DictionaryBatch andRecordBatchmessages may be interleaved, but before any dictionary key is used inaRecordBatch it should be defined in aDictionaryBatch.
<SCHEMA><DICTIONARY0>...<DICTIONARYk-1><RECORDBATCH0>...<DICTIONARYxDELTA>...<DICTIONARYyDELTA>...<RECORDBATCHn-1><EOS[optional]:0xFFFFFFFF0x00000000>
Note
An edge-case for interleaved dictionary and record batches occurswhen the record batches contain dictionary encoded arrays that arecompletely null. In this case, the dictionary for the encoded column mightappear after the first record batch.
When a stream reader implementation is reading a stream, after eachmessage, it may read the next 8 bytes to determine both if the streamcontinues and the size of the message metadata that follows. Once themessage flatbuffer is read, you can then read the message body.
The stream writer can signal end-of-stream (EOS) either by writing 8 bytescontaining the 4-byte continuation indicator (0xFFFFFFFF) followed by 0metadata length (0x00000000) or closing the stream interface. Werecommend the “.arrows” file extension for the streaming format althoughin many cases these streams will not ever be stored as files.
IPC File Format#
We define a “file format” supporting random access that is an extension ofthe stream format. The file starts and ends with a magic stringARROW1(plus padding). What follows in the file is identical to the stream format.At the end of the file, we write afooter containing a redundant copy ofthe schema (which is a part of the streaming format) plus memory offsets andsizes for each of the data blocks in the file. This enables random access toany record batch in the file. SeeFile.fbs for the precise details of thefile footer.
Schematically we have:
<magicnumber"ARROW1"><emptypaddingbytes[to8byteboundary]><STREAMINGFORMATwithEOS><FOOTER><FOOTERSIZE:int32><magicnumber"ARROW1">
In the file format, there is no requirement that dictionary keysshould be defined in aDictionaryBatch before they are used in aRecordBatch, as long as the keys are defined somewhere in thefile. Further more, it is invalid to have more than onenon-deltadictionary batch per dictionary ID (i.e. dictionary replacement is notsupported). Delta dictionaries are applied in the order they appear inthe file footer. We recommend the “.arrow” extension for files created withthis format. Note that files created with this format are sometimes called“Feather V2” or with the “.feather” extension, the name and the extensionderived from “Feather (V1)”, which was a proof of concept early inthe Arrow project for language-agnostic fast data frame storage forPython (pandas) and R.
Dictionary Messages#
Dictionaries are written in the stream and file formats as a sequence of recordbatches, each having a single field. The complete semantic schema for asequence of record batches, therefore, consists of the schema along with all ofthe dictionaries. The dictionary types are found in the schema, so it isnecessary to read the schema to first determine the dictionary types so thatthe dictionaries can be properly interpreted:
tableDictionaryBatch{id:long;data:RecordBatch;isDelta:boolean=false;}
The dictionaryid in the message metadata can be referenced one or more timesin the schema, so that dictionaries can even be used for multiple fields. SeetheDictionary-encoded Layout section for more about the semantics ofdictionary-encoded data.
The dictionaryisDelta flag allows existing dictionaries to beexpanded for future record batch materializations. A dictionary batchwithisDelta set indicates that its vector should be concatenatedwith those of any previous batches with the sameid. In a streamwhich encodes one column, the list of strings["A","B","C","B","D","C","E","A"], with a delta dictionary batch could take theform:
<SCHEMA><DICTIONARY0>(0)"A"(1)"B"(2)"C"<RECORDBATCH0>0121<DICTIONARY0DELTA>(3)"D"(4)"E"<RECORDBATCH1>3240EOS
Alternatively, ifisDelta is set to false, then the dictionaryreplaces the existing dictionary for the same ID. Using the sameexample as above, an alternate encoding could be:
<SCHEMA><DICTIONARY0>(0)"A"(1)"B"(2)"C"<RECORDBATCH0>0121<DICTIONARY0>(0)"A"(1)"C"(2)"D"(3)"E"<RECORDBATCH1>2130EOS
Custom Application Metadata#
We provide acustom_metadata field at three levels to provide amechanism for developers to pass application-specific metadata inArrow protocol messages. This includesField,Schema, andMessage.
The colon symbol: is to be used as a namespace separator. It canbe used multiple times in a key.
TheARROW pattern is a reserved namespace for internal Arrow usein thecustom_metadata fields. For example,ARROW:extension:name.
Extension Types#
User-defined “extension” types can be defined setting certainKeyValue pairs incustom_metadata in theField metadatastructure. These extension keys are:
'ARROW:extension:name'for the string name identifying thecustom data type. We recommend that you use a “namespace”-styleprefix for extension type names to minimize the possibility ofconflicts with multiple Arrow readers and writers in the sameapplication. For example, usemyorg.name_of_typeinstead ofsimplyname_of_type'ARROW:extension:metadata'for a serialized representationof theExtensionTypenecessary to reconstruct the custom type
Note
Extension names beginning witharrow. are reserved forcanonical extension types,they should not be used for third-party extension types.
This extension metadata can annotate any of the built-in Arrow logicaltypes. For example, Arrow specifies a canonical extension type thatrepresents a UUID as aFixedSizeBinary(16). Arrow implementations arenot required to support canonical extensions, so an implementation thatdoes not support this UUID type will simply interpret it as aFixedSizeBinary(16) and pass along thecustom_metadata insubsequent Arrow protocol messages.
Extension types may or may not use the'ARROW:extension:metadata' field. Let’s consider some exampleextension types:
uuidrepresented asFixedSizeBinary(16)with empty metadatalatitude-longituderepresented asstruct<latitude:double,longitude:double>, and empty metadatatensor(multidimensional array) stored asBinaryvalues andhaving serialized metadata indicating the data type and shape ofeach value. This could be JSON like{'type':'int8','shape':[4,5]}for a 4x5 cell tensor.trading-timerepresented asTimestampwith serializedmetadata indicating the market trading calendar the data correspondsto
See also
Implementation guidelines#
An execution engine (or framework, or UDF executor, or storage engine,etc) can implement only a subset of the Arrow spec and/or extend itgiven the following constraints:
Implementing a subset of the spec#
If only producing (and not consuming) arrow vectors: Any subsetof the vector spec and the corresponding metadata can be implemented.
If consuming and producing vectors: There is a minimal subset ofvectors to be supported. Production of a subset of vectors andtheir corresponding metadata is always fine. Consumption of vectorsshould at least convert the unsupported input vectors to thesupported subset (for example Timestamp.millis to timestamp.microsor int32 to int64).
Extensibility#
An execution engine implementor can also extend their memoryrepresentation with their own vectors internally as long as they arenever exposed. Before sending data to another system expecting Arrowdata, these custom vectors should be converted to a type that exist inthe Arrow spec.

