Developing Arrow C++ Compute#

This section provides information for developers of the Arrow C++ Compute module.

Row Table#

The row table in Arrow represents data stored in row-major format. This formatis particularly useful for scenarios involving random access to individual rowsand where all columns are frequently accessed together. It is especiallyadvantageous for hash-table keys and facilitates efficient operations such asgrouping and hash joins by optimizing memory access patterns and data locality.

Metadata#

A row table is defined by its metadata,RowTableMetadata, which includesinformation about its schema, alignment, and derived properties.

The schema specifies the types and order of columns. Each row in the row tablecontains the data for each column in that logical order (the physical order mayvary; seeRow Encoding for details).

Note

Columns of nested types or large binary types arenot supported in therow table.

One important property derived from the schema is whether the row table isfixed-length or varying-length. A fixed-length row table contains onlyfixed-length columns, while a varying-length row table includes at least onevarying-length column. This distinction determines how data is stored andaccessed in the row table.

Each row in the row table is aligned toRowTableMetadata::row_alignmentbytes. Fixed-length columns with non-power-of-2 lengths are also aligned toRowTableMetadata::row_alignment bytes. Varying-length columns are aligned toRowTableMetadata::string_alignment bytes.

Buffer Layout#

Similar to most ArrowArrays, the row table consists of three buffers:

  • Null Masks Buffer: Indicates null values for each column in each row.

  • Fixed-length Buffer: Stores row data for fixed-length tables or offsets tovarying-length data for varying-length tables.

  • Varying-length Buffer (Optional): Contains row data for varying-lengthtables; unused for fixed-length tables.

Row Format#

Null Masks#

For each row, a contiguous sequence of bits represents whether each column inthat row is null. Each bit corresponds to a specific column, with1indicating the value is null and0 indicating the value is valid. Note thatthis is the opposite of how the validity bitmap works forArrays. The nullmask for a row occupiesRowTableMetadata::null_masks_bytes_per_row bytes.

Fixed-length Row Data#

In a fixed-length row table, row data is directly stored in the fixed-lengthbuffer. All columns in each row are stored sequentially. Notably, abooleancolumn is special because, in a normal ArrowArray, it is stored using 1bit, whereas in a row table, it occupies 1 byte. The varying-length buffer isnot used in this case.

For example, a row table with the schema(int32,boolean) and rows[[7,false],[8,true],[9,false],...] is stored in the fixed-lengthbuffer as follows:

Row 0

Row 1

Row 2

7000,0(padding)

8000,1(padding)

9000,0(padding)

Offsets for Varying-length Row Data#

In a varying-length row table, the fixed-length buffer contains offsets to thevarying-length row data, which is stored separately in the optionalvarying-length buffer. The offsets are of typeRowTableMetadata::offset_type(fixed asint64_t) and indicate the starting position of the row data foreach row.

Varying-length Row Data#

In a varying-length row table, the varying-length buffer contains the actual rowdata, stored contiguously. The offsets in the fixed-length buffer point to thestarting position of each row’s data.

Row Encoding#

A varying-length row is encoded as follows:

  • Fixed-length columns are stored first.

  • A sequence of offsets to each varying-length column follows. Each offset is32-bit and indicates theend position within the row data of thecorresponding varying-length column.

  • Varying-length columns are stored last.

For example, a row table with the schema(int32,string,string,int32) androws[[7,'Alice','x',0],[8,'Bob','y',1],[9,'Charlotte','z',2],...]is stored as follows (assuming 8-byte alignment for varying-length columns):

Fixed-length buffer (row offsets):

Row 0

Row 1

Row 2

Row 3

00000000

320000000

640000000

1040000000

Varying-length buffer (row data):

Row

Fixed-length Cols

Varying-length Offsets

Varying-length Cols

0

7000,0000

21000,25000

Alice~~~x~~~~~~~

1

8000,1000

19000,25000

Bob~~~~~y~~~~~~~

2

9000,2000

25000,33000

Charlotte~~~~~~~z~~~~~~~

3