Two-dimensional Datasets#

Record Batches#

classRecordBatch#

Collection of equal-length arrays matching a particularSchema.

A record batch is table-like data structure that is semantically a sequence of fields, each a contiguous Arrow array

Public Functions

Result<std::shared_ptr<StructArray>>ToStructArray()const#

Convert record batch to struct array.

Create a struct array whose child arrays are the record batch’s columns. Note that the record batch’s top-level field metadata cannot be reflected in the resulting struct array.

Result<std::shared_ptr<Tensor>>ToTensor(boolnull_to_nan=false,boolrow_major=true,MemoryPool*pool=default_memory_pool())const#

Convert record batch with one data type toTensor.

Create aTensor object with shape (number of rows, number of columns) and strides (type size in bytes, type size in bytes * number of rows). GeneratedTensor will have column-major layout.

Parameters:
  • null_to_nan[in] if true, convert nulls to NaN

  • row_major[in] if true, create row-majorTensor else column-majorTensor

  • pool[in] the memory pool to allocate the tensor buffer

Returns:

the resultingTensor

boolEquals(constRecordBatch&other,boolcheck_metadata=false,constEqualOptions&opts=EqualOptions::Defaults())const#

Determine if two record batches are equal.

Parameters:
  • other[in] theRecordBatch to compare with

  • check_metadata[in] if true, the schema metadata will be compared, regardless of the value set inEqualOptions::use_metadata

  • opts[in] the options for equality comparisons

Returns:

true if batches are equal

boolEquals(constRecordBatch&other,constEqualOptions&opts)const#

Determine if two record batches are equal.

Parameters:
  • other[in] theRecordBatch to compare with

  • opts[in] the options for equality comparisons

Returns:

true if batches are equal

inlineboolApproxEquals(constRecordBatch&other,constEqualOptions&opts=EqualOptions::Defaults())const#

Determine if two record batches are approximately equal.

Parameters:
  • other[in] theRecordBatch to compare with

  • opts[in] the options for equality comparisons

Returns:

true if batches are approximately equal

inlineconststd::shared_ptr<Schema>&schema()const#
Returns:

the record batch’s schema

Result<std::shared_ptr<RecordBatch>>ReplaceSchema(std::shared_ptr<Schema>schema)const#

Replace the schema with another schema with the same types, but potentially different field names and/or metadata.

virtualconststd::vector<std::shared_ptr<Array>>&columns()const=0#

Retrieve all columns at once.

virtualstd::shared_ptr<Array>column(inti)const=0#

Retrieve an array from the record batch.

Parameters:

i[in] field index, does not boundscheck

Returns:

anArray object

std::shared_ptr<Array>GetColumnByName(conststd::string&name)const#

Retrieve an array from the record batch.

Parameters:

name[in] field name

Returns:

anArray or null if no field was found

virtualstd::shared_ptr<ArrayData>column_data(inti)const=0#

Retrieve an array’s internal data from the record batch.

Parameters:

i[in] field index, does not boundscheck

Returns:

an internalArrayData object

virtualconstArrayDataVector&column_data()const=0#

Retrieve all arrays’ internal data from the record batch.

virtualResult<std::shared_ptr<RecordBatch>>AddColumn(inti,conststd::shared_ptr<Field>&field,conststd::shared_ptr<Array>&column)const=0#

Add column to the record batch, producing a newRecordBatch.

Parameters:
  • i[in] field index, which will be boundschecked

  • field[in] field to be added

  • column[in] column to be added

virtualResult<std::shared_ptr<RecordBatch>>AddColumn(inti,std::stringfield_name,conststd::shared_ptr<Array>&column)const#

Add new nullable column to the record batch, producing a newRecordBatch.

For non-nullable columns, use the Field-based version of this method.

Parameters:
  • i[in] field index, which will be boundschecked

  • field_name[in] name of field to be added

  • column[in] column to be added

virtualResult<std::shared_ptr<RecordBatch>>SetColumn(inti,conststd::shared_ptr<Field>&field,conststd::shared_ptr<Array>&column)const=0#

Replace a column in the record batch, producing a newRecordBatch.

Parameters:
  • i[in] field index, does boundscheck

  • field[in] field to be replaced

  • column[in] column to be replaced

virtualResult<std::shared_ptr<RecordBatch>>RemoveColumn(inti)const=0#

Remove column from the record batch, producing a newRecordBatch.

Parameters:

i[in] field index, does boundscheck

conststd::string&column_name(inti)const#

Name in i-th column.

intnum_columns()const#
Returns:

the number of columns in the table

inlineint64_tnum_rows()const#
Returns:

the number of rows (the corresponding length of each column)

Result<std::shared_ptr<RecordBatch>>CopyTo(conststd::shared_ptr<MemoryManager>&to)const#

Copy the entireRecordBatch to destinationMemoryManager.

This usesArray::CopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been copied to the destinationMemoryManager. This usesMemoryManager::CopyBuffer under the hood.

Result<std::shared_ptr<RecordBatch>>ViewOrCopyTo(conststd::shared_ptr<MemoryManager>&to)const#

View or Copy the entireRecordBatch to destinationMemoryManager.

This usesArray::ViewOrCopyTo on each column of the record batch to create a new record batch where all underlying buffers for the columns have been zero-copy viewed on the destinationMemoryManager, falling back to performing a copy if it can’t be viewed as a zero-copy buffer. This usesBuffer::ViewOrCopy under the hood.

virtualstd::shared_ptr<RecordBatch>Slice(int64_toffset)const#

Slice each of the arrays in the record batch.

Parameters:

offset[in] the starting offset to slice, through end of batch

Returns:

new record batch

virtualstd::shared_ptr<RecordBatch>Slice(int64_toffset,int64_tlength)const=0#

Slice each of the arrays in the record batch.

Parameters:
  • offset[in] the starting offset to slice

  • length[in] the number of elements to slice from offset

Returns:

new record batch

std::stringToString()const#
Returns:

PrettyPrint representation suitable for debugging

std::vector<std::string>ColumnNames()const#

Return names of all columns.

Result<std::shared_ptr<RecordBatch>>RenameColumns(conststd::vector<std::string>&names)const#

Rename columns with provided names.

Result<std::shared_ptr<RecordBatch>>SelectColumns(conststd::vector<int>&indices)const#

Return new record batch with specified columns.

virtualStatusValidate()const#

Perform cheap validation checks to determine obvious inconsistencies within the record batch’s schema and internal data.

This is O(k) where k is the total number of fields and array descendents.

Returns:

Status

virtualStatusValidateFull()const#

Perform extensive validation checks to determine inconsistencies within the record batch’s schema and internal data.

This is potentially O(k*n) where n is the number of rows.

Returns:

Status

virtualconststd::shared_ptr<Device::SyncEvent>&GetSyncEvent()const=0#

EXPERIMENTAL: Return a top-level sync event object for this record batch.

If all of the data for this record batch is in CPU memory, then this will return null. If the data for this batch is on a device, then if synchronization is needed before accessing the data the returned sync event will allow for it.

Returns:

null or aDevice::SyncEvent

Result<std::shared_ptr<Array>>MakeStatisticsArray(MemoryPool*pool=default_memory_pool())const#

Create a statistics array of this record batch.

The created array follows the C data interface statistics specification. Seehttps://arrow.apache.org/docs/format/StatisticsSchema.html for details.

Parameters:

pool[in] the memory pool to allocate memory from

Returns:

the statistics array of this record batch

Public Static Functions

staticstd::shared_ptr<RecordBatch>Make(std::shared_ptr<Schema>schema,int64_tnum_rows,std::vector<std::shared_ptr<Array>>columns,std::shared_ptr<Device::SyncEvent>sync_event=NULLPTR)#
Parameters:
  • schema[in] The record batch schema

  • num_rows[in] length of fields in the record batch. Each array should have the same length as num_rows

  • columns[in] the record batch fields as vector of arrays

  • sync_event[in] optional synchronization event for non-CPU device memory used by buffers

staticstd::shared_ptr<RecordBatch>Make(std::shared_ptr<Schema>schema,int64_tnum_rows,std::vector<std::shared_ptr<ArrayData>>columns,DeviceAllocationTypedevice_type=DeviceAllocationType::kCPU,std::shared_ptr<Device::SyncEvent>sync_event=NULLPTR)#

Construct record batch from vector of internal data structures.

This class is intended for internal use, or advanced users.

Since

0.5.0

Parameters:
  • schema – the record batch schema

  • num_rows – the number of semantic rows in the record batch. This should be equal to the length of each field

  • columns – the data for the batch’s columns

  • device_type – the type of the device that the Arrow columns are allocated on

  • sync_event – optional synchronization event for non-CPU device memory used by buffers

staticResult<std::shared_ptr<RecordBatch>>MakeEmpty(std::shared_ptr<Schema>schema,MemoryPool*pool=default_memory_pool())#

Create an emptyRecordBatch of a given schema.

The outputRecordBatch will be created with DataTypes from the given schema.

Parameters:
  • schema[in] the schema of the emptyRecordBatch

  • pool[in] the memory pool to allocate memory from

Returns:

the resultingRecordBatch

staticResult<std::shared_ptr<RecordBatch>>FromStructArray(conststd::shared_ptr<Array>&array,MemoryPool*pool=default_memory_pool())#

Construct record batch from struct array.

This constructs a record batch using the child arrays of the given array, which must be a struct array.

This operation will usually be zero-copy. However, if the struct array has an offset or a validity bitmap then these will need to be pushed into the child arrays. Pushing the offset is zero-copy but pushing the validity bitmap is not.

Parameters:
  • array[in] the source array, must be aStructArray

  • pool[in] the memory pool to allocate new validity bitmaps

classRecordBatchReader#

Abstract interface for reading stream of record batches.

Subclassed byarrow::TableBatchReader,arrow::csv::StreamingReader, arrow::flight::sql::example::SqliteStatementBatchReader, arrow::flight::sql::example::SqliteTablesWithSchemaBatchReader,arrow::ipc::RecordBatchStreamReader,arrow::json::StreamingReader

Public Functions

virtualstd::shared_ptr<Schema>schema()const=0#
Returns:

the shared schema of the record batches in the stream

virtualStatusReadNext(std::shared_ptr<RecordBatch>*batch)=0#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Example:

while (true) {  std::shared_ptr<RecordBatch> batch;  ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));  if (!batch) {    break;  }  // handling the `batch`, the `batch->num_rows()`  // might be 0.}
Parameters:

batch[out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data.

Returns:

Status

inlineResult<std::shared_ptr<RecordBatch>>Next()#

Iterator interface.

inlinevirtualStatusClose()#

finalize reader

inlinevirtualDeviceAllocationTypedevice_type()const#

EXPERIMENTAL: Get the device type for record batches this reader produces.

default implementation is to return DeviceAllocationType::kCPU

inlineRecordBatchReaderIteratorbegin()#

Return an iterator to the first record batch in the stream.

inlineRecordBatchReaderIteratorend()#

Return an iterator to the end of the stream.

Result<RecordBatchVector>ToRecordBatches()#

Consume entire stream as a vector of record batches.

Result<std::shared_ptr<Table>>ToTable()#

Read all batches and concatenate asarrow::Table.

Public Static Functions

staticResult<std::shared_ptr<RecordBatchReader>>Make(RecordBatchVectorbatches,std::shared_ptr<Schema>schema=NULLPTR,DeviceAllocationTypedevice_type=DeviceAllocationType::kCPU)#

Create aRecordBatchReader from a vector ofRecordBatch.

Parameters:
  • batches[in] the vector ofRecordBatch to read from

  • schema[in] schema to conform to. Will be inferred from the first element if not provided.

  • device_type[in] the type of device that the batches are allocated on

staticResult<std::shared_ptr<RecordBatchReader>>MakeFromIterator(Iterator<std::shared_ptr<RecordBatch>>batches,std::shared_ptr<Schema>schema,DeviceAllocationTypedevice_type=DeviceAllocationType::kCPU)#

Create aRecordBatchReader from anIterator ofRecordBatch.

Parameters:
  • batches[in] an iterator ofRecordBatch to read from.

  • schema[in] schema that each record batch in iterator will conform to.

  • device_type[in] the type of device that the batches are allocated on

classRecordBatchReaderIterator#
classTableBatchReader:publicarrow::RecordBatchReader#

Compute a stream of record batches from a (possibly chunked)Table.

The conversion is zero-copy: each record batch is a view over a slice of the table’s columns.

The table is expected to be valid prior to using it with the batch reader.

Public Functions

explicitTableBatchReader(constTable&table)#

Construct aTableBatchReader for the given table.

virtualstd::shared_ptr<Schema>schema()constoverride#
Returns:

the shared schema of the record batches in the stream

virtualStatusReadNext(std::shared_ptr<RecordBatch>*out)override#

Read the next record batch in the stream.

Return null for batch when reaching end of stream

Example:

while (true) {  std::shared_ptr<RecordBatch> batch;  ARROW_RETURN_NOT_OK(reader->ReadNext(&batch));  if (!batch) {    break;  }  // handling the `batch`, the `batch->num_rows()`  // might be 0.}
Parameters:

batch[out] the next loaded batch, null at end of stream. Returning an empty batch doesn’t mean the end of stream because it is valid data.

Returns:

Status

voidset_chunksize(int64_tchunksize)#

Set the desired maximum number of rows for record batches.

The actual number of rows in each record batch may be smaller, depending on actual chunking characteristics of each table column.

Tables#

classTable#

Logical table as sequence of chunked arrays.

Public Functions

inlineconststd::shared_ptr<Schema>&schema()const#

Return the table schema.

virtualstd::shared_ptr<ChunkedArray>column(inti)const=0#

Return a column by index.

virtualconststd::vector<std::shared_ptr<ChunkedArray>>&columns()const=0#

Return vector of all columns for table.

inlinestd::shared_ptr<Field>field(inti)const#

Return a column’s field by index.

std::vector<std::shared_ptr<Field>>fields()const#

Return vector of all fields for table.

virtualstd::shared_ptr<Table>Slice(int64_toffset,int64_tlength)const=0#

Construct a zero-copy slice of the table with the indicated offset and length.

Parameters:
  • offset[in] the index of the first row in the constructed slice

  • length[in] the number of rows of the slice. If there are not enough rows in the table, the length will be adjusted accordingly

Returns:

a new object wrapped in std::shared_ptr<Table>

inlinestd::shared_ptr<Table>Slice(int64_toffset)const#

Slice from first row at offset until end of the table.

inlinestd::shared_ptr<ChunkedArray>GetColumnByName(conststd::string&name)const#

Return a column by name.

Parameters:

name[in] field name

Returns:

anArray or null if no field was found

virtualResult<std::shared_ptr<Table>>RemoveColumn(inti)const=0#

Remove column from the table, producing a newTable.

virtualResult<std::shared_ptr<Table>>AddColumn(inti,std::shared_ptr<Field>field_arg,std::shared_ptr<ChunkedArray>column)const=0#

Add column to the table, producing a newTable.

virtualResult<std::shared_ptr<Table>>SetColumn(inti,std::shared_ptr<Field>field_arg,std::shared_ptr<ChunkedArray>column)const=0#

Replace a column in the table, producing a newTable.

std::vector<std::string>ColumnNames()const#

Return names of all columns.

Result<std::shared_ptr<Table>>RenameColumns(conststd::vector<std::string>&names)const#

Rename columns with provided names.

Result<std::shared_ptr<Table>>SelectColumns(conststd::vector<int>&indices)const#

Return new table with specified columns.

virtualstd::shared_ptr<Table>ReplaceSchemaMetadata(conststd::shared_ptr<constKeyValueMetadata>&metadata)const=0#

Replace schema key-value metadata with new metadata.

Since

0.5.0

Parameters:

metadata[in] newKeyValueMetadata

Returns:

newTable

virtualResult<std::shared_ptr<Table>>Flatten(MemoryPool*pool=default_memory_pool())const=0#

Flatten the table, producing a newTable.

Any column with a struct type will be flattened into multiple columns

Parameters:

pool[in] The pool for buffer allocations, if any

std::stringToString()const#
Returns:

PrettyPrint representation suitable for debugging

virtualStatusValidate()const=0#

Perform cheap validation checks to determine obvious inconsistencies within the table’s schema and internal data.

This is O(k*m) where k is the total number of field descendents, and m is the number of chunks.

Returns:

Status

virtualStatusValidateFull()const=0#

Perform extensive validation checks to determine inconsistencies within the table’s schema and internal data.

This is O(k*n) where k is the total number of field descendents, and n is the number of rows.

Returns:

Status

inlineintnum_columns()const#

Return the number of columns in the table.

inlineint64_tnum_rows()const#

Return the number of rows (equal to each column’s logical length)

boolEquals(constTable&other,constEqualOptions&opts)const#

Determine if two tables are equal.

Parameters:
  • other[in] the table to compare with

  • opts[in] the options for equality comparisons

Returns:

true if two tables are equal

inlineboolEquals(constTable&other,boolcheck_metadata=false,constEqualOptions&opts=EqualOptions::Defaults())const#

Determine if two tables are equal.

Parameters:
  • other[in] the table to compare with

  • check_metadata[in] if true, the schema metadata will be compared, regardless of the value set inEqualOptions::use_metadata

  • opts[in] the options for equality comparisons

Returns:

true if two tables are equal

Result<std::shared_ptr<Table>>CombineChunks(MemoryPool*pool=default_memory_pool())const#

Make a new table by combining the chunks this table has.

All the underlying chunks in theChunkedArray of each column are concatenated into zero or one chunk.

To avoid buffer overflow, binary columns may be combined into multiple chunks. Chunks will have the maximum possible length.

Parameters:

pool[in] The pool for buffer allocations

Result<std::shared_ptr<RecordBatch>>CombineChunksToBatch(MemoryPool*pool=default_memory_pool())const#

Make a new record batch by combining the chunks this table has.

All the underlying chunks in theChunkedArray of each column are concatenated into a single chunk.

Parameters:

pool[in] The pool for buffer allocations

Public Static Functions

staticstd::shared_ptr<Table>Make(std::shared_ptr<Schema>schema,std::vector<std::shared_ptr<ChunkedArray>>columns,int64_tnum_rows=-1)#

Construct aTable from schema and columns.

If columns is zero-length, the table’s number of rows is zero

Parameters:
  • schema[in] The table schema (column types)

  • columns[in] The table’s columns as chunked arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

staticstd::shared_ptr<Table>Make(std::shared_ptr<Schema>schema,conststd::vector<std::shared_ptr<Array>>&arrays,int64_tnum_rows=-1)#

Construct aTable from schema and arrays.

Parameters:
  • schema[in] The table schema (column types)

  • arrays[in] The table’s columns as arrays

  • num_rows[in] number of rows in table, -1 (default) to infer from columns

staticResult<std::shared_ptr<Table>>MakeEmpty(std::shared_ptr<Schema>schema,MemoryPool*pool=default_memory_pool())#

Create an emptyTable of a given schema.

The outputTable will be created with a single empty chunk per column.

Parameters:
  • schema[in] the schema of the emptyTable

  • pool[in] the memory pool to allocate memory from

Returns:

the resultingTable

staticResult<std::shared_ptr<Table>>FromRecordBatchReader(RecordBatchReader*reader)#

Construct aTable from aRecordBatchReader.

Parameters:

reader[in] thearrow::RecordBatchReader that produces batches

staticResult<std::shared_ptr<Table>>FromRecordBatches(conststd::vector<std::shared_ptr<RecordBatch>>&batches)#

Construct aTable from RecordBatches, using schema supplied by the firstRecordBatch.

Parameters:

batches[in] a std::vector of record batches

staticResult<std::shared_ptr<Table>>FromRecordBatches(std::shared_ptr<Schema>schema,conststd::vector<std::shared_ptr<RecordBatch>>&batches)#

Construct aTable from RecordBatches, using supplied schema.

There may be zero record batches

Parameters:
  • schema[in] thearrow::Schema for each batch

  • batches[in] a std::vector of record batches

staticResult<std::shared_ptr<Table>>FromChunkedStructArray(conststd::shared_ptr<ChunkedArray>&array)#

Construct aTable from a chunkedStructArray.

One column will be produced for each field of theStructArray.

Parameters:

array[in] a chunkedStructArray

Result<std::shared_ptr<Table>>arrow::ConcatenateTables(conststd::vector<std::shared_ptr<Table>>&tables,ConcatenateTablesOptionsoptions=ConcatenateTablesOptions::Defaults(),MemoryPool*memory_pool=default_memory_pool())#

Construct a new table from multiple input tables.

The new table is assembled from existing column chunks without copying, if schemas are identical. If schemas do not match exactly and unify_schemas is enabled in options (off by default), an attempt is made to unify them, and then column chunks are converted to their respective unified datatype, which will probably incur a copy. :func:arrow::PromoteTableToSchema is used to unify schemas.

Tables are concatenated in order they are provided in and the order of rows within tables will be preserved.

Parameters:
  • tables[in] a std::vector of Tables to be concatenated

  • options[in] specify how to unify schema of input tables

  • memory_pool[in]MemoryPool to be used if null-filled arrays need to be created or if existing column chunks need to endure type conversion

Returns:

newTable

Result<std::shared_ptr<Table>>arrow::PromoteTableToSchema(conststd::shared_ptr<Table>&table,conststd::shared_ptr<Schema>&schema,MemoryPool*pool=default_memory_pool())#

Promotes a table to conform to the given schema.

If a field in the schema does not have a corresponding column in the table, a column of nulls will be added to the resulting table. If the corresponding column is of type Null, it will be promoted to the type specified by schema, with null values filled. The column will be casted to the type specified by the schema.

Returns an error:

  • if the corresponding column’s type is not compatible with the schema.

  • if there is a column in the table that does not exist in the schema.

  • if the cast fails or casting would be required but is not available.

Parameters:
  • table[in] the inputTable

  • schema[in] the target schema to promote to

  • pool[in] The memory pool to be used if null-filled arrays need to be created.

Result<std::shared_ptr<Table>>arrow::PromoteTableToSchema(conststd::shared_ptr<Table>&table,conststd::shared_ptr<Schema>&schema,constcompute::CastOptions&options,MemoryPool*pool=default_memory_pool())#

Promotes a table to conform to the given schema.

If a field in the schema does not have a corresponding column in the table, a column of nulls will be added to the resulting table. If the corresponding column is of type Null, it will be promoted to the type specified by schema, with null values filled. The column will be casted to the type specified by the schema.

Returns an error:

  • if the corresponding column’s type is not compatible with the schema.

  • if there is a column in the table that does not exist in the schema.

  • if the cast fails or casting would be required but is not available.

Parameters:
  • table[in] the inputTable

  • schema[in] the target schema to promote to

  • options[in] The cast options to allow promotion of types

  • pool[in] The memory pool to be used if null-filled arrays need to be created.

On this page