Arrays#

The central type in Arrow is the classarrow::Array. An arrayrepresents a known-length sequence of values all having the same type.Internally, those values are represented by one or several buffers, thenumber and meaning of which depend on the array’s data type, as documentedinthe Arrow data layout specification.

Those buffers consist of the value data itself and an optional bitmap bufferthat indicates which array entries are null values. The bitmap buffercan be entirely omitted if the array is known to have zero null values.

There are concrete subclasses ofarrow::Array for each data type,that help you access individual values of the array.

Building an array#

Available strategies#

As Arrow objects are immutable, they cannot be populated directly like forexample astd::vector. Instead, several strategies can be used:

  • if the data already exists in memory with the right layout, you can wrapsaid memory insidearrow::Buffer instances and then constructaarrow::ArrayData describing the array;

  • otherwise, thearrow::ArrayBuilder base class and its concretesubclasses help building up array data incrementally, without having todeal with details of the Arrow format yourself.

Note

For cases where performance isn’t important such as examples or tests,you may prefer to use the*FromJSONString helpers which can createArrays using a JSON text shorthand. SeeFromJSONString Helpers.

Using ArrayBuilder and its subclasses#

To build anInt64 Arrow array, we can use thearrow::Int64Builderclass. In the following example, we build an array of the range 1 to 8 wherethe element that should hold the value 4 is nulled:

arrow::Int64Builderbuilder;builder.Append(1);builder.Append(2);builder.Append(3);builder.AppendNull();builder.Append(5);builder.Append(6);builder.Append(7);builder.Append(8);automaybe_array=builder.Finish();if(!maybe_array.ok()){// ... do something on array building failure}std::shared_ptr<arrow::Array>array=*maybe_array;

The resulting Array (which can be casted to the concretearrow::Int64Arraysubclass if you want to access its values) then consists of twoarrow::Buffers.The first buffer holds the null bitmap, which consists here of a single byte withthe bits1|1|1|1|0|1|1|1. As we useleast-significant bit (LSB) numbering,this indicates that the fourth entry in the array is null. The secondbuffer is simply anint64_t array containing all the above values.As the fourth entry is null, the value at that position in the buffer isundefined.

Here is how you could access the concrete array’s contents:

// Cast the Array to its actual type to access its dataautoint64_array=std::static_pointer_cast<arrow::Int64Array>(array);// Get the pointer to the null bitmapconstuint8_t*null_bitmap=int64_array->null_bitmap_data();// Get the pointer to the actual dataconstint64_t*data=int64_array->raw_values();// Alternatively, given an array index, query its null bit and value directlyint64_tindex=2;if(!int64_array->IsNull(index)){int64_tvalue=int64_array->Value(index);}

Note

arrow::Int64Array (respectivelyarrow::Int64Builder) isjust atypedef, provided for convenience, ofarrow::NumericArray<Int64Type>(respectivelyarrow::NumericBuilder<Int64Type>).

Performance#

While it is possible to build an array value-by-value as in the example above,to attain highest performance it is recommended to use the bulk appendingmethods (usually namedAppendValues) in the concretearrow::ArrayBuildersubclasses.

If you know the number of elements in advance, it is also recommended topresize the working area by calling theResize()orReserve() methods.

Here is how one could rewrite the above example to take advantage of thoseAPIs:

arrow::Int64Builderbuilder;// Make place for 8 values in totalbuilder.Reserve(8);// Bulk append the given values (with a null in 4th place as indicated by the// validity vector)std::vector<bool>validity={true,true,true,false,true,true,true,true};std::vector<int64_t>values={1,2,3,0,5,6,7,8};builder.AppendValues(values,validity);automaybe_array=builder.Finish();

If you still must append values one by one, some concrete builder subclasseshave methods marked “Unsafe” that assume the working area has been correctlypresized, and offer higher performance in exchange:

arrow::Int64Builderbuilder;// Make place for 8 values in totalbuilder.Reserve(8);builder.UnsafeAppend(1);builder.UnsafeAppend(2);builder.UnsafeAppend(3);builder.UnsafeAppendNull();builder.UnsafeAppend(5);builder.UnsafeAppend(6);builder.UnsafeAppend(7);builder.UnsafeAppend(8);automaybe_array=builder.Finish();

Size Limitations and Recommendations#

Some array types are structurally limited to 32-bit sizes. This is the casefor list arrays (which can hold up to 2^31 elements), string arrays and binaryarrays (which can hold up to 2GB of binary data), at least. Some other arraytypes can hold up to 2^63 elements in the C++ implementation, but other Arrowimplementations can have a 32-bit size limitation for those array types as well.

For these reasons, it is recommended that huge data be chunked in subsets ofmore reasonable size.

Chunked Arrays#

Aarrow::ChunkedArray is, like an array, a logical sequence of values;but unlike a simple array, a chunked array does not require the entire sequenceto be physically contiguous in memory. Also, the constituents of a chunked arrayneed not have the same size, but they must all have the same data type.

A chunked array is constructed by aggregating any number of arrays. Here we’llbuild a chunked array with the same logical values as in the example above,but in two separate chunks:

std::vector<std::shared_ptr<arrow::Array>>chunks;std::shared_ptr<arrow::Array>array;// Build first chunkarrow::Int64Builderbuilder;builder.Append(1);builder.Append(2);builder.Append(3);if(!builder.Finish(&array).ok()){// ... do something on array building failure}chunks.push_back(std::move(array));// Build second chunkbuilder.Reset();builder.AppendNull();builder.Append(5);builder.Append(6);builder.Append(7);builder.Append(8);if(!builder.Finish(&array).ok()){// ... do something on array building failure}chunks.push_back(std::move(array));autochunked_array=std::make_shared<arrow::ChunkedArray>(std::move(chunks));assert(chunked_array->num_chunks()==2);// Logical length in number of valuesassert(chunked_array->length()==8);assert(chunked_array->null_count()==1);

Slicing#

Like for physical memory buffers, it is possible to make zero-copy slicesof arrays and chunked arrays, to obtain an array or chunked array referringto some logical subsequence of the data. This is done by calling thearrow::Array::Slice() andarrow::ChunkedArray::Slice() methods,respectively.

FromJSONString Helpers#

A set of helper functions is provided for concisely creating Arrays and ScalarsfromJSON text. These helpers are intended to be used in examples, tests, orfor quick prototyping and are not intended to be used where performance matters.Most users will want to use the API described inReading JSON files which provides aperformant way to createarrow::Table andarrow::RecordBatchobjects from line-separated JSON files.

Examples forArrayFromJSONString,ChunkedArrayFromJSONString,DictArrayFromJSONString are shown below:

// Simple typesARROW_ASSIGN_OR_RAISE(autoint32_array,ArrayFromJSONString(arrow::int32(),"[1, 2, 3]"));ARROW_ASSIGN_OR_RAISE(autofloat64_array,ArrayFromJSONString(arrow::float64(),"[4.0, 5.0, 6.0]"));ARROW_ASSIGN_OR_RAISE(autobool_array,ArrayFromJSONString(arrow::boolean(),"[true, false, true]"));ARROW_ASSIGN_OR_RAISE(autostring_array,ArrayFromJSONString(arrow::utf8(),R"(["Hello", "World", null])"));// Timestamps can be created from string representationsARROW_ASSIGN_OR_RAISE(autots_array,ArrayFromJSONString(timestamp(arrow::TimeUnit::SECOND),R"(["1970-01-01", "2000-02-29","3989-07-14","1900-02-28"])"));// List, Map, StructARROW_ASSIGN_OR_RAISE(autolist_array,ArrayFromJSONString(list(arrow::int64()),"[[null], [], null, [4, 5, 6, 7, 8], [2, 3]]"));ARROW_ASSIGN_OR_RAISE(automap_array,ArrayFromJSONString(map(arrow::utf8(),arrow::int32()),R"([[["joe", 0], ["mark", null]], null, [["cap", 8]], []])"));ARROW_ASSIGN_OR_RAISE(autostruct_array,ArrayFromJSONString(arrow::struct_({field("one",arrow::int32()),field("two",arrow::int32())}),"[[11, 22], null, [null, 33]]"));// ChunkedArrayFromJSONStringARROW_ASSIGN_OR_RAISE(autochunked_array,ChunkedArrayFromJSONString(arrow::int32(),{"[5, 10]","[null]","[16]"}));// DictArrayFromJSONStringARROW_ASSIGN_OR_RAISE(autodict_array,DictArrayFromJSONString(dictionary(arrow::int32(),arrow::utf8()),"[0, 1, 0, 2, 0, 3]",R"(["k1", "k2", "k3", "k4"])"));

Please see theFromJSONString API listing forthe complete set of helpers.