Basic Arrow Data Structures#
Apache Arrow provides fundamental data structures for representing data:Array,ChunkedArray,RecordBatch, andTable.This article shows how to construct these data structures from primitivedata types; specifically, we will work with integers of varying sizerepresenting days, months, and years. We will use them to create the following data structures:
Arrow
ArraysRecordBatch, fromArraysTable, fromChunkedArrays
Pre-requisites#
Before continuing, make sure you have:
An Arrow installation, which you can set up here:Using Arrow C++ in your own project
Understanding of how to use basic C++ data structures
Understanding of basic C++ data types
Setup#
Before trying out Arrow, we need to fill in a couple gaps:
We need to include necessary headers.
Amain()is needed to glue things together.
Includes#
First, as ever, we need some includes. We’ll getiostream for output, then import Arrow’s basicfunctionality fromapi.h, like so:
#include<arrow/api.h>#include<iostream>
Main()#
Next, we need amain() – a common pattern with Arrow looks like thefollowing:
intmain(){arrow::Statusst=RunMain();if(!st.ok()){std::cerr<<st<<std::endl;return1;}return0;}
This allows us to easily use Arrow’s error-handling macros, which willreturn back tomain() with aarrow::Status object if a failure occurs – andthismain() will report the error. Note that this means Arrow neverraises exceptions, instead relying upon returningStatus. For more onthat, read here:Conventions.
To accompany thismain(), we have aRunMain() from which anyStatusobjects can return – this is where we’ll write the rest of the program:
arrow::StatusRunMain(){
Making an Arrow Array#
Building int8 Arrays#
Given that we have some data in standard C++ arrays, and want to use Arrow, we need to movethe data from said arrays into Arrow arrays. We still guarantee contiguity of memory in anArray, so no worries about a performance loss when usingArray vs C++ arrays.The easiest way to construct anArray uses anArrayBuilder.
The following code initializes anArrayBuilder for anArray that will hold 8 bitintegers. Specifically, it uses theAppendValues() method, present in concretearrow::ArrayBuilder subclasses, to fill theArrayBuilder with thecontents of a standard C++ array. Note the use ofARROW_RETURN_NOT_OK.IfAppendValues() fails, this macro will return tomain(), which willprint out the meaning of the failure.
// Builders are the main way to create Arrays in Arrow from existing values that are not// on-disk. In this case, we'll make a simple array, and feed that in.// Data types are important as ever, and there is a Builder for each compatible type;// in this case, int8.arrow::Int8Builderint8builder;int8_tdays_raw[5]={1,12,17,23,28};// AppendValues, as called, puts 5 values from days_raw into our Builder object.ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw,5));
Given anArrayBuilder has the values we want in ourArray, we can useArrayBuilder::Finish() to output the final structure to anArray – specifically,we output to astd::shared_ptr<arrow::Array>. Note the use ofARROW_ASSIGN_OR_RAISEin the following code.Finish() outputs aarrow::Result object, whichARROW_ASSIGN_OR_RAISEcan process. If the method fails, it will return tomain() with aStatusthat will explain what went wrong. If it succeeds, then it will assignthe final output to the left-hand variable.
// We only have a Builder though, not an Array -- the following code pushes out the// built up data into a proper Array.std::shared_ptr<arrow::Array>days;ARROW_ASSIGN_OR_RAISE(days,int8builder.Finish());
As soon asArrayBuilder has had itsFinish method called, its state resets, soit can be used again, as if it was fresh. Thus, we repeat the process above for our second array:
// Builders clear their state every time they fill an Array, so if the type is the same,// we can re-use the builder. We do that here for month values.int8_tmonths_raw[5]={1,3,5,7,1};ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw,5));std::shared_ptr<arrow::Array>months;ARROW_ASSIGN_OR_RAISE(months,int8builder.Finish());
Building int16 Arrays#
AnArrayBuilder has its type specified at the time of declaration.Once this is done, it cannot have its type changed. We have to make a new one when we switch to year data, whichrequires a 16-bit integer at the minimum. Of course, there’s anArrayBuilder for that.It uses the exact same methods, but with the new data type:
// Now that we change to int16, we use the Builder for that data type instead.arrow::Int16Builderint16builder;int16_tyears_raw[5]={1990,2000,1995,2000,1995};ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw,5));std::shared_ptr<arrow::Array>years;ARROW_ASSIGN_OR_RAISE(years,int16builder.Finish());
Now, we have three ArrowArrays, with some variance in type.
Making a RecordBatch#
A columnar data format only really comes into play when you have a table.So, let’s make one. The first kind we’ll make is theRecordBatch – thisusesArrays internally, which means all data will be contiguous within eachcolumn, but any appending or concatenating will require copying. Making aRecordBatchhas two steps, given existingArrays:
Defining a Schema#
To get started making aRecordBatch, we first need to definecharacteristics of the columns, each represented by aField instance.EachField contains a name and datatype for its associated column; then,aSchema groups them together and sets the order of the columns, likeso:
// Now, we want a RecordBatch, which has columns and labels for said columns.// This gets us to the 2d data structures we want in Arrow.// These are defined by schema, which have fields -- here we get both those object types// ready.std::shared_ptr<arrow::Field>field_day,field_month,field_year;std::shared_ptr<arrow::Schema>schema;// Every field needs its name and data type.field_day=arrow::field("Day",arrow::int8());field_month=arrow::field("Month",arrow::int8());field_year=arrow::field("Year",arrow::int16());// The schema can be built from a vector of fields, and we do so here.schema=arrow::schema({field_day,field_month,field_year});
Building a RecordBatch#
With data inArrays from the previous section, and column descriptions in ourSchema from the previous step, we can make theRecordBatch. Note that thelength of the columns is necessary, and the length is shared by all columns.
// With the schema and Arrays full of data, we can make our RecordBatch! Here,// each column is internally contiguous. This is in opposition to Tables, which we'll// see next.std::shared_ptr<arrow::RecordBatch>rbatch;// The RecordBatch needs the schema, length for columns, which all must match,// and the actual data itself.rbatch=arrow::RecordBatch::Make(schema,days->length(),{days,months,years});std::cout<<rbatch->ToString();
Now, we have our data in a nice tabular form, safely within theRecordBatch.What we can do with this will be discussed in the later tutorials.
Making a ChunkedArray#
Let’s say that we want an array made up of sub-arrays, because itcan be useful for avoiding data copies when concatenating, for parallelizing work, for fitting each chunkinto cache, or for exceeding the 2,147,483,647 row limit in astandard ArrowArray. For this, Arrow offersChunkedArray, which can bemade up of individual ArrowArrays. In this example, we can reuse the arrayswe made earlier in part of our chunked array, allowing us to extend them without having to copydata. So, let’s build a few moreArrays,using the same builders for ease of use:
// Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use// Builders.int8_tdays_raw2[5]={6,12,3,30,22};ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2,5));std::shared_ptr<arrow::Array>days2;ARROW_ASSIGN_OR_RAISE(days2,int8builder.Finish());int8_tmonths_raw2[5]={5,4,11,3,2};ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2,5));std::shared_ptr<arrow::Array>months2;ARROW_ASSIGN_OR_RAISE(months2,int8builder.Finish());int16_tyears_raw2[5]={1980,2001,1915,2020,1996};ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2,5));std::shared_ptr<arrow::Array>years2;ARROW_ASSIGN_OR_RAISE(years2,int16builder.Finish());
In order to support an arbitrary amount ofArrays in the construction of theChunkedArray, Arrow suppliesArrayVector. This provides a vector forArrays,and we’ll use it here to prepare to make aChunkedArray:
// ChunkedArrays let us have a list of arrays, which aren't contiguous// with each other. First, we get a vector of arrays.arrow::ArrayVectorday_vecs{days,days2};
In order to leverage Arrow, we do need to take that last step, and move into aChunkedArray:
// Then, we use that to initialize a ChunkedArray, which can be used with other// functions in Arrow! This is good, since having a normal vector of arrays wouldn't// get us far.std::shared_ptr<arrow::ChunkedArray>day_chunks=std::make_shared<arrow::ChunkedArray>(day_vecs);
With aChunkedArray for our day values, we now just need to repeat the processfor the month and year data:
// Repeat for months.arrow::ArrayVectormonth_vecs{months,months2};std::shared_ptr<arrow::ChunkedArray>month_chunks=std::make_shared<arrow::ChunkedArray>(month_vecs);// Repeat for years.arrow::ArrayVectoryear_vecs{years,years2};std::shared_ptr<arrow::ChunkedArray>year_chunks=std::make_shared<arrow::ChunkedArray>(year_vecs);
With that, we are left with threeChunkedArrays, varying in type.
Making a Table#
One particularly useful thing we can do with theChunkedArrays from the previous section is creatingTables. Much like aRecordBatch, aTable stores tabular data. However, aTable does not guarantee contiguity, due to being made up ofChunkedArrays.This can be useful for logic, parallelizing work, for fitting chunks into cache, or exceeding the 2,147,483,647 row limitpresent inArray and, thus,RecordBatch.
If you read up toRecordBatch, you may note that theTable constructor in the following code iseffectively identical, it just happens to put the length of the columnsin position 3, and makes aTable. We re-use theSchema from before, andmake ourTable:
// A Table is the structure we need for these non-contiguous columns, and keeps them// all in one place for us so we can use them as if they were normal arrays.std::shared_ptr<arrow::Table>table;table=arrow::Table::Make(schema,{day_chunks,month_chunks,year_chunks},10);std::cout<<table->ToString();
Now, we have our data in a nice tabular form, safely within theTable.What we can do with this will be discussed in the later tutorials.
Ending Program#
At the end, we just returnStatus::OK(), so themain() knows thatwe’re done, and that everything’s okay.
returnarrow::Status::OK();}
Wrapping Up#
With that, you’ve created the fundamental data structures in Arrow, andcan proceed to getting them in and out of a program with file I/O in the next article.
Refer to the below for a copy of the complete code:
19// (Doc section: Includes) 20#include<arrow/api.h> 21 22#include<iostream> 23// (Doc section: Includes) 24 25// (Doc section: RunMain Start) 26arrow::StatusRunMain(){ 27// (Doc section: RunMain Start) 28// (Doc section: int8builder 1 Append) 29// Builders are the main way to create Arrays in Arrow from existing values that are not 30// on-disk. In this case, we'll make a simple array, and feed that in. 31// Data types are important as ever, and there is a Builder for each compatible type; 32// in this case, int8. 33arrow::Int8Builderint8builder; 34int8_tdays_raw[5]={1,12,17,23,28}; 35// AppendValues, as called, puts 5 values from days_raw into our Builder object. 36ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw,5)); 37// (Doc section: int8builder 1 Append) 38 39// (Doc section: int8builder 1 Finish) 40// We only have a Builder though, not an Array -- the following code pushes out the 41// built up data into a proper Array. 42std::shared_ptr<arrow::Array>days; 43ARROW_ASSIGN_OR_RAISE(days,int8builder.Finish()); 44// (Doc section: int8builder 1 Finish) 45 46// (Doc section: int8builder 2) 47// Builders clear their state every time they fill an Array, so if the type is the same, 48// we can re-use the builder. We do that here for month values. 49int8_tmonths_raw[5]={1,3,5,7,1}; 50ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw,5)); 51std::shared_ptr<arrow::Array>months; 52ARROW_ASSIGN_OR_RAISE(months,int8builder.Finish()); 53// (Doc section: int8builder 2) 54 55// (Doc section: int16builder) 56// Now that we change to int16, we use the Builder for that data type instead. 57arrow::Int16Builderint16builder; 58int16_tyears_raw[5]={1990,2000,1995,2000,1995}; 59ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw,5)); 60std::shared_ptr<arrow::Array>years; 61ARROW_ASSIGN_OR_RAISE(years,int16builder.Finish()); 62// (Doc section: int16builder) 63 64// (Doc section: Schema) 65// Now, we want a RecordBatch, which has columns and labels for said columns. 66// This gets us to the 2d data structures we want in Arrow. 67// These are defined by schema, which have fields -- here we get both those object types 68// ready. 69std::shared_ptr<arrow::Field>field_day,field_month,field_year; 70std::shared_ptr<arrow::Schema>schema; 71 72// Every field needs its name and data type. 73field_day=arrow::field("Day",arrow::int8()); 74field_month=arrow::field("Month",arrow::int8()); 75field_year=arrow::field("Year",arrow::int16()); 76 77// The schema can be built from a vector of fields, and we do so here. 78schema=arrow::schema({field_day,field_month,field_year}); 79// (Doc section: Schema) 80 81// (Doc section: RBatch) 82// With the schema and Arrays full of data, we can make our RecordBatch! Here, 83// each column is internally contiguous. This is in opposition to Tables, which we'll 84// see next. 85std::shared_ptr<arrow::RecordBatch>rbatch; 86// The RecordBatch needs the schema, length for columns, which all must match, 87// and the actual data itself. 88rbatch=arrow::RecordBatch::Make(schema,days->length(),{days,months,years}); 89 90std::cout<<rbatch->ToString(); 91// (Doc section: RBatch) 92 93// (Doc section: More Arrays) 94// Now, let's get some new arrays! It'll be the same datatypes as above, so we re-use 95// Builders. 96int8_tdays_raw2[5]={6,12,3,30,22}; 97ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw2,5)); 98std::shared_ptr<arrow::Array>days2; 99ARROW_ASSIGN_OR_RAISE(days2,int8builder.Finish());100101int8_tmonths_raw2[5]={5,4,11,3,2};102ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw2,5));103std::shared_ptr<arrow::Array>months2;104ARROW_ASSIGN_OR_RAISE(months2,int8builder.Finish());105106int16_tyears_raw2[5]={1980,2001,1915,2020,1996};107ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw2,5));108std::shared_ptr<arrow::Array>years2;109ARROW_ASSIGN_OR_RAISE(years2,int16builder.Finish());110// (Doc section: More Arrays)111112// (Doc section: ArrayVector)113// ChunkedArrays let us have a list of arrays, which aren't contiguous114// with each other. First, we get a vector of arrays.115arrow::ArrayVectorday_vecs{days,days2};116// (Doc section: ArrayVector)117// (Doc section: ChunkedArray Day)118// Then, we use that to initialize a ChunkedArray, which can be used with other119// functions in Arrow! This is good, since having a normal vector of arrays wouldn't120// get us far.121std::shared_ptr<arrow::ChunkedArray>day_chunks=122std::make_shared<arrow::ChunkedArray>(day_vecs);123// (Doc section: ChunkedArray Day)124125// (Doc section: ChunkedArray Month Year)126// Repeat for months.127arrow::ArrayVectormonth_vecs{months,months2};128std::shared_ptr<arrow::ChunkedArray>month_chunks=129std::make_shared<arrow::ChunkedArray>(month_vecs);130131// Repeat for years.132arrow::ArrayVectoryear_vecs{years,years2};133std::shared_ptr<arrow::ChunkedArray>year_chunks=134std::make_shared<arrow::ChunkedArray>(year_vecs);135// (Doc section: ChunkedArray Month Year)136137// (Doc section: Table)138// A Table is the structure we need for these non-contiguous columns, and keeps them139// all in one place for us so we can use them as if they were normal arrays.140std::shared_ptr<arrow::Table>table;141table=arrow::Table::Make(schema,{day_chunks,month_chunks,year_chunks},10);142143std::cout<<table->ToString();144// (Doc section: Table)145146// (Doc section: Ret)147returnarrow::Status::OK();148}149// (Doc section: Ret)150151// (Doc section: Main)152intmain(){153arrow::Statusst=RunMain();154if(!st.ok()){155std::cerr<<st<<std::endl;156return1;157}158return0;159}160161// (Doc section: Main)

