Arrow Compute#
Apache Arrow provides compute functions to facilitate efficient andportable data processing. In this article, you will use Arrow’s computefunctionality to:
Calculate a sum over a column
Calculate element-wise sums over two columns
Search for a value in a column
Pre-requisites#
Before continuing, make sure you have:
An Arrow installation, which you can set up here:Using Arrow C++ in your own project.If you’re compiling Arrow yourself, be sure you compile with the compute moduleenabled (i.e.,
-DARROW_COMPUTE=ON), seeOptional Components.An understanding of basic Arrow data structures fromBasic Arrow Data Structures
Setup#
Before running some computations, we need to fill in a couple gaps:
We need to include necessary headers.
A
main()is needed to glue things together.We need data to play with.
Includes#
Before writing C++ code, we need some includes. We’ll getiostream for output, then import Arrow’scompute functionality:
#include<arrow/api.h>#include<arrow/compute/api.h>#include<iostream>
Main()#
For our glue, we’ll use themain() pattern from the previous tutorial ondata structures:
intmain(){arrow::Statusst=RunMain();if(!st.ok()){std::cerr<<st<<std::endl;return1;}return0;}
Which, like when we used it before, is paired with aRunMain():
arrow::StatusRunMain(){
Generating Tables for Computation#
Before we begin, we’ll initialize aTable with two columns to play with. We’ll usethe method fromBasic Arrow Data Structures, so look backthere if anything’s confusing:
// Create a couple 32-bit integer arrays.arrow::Int32Builderint32builder;int32_tsome_nums_raw[5]={34,624,2223,5654,4356};ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw,5));std::shared_ptr<arrow::Array>some_nums;ARROW_ASSIGN_OR_RAISE(some_nums,int32builder.Finish());int32_tmore_nums_raw[5]={75342,23,64,17,736};ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw,5));std::shared_ptr<arrow::Array>more_nums;ARROW_ASSIGN_OR_RAISE(more_nums,int32builder.Finish());// Make a table out of our pair of arrays.std::shared_ptr<arrow::Field>field_a,field_b;std::shared_ptr<arrow::Schema>schema;field_a=arrow::field("A",arrow::int32());field_b=arrow::field("B",arrow::int32());schema=arrow::schema({field_a,field_b});// Initialize the compute module to register the required compute kernels.ARROW_RETURN_NOT_OK(arrow::compute::Initialize());std::shared_ptr<arrow::Table>table;table=arrow::Table::Make(schema,{some_nums,more_nums},5);
Calculating a Sum over an Array#
Using a computation function has two general steps, which we separatehere:
Preparing a
Datumfor outputCalling
compute::Sum(), a convenience function for summation over anArrayRetrieving and printing output
Prepare Memory for Output with Datum#
When computation is done, we need somewhere for our results to go. InArrow, the object for such output is calledDatum. This object is usedto pass around inputs and outputs in compute functions, and can containmany differently-shaped Arrow data structures. We’ll need it to retrievethe output from compute functions.
// The Datum class is what all compute functions output to, and they can take Datums// as inputs, as well.arrow::Datumsum;
Call Sum()#
Here, we’ll get ourTable, which has columns “A” and “B”, and sum overcolumn “A.” For summation, there is a convenience function, calledcompute::Sum(), which reduces the complexity of the compute interface. We’ll lookat the more complex version for the next computation. For a givenfunction, refer toCompute Functions to see if there is aconvenience function.compute::Sum() takes in a givenArray orChunkedArray– here, we useTable::GetColumnByName() to pass in column A. Then, it outputs toaDatum. Putting that all together, we get this:
// Here, we can use arrow::compute::Sum. This is a convenience function, and the next// computation won't be so simple. However, using these where possible helps// readability.ARROW_ASSIGN_OR_RAISE(sum,arrow::compute::Sum({table->GetColumnByName("A")}));
Get Results from Datum#
The previous step leaves us with aDatum which contains our sum.However, we cannot print it directly – its flexibility in holdingarbitrary Arrow data structures means we have to retrieve our datacarefully. First, to understand what’s in it, we can check which kind ofdata structure it is, then what kind of primitive is being held:
// Get the kind of Datum and what it holds -- this is a Scalar, with int64.std::cout<<"Datum kind: "<<sum.ToString()<<" content type: "<<sum.type()->ToString()<<std::endl;
This should report theDatum stores aScalar with a 64-bit integer. Justto see what the value is, we can print it out like so, which yields12891:
// Note that we explicitly request a scalar -- the Datum cannot simply give what it is,// you must ask for the correct type.std::cout<<sum.scalar_as<arrow::Int64Scalar>().value<<std::endl;
Now we’ve usedcompute::Sum() and gotten what we want out of it!
Calculating Element-Wise Array Addition with CallFunction()#
A next layer of complexity uses whatcompute::Sum() was helpfully hiding:compute::CallFunction(). For this example, we will explore how to use the morerobustcompute::CallFunction() with the “add” compute function. The patternremains similar:
Preparing a Datum for output
Calling
compute::CallFunction()with “add”Retrieving and printing output
Prepare Memory for Output with Datum#
Once more, we’ll need a Datum for any output we get:
arrow::Datumelement_wise_sum;
Use CallFunction() with “add”#
compute::CallFunction() takes the name of the desired function as its firstargument, then the data inputs for said function as a vector in itssecond argument. Right now, we want an element-wise addition betweencolumns “A” and “B”. So, we’ll ask for “add,” pass in columns “A and B”,and output to ourDatum. Put this all together, and we get:
// Get element-wise sum of both columns A and B in our Table. Note that here we use// CallFunction(), which takes the name of the function as the first argument.ARROW_ASSIGN_OR_RAISE(element_wise_sum,arrow::compute::CallFunction("add",{table->GetColumnByName("A"),table->GetColumnByName("B")}));
See also
Available functions for a list of other functions to go withcompute::CallFunction()
Get Results from Datum#
Again, theDatum needs some careful handling. Said handling is mucheasier when we know what’s in it. ThisDatum holds aChunkedArray with32-bit integers, but we can print that to confirm:
// Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32.std::cout<<"Datum kind: "<<element_wise_sum.ToString()<<" content type: "<<element_wise_sum.type()->ToString()<<std::endl;
Since it’s aChunkedArray, we request that from theDatum –ChunkedArrayhas aChunkedArray::ToString() method, so we’ll use that to print out its contents:
// This time, we get a ChunkedArray, not a scalar.std::cout<<element_wise_sum.chunked_array()->ToString()<<std::endl;
The output looks like this:
Datumkind:ChunkedArraycontenttype:int32[[75376,647,2287,5671,5092]]
Now, we’ve usedcompute::CallFunction(), instead of a convenience function! Thisenables a much wider range of available computations.
Searching for a Value with CallFunction() and Options#
One class of computations remains.compute::CallFunction() uses a vector for datainputs, but computation often needs additional arguments to function. Inorder to supply this, computation functions may be associated withstructs where their arguments can be defined. You can check a givenfunction to see which struct it useshere. For this example, we’ll search for a value in column “A” usingthe “index” compute function. This process has three steps, as opposedto the two from before:
Preparing a
Datumfor outputPreparing
compute::IndexOptionsCalling
compute::CallFunction()with “index” andcompute::IndexOptionsRetrieving and printing output
Prepare Memory for Output with Datum#
We’ll need aDatum for any output we get:
// Use an options struct to set up searching for 2223 in column A (the third item).arrow::Datumthird_item;
Configure “index” with IndexOptions#
For this exploration, we’ll use the “index” function – this is asearching method, which returns the index of an input value. In order topass this input value, we require ancompute::IndexOptions struct. So, let’s makethat struct:
// An options struct is used in lieu of passing an arbitrary amount of arguments.arrow::compute::IndexOptionsindex_options;
In a searching function, one requires a target value. Here, we’ll use2223, the third item in column A, and configure our struct accordingly:
// We need an Arrow Scalar, not a raw value.index_options.value=arrow::MakeScalar(2223);
Use CallFunction() with “index” and IndexOptions#
To actually run the function, we usecompute::CallFunction() again, this timepassing our IndexOptions struct by reference as a third argument. Asbefore, the first argument is the name of the function, and the secondour data input:
ARROW_ASSIGN_OR_RAISE(third_item,arrow::compute::CallFunction("index",{table->GetColumnByName("A")},&index_options));
Get Results from Datum#
One last time, let’s see what ourDatum has! This will be aScalar witha 64-bit integer, and the output will be 2:
// Get the kind of Datum and what it holds -- this is a Scalar, with int64std::cout<<"Datum kind: "<<third_item.ToString()<<" content type: "<<third_item.type()->ToString()<<std::endl;// We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.std::cout<<third_item.scalar_as<arrow::Int64Scalar>().value<<std::endl;
Ending Program#
At the end, we just returnarrow::Status::OK(), so themain() knows thatwe’re done, and that everything’s okay, just like the precedingtutorials.
returnarrow::Status::OK();}
With that, you’ve used compute functions which fall into the three maintypes – with and without convenience functions, then with an Optionsstruct. Now you can process anyTable you need to, and solve whateverdata problem you have that fits into memory!
Which means that now we have to see how we can work withlarger-than-memory datasets, via Arrow Datasets in the next article.
Refer to the below for a copy of the complete code:
19// (Doc section: Includes) 20#include<arrow/api.h> 21#include<arrow/compute/api.h> 22 23#include<iostream> 24// (Doc section: Includes) 25 26// (Doc section: RunMain) 27arrow::StatusRunMain(){ 28// (Doc section: RunMain) 29// (Doc section: Create Tables) 30// Create a couple 32-bit integer arrays. 31arrow::Int32Builderint32builder; 32int32_tsome_nums_raw[5]={34,624,2223,5654,4356}; 33ARROW_RETURN_NOT_OK(int32builder.AppendValues(some_nums_raw,5)); 34std::shared_ptr<arrow::Array>some_nums; 35ARROW_ASSIGN_OR_RAISE(some_nums,int32builder.Finish()); 36 37int32_tmore_nums_raw[5]={75342,23,64,17,736}; 38ARROW_RETURN_NOT_OK(int32builder.AppendValues(more_nums_raw,5)); 39std::shared_ptr<arrow::Array>more_nums; 40ARROW_ASSIGN_OR_RAISE(more_nums,int32builder.Finish()); 41 42// Make a table out of our pair of arrays. 43std::shared_ptr<arrow::Field>field_a,field_b; 44std::shared_ptr<arrow::Schema>schema; 45 46field_a=arrow::field("A",arrow::int32()); 47field_b=arrow::field("B",arrow::int32()); 48 49schema=arrow::schema({field_a,field_b}); 50 51// Initialize the compute module to register the required compute kernels. 52ARROW_RETURN_NOT_OK(arrow::compute::Initialize()); 53 54std::shared_ptr<arrow::Table>table; 55table=arrow::Table::Make(schema,{some_nums,more_nums},5); 56// (Doc section: Create Tables) 57 58// (Doc section: Sum Datum Declaration) 59// The Datum class is what all compute functions output to, and they can take Datums 60// as inputs, as well. 61arrow::Datumsum; 62// (Doc section: Sum Datum Declaration) 63// (Doc section: Sum Call) 64// Here, we can use arrow::compute::Sum. This is a convenience function, and the next 65// computation won't be so simple. However, using these where possible helps 66// readability. 67ARROW_ASSIGN_OR_RAISE(sum,arrow::compute::Sum({table->GetColumnByName("A")})); 68// (Doc section: Sum Call) 69// (Doc section: Sum Datum Type) 70// Get the kind of Datum and what it holds -- this is a Scalar, with int64. 71std::cout<<"Datum kind: "<<sum.ToString() 72<<" content type: "<<sum.type()->ToString()<<std::endl; 73// (Doc section: Sum Datum Type) 74// (Doc section: Sum Contents) 75// Note that we explicitly request a scalar -- the Datum cannot simply give what it is, 76// you must ask for the correct type. 77std::cout<<sum.scalar_as<arrow::Int64Scalar>().value<<std::endl; 78// (Doc section: Sum Contents) 79 80// (Doc section: Add Datum Declaration) 81arrow::Datumelement_wise_sum; 82// (Doc section: Add Datum Declaration) 83// (Doc section: Add Call) 84// Get element-wise sum of both columns A and B in our Table. Note that here we use 85// CallFunction(), which takes the name of the function as the first argument. 86ARROW_ASSIGN_OR_RAISE(element_wise_sum,arrow::compute::CallFunction( 87"add",{table->GetColumnByName("A"), 88table->GetColumnByName("B")})); 89// (Doc section: Add Call) 90// (Doc section: Add Datum Type) 91// Get the kind of Datum and what it holds -- this is a ChunkedArray, with int32. 92std::cout<<"Datum kind: "<<element_wise_sum.ToString() 93<<" content type: "<<element_wise_sum.type()->ToString()<<std::endl; 94// (Doc section: Add Datum Type) 95// (Doc section: Add Contents) 96// This time, we get a ChunkedArray, not a scalar. 97std::cout<<element_wise_sum.chunked_array()->ToString()<<std::endl; 98// (Doc section: Add Contents) 99100// (Doc section: Index Datum Declare)101// Use an options struct to set up searching for 2223 in column A (the third item).102arrow::Datumthird_item;103// (Doc section: Index Datum Declare)104// (Doc section: IndexOptions Declare)105// An options struct is used in lieu of passing an arbitrary amount of arguments.106arrow::compute::IndexOptionsindex_options;107// (Doc section: IndexOptions Declare)108// (Doc section: IndexOptions Assign)109// We need an Arrow Scalar, not a raw value.110index_options.value=arrow::MakeScalar(2223);111// (Doc section: IndexOptions Assign)112// (Doc section: Index Call)113ARROW_ASSIGN_OR_RAISE(114third_item,arrow::compute::CallFunction("index",{table->GetColumnByName("A")},115&index_options));116// (Doc section: Index Call)117// (Doc section: Index Inspection)118// Get the kind of Datum and what it holds -- this is a Scalar, with int64119std::cout<<"Datum kind: "<<third_item.ToString()120<<" content type: "<<third_item.type()->ToString()<<std::endl;121// We get a scalar -- the location of 2223 in column A, which is 2 in 0-based indexing.122std::cout<<third_item.scalar_as<arrow::Int64Scalar>().value<<std::endl;123// (Doc section: Index Inspection)124// (Doc section: Ret)125returnarrow::Status::OK();126}127// (Doc section: Ret)128129// (Doc section: Main)130intmain(){131arrow::Statusst=RunMain();132if(!st.ok()){133std::cerr<<st<<std::endl;134return1;135}136return0;137}138// (Doc section: Main)

