Arrow File I/O#

Apache Arrow provides file I/O functions to facilitate use of Arrow fromthe start to end of an application. In this article, you will:

  1. Read an Arrow file into aRecordBatch and write it back out afterwards

  2. Read a CSV file into aTable and write it back out afterwards

  3. Read a Parquet file into aTable and write it back out afterwards

Pre-requisites#

Before continuing, make sure you have:

  1. An Arrow installation, which you can set up here:Using Arrow C++ in your own project

  2. An understanding of basic Arrow data structures fromBasic Arrow Data Structures

  3. A directory to run the final application in – this program will generate some files, so be prepared for that.

Setup#

Before writing out some file I/O, we need to fill in a couple gaps:

  1. We need to include necessary headers.

  2. Amain() is needed to glue things together.

  3. We need files to play with.

Includes#

Before writing C++ code, we need some includes. We’ll getiostream for output, then import Arrow’sI/O functionality for each file type we’ll work with in this article:

#include<arrow/api.h>#include<arrow/csv/api.h>#include<arrow/io/api.h>#include<arrow/ipc/api.h>#include<parquet/arrow/reader.h>#include<parquet/arrow/writer.h>#include<iostream>

Main()#

For our glue, we’ll use themain() pattern from the previous tutorial ondata structures:

intmain(){arrow::Statusst=RunMain();if(!st.ok()){std::cerr<<st<<std::endl;return1;}return0;}

Which, like when we used it before, is paired with aRunMain():

arrow::StatusRunMain(){

Generating Files for Reading#

We need some files to actually play with. In practice, you’ll likelyhave some input for your own application. Here, however, we want toexplore doing I/O for the sake of it, so let’s generate some files to makethis easy to follow. To create those, we’ll define a helper functionthat we’ll run first. Feel free to read through this, but the conceptsused will be explained later in this article. Note that we’re using theday/month/year data from the previous tutorial. For now, just copy thefunction in:

arrow::StatusGenInitialFile(){// Make a couple 8-bit integer arrays and a 16-bit integer array -- just like// basic Arrow example.arrow::Int8Builderint8builder;int8_tdays_raw[5]={1,12,17,23,28};ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw,5));std::shared_ptr<arrow::Array>days;ARROW_ASSIGN_OR_RAISE(days,int8builder.Finish());int8_tmonths_raw[5]={1,3,5,7,1};ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw,5));std::shared_ptr<arrow::Array>months;ARROW_ASSIGN_OR_RAISE(months,int8builder.Finish());arrow::Int16Builderint16builder;int16_tyears_raw[5]={1990,2000,1995,2000,1995};ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw,5));std::shared_ptr<arrow::Array>years;ARROW_ASSIGN_OR_RAISE(years,int16builder.Finish());// Get a vector of our Arraysstd::vector<std::shared_ptr<arrow::Array>>columns={days,months,years};// Make a schema to initialize the Table withstd::shared_ptr<arrow::Field>field_day,field_month,field_year;std::shared_ptr<arrow::Schema>schema;field_day=arrow::field("Day",arrow::int8());field_month=arrow::field("Month",arrow::int8());field_year=arrow::field("Year",arrow::int16());schema=arrow::schema({field_day,field_month,field_year});// With the schema and data, create a Tablestd::shared_ptr<arrow::Table>table;table=arrow::Table::Make(schema,columns);// Write out test files in IPC, CSV, and Parquet for the example to use.std::shared_ptr<arrow::io::FileOutputStream>outfile;ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.arrow"));ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter>ipc_writer,arrow::ipc::MakeFileWriter(outfile,schema));ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table));ARROW_RETURN_NOT_OK(ipc_writer->Close());ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.csv"));ARROW_ASSIGN_OR_RAISE(autocsv_writer,arrow::csv::MakeCSVWriter(outfile,table->schema()));ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table));ARROW_RETURN_NOT_OK(csv_writer->Close());ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.parquet"));PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(*table,arrow::default_memory_pool(),outfile,5));returnarrow::Status::OK();}

To get the files for the rest of your code to function, make sure tocallGenInitialFile() as the very first line inRunMain() to initializethe environment:

// Generate initial files for each format with a helper function -- don't worry,// we'll also write a table in this example.ARROW_RETURN_NOT_OK(GenInitialFile());

I/O with Arrow Files#

We’re going to go through this step by step, reading then writing, asfollows:

  1. Reading a file

    1. Open the file

    2. Bind file toipc::RecordBatchFileReader

    3. Read file toRecordBatch

  2. Writing a file

    1. Get aio::FileOutputStream

    2. Write to file fromRecordBatch

Opening a File#

To actually read a file, we need to get some sort of way to point to it.In Arrow, that means we’re going to get aio::ReadableFile object – muchlike anArrayBuilder can clear and make new arrays, we can reassign thisto new files, so we’ll use this instance throughout the examples:

// First, we have to set up a ReadableFile object, which just lets us point our// readers to the right data on disk. We'll be reusing this object, and rebinding// it to multiple files throughout the example.std::shared_ptr<arrow::io::ReadableFile>infile;

Aio::ReadableFile does little alone – we actually have it bind to a filewithio::ReadableFile::Open(). Forour purposes here, the default arguments suffice:

// Get "test_in.arrow" into our file pointerARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open("test_in.arrow",arrow::default_memory_pool()));

Opening an Arrow file Reader#

Anio::ReadableFile is too generic to offer all functionality to read an Arrow file.We need to use it to get anipc::RecordBatchFileReader object. This object implementsall the logic needed to read an Arrow file with correct formatting. We get one throughipc::RecordBatchFileReader::Open():

// Open up the file with the IPC features of the library, gives us a reader object.ARROW_ASSIGN_OR_RAISE(autoipc_reader,arrow::ipc::RecordBatchFileReader::Open(infile));

Reading an Open Arrow File to RecordBatch#

We have to use aRecordBatch to read an Arrow file, so we’ll get aRecordBatch. Once we have that, we can actually read the file. Arrowfiles can have multipleRecordBatches, so we must pass an index. Thisfile only has one, so pass 0:

// Using the reader, we can read Record Batches. Note that this is specific to IPC;// for other formats, we focus on Tables, but here, RecordBatches are used.std::shared_ptr<arrow::RecordBatch>rbatch;ARROW_ASSIGN_OR_RAISE(rbatch,ipc_reader->ReadRecordBatch(0));

Prepare a FileOutputStream#

For output, we need aio::FileOutputStream. Just like ourio::ReadableFile,we’ll be reusing this, so be ready for that. We open files the same wayas when reading:

// Just like with input, we get an object for the output file.std::shared_ptr<arrow::io::FileOutputStream>outfile;// Bind it to "test_out.arrow"ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.arrow"));

Write Arrow File from RecordBatch#

Now, we grab ourRecordBatch we read into previously, and use it, alongwith our target file, to create aipc::RecordBatchWriter. Theipc::RecordBatchWriter needs two things:

  1. the target file

  2. theSchema for ourRecordBatch (in case we need to write moreRecordBatches of the same format.)

TheSchema comes from our existingRecordBatch and the target file isthe output stream we just created.

// Set up a writer with the output file -- and the schema! We're defining everything// here, loading to fire.ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter>ipc_writer,arrow::ipc::MakeFileWriter(outfile,rbatch->schema()));

We can just callipc::RecordBatchWriter::WriteRecordBatch() with ourRecordBatch to fill up ourfile:

// Write the record batch.ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));

For IPC in particular, the writer has to be closed since it anticipates more than one batch may be written. To do that:

// Specifically for IPC, the writer needs to be explicitly closed.ARROW_RETURN_NOT_OK(ipc_writer->Close());

Now we’ve read and written an IPC file!

I/O with CSV#

We’re going to go through this step by step, reading then writing, asfollows:

  1. Reading a file

    1. Open the file

    2. Prepare Table

    3. Read File usingcsv::TableReader

  2. Writing a file

    1. Get aio::FileOutputStream

    2. Write to file fromTable

Opening a CSV File#

For a CSV file, we need to open aio::ReadableFile, just like an Arrow file,and reuse ourio::ReadableFile object from before to do so:

// Bind our input file to "test_in.csv"ARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open("test_in.csv"));

Preparing a Table#

CSV can be read into aTable, so declare a pointer to aTable:

std::shared_ptr<arrow::Table>csv_table;

Read a CSV File to Table#

The CSV reader has option structs which need to be passed – luckily,there are defaults for these which we can pass directly. For referenceon the other options, go here:File Formats.without any special delimiters and is small, so we can make our readerwith defaults:

// The CSV reader has several objects for various options. For now, we'll use defaults.ARROW_ASSIGN_OR_RAISE(autocsv_reader,arrow::csv::TableReader::Make(arrow::io::default_io_context(),infile,arrow::csv::ReadOptions::Defaults(),arrow::csv::ParseOptions::Defaults(),arrow::csv::ConvertOptions::Defaults()));

With the CSV reader primed, we can use itscsv::TableReader::Read() method to fill ourTable:

// Read the table.ARROW_ASSIGN_OR_RAISE(csv_table,csv_reader->Read())

Write a CSV File from Table#

CSV writing toTable looks exactly like IPC writing toRecordBatch,except with ourTable, and usingipc::RecordBatchWriter::WriteTable() instead ofipc::RecordBatchWriter::WriteRecordBatch(). Note that the same writer class is used –we’re writing withipc::RecordBatchWriter::WriteTable() because we have aTable. We’ll targeta file, use ourTable’sSchema, and then write theTable:

// Bind our output file to "test_out.csv"ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.csv"));// The CSV writer has simpler defaults, review API documentation for more complex usage.ARROW_ASSIGN_OR_RAISE(autocsv_writer,arrow::csv::MakeCSVWriter(outfile,csv_table->schema()));ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));// Not necessary, but a safe practice.ARROW_RETURN_NOT_OK(csv_writer->Close());

Now, we’ve read and written a CSV file!

File I/O with Parquet#

We’re going to go through this step by step, reading then writing, asfollows:

  1. Reading a file

    1. Open the file

    2. Prepareparquet::arrow::FileReader

    3. Read file toTable

  2. Writing a file

    1. WriteTable to file

Opening a Parquet File#

Once more, this file format, Parquet, needs aio::ReadableFile, which wealready have, and for theio::ReadableFile::Open() method to be called on a file:

// Bind our input file to "test_in.parquet"ARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open("test_in.parquet"));

Setting up a Parquet Reader#

As always, we need a Reader to actually read the file. We’ve beengetting Readers for each file format from the Arrow namespace. Thistime, we enter the Parquet namespace to get theparquet::arrow::FileReader:

std::unique_ptr<parquet::arrow::FileReader>reader;

Now, to set up our reader, we callparquet::arrow::OpenFile(). Yes, this is necessaryeven though we usedio::ReadableFile::Open(). Note that we pass ourparquet::arrow::FileReader by reference, instead of assigning to it in output:

// Note that Parquet's OpenFile() takes the reader by reference, rather than returning// a reader.PARQUET_ASSIGN_OR_THROW(reader,parquet::arrow::OpenFile(infile,arrow::default_memory_pool()));

Reading a Parquet File to Table#

With a preparedparquet::arrow::FileReader in hand, we can read to aTable, except we must pass theTable by reference instead of outputting to it:

std::shared_ptr<arrow::Table>parquet_table;// Read the table.PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));

Writing a Parquet File from Table#

For single-shot writes, writing a Parquet file does not need a writer object. Instead, we giveit our table, point to the memory pool it will use for any necessarymemory consumption, tell it where to write, and the chunk size if itneeds to break up the file at all:

// Parquet writing does not need a declared writer object. Just get the output// file bound, then pass in the table, memory pool, output, and chunk size for// breaking up the Table on-disk.ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.parquet"));PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(*parquet_table,arrow::default_memory_pool(),outfile,5));

Ending Program#

At the end, we just returnStatus::OK(), so themain() knows thatwe’re done, and that everything’s okay. Just like in the first tutorial.

returnarrow::Status::OK();}

With that, you’ve read and written IPC, CSV, and Parquet in Arrow, andcan properly load data and write output! Now, we can move intoprocessing data with compute functions in the next article.

Refer to the below for a copy of the complete code:

 19// (Doc section: Includes) 20#include<arrow/api.h> 21#include<arrow/csv/api.h> 22#include<arrow/io/api.h> 23#include<arrow/ipc/api.h> 24#include<parquet/arrow/reader.h> 25#include<parquet/arrow/writer.h> 26 27#include<iostream> 28// (Doc section: Includes) 29 30// (Doc section: GenInitialFile) 31arrow::StatusGenInitialFile(){ 32// Make a couple 8-bit integer arrays and a 16-bit integer array -- just like 33// basic Arrow example. 34arrow::Int8Builderint8builder; 35int8_tdays_raw[5]={1,12,17,23,28}; 36ARROW_RETURN_NOT_OK(int8builder.AppendValues(days_raw,5)); 37std::shared_ptr<arrow::Array>days; 38ARROW_ASSIGN_OR_RAISE(days,int8builder.Finish()); 39 40int8_tmonths_raw[5]={1,3,5,7,1}; 41ARROW_RETURN_NOT_OK(int8builder.AppendValues(months_raw,5)); 42std::shared_ptr<arrow::Array>months; 43ARROW_ASSIGN_OR_RAISE(months,int8builder.Finish()); 44 45arrow::Int16Builderint16builder; 46int16_tyears_raw[5]={1990,2000,1995,2000,1995}; 47ARROW_RETURN_NOT_OK(int16builder.AppendValues(years_raw,5)); 48std::shared_ptr<arrow::Array>years; 49ARROW_ASSIGN_OR_RAISE(years,int16builder.Finish()); 50 51// Get a vector of our Arrays 52std::vector<std::shared_ptr<arrow::Array>>columns={days,months,years}; 53 54// Make a schema to initialize the Table with 55std::shared_ptr<arrow::Field>field_day,field_month,field_year; 56std::shared_ptr<arrow::Schema>schema; 57 58field_day=arrow::field("Day",arrow::int8()); 59field_month=arrow::field("Month",arrow::int8()); 60field_year=arrow::field("Year",arrow::int16()); 61 62schema=arrow::schema({field_day,field_month,field_year}); 63// With the schema and data, create a Table 64std::shared_ptr<arrow::Table>table; 65table=arrow::Table::Make(schema,columns); 66 67// Write out test files in IPC, CSV, and Parquet for the example to use. 68std::shared_ptr<arrow::io::FileOutputStream>outfile; 69ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.arrow")); 70ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter>ipc_writer, 71arrow::ipc::MakeFileWriter(outfile,schema)); 72ARROW_RETURN_NOT_OK(ipc_writer->WriteTable(*table)); 73ARROW_RETURN_NOT_OK(ipc_writer->Close()); 74 75ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.csv")); 76ARROW_ASSIGN_OR_RAISE(autocsv_writer, 77arrow::csv::MakeCSVWriter(outfile,table->schema())); 78ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*table)); 79ARROW_RETURN_NOT_OK(csv_writer->Close()); 80 81ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_in.parquet")); 82PARQUET_THROW_NOT_OK( 83parquet::arrow::WriteTable(*table,arrow::default_memory_pool(),outfile,5)); 84 85returnarrow::Status::OK(); 86} 87// (Doc section: GenInitialFile) 88 89// (Doc section: RunMain) 90arrow::StatusRunMain(){ 91// (Doc section: RunMain) 92// (Doc section: Gen Files) 93// Generate initial files for each format with a helper function -- don't worry, 94// we'll also write a table in this example. 95ARROW_RETURN_NOT_OK(GenInitialFile()); 96// (Doc section: Gen Files) 97 98// (Doc section: ReadableFile Definition) 99// First, we have to set up a ReadableFile object, which just lets us point our100// readers to the right data on disk. We'll be reusing this object, and rebinding101// it to multiple files throughout the example.102std::shared_ptr<arrow::io::ReadableFile>infile;103// (Doc section: ReadableFile Definition)104// (Doc section: Arrow ReadableFile Open)105// Get "test_in.arrow" into our file pointer106ARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open(107"test_in.arrow",arrow::default_memory_pool()));108// (Doc section: Arrow ReadableFile Open)109// (Doc section: Arrow Read Open)110// Open up the file with the IPC features of the library, gives us a reader object.111ARROW_ASSIGN_OR_RAISE(autoipc_reader,arrow::ipc::RecordBatchFileReader::Open(infile));112// (Doc section: Arrow Read Open)113// (Doc section: Arrow Read)114// Using the reader, we can read Record Batches. Note that this is specific to IPC;115// for other formats, we focus on Tables, but here, RecordBatches are used.116std::shared_ptr<arrow::RecordBatch>rbatch;117ARROW_ASSIGN_OR_RAISE(rbatch,ipc_reader->ReadRecordBatch(0));118// (Doc section: Arrow Read)119120// (Doc section: Arrow Write Open)121// Just like with input, we get an object for the output file.122std::shared_ptr<arrow::io::FileOutputStream>outfile;123// Bind it to "test_out.arrow"124ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.arrow"));125// (Doc section: Arrow Write Open)126// (Doc section: Arrow Writer)127// Set up a writer with the output file -- and the schema! We're defining everything128// here, loading to fire.129ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::ipc::RecordBatchWriter>ipc_writer,130arrow::ipc::MakeFileWriter(outfile,rbatch->schema()));131// (Doc section: Arrow Writer)132// (Doc section: Arrow Write)133// Write the record batch.134ARROW_RETURN_NOT_OK(ipc_writer->WriteRecordBatch(*rbatch));135// (Doc section: Arrow Write)136// (Doc section: Arrow Close)137// Specifically for IPC, the writer needs to be explicitly closed.138ARROW_RETURN_NOT_OK(ipc_writer->Close());139// (Doc section: Arrow Close)140141// (Doc section: CSV Read Open)142// Bind our input file to "test_in.csv"143ARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open("test_in.csv"));144// (Doc section: CSV Read Open)145// (Doc section: CSV Table Declare)146std::shared_ptr<arrow::Table>csv_table;147// (Doc section: CSV Table Declare)148// (Doc section: CSV Reader Make)149// The CSV reader has several objects for various options. For now, we'll use defaults.150ARROW_ASSIGN_OR_RAISE(151autocsv_reader,152arrow::csv::TableReader::Make(153arrow::io::default_io_context(),infile,arrow::csv::ReadOptions::Defaults(),154arrow::csv::ParseOptions::Defaults(),arrow::csv::ConvertOptions::Defaults()));155// (Doc section: CSV Reader Make)156// (Doc section: CSV Read)157// Read the table.158ARROW_ASSIGN_OR_RAISE(csv_table,csv_reader->Read())159// (Doc section: CSV Read)160161// (Doc section: CSV Write)162// Bind our output file to "test_out.csv"163ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.csv"));164// The CSV writer has simpler defaults, review API documentation for more complex usage.165ARROW_ASSIGN_OR_RAISE(autocsv_writer,166arrow::csv::MakeCSVWriter(outfile,csv_table->schema()));167ARROW_RETURN_NOT_OK(csv_writer->WriteTable(*csv_table));168// Not necessary, but a safe practice.169ARROW_RETURN_NOT_OK(csv_writer->Close());170// (Doc section: CSV Write)171172// (Doc section: Parquet Read Open)173// Bind our input file to "test_in.parquet"174ARROW_ASSIGN_OR_RAISE(infile,arrow::io::ReadableFile::Open("test_in.parquet"));175// (Doc section: Parquet Read Open)176// (Doc section: Parquet FileReader)177std::unique_ptr<parquet::arrow::FileReader>reader;178// (Doc section: Parquet FileReader)179// (Doc section: Parquet OpenFile)180// Note that Parquet's OpenFile() takes the reader by reference, rather than returning181// a reader.182PARQUET_ASSIGN_OR_THROW(reader,183parquet::arrow::OpenFile(infile,arrow::default_memory_pool()));184// (Doc section: Parquet OpenFile)185186// (Doc section: Parquet Read)187std::shared_ptr<arrow::Table>parquet_table;188// Read the table.189PARQUET_THROW_NOT_OK(reader->ReadTable(&parquet_table));190// (Doc section: Parquet Read)191192// (Doc section: Parquet Write)193// Parquet writing does not need a declared writer object. Just get the output194// file bound, then pass in the table, memory pool, output, and chunk size for195// breaking up the Table on-disk.196ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open("test_out.parquet"));197PARQUET_THROW_NOT_OK(parquet::arrow::WriteTable(198*parquet_table,arrow::default_memory_pool(),outfile,5));199// (Doc section: Parquet Write)200// (Doc section: Return)201returnarrow::Status::OK();202}203// (Doc section: Return)204205// (Doc section: Main)206intmain(){207arrow::Statusst=RunMain();208if(!st.ok()){209std::cerr<<st<<std::endl;210return1;211}212return0;213}214// (Doc section: Main)