Arrow Datasets#
Arrow C++ provides the concept and implementation ofDatasets to workwith fragmented data, which can be larger-than-memory, be that due togenerating large amounts, reading in from a stream, or having a largefile on disk. In this article, you will:
read a multi-file partitioned dataset and put it into a Table,
write out a partitioned dataset from a Table.
Pre-requisites#
Before continuing, make sure you have:
An Arrow installation, which you can set up here:Using Arrow C++ in your own project
An understanding of basic Arrow data structures fromBasic Arrow Data Structures
To witness the differences, it may be useful to have also read theArrow File I/O. However, it is not required.
Setup#
Before running some computations, we need to fill in a couple gaps:
We need to include necessary headers.
A
main()is needed to glue things together.We need data on disk to play with.
Includes#
Before writing C++ code, we need some includes. We’ll getiostream for output, then import Arrow’scompute functionality for each file type we’ll work with in this article:
#include<arrow/api.h>#include<arrow/compute/api.h>#include<arrow/dataset/api.h>// We use Parquet headers for setting up examples; they are not required for using// datasets.#include<parquet/arrow/reader.h>#include<parquet/arrow/writer.h>#include<unistd.h>#include<iostream>
Main()#
For our glue, we’ll use themain() pattern from the previous tutorial ondata structures:
intmain(){arrow::Statusst=RunMain();if(!st.ok()){std::cerr<<st<<std::endl;return1;}return0;}
Which, like when we used it before, is paired with aRunMain():
arrow::StatusRunMain(){
Generating Files for Reading#
We need some files to actually play with. In practice, you’ll likelyhave some input for your own application. Here, however, we want toexplore without the overhead of supplying or finding a dataset, so let’sgenerate some to make this easy to follow. Feel free to read throughthis, but the concepts will be visited properly in this article – justcopy it in, for now, and realize it ends with a partitioned dataset ondisk:
// Generate some data for the rest of this example.arrow::Result<std::shared_ptr<arrow::Table>>CreateTable(){// This code should look familiar from the basic Arrow example, and is not the// focus of this example. However, we need data to work on it, and this makes that!autoschema=arrow::schema({arrow::field("a",arrow::int64()),arrow::field("b",arrow::int64()),arrow::field("c",arrow::int64())});std::shared_ptr<arrow::Array>array_a;std::shared_ptr<arrow::Array>array_b;std::shared_ptr<arrow::Array>array_c;arrow::NumericBuilder<arrow::Int64Type>builder;ARROW_RETURN_NOT_OK(builder.AppendValues({0,1,2,3,4,5,6,7,8,9}));ARROW_RETURN_NOT_OK(builder.Finish(&array_a));builder.Reset();ARROW_RETURN_NOT_OK(builder.AppendValues({9,8,7,6,5,4,3,2,1,0}));ARROW_RETURN_NOT_OK(builder.Finish(&array_b));builder.Reset();ARROW_RETURN_NOT_OK(builder.AppendValues({1,2,1,2,1,2,1,2,1,2}));ARROW_RETURN_NOT_OK(builder.Finish(&array_c));returnarrow::Table::Make(schema,{array_a,array_b,array_c});}// Set up a dataset by writing two Parquet files.arrow::Result<std::string>CreateExampleParquetDataset(conststd::shared_ptr<arrow::fs::FileSystem>&filesystem,conststd::string&root_path){// Much like CreateTable(), this is utility that gets us the dataset we'll be reading// from. Don't worry, we also write a dataset in the example proper.autobase_path=root_path+"parquet_dataset";ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path));// Create an Arrow TableARROW_ASSIGN_OR_RAISE(autotable,CreateTable());// Write it into two Parquet filesARROW_ASSIGN_OR_RAISE(autooutput,filesystem->OpenOutputStream(base_path+"/data1.parquet"));ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table->Slice(0,5),arrow::default_memory_pool(),output,2048));ARROW_ASSIGN_OR_RAISE(output,filesystem->OpenOutputStream(base_path+"/data2.parquet"));ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table->Slice(5),arrow::default_memory_pool(),output,2048));returnbase_path;}arrow::StatusPrepareEnv(){// Initilize the compute module to register the required kernels for DatasetARROW_RETURN_NOT_OK(arrow::compute::Initialize());// Get our environment prepared for reading, by setting up some quick writing.ARROW_ASSIGN_OR_RAISE(autosrc_table,CreateTable())std::shared_ptr<arrow::fs::FileSystem>setup_fs;// Note this operates in the directory the executable is built in.charsetup_path[256];char*result=getcwd(setup_path,256);if(result==NULL){returnarrow::Status::IOError("Fetching PWD failed.");}ARROW_ASSIGN_OR_RAISE(setup_fs,arrow::fs::FileSystemFromUriOrPath(setup_path));ARROW_ASSIGN_OR_RAISE(autodset_path,CreateExampleParquetDataset(setup_fs,""));returnarrow::Status::OK();}
In order to actually have these files, make sure the first thing calledinRunMain() is our helper functionPrepareEnv(), which will get adataset on disk for us to play with:
ARROW_RETURN_NOT_OK(PrepareEnv());
Reading a Partitioned Dataset#
Reading a Dataset is a distinct task from reading a single file. Thetask takes more work than reading a single file, due to needing to beable to parse multiple files and/or folders. This process can be brokenup into the following steps:
Getting a
fs::FileSystemobject for the local FSCreate a
fs::FileSelectorand use it to prepare adataset::FileSystemDatasetFactoryBuild a
dataset::Datasetusing thedataset::FileSystemDatasetFactoryUse a
dataset::Scannerto read into aTable
Preparing a FileSystem Object#
In order to begin, we’ll need to be able to interact with the localfilesystem. In order to do that, we’ll need anfs::FileSystem object.Afs::FileSystem is an abstraction that lets us use the same interfaceregardless of using Amazon S3, Google Cloud Storage, or local disk – andwe’ll be using local disk. So, let’s declare it:
// First, we need a filesystem object, which lets us interact with our local// filesystem starting at a given path. For the sake of simplicity, that'll be// the current directory.std::shared_ptr<arrow::fs::FileSystem>fs;
For this example, we’ll have ourFileSystem’s base path exist in thesame directory as the executable.fs::FileSystemFromUriOrPath() lets us getafs::FileSystem object for any of the types of supported filesystems.Here, though, we’ll just pass our path:
// Get the CWD, use it to make the FileSystem object.charinit_path[256];char*result=getcwd(init_path,256);if(result==NULL){returnarrow::Status::IOError("Fetching PWD failed.");}ARROW_ASSIGN_OR_RAISE(fs,arrow::fs::FileSystemFromUriOrPath(init_path));
See also
fs::FileSystem for the other supported filesystems.
Creating a FileSystemDatasetFactory#
Afs::FileSystem stores a lot of metadata, but we need to be able totraverse it and parse that metadata. In Arrow, we use aFileSelector todo so:
// A file selector lets us actually traverse a multi-file dataset.arrow::fs::FileSelectorselector;
Thisfs::FileSelector isn’t able to do anything yet. In order to use it, weneed to configure it – we’ll have it start any selection in“parquet_dataset,” which is where the environment preparation processhas left us a dataset, and set recursive to true, which allows fortraversal of folders.
selector.base_dir="parquet_dataset";// Recursive is a safe bet if you don't know the nesting of your dataset.selector.recursive=true;
To get adataset::Dataset from afs::FileSystem, we need to prepare adataset::FileSystemDatasetFactory. This is a long but descriptive name – it’llmake us a factory to get data from ourfs::FileSystem. First, we configureit by filling adataset::FileSystemFactoryOptions struct:
// Making an options object lets us configure our dataset reading.arrow::dataset::FileSystemFactoryOptionsoptions;// We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition// schema. We won't set any other options, defaults are fine.options.partitioning=arrow::dataset::HivePartitioning::MakeFactory();
There are many file formats, and we have to pick one that will beexpected when actually reading. Parquet is what we have on disk, so ofcourse we’ll ask for that when reading:
autoread_format=std::make_shared<arrow::dataset::ParquetFileFormat>();
After setting up thefs::FileSystem,fs::FileSelector, options, and file format,we can make thatdataset::FileSystemDatasetFactory. This simply requires passingin everything we’ve prepared and assigning that to a variable:
// Now, we get a factory that will let us get our dataset -- we don't have the// dataset yet!ARROW_ASSIGN_OR_RAISE(autofactory,arrow::dataset::FileSystemDatasetFactory::Make(fs,selector,read_format,options));
Build Dataset using Factory#
With adataset::FileSystemDatasetFactory set up, we can actually build ourdataset::Dataset withdataset::FileSystemDatasetFactory::Finish(), justlike with anArrayBuilder back in the basic tutorial:
// Now we build our dataset from the factory.ARROW_ASSIGN_OR_RAISE(autoread_dataset,factory->Finish());
Now, we have adataset::Dataset object in memory. This does not mean that theentire dataset is manifested in memory, but that we now have access totools that allow us to explore and use the dataset that is on disk. Forexample, we can grab the fragments (files) that make up our wholedataset, and print those out, along with some small info:
// Print out the fragmentsARROW_ASSIGN_OR_RAISE(autofragments,read_dataset->GetFragments());for(constauto&fragment:fragments){std::cout<<"Found fragment: "<<(*fragment)->ToString()<<std::endl;std::cout<<"Partition expression: "<<(*fragment)->partition_expression().ToString()<<std::endl;}
Move Dataset into Table#
One way we can do something withDatasets is gettingthem into aTable, where we can do anything we’ve learned we can do toTables to thatTable.
See also
Acero: A C++ streaming execution engine for execution that avoids manifesting the entire dataset in memory.
In order to move aDataset’s contents into aTable,we need adataset::Scanner, which scans the data and outputs it to theTable.First, we get adataset::ScannerBuilder from thedataset::Dataset:
// Scan dataset into a Table -- once this is done, you can do// normal table things with it, like computation and printing. However, now you're// also dedicated to being in memory.ARROW_ASSIGN_OR_RAISE(autoread_scan_builder,read_dataset->NewScan());
Of course, a Builder’s only use is to get us ourdataset::Scanner, so let’s usedataset::ScannerBuilder::Finish():
ARROW_ASSIGN_OR_RAISE(autoread_scanner,read_scan_builder->Finish());
Now that we have a tool to move through ourdataset::Dataset, let’s use it to getourTable.dataset::Scanner::ToTable() offers exactly what we’re looking for,and we can print the results:
ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table>table,read_scanner->ToTable());std::cout<<table->ToString();
This leaves us with a normalTable. Again, to do things withDatasetswithout moving to aTable, consider using Acero.
Writing a Dataset to Disk from Table#
Writing adataset::Dataset is a distinct task from writing a single file. Thetask takes more work than writing a single file, due to needing to beable to parse handle a partitioning scheme across multiple files andfolders. This process can be broken up into the following steps:
Prepare a
TableBatchReaderCreate a
dataset::Scannerto pull data fromTableBatchReaderPrepare schema, partitioning, and file format options
Set up
dataset::FileSystemDatasetWriteOptions– a struct that configures our writing functionsWrite dataset to disk
Prepare Data from Table for Writing#
We have aTable, and we want to get adataset::Dataset on disk. In fact, for thesake of exploration, we’ll use a different partitioning scheme for thedataset – instead of just breaking into halves like the originalfragments, we’ll partition based on each row’s value in the “a” column.
To get started on that, let’s get aTableBatchReader! This makes it veryeasy to write to aDataset, and can be used elsewhere whenever aTableneeds to be broken into a stream ofRecordBatches. Here, we can just usetheTableBatchReader’s constructor, with our table:
// Now, let's get a table out to disk as a dataset!// We make a RecordBatchReader from our Table, then set up a scanner, which lets us// go to a file.std::shared_ptr<arrow::TableBatchReader>write_dataset=std::make_shared<arrow::TableBatchReader>(table);
Create Scanner for Moving Table Data#
The process for writing adataset::Dataset, once a source of data is available,is similar to the reverse of reading it. Before, we used adataset::Scanner inorder to scan into aTable – now, we need one to read out of ourTableBatchReader. To get thatdataset::Scanner, we’ll make adataset::ScannerBuilderbased on ourTableBatchReader, then use that Builder to build adataset::Scanner:
autowrite_scanner_builder=arrow::dataset::ScannerBuilder::FromRecordBatchReader(write_dataset);ARROW_ASSIGN_OR_RAISE(autowrite_scanner,write_scanner_builder->Finish())
Prepare Schema, Partitioning, and File Format Variables#
Since we want to partition based on the “a” column, we need to declarethat. When defining our partitioningSchema, we’ll just have a singleField that contains “a”:
// The partition schema determines which fields are used as keys for partitioning.autopartition_schema=arrow::schema({arrow::field("a",arrow::utf8())});
ThisSchema determines what the key is for partitioning, but we need tochoose the algorithm that’ll do something with this key. We will useHive-style again, this time with our schema passed to it asconfiguration:
// We'll use Hive-style partitioning, which creates directories with "key=value"// pairs.autopartitioning=std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);
Several file formats are available, but Parquet is commonly used withArrow, so we’ll write back out to that:
// Now, we declare we'll be writing Parquet files.autowrite_format=std::make_shared<arrow::dataset::ParquetFileFormat>();
Configure FileSystemDatasetWriteOptions#
In order to write to disk, we need some configuration. We’ll do so viasetting values in adataset::FileSystemDatasetWriteOptions struct. We’llinitialize it with defaults where possible:
// This time, we make Options for writing, but do much more configuration.arrow::dataset::FileSystemDatasetWriteOptionswrite_options;// Defaults to start.write_options.file_write_options=write_format->DefaultWriteOptions();
One important step in writing to file is having afs::FileSystem to target.Luckily, we have one from when we set it up for reading. This is asimple variable assignment:
// Use the filesystem we already have.write_options.filesystem=fs;
Arrow can make the directory, but it does need a name for saiddirectory, so let’s give it one, call it “write_dataset”:
// Write to the folder "write_dataset" in current directory.write_options.base_dir="write_dataset";
We made a partitioning method previously, declaring that we’d useHive-style – this is where we actually pass that to our writingfunction:
// Use the partitioning declared above.write_options.partitioning=partitioning;
Part of what’ll happen is Arrow will break up files, thus preventingthem from being too large to handle. This is what makes a datasetfragmented in the first place. In order to set this up, we need a basename for each fragment in a directory – in this case, we’ll have“part{i}.parquet”, which means the third file (within the samedirectory) will be called “part3.parquet”, for example:
// Define what the name for the files making up the dataset will be.write_options.basename_template="part{i}.parquet";
Sometimes, data will be written to the same location more than once, andoverwriting will be accepted. Since we may want to run this applicationmore than once, we will set Arrow to overwrite existing data – if wedidn’t, Arrow would abort due to seeing existing data after the firstrun of this application:
// Set behavior to overwrite existing data -- specifically, this lets this example// be run more than once, and allows whatever code you have to overwrite what's there.write_options.existing_data_behavior=arrow::dataset::ExistingDataBehavior::kOverwriteOrIgnore;
Write Dataset to Disk#
Once thedataset::FileSystemDatasetWriteOptions has been configured, and adataset::Scanner is prepared to parse the data, we can pass the Options anddataset::Scanner to thedataset::FileSystemDataset::Write() to write out todisk:
// Write to disk!ARROW_RETURN_NOT_OK(arrow::dataset::FileSystemDataset::Write(write_options,write_scanner));
You can review your disk to see that you’ve written a folder containingsubfolders for every value of “a”, which each have Parquet files!
Ending Program#
At the end, we just returnStatus::OK(), so themain() knows thatwe’re done, and that everything’s okay, just like the precedingtutorials.
returnarrow::Status::OK();}
With that, you’ve read and written partitioned datasets! This method,with some configuration, will work for any supported dataset format. Foran example of such a dataset, the NYC Taxi dataset is a well-knownone, which you can findhere.Now you can get larger-than-memory data mapped for use!
Which means that now we have to be able to process this data withoutpulling it all into memory at once. For this, try Acero.
See also
Acero: A C++ streaming execution engine for more information on Acero.
Refer to the below for a copy of the complete code:
19// (Doc section: Includes) 20#include<arrow/api.h> 21#include<arrow/compute/api.h> 22#include<arrow/dataset/api.h> 23// We use Parquet headers for setting up examples; they are not required for using 24// datasets. 25#include<parquet/arrow/reader.h> 26#include<parquet/arrow/writer.h> 27 28#include<unistd.h> 29#include<iostream> 30// (Doc section: Includes) 31 32// (Doc section: Helper Functions) 33// Generate some data for the rest of this example. 34arrow::Result<std::shared_ptr<arrow::Table>>CreateTable(){ 35// This code should look familiar from the basic Arrow example, and is not the 36// focus of this example. However, we need data to work on it, and this makes that! 37autoschema= 38arrow::schema({arrow::field("a",arrow::int64()),arrow::field("b",arrow::int64()), 39arrow::field("c",arrow::int64())}); 40std::shared_ptr<arrow::Array>array_a; 41std::shared_ptr<arrow::Array>array_b; 42std::shared_ptr<arrow::Array>array_c; 43arrow::NumericBuilder<arrow::Int64Type>builder; 44ARROW_RETURN_NOT_OK(builder.AppendValues({0,1,2,3,4,5,6,7,8,9})); 45ARROW_RETURN_NOT_OK(builder.Finish(&array_a)); 46builder.Reset(); 47ARROW_RETURN_NOT_OK(builder.AppendValues({9,8,7,6,5,4,3,2,1,0})); 48ARROW_RETURN_NOT_OK(builder.Finish(&array_b)); 49builder.Reset(); 50ARROW_RETURN_NOT_OK(builder.AppendValues({1,2,1,2,1,2,1,2,1,2})); 51ARROW_RETURN_NOT_OK(builder.Finish(&array_c)); 52returnarrow::Table::Make(schema,{array_a,array_b,array_c}); 53} 54 55// Set up a dataset by writing two Parquet files. 56arrow::Result<std::string>CreateExampleParquetDataset( 57conststd::shared_ptr<arrow::fs::FileSystem>&filesystem, 58conststd::string&root_path){ 59// Much like CreateTable(), this is utility that gets us the dataset we'll be reading 60// from. Don't worry, we also write a dataset in the example proper. 61autobase_path=root_path+"parquet_dataset"; 62ARROW_RETURN_NOT_OK(filesystem->CreateDir(base_path)); 63// Create an Arrow Table 64ARROW_ASSIGN_OR_RAISE(autotable,CreateTable()); 65// Write it into two Parquet files 66ARROW_ASSIGN_OR_RAISE(autooutput, 67filesystem->OpenOutputStream(base_path+"/data1.parquet")); 68ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable( 69*table->Slice(0,5),arrow::default_memory_pool(),output,2048)); 70ARROW_ASSIGN_OR_RAISE(output, 71filesystem->OpenOutputStream(base_path+"/data2.parquet")); 72ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable( 73*table->Slice(5),arrow::default_memory_pool(),output,2048)); 74returnbase_path; 75} 76 77arrow::StatusPrepareEnv(){ 78// Initilize the compute module to register the required kernels for Dataset 79ARROW_RETURN_NOT_OK(arrow::compute::Initialize()); 80// Get our environment prepared for reading, by setting up some quick writing. 81ARROW_ASSIGN_OR_RAISE(autosrc_table,CreateTable()) 82std::shared_ptr<arrow::fs::FileSystem>setup_fs; 83// Note this operates in the directory the executable is built in. 84charsetup_path[256]; 85char*result=getcwd(setup_path,256); 86if(result==NULL){ 87returnarrow::Status::IOError("Fetching PWD failed."); 88} 89 90ARROW_ASSIGN_OR_RAISE(setup_fs,arrow::fs::FileSystemFromUriOrPath(setup_path)); 91ARROW_ASSIGN_OR_RAISE(autodset_path,CreateExampleParquetDataset(setup_fs,"")); 92 93returnarrow::Status::OK(); 94} 95// (Doc section: Helper Functions) 96 97// (Doc section: RunMain) 98arrow::StatusRunMain(){ 99// (Doc section: RunMain)100// (Doc section: PrepareEnv)101ARROW_RETURN_NOT_OK(PrepareEnv());102// (Doc section: PrepareEnv)103104// (Doc section: FileSystem Declare)105// First, we need a filesystem object, which lets us interact with our local106// filesystem starting at a given path. For the sake of simplicity, that'll be107// the current directory.108std::shared_ptr<arrow::fs::FileSystem>fs;109// (Doc section: FileSystem Declare)110111// (Doc section: FileSystem Init)112// Get the CWD, use it to make the FileSystem object.113charinit_path[256];114char*result=getcwd(init_path,256);115if(result==NULL){116returnarrow::Status::IOError("Fetching PWD failed.");117}118ARROW_ASSIGN_OR_RAISE(fs,arrow::fs::FileSystemFromUriOrPath(init_path));119// (Doc section: FileSystem Init)120121// (Doc section: FileSelector Declare)122// A file selector lets us actually traverse a multi-file dataset.123arrow::fs::FileSelectorselector;124// (Doc section: FileSelector Declare)125// (Doc section: FileSelector Config)126selector.base_dir="parquet_dataset";127// Recursive is a safe bet if you don't know the nesting of your dataset.128selector.recursive=true;129// (Doc section: FileSelector Config)130// (Doc section: FileSystemFactoryOptions)131// Making an options object lets us configure our dataset reading.132arrow::dataset::FileSystemFactoryOptionsoptions;133// We'll use Hive-style partitioning. We'll let Arrow Datasets infer the partition134// schema. We won't set any other options, defaults are fine.135options.partitioning=arrow::dataset::HivePartitioning::MakeFactory();136// (Doc section: FileSystemFactoryOptions)137// (Doc section: File Format Setup)138autoread_format=std::make_shared<arrow::dataset::ParquetFileFormat>();139// (Doc section: File Format Setup)140// (Doc section: FileSystemDatasetFactory Make)141// Now, we get a factory that will let us get our dataset -- we don't have the142// dataset yet!143ARROW_ASSIGN_OR_RAISE(autofactory,arrow::dataset::FileSystemDatasetFactory::Make(144fs,selector,read_format,options));145// (Doc section: FileSystemDatasetFactory Make)146// (Doc section: FileSystemDatasetFactory Finish)147// Now we build our dataset from the factory.148ARROW_ASSIGN_OR_RAISE(autoread_dataset,factory->Finish());149// (Doc section: FileSystemDatasetFactory Finish)150// (Doc section: Dataset Fragments)151// Print out the fragments152ARROW_ASSIGN_OR_RAISE(autofragments,read_dataset->GetFragments());153for(constauto&fragment:fragments){154std::cout<<"Found fragment: "<<(*fragment)->ToString()<<std::endl;155std::cout<<"Partition expression: "156<<(*fragment)->partition_expression().ToString()<<std::endl;157}158// (Doc section: Dataset Fragments)159// (Doc section: Read Scan Builder)160// Scan dataset into a Table -- once this is done, you can do161// normal table things with it, like computation and printing. However, now you're162// also dedicated to being in memory.163ARROW_ASSIGN_OR_RAISE(autoread_scan_builder,read_dataset->NewScan());164// (Doc section: Read Scan Builder)165// (Doc section: Read Scanner)166ARROW_ASSIGN_OR_RAISE(autoread_scanner,read_scan_builder->Finish());167// (Doc section: Read Scanner)168// (Doc section: To Table)169ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table>table,read_scanner->ToTable());170std::cout<<table->ToString();171// (Doc section: To Table)172173// (Doc section: TableBatchReader)174// Now, let's get a table out to disk as a dataset!175// We make a RecordBatchReader from our Table, then set up a scanner, which lets us176// go to a file.177std::shared_ptr<arrow::TableBatchReader>write_dataset=178std::make_shared<arrow::TableBatchReader>(table);179// (Doc section: TableBatchReader)180// (Doc section: WriteScanner)181autowrite_scanner_builder=182arrow::dataset::ScannerBuilder::FromRecordBatchReader(write_dataset);183ARROW_ASSIGN_OR_RAISE(autowrite_scanner,write_scanner_builder->Finish())184// (Doc section: WriteScanner)185// (Doc section: Partition Schema)186// The partition schema determines which fields are used as keys for partitioning.187autopartition_schema=arrow::schema({arrow::field("a",arrow::utf8())});188// (Doc section: Partition Schema)189// (Doc section: Partition Create)190// We'll use Hive-style partitioning, which creates directories with "key=value"191// pairs.192autopartitioning=193std::make_shared<arrow::dataset::HivePartitioning>(partition_schema);194// (Doc section: Partition Create)195// (Doc section: Write Format)196// Now, we declare we'll be writing Parquet files.197autowrite_format=std::make_shared<arrow::dataset::ParquetFileFormat>();198// (Doc section: Write Format)199// (Doc section: Write Options)200// This time, we make Options for writing, but do much more configuration.201arrow::dataset::FileSystemDatasetWriteOptionswrite_options;202// Defaults to start.203write_options.file_write_options=write_format->DefaultWriteOptions();204// (Doc section: Write Options)205// (Doc section: Options FS)206// Use the filesystem we already have.207write_options.filesystem=fs;208// (Doc section: Options FS)209// (Doc section: Options Target)210// Write to the folder "write_dataset" in current directory.211write_options.base_dir="write_dataset";212// (Doc section: Options Target)213// (Doc section: Options Partitioning)214// Use the partitioning declared above.215write_options.partitioning=partitioning;216// (Doc section: Options Partitioning)217// (Doc section: Options Name Template)218// Define what the name for the files making up the dataset will be.219write_options.basename_template="part{i}.parquet";220// (Doc section: Options Name Template)221// (Doc section: Options File Behavior)222// Set behavior to overwrite existing data -- specifically, this lets this example223// be run more than once, and allows whatever code you have to overwrite what's there.224write_options.existing_data_behavior=225arrow::dataset::ExistingDataBehavior::kOverwriteOrIgnore;226// (Doc section: Options File Behavior)227// (Doc section: Write Dataset)228// Write to disk!229ARROW_RETURN_NOT_OK(230arrow::dataset::FileSystemDataset::Write(write_options,write_scanner));231// (Doc section: Write Dataset)232// (Doc section: Ret)233returnarrow::Status::OK();234}235// (Doc section: Ret)236// (Doc section: Main)237intmain(){238arrow::Statusst=RunMain();239if(!st.ok()){240std::cerr<<st<<std::endl;241return1;242}243return0;244}245// (Doc section: Main)

