Reading and writing Parquet files #

Reading Parquet files#

Thearrow::FileReader class reads data into Arrow Tables and RecordBatches.

TheStreamReader class allows for data to be read using a C++ inputstream approach to read fields column by column and row by row. This approachis offered for ease of use and type-safety. It is of course also useful whendata must be streamed as files are read and written incrementally.

Please note that the performance of theStreamReader will notbe as good due to the type checking and the fact that column valuesare processed one at a time.

FileReader#

To read Parquet data into Arrow structures, usearrow::FileReader.To construct, it requires a::arrow::io::RandomAccessFile instancerepresenting the input file. To read the whole file at once,usearrow::FileReader::ReadTable():

// #include "arrow/io/api.h"// #include "parquet/arrow/reader.h"arrow::MemoryPool*pool=arrow::default_memory_pool();std::shared_ptr<arrow::io::RandomAccessFile>input;ARROW_ASSIGN_OR_RAISE(input,arrow::io::ReadableFile::Open(path_to_file));// Open Parquet file readerstd::unique_ptr<parquet::arrow::FileReader>arrow_reader;ARROW_ASSIGN_OR_RAISE(arrow_reader,parquet::arrow::OpenFile(input,pool));// Read entire file as a single Arrow tablestd::shared_ptr<arrow::Table>table;ARROW_RETURN_NOT_OK(arrow_reader->ReadTable(&table));

Finer-grained options are available through thearrow::FileReaderBuilder helper class, which accepts theReaderPropertiesandArrowReaderProperties classes.

For reading as a stream of batches, use thearrow::FileReader::GetRecordBatchReader()method to retrieve aarrow::RecordBatchReader. It will use the batchsize set inArrowReaderProperties.

// #include "arrow/io/api.h"// #include "parquet/arrow/reader.h"arrow::MemoryPool*pool=arrow::default_memory_pool();// Configure general Parquet reader settingsautoreader_properties=parquet::ReaderProperties(pool);reader_properties.set_buffer_size(4096*4);reader_properties.enable_buffered_stream();// Configure Arrow-specific Parquet reader settingsautoarrow_reader_props=parquet::ArrowReaderProperties();arrow_reader_props.set_batch_size(128*1024);// default 64 * 1024parquet::arrow::FileReaderBuilderreader_builder;ARROW_RETURN_NOT_OK(reader_builder.OpenFile(path_to_file,/*memory_map=*/false,reader_properties));reader_builder.memory_pool(pool);reader_builder.properties(arrow_reader_props);std::unique_ptr<parquet::arrow::FileReader>arrow_reader;ARROW_ASSIGN_OR_RAISE(arrow_reader,reader_builder.Build());std::shared_ptr<::arrow::RecordBatchReader>rb_reader;ARROW_ASSIGN_OR_RAISE(rb_reader,arrow_reader->GetRecordBatchReader());for(arrow::Result<std::shared_ptr<arrow::RecordBatch>>maybe_batch:*rb_reader){// Operate on each batch...}

Performance and Memory Efficiency#

For remote filesystems, use read coalescing (pre-buffering) to reduce number of API calls:

autoarrow_reader_props=parquet::ArrowReaderProperties();reader_properties.set_prebuffer(true);

The defaults are generally tuned towards good performance, but parallel columndecoding is off by default. Enable it in the constructor ofArrowReaderProperties:

autoarrow_reader_props=parquet::ArrowReaderProperties(/*use_threads=*/true);

If memory efficiency is more important than performance, then:

Donot turn on read coalescing (pre-buffering) inparquet::ArrowReaderProperties.
Read data in batches usingarrow::FileReader::GetRecordBatchReader().
Turn onenable_buffered_stream inparquet::ReaderProperties.

In addition, if you know certain columns contain many repeated values, you canread them asdictionary encoded columns. This isenabled with theset_read_dictionary setting onArrowReaderProperties.If the files were written with Arrow C++ and thestore_schema was activated,then the original Arrow schema will be automatically read and will override thissetting.

StreamReader#

TheStreamReader allows for Parquet files to be read usingstandard C++ input operators which ensures type-safety.

Please note that types must match the schema exactly i.e. if theschema field is an unsigned 16-bit integer then you must supply auint16_t type.

Exceptions are used to signal errors. AParquetException isthrown in the following circumstances:

Attempt to read field by supplying the incorrect type.
Attempt to read beyond end of row.
Attempt to read beyond end of file.

#include"arrow/io/file.h"#include"parquet/stream_reader.h"{std::shared_ptr<arrow::io::ReadableFile>infile;PARQUET_ASSIGN_OR_THROW(infile,arrow::io::ReadableFile::Open("test.parquet"));parquet::StreamReaderstream{parquet::ParquetFileReader::Open(infile)};std::stringarticle;floatprice;uint32_tquantity;while(!stream.eof()){stream>>article>>price>>quantity>>parquet::EndRow;// ...}}

Writing Parquet files#

WriteTable#

Thearrow::WriteTable() function writes an entire::arrow::Table to an output file.

// #include "parquet/arrow/writer.h"// #include "arrow/util/type_fwd.h"usingparquet::ArrowWriterProperties;usingparquet::WriterProperties;ARROW_ASSIGN_OR_RAISE(std::shared_ptr<arrow::Table>table,GetTable());// Choose compressionstd::shared_ptr<WriterProperties>props=WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();// Opt to store Arrow schema for easier reads back into Arrowstd::shared_ptr<ArrowWriterProperties>arrow_props=ArrowWriterProperties::Builder().store_schema()->build();std::shared_ptr<arrow::io::FileOutputStream>outfile;ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open(path_to_file));ARROW_RETURN_NOT_OK(parquet::arrow::WriteTable(*table.get(),arrow::default_memory_pool(),outfile,/*chunk_size=*/3,props,arrow_props));

Note

Column compression is off by default in C++. Seebelowfor how to choose a compression codec in the writer properties.

To write out data batch-by-batch, usearrow::FileWriter.

// #include "parquet/arrow/writer.h"// #include "arrow/util/type_fwd.h"usingparquet::ArrowWriterProperties;usingparquet::WriterProperties;// Data is in RBRstd::shared_ptr<arrow::RecordBatchReader>batch_stream;ARROW_ASSIGN_OR_RAISE(batch_stream,GetRBR());// Choose compressionstd::shared_ptr<WriterProperties>props=WriterProperties::Builder().compression(arrow::Compression::SNAPPY)->build();// Opt to store Arrow schema for easier reads back into Arrowstd::shared_ptr<ArrowWriterProperties>arrow_props=ArrowWriterProperties::Builder().store_schema()->build();// Create a writerstd::shared_ptr<arrow::io::FileOutputStream>outfile;ARROW_ASSIGN_OR_RAISE(outfile,arrow::io::FileOutputStream::Open(path_to_file));std::unique_ptr<parquet::arrow::FileWriter>writer;ARROW_ASSIGN_OR_RAISE(writer,parquet::arrow::FileWriter::Open(*batch_stream->schema().get(),arrow::default_memory_pool(),outfile,props,arrow_props));// Write each batch as a row_groupfor(arrow::Result<std::shared_ptr<arrow::RecordBatch>>maybe_batch:*batch_stream){ARROW_ASSIGN_OR_RAISE(autobatch,maybe_batch);ARROW_ASSIGN_OR_RAISE(autotable,arrow::Table::FromRecordBatches(batch->schema(),{batch}));ARROW_RETURN_NOT_OK(writer->WriteTable(*table.get(),batch->num_rows()));}// Write file footer and closeARROW_RETURN_NOT_OK(writer->Close());

StreamWriter#

TheStreamWriter allows for Parquet files to be written usingstandard C++ output operators, similar to reading with theStreamReaderclass. This type-safe approach also ensures that rows are written withoutomitting fields and allows for new row groups to be created automatically(after certain volume of data) or explicitly by using theEndRowGroupstream modifier.

Exceptions are used to signal errors. AParquetException isthrown in the following circumstances:

Attempt to write a field using an incorrect type.
Attempt to write too many fields in a row.
Attempt to skip a required field.

#include"arrow/io/file.h"#include"parquet/stream_writer.h"{std::shared_ptr<arrow::io::FileOutputStream>outfile;PARQUET_ASSIGN_OR_THROW(outfile,arrow::io::FileOutputStream::Open("test.parquet"));parquet::WriterProperties::Builderbuilder;std::shared_ptr<parquet::schema::GroupNode>schema;// Set up builder with required compression type etc.// Define schema.// ...parquet::StreamWriteros{parquet::ParquetFileWriter::Open(outfile,schema,builder.build())};// Loop over some data structure which provides the required// fields to be written and write each row.for(constauto&a:getArticles()){os<<a.name()<<a.price()<<a.quantity()<<parquet::EndRow;}}

Writer properties#

To configure how Parquet files are written, use theWriterProperties::Builder:

#include"parquet/arrow/writer.h"#include"arrow/util/type_fwd.h"usingparquet::WriterProperties;usingparquet::ParquetVersion;usingparquet::ParquetDataPageVersion;usingarrow::Compression;std::shared_ptr<WriterProperties>props=WriterProperties::Builder().max_row_group_length(64*1024).created_by("My Application").version(ParquetVersion::PARQUET_2_6).data_page_version(ParquetDataPageVersion::V2).compression(Compression::SNAPPY).build();

Themax_row_group_length sets an upper bound on the number of rows per rowgroup that takes precedent over thechunk_size passed in the write methods.

You can set the version of Parquet to write withversion, which determineswhich logical types are available. In addition, you can set the data page versionwithdata_page_version. It’s V1 by default; setting to V2 will allow moreoptimal compression (skipping compressing pages where there isn’t a spacebenefit), but not all readers support this data page version.

Compression is off by default, but to get the most out of Parquet, you shouldalso choose a compression codec. You can choose one for the whole file orchoose one for individual columns. If you choose a mix, the file-level optionwill apply to columns that don’t have a specific compression codec. See::arrow::Compression for options.

Column data encodings can likewise be applied at the file-level or at thecolumn level. By default, the writer will attempt to dictionary encode allsupported columns, unless the dictionary grows too large. This behavior canbe changed at file-level or at the column level withdisable_dictionary().When not using dictionary encoding, it will fallback to the encoding set forthe column or the overall file; by defaultEncoding::PLAIN, but this canbe changed withencoding().

#include"parquet/arrow/writer.h"#include"arrow/util/type_fwd.h"usingparquet::WriterProperties;usingarrow::Compression;usingparquet::Encoding;std::shared_ptr<WriterProperties>props=WriterProperties::Builder().compression(Compression::SNAPPY)// Fallback->compression("colA",Compression::ZSTD)// Only applies to column "colA"->encoding(Encoding::BIT_PACKED)// Fallback->encoding("colB",Encoding::RLE)// Only applies to column "colB"->disable_dictionary("colB")// Never dictionary-encode column "colB"->build();

Statistics are enabled by default for all columns. You can disable statistics forall columns or specific columns usingdisable_statistics on the builder.There is amax_statistics_size which limits the maximum number of bytes thatmay be used for min and max values, useful for types like strings or binary blobs.If a column has enabled page index usingenable_write_page_index, then it doesnot write statistics to the page header because it is duplicated in the ColumnIndex.

There are also Arrow-specific settings that can be configured withparquet::ArrowWriterProperties:

#include"parquet/arrow/writer.h"usingparquet::ArrowWriterProperties;std::shared_ptr<ArrowWriterProperties>arrow_props=ArrowWriterProperties::Builder().enable_deprecated_int96_timestamps()// default False->store_schema()// default False->build();

These options mostly dictate how Arrow types are converted to Parquet types.Turning onstore_schema will cause the writer to store the serialized Arrowschema within the file metadata. Since there is no bijection between Parquetschemas and Arrow schemas, storing the Arrow schema allows the Arrow readerto more faithfully recreate the original data. This mapping from Parquet typesback to original Arrow types includes:

Reading timestamps with original timezone information (Parquet does notsupport time zones);
Reading Arrow types from their storage types (such as Duration from int64columns);
Reading string and binary columns back into large variants with 64-bit offsets;
Reading back columns as dictionary encoded (whether an Arrow column andthe serialized Parquet version are dictionary encoded are independent).

Supported Parquet features#

The Parquet format has many features, and Parquet C++ supports a subset of them.

Page types#

Page type	Notes
DATA_PAGE
DATA_PAGE_V2
DICTIONARY_PAGE

Unsupported page type: INDEX_PAGE. When reading a Parquet file, pages ofthis type are ignored.

Compression#

Compression codec	Notes
SNAPPY
GZIP
BROTLI
LZ4	(1)
ZSTD

(1) On the read side, Parquet C++ is able to decompress both the regularLZ4 block format and the ad-hoc Hadoop LZ4 format used by thereference Parquet implementation.On the write side, Parquet C++ always generates the ad-hoc Hadoop LZ4 format.

Unsupported compression codec: LZO.

Encodings#

Encoding	Reading	Writing	Notes
PLAIN	✓	✓
PLAIN_DICTIONARY	✓	✓
BIT_PACKED	✓	✓	(1)
RLE	✓	✓	(1)
RLE_DICTIONARY	✓	✓	(2)
BYTE_STREAM_SPLIT	✓	✓
DELTA_BINARY_PACKED	✓	✓
DELTA_BYTE_ARRAY	✓	✓
DELTA_LENGTH_BYTE_ARRAY	✓	✓

(1) Only supported for encoding definition and repetition levels,and boolean values.
(2) On the write path, RLE_DICTIONARY is only enabled if Parquet format version2.4 or greater is selected inWriterProperties::version().

Types#

Physical types#

Physical type	Mapped Arrow type	Notes
BOOLEAN	Boolean
INT32	Int32 / other	(1)
INT64	Int64 / other	(1)
INT96	Timestamp (nanoseconds)	(2)
FLOAT	Float32
DOUBLE	Float64
BYTE_ARRAY	Binary / LargeBinary / BinaryView	(1)
FIXED_LENGTH_BYTE_ARRAY	FixedSizeBinary / other	(1)

(1) Can be mapped to other Arrow types, depending on the logical type(see table below).
(2) On the write side,ArrowWriterProperties::support_deprecated_int96_timestamps()must be enabled.

Logical types#

Specific logical types can override the default Arrow type mapping for a givenphysical type.

Logical type	Physical type	Mapped Arrow type	Notes
NULL	Any	Null	(1)
INT	INT32	Int8 / UInt8 / Int16 /UInt16 / Int32 / UInt32
INT	INT64	Int64 / UInt64
DECIMAL	INT32 / INT64 / BYTE_ARRAY/ FIXED_LENGTH_BYTE_ARRAY	Decimal32/ Decimal64 /Decimal128 / Decimal256	(2)
DATE	INT32	Date32	(3)
TIME	INT32	Time32 (milliseconds)
TIME	INT64	Time64 (micro- ornanoseconds)
TIMESTAMP	INT64	Timestamp (milli-, micro-or nanoseconds)
STRING	BYTE_ARRAY	String / LargeString /StringView
LIST	Any	List / LargeList	(4)
MAP	Any	Map	(5)
FLOAT16	FIXED_LENGTH_BYTE_ARRAY	HalfFloat
UUID	FIXED_LENGTH_BYTE_ARRAY	Extension (`arrow.uuid`)	(6)
JSON	BYTE_ARRAY	Extension (`arrow.json`)	(6)
GEOMETRY	BYTE_ARRAY	Extension (`geoarrow.wkb`)	(6) (7)
GEOGRAPHY	BYTE_ARRAY	Extension (`geoarrow.wkb`)	(6) (7)

(1) On the write side, the Parquet physical type INT32 is generated.
(2) On the write side, a FIXED_LENGTH_BYTE_ARRAY is always emittedexcept ifstore_decimal_as_integer is set to true.
(3) On the write side, an Arrow Date64 is also mapped to a Parquet DATE INT32.
(4) On the write side, an Arrow FixedSizedList is also mapped to a Parquet LIST.
(5) On the read side, a key with multiple values does not get deduplicated,in contradiction with theParquet specification.
(6) Requires thatarrow_extensions_enabled inArrowReaderProperties istrue.Whenfalse, the underlying storage type is read.
(7) Requires that thegeoarrow.wkb extension type is registered.

Unsupported logical types: BSON. If such a type is encounteredwhen reading a Parquet file, the default physical type mapping is used (forexample, a Parquet BSON column may be read as Arrow Binary or FixedSizeBinary).

Converted types#

While converted types are deprecated in the Parquet format (they are supercededby logical types), they are recognized and emitted by the Parquet C++implementation so as to maximize compatibility with other Parquetimplementations.

Special cases#

An Arrow Extension type is written out as its storage type. It can stillbe recreated at read time using Parquet metadata (see “Roundtripping Arrowtypes” below). Some extension types have Parquet LogicalType equivalents(e.g., UUID, JSON, GEOMETRY, GEOGRAPHY). These are created automaticallyif the appropriate option is set in theArrowReaderProperties even ifthere was no Arrow schema stored in the Parquet metadata.

An Arrow Dictionary type is written out as its value type. It can stillbe recreated at read time using Parquet metadata (see “Roundtripping Arrowtypes” below).

Roundtripping Arrow types and schema#

While there is no bijection between Arrow types and Parquet types, it ispossible to serialize the Arrow schema as part of the Parquet file metadata.This is enabled usingArrowWriterProperties::store_schema().

On the read path, the serialized schema will be automatically recognizedand will recreate the original Arrow data, converting the Parquet data asrequired.

As an example, when serializing an Arrow LargeList to Parquet:

The data is written out as a Parquet LIST
When read back, the Parquet LIST data is decoded as an Arrow LargeList ifArrowWriterProperties::store_schema() was enabled when writing the file;otherwise, it is decoded as an Arrow List.

Parquet field id#

The Parquet format supports an optional integerfield id which can be assignedto a given field. This is used for example in theApache Iceberg specification.

On the writer side, ifPARQUET:field_id is present as a metadata key on anArrow field, then its value is parsed as a non-negative integer and is used asthe field id for the corresponding Parquet field.

On the reader side, Arrow will convert such a field id to a metadata key namedPARQUET:field_id on the corresponding Arrow field.

Serialization details#

The Arrow schema is serialized as aArrow IPC schema message,then base64-encoded and stored under theARROW:schema metadata key inthe Parquet file metadata.

Limitations#

Writing or reading back FixedSizedList data with null entries is not supported.

Encryption#

Parquet C++ implements all features specified in theencryption specification,except for encryption of column index and bloom filter modules.

More specifically, Parquet C++ supports:

AES_GCM_V1 and AES_GCM_CTR_V1 encryption algorithms.
AAD suffix for Footer, ColumnMetaData, Data Page, Dictionary Page,Data PageHeader, Dictionary PageHeader module types. Other module types(ColumnIndex, OffsetIndex, BloomFilter Header, BloomFilter Bitset) are notsupported.
EncryptionWithFooterKey and EncryptionWithColumnKey modes.
Encrypted Footer and Plaintext Footer modes.

Configuration#

Parquet encryption uses aparquet::encryption::CryptoFactory that has access to aKey Management System (KMS), which stores actual encryption keys, referenced by key ids.The Parquet encryption configuration only uses key ids, no actual keys.

Parquet metadata encryption is configured viaparquet::encryption::EncryptionConfiguration:

// Set write options with encryption configuration.autoencryption_config=std::make_shared<parquet::encryption::EncryptionConfiguration>(std::string("footerKeyId"));

Ifencryption_config->uniform_encryption is set totrue, then all columns areencrypted with the same key as the Parquet metadata. Otherwise, individualcolumns are encrypted with individual keys as configured viaencryption_config->column_keys. This field expects a string of the format"columnKeyID1:colName1,colName2;columnKeyID3:colName3...".

// Set write options with encryption configuration.autoencryption_config=std::make_shared<parquet::encryption::EncryptionConfiguration>(std::string("footerKeyId"));encryption_config->column_keys="columnKeyId: i, s.a, s.b, m.key_value.key, m.key_value.value, l.list.element";

See the fullParquet column encryption example.

Note

Encrypting columns that have nested fields (struct, map or list data types)requires column keys for the inner fields, not the outer column itself.Configuring a column key for the outer column causesthis error (here the column name iscol):

OSError:Encryptedcolumncolnotinfileschema

Conventionally, the key and value fields of a map columnm have the namesm.key_value.key andm.key_value.value, respectively. The inner field of alist columnl has the namel.list.element. An inner fieldf of a struct columns hasthe names.f.

Miscellaneous#

Feature	Reading	Writing	Notes
Column Index	✓	✓	(1)
Offset Index	✓	✓	(1)
Bloom Filter	✓	✓	(2)
CRC checksums	✓	✓

(1) Access to the Column and Offset Index structures is provided, butdata read APIs do not currently make any use of them.
(2) APIs are provided for creating, serializing and deserializing BloomFilters, but they are not integrated into data read APIs.

On this page

Edit on GitHub

Movatterモバイル変換

Reading and writing Parquet files#

Reading Parquet files#

FileReader#

Performance and Memory Efficiency#

StreamReader#

Writing Parquet files#

WriteTable#

StreamWriter#

Writer properties#

Supported Parquet features#

Page types#

Compression#

Encodings#

Types#

Physical types#

Logical types#

Converted types#

Special cases#

Roundtripping Arrow types and schema#

Parquet field id#

Serialization details#

Limitations#

Encryption#

Configuration#

Miscellaneous#

Reading and writing Parquet files #