File Formats#
CSV reader#
- structConvertOptions#
Public Members
- boolcheck_utf8=true#
Whether to check UTF8 validity of string columns.
- std::unordered_map<std::string,std::shared_ptr<DataType>>column_types#
Optional per-column types (disabling type inference on those columns)
- std::vector<std::string>null_values#
Recognized spellings for null values.
- std::vector<std::string>true_values#
Recognized spellings for boolean true values.
- std::vector<std::string>false_values#
Recognized spellings for boolean false values.
- boolstrings_can_be_null=false#
Whether string / binary columns can have null values.
If true, then strings in “null_values” are considered null for string columns. If false, then all strings are valid string values.
- boolquoted_strings_can_be_null=true#
Whether quoted values can be null.
If true, then strings in “null_values” are also considered null when they appear quoted in the CSV file. Otherwise, quoted values are never considered null.
- boolauto_dict_encode=false#
Whether to try to automatically dict-encode string / binary data.
If true, then when type inference detects a string or binary column, it is dict-encoded up to
auto_dict_max_cardinality
distinct values (per chunk), after which it switches to regular encoding.This setting is ignored for non-inferred columns (those in
column_types
).
- chardecimal_point='.'#
Decimal point character for floating-point and decimal data.
- std::vector<std::string>include_columns#
If non-empty, indicates the names of columns from the CSV file that should be actually read and converted (in the vector’s order).
Columns not in this vector will be ignored.
- boolinclude_missing_columns=false#
If false, columns in
include_columns
but not in the CSV file will error out.If true, columns in
include_columns
but not in the CSV file will produce a column of nulls (whose type is selected usingcolumn_types
, or null by default) This option is ignored ifinclude_columns
is empty.
- std::vector<std::shared_ptr<TimestampParser>>timestamp_parsers#
User-defined timestamp parsers, using the virtual parser interface in arrow/util/value_parsing.h.
More than one parser can be specified, and the CSV conversion logic will try parsing values starting from the beginning of this vector. If no parsers are specified, we use the default built-in ISO-8601 parser.
Public Static Functions
- staticConvertOptionsDefaults()#
Create conversion options with default values, including conventional values for
null_values
,true_values
andfalse_values
- boolcheck_utf8=true#
- structParseOptions#
Public Members
- boolquoting=true#
Whether quoting is used.
- charquote_char='"'#
Quoting character (if
quoting
is true)
- booldouble_quote=true#
Whether a quote inside a value is double-quoted.
- boolescaping=false#
Whether escaping is used.
- charescape_char=kDefaultEscapeChar#
Escaping character (if
escaping
is true)
- boolnewlines_in_values=false#
Whether values are allowed to contain CR (0x0d) and LF (0x0a) characters.
- boolignore_empty_lines=true#
Whether empty lines are ignored.
If false, an empty line represents a single empty value (assuming a one-column CSV file).
- InvalidRowHandlerinvalid_row_handler#
A handler function for rows which do not have the correct number of columns.
Public Static Functions
- staticParseOptionsDefaults()#
Create parsing options with default values.
- boolquoting=true#
- structReadOptions#
Public Members
- booluse_threads=true#
Whether to use the global CPU thread pool.
- int32_tblock_size=1<<20#
Block size we request from the IO layer.
This will determine multi-threading granularity as well as the size of individual record batches. Minimum valid value for block size is 1
- int32_tskip_rows=0#
Number of header rows to skip (not including the row of column names, if any)
- int32_tskip_rows_after_names=0#
Number of rows to skip after the column names are read, if any.
- std::vector<std::string>column_names#
Column names for the target table.
If empty, fall back on autogenerate_column_names.
- boolautogenerate_column_names=false#
Whether to autogenerate column names if
column_names
is empty.If true, column names will be of the form “f0”, “f1”… If false, column names will be read from the first CSV row after
skip_rows
.
Public Static Functions
- staticReadOptionsDefaults()#
Create read options with default values.
- booluse_threads=true#
- classTableReader#
A class that reads an entire CSV file into a ArrowTable.
Public Functions
Public Static Functions
- staticResult<std::shared_ptr<TableReader>>Make(io::IOContextio_context,std::shared_ptr<io::InputStream>input,constReadOptions&,constParseOptions&,constConvertOptions&)#
Create aTableReader instance.
- staticResult<std::shared_ptr<TableReader>>Make(io::IOContextio_context,std::shared_ptr<io::InputStream>input,constReadOptions&,constParseOptions&,constConvertOptions&)#
- classStreamingReader:publicarrow::RecordBatchReader#
A class that reads a CSV file incrementally.
Caveats:
For now, this is always single-threaded (regardless of
ReadOptions::use_threads
.Type inference is done on the first block and types are frozen afterwards; to make sure the right data types are inferred, either set
ReadOptions::block_size
to a large enough value, or useConvertOptions::column_types
to set the desired data types explicitly.
Public Functions
- virtualint64_tbytes_read()const=0#
Return the number of bytes which have been read and processed.
The returned number includes CSV bytes which theStreamingReader has finished processing, but not bytes for which some processing (e.g. CSV parsing or conversion to Arrow layout) is still ongoing.
Furthermore, the following rules apply:
bytes skipped by
ReadOptions.skip_rows
are counted as being read before any records are returned.bytes read while parsing the header are counted as being read before any records are returned.
bytes skipped by
ReadOptions.skip_rows_after_names
are counted after the first batch is returned.
Public Static Functions
- staticFuture<std::shared_ptr<StreamingReader>>MakeAsync(io::IOContextio_context,std::shared_ptr<io::InputStream>input,arrow::internal::Executor*cpu_executor,constReadOptions&,constParseOptions&,constConvertOptions&)#
Create aStreamingReader instance.
This involves some I/O as the first batch must be loaded during the creation process so it is returned as a future
Currently, theStreamingReader is not async-reentrant and does not do any fan-out parsing (see ARROW-11889)
CSV writer#
- structWriteOptions#
Public Members
- boolinclude_header=true#
Whether to write an initial header line with column names.
- int32_tbatch_size=1024#
Maximum number of rows processed at a time.
The CSV writer converts and writes data in batches of N rows. This number can impact performance.
- std::stringnull_string#
The string to write for null values. Quotes are not allowed in this string.
- io::IOContextio_context#
IO context for writing.
- std::stringeol="\n"#
The end of line character to use for ending rows.
- QuotingStylequoting_style=QuotingStyle::Needed#
Quoting style.
Public Static Functions
- staticWriteOptionsDefaults()#
Create write options with default values.
- boolinclude_header=true#
- StatusWriteCSV(constTable&table,constWriteOptions&options,arrow::io::OutputStream*output)#
Convert table to CSV and write the result to output.
Experimental
- StatusWriteCSV(constRecordBatch&batch,constWriteOptions&options,arrow::io::OutputStream*output)#
Convert batch to CSV and write the result to output.
Experimental
- StatusWriteCSV(conststd::shared_ptr<RecordBatchReader>&reader,constWriteOptions&options,arrow::io::OutputStream*output)#
Convert batches read through aRecordBatchReader to CSV and write the results to output.
Experimental
- Result<std::shared_ptr<ipc::RecordBatchWriter>>MakeCSVWriter(std::shared_ptr<io::OutputStream>sink,conststd::shared_ptr<Schema>&schema,constWriteOptions&options=WriteOptions::Defaults())#
Create a new CSV writer.
User is responsible for closing the actual OutputStream.
- Parameters:
sink –[in] output stream to write to
schema –[in] the schema of the record batches to be written
options –[in] options for serialization
- Returns:
Result<std::shared_ptr<RecordBatchWriter>>
- Result<std::shared_ptr<ipc::RecordBatchWriter>>MakeCSVWriter(io::OutputStream*sink,conststd::shared_ptr<Schema>&schema,constWriteOptions&options=WriteOptions::Defaults())#
Create a new CSV writer.
- Parameters:
sink –[in] output stream to write to (does not take ownership)
schema –[in] the schema of the record batches to be written
options –[in] options for serialization
- Returns:
Result<std::shared_ptr<RecordBatchWriter>>
Line-separated JSON#
- enumclassarrow::json::UnexpectedFieldBehavior:char#
Values:
- enumeratorIgnore#
Unexpected JSON fields are ignored.
- enumeratorError#
Unexpected JSON fields error out.
- enumeratorInferType#
Unexpected JSON fields are type-inferred and included in the output.
- enumeratorIgnore#
- structReadOptions#
Public Members
- booluse_threads=true#
Whether to use the global CPU thread pool.
- int32_tblock_size=1<<20#
Block size we request from the IO layer; also determines the size of chunks when use_threads is true.
Public Static Functions
- staticReadOptionsDefaults()#
Create read options with default values.
- booluse_threads=true#
- structParseOptions#
Public Members
- std::shared_ptr<Schema>explicit_schema#
Optional explicit schema (disables type inference on those fields)
- boolnewlines_in_values=false#
Whether objects may be printed across multiple lines (for example pretty-printed)
If true, parsing may be slower.
- UnexpectedFieldBehaviorunexpected_field_behavior=UnexpectedFieldBehavior::InferType#
How JSON fields outside of explicit_schema (if given) are treated.
Public Static Functions
- staticParseOptionsDefaults()#
Create parsing options with default values.
- std::shared_ptr<Schema>explicit_schema#
- classTableReader#
A class that reads an entire JSON file into a ArrowTable.
The file is expected to consist of individual line-separated JSON objects
Public Functions
Public Static Functions
- staticResult<std::shared_ptr<TableReader>>Make(MemoryPool*pool,std::shared_ptr<io::InputStream>input,constReadOptions&,constParseOptions&)#
Create aTableReader instance.
- staticResult<std::shared_ptr<TableReader>>Make(MemoryPool*pool,std::shared_ptr<io::InputStream>input,constReadOptions&,constParseOptions&)#
- classStreamingReader:publicarrow::RecordBatchReader#
A class that reads a JSON file incrementally.
JSON data is read from a stream in fixed-size blocks (configurable with
ReadOptions::block_size
). Each block is converted to aRecordBatch
. Yielded batches have a consistent schema but may differ in row count.The supplied
ParseOptions
are used to determine a schema, based either on a provided explicit schema or inferred from the first non-empty block. Afterwards, the target schema is frozen. IfUnexpectedFieldBehavior::InferType
is specified, unexpected fields will only be inferred for the first block. Afterwards they’ll be treated as errors.If
ReadOptions::use_threads
istrue
, each block’s parsing/decoding task will be parallelized on the givencpu_executor
(with readahead corresponding to the executor’s capacity). If an executor isn’t provided, the global thread pool will be used.If
ReadOptions::use_threads
isfalse
, computations will be run on the calling thread andcpu_executor
will be ignored.Public Functions
- virtualFuture<std::shared_ptr<RecordBatch>>ReadNextAsync()=0#
Read the next
RecordBatch
asynchronously This function is async-reentrant (but not synchronously reentrant).However, if threading is disabled, this will block until completion.
- virtualint64_tbytes_processed()const=0#
Get the number of bytes which have been successfully converted to record batches and consumed.
Public Static Functions
- staticResult<std::shared_ptr<StreamingReader>>Make(std::shared_ptr<io::InputStream>stream,constReadOptions&read_options,constParseOptions&parse_options,constio::IOContext&io_context=io::default_io_context(),::arrow::internal::Executor*cpu_executor=NULLPTR)#
Create a
StreamingReader
from anInputStream
Blocks until the initial batch is loaded.- Parameters:
stream –[in] JSON source stream
read_options –[in] Options for reading
parse_options –[in] Options for chunking, parsing, and conversion
io_context –[in] Context for IO operations (optional)
cpu_executor –[in] Executor for computation tasks (optional)
- Returns:
The initialized reader
- staticFuture<std::shared_ptr<StreamingReader>>MakeAsync(std::shared_ptr<io::InputStream>stream,constReadOptions&read_options,constParseOptions&parse_options,constio::IOContext&io_context=io::default_io_context(),::arrow::internal::Executor*cpu_executor=NULLPTR)#
Create a
StreamingReader
from anInputStream
asynchronously Returned future completes after loading the first batch.- Parameters:
stream –[in] JSON source stream
read_options –[in] Options for reading
parse_options –[in] Options for chunking, parsing, and conversion
io_context –[in] Context for IO operations (optional)
cpu_executor –[in] Executor for computation tasks (optional)
- Returns:
Future for the initialized reader
- virtualFuture<std::shared_ptr<RecordBatch>>ReadNextAsync()=0#
Parquet reader#
- classReaderProperties#
Public Functions
- inlineboolis_buffered_stream_enabled()const#
Buffered stream reading allows the user to control the memory usage of parquet readers.
This ensure that all
RandomAccessFile::ReadAt
calls are wrapped in a buffered reader that uses a fix sized buffer (of sizebuffer_size()
) instead of the full size of the ReadAt.The primary reason for this control knobs is for resource control and not performance.
- inlinevoidenable_buffered_stream()#
Enable buffered stream reading.
- inlinevoiddisable_buffered_stream()#
Disable buffered stream reading.
- inlineint64_tbuffer_size()const#
Return the size of the buffered stream buffer.
- inlinevoidset_buffer_size(int64_tsize)#
Set the size of the buffered stream buffer in bytes.
- inlineint32_tthrift_string_size_limit()const#
Return the size limit on thrift strings.
This limit helps prevent space and time bombs in files, but may need to be increased in order to read files with especially large headers.
- inlinevoidset_thrift_string_size_limit(int32_tsize)#
Set the size limit on thrift strings.
- inlineint32_tthrift_container_size_limit()const#
Return the size limit on thrift containers.
This limit helps prevent space and time bombs in files, but may need to be increased in order to read files with especially large headers.
- inlinevoidset_thrift_container_size_limit(int32_tsize)#
Set the size limit on thrift containers.
- inlinevoidfile_decryption_properties(std::shared_ptr<FileDecryptionProperties>decryption)#
Set the decryption properties.
- inlineconststd::shared_ptr<FileDecryptionProperties>&file_decryption_properties()const#
Return the decryption properties.
- inlineboolis_buffered_stream_enabled()const#
- classArrowReaderProperties#
EXPERIMENTAL: Properties for configuring FileReader behavior.
Public Functions
- inlinevoidset_use_threads(booluse_threads)#
Set whether to use the IO thread pool to parse columns in parallel.
Default is false.
- inlinebooluse_threads()const#
Return whether will use multiple threads.
- inlinevoidset_read_dictionary(intcolumn_index,boolread_dict)#
Set whether to read a particular column as dictionary encoded.
If the file metadata contains a serialized Arrow schema, then …
This is only supported for columns with a Parquet physical type of BYTE_ARRAY, such as string or binary types.
- inlineboolread_dictionary(intcolumn_index)const#
Return whether the column at the index will be read as dictionary.
- inlinevoidset_batch_size(int64_tbatch_size)#
Set the maximum number of rows to read into a record batch.
Will only be fewer rows when there are no more rows in the file. Note that some APIs such as ReadTable may ignore this setting.
- inlineint64_tbatch_size()const#
Return the batch size in rows.
Note that some APIs such as ReadTable may ignore this setting.
- inlinevoidset_pre_buffer(boolpre_buffer)#
Enable read coalescing (default false).
When enabled, the Arrow reader will pre-buffer necessary regions of the file in-memory. This is intended to improve performance on high-latency filesystems (e.g. Amazon S3).
- inlineboolpre_buffer()const#
Return whether read coalescing is enabled.
- inlinevoidset_cache_options(::arrow::io::CacheOptionsoptions)#
Set options for read coalescing.
This can be used to tune the implementation for characteristics of different filesystems.
- inlineconst::arrow::io::CacheOptions&cache_options()const#
Return the options for read coalescing.
- inlinevoidset_io_context(const::arrow::io::IOContext&ctx)#
Set execution context for read coalescing.
- inlineconst::arrow::io::IOContext&io_context()const#
Return the execution context used for read coalescing.
- inlinevoidset_coerce_int96_timestamp_unit(::arrow::TimeUnit::typeunit)#
Set timestamp unit to use for deprecated INT96-encoded timestamps (default is NANO).
- inlinevoidset_arrow_extensions_enabled(boolextensions_enabled)#
Enable Parquet-supported Arrow extension types.
When enabled, Parquet logical types will be mapped to their corresponding Arrow extension types at read time, if such exist. Currently only arrow::extension::json() extension type is supported. Columns whose LogicalType is JSON will be interpreted as arrow::extension::json(), with storage type inferred from the serialized Arrow schema if present, or
utf8
by default.
- inlinevoidset_should_load_statistics(boolshould_load_statistics)#
Set whether to load statistics as much as possible.
Default is false.
- inlineboolshould_load_statistics()const#
Return whether loading statistics as much as possible.
- inlinevoidset_use_threads(booluse_threads)#
- classParquetFileReader#
Public Functions
- std::shared_ptr<PageIndexReader>GetPageIndexReader()#
Returns the PageIndexReader.
Only one instance is ever created.
If the file does not have the page index, nullptr may be returned. Because it pays to check existence of page index in the file, it is possible to return a non null value even if page index does not exist. It is the caller’s responsibility to check the return value and follow-up calls to PageIndexReader.
WARNING: The returned PageIndexReader must not outlive theParquetFileReader. InitializeGetPageIndexReader() is not thread-safety.
- BloomFilterReader&GetBloomFilterReader()#
Returns the BloomFilterReader.
Only one instance is ever created.
WARNING: The returned BloomFilterReader must not outlive theParquetFileReader. InitializeGetBloomFilterReader() is not thread-safety.
- voidPreBuffer(conststd::vector<int>&row_groups,conststd::vector<int>&column_indices,const::arrow::io::IOContext&ctx,const::arrow::io::CacheOptions&options)#
Pre-buffer the specified column indices in all row groups.
Readers can optionally call this to cache the necessary slices of the file in-memory before deserialization. Arrow readers can automatically do this via an option. This is intended to increase performance when reading from high-latency filesystems (e.g. Amazon S3).
After calling this, creating readers for row groups/column indices that were not buffered may fail. Creating multiple readers for the a subset of the buffered regions is acceptable. This may be called again to buffer a different set of row groups/columns.
If memory usage is a concern, note that data will remain buffered in memory until eitherPreBuffer() is called again, or the reader itself is destructed. Reading - and buffering - only one row group at a time may be useful.
This method may throw.
- ::arrow::Result<std::vector<::arrow::io::ReadRange>>GetReadRanges(conststd::vector<int>&row_groups,conststd::vector<int>&column_indices,int64_thole_size_limit=1024*1024,int64_trange_size_limit=64*1024*1024)#
Retrieve the list of byte ranges that would need to be read to retrieve the data for the specified row groups and column indices.
A reader can optionally call this if they wish to handle their own caching and management of file reads (or offload them to other readers). Unlike PreBuffer, this method will not perform any actual caching or reads, instead just using the file metadata to determine the byte ranges that would need to be read if you were to consume the entirety of the column chunks for the provided columns in the specified row groups.
If row_groups or column_indices are empty, then the result of this will be empty.
hole_size_limit represents the maximum distance, in bytes, between two consecutive ranges; beyond this value, ranges will not be combined. The default value is 1MB.
range_size_limit is the maximum size in bytes of a combined range; if combining two consecutive ranges would produce a range larger than this, they are not combined. The default values is 64MB. Thismust be larger than hole_size_limit.
This will not take into account page indexes or any other predicate push down benefits that may be available.
- ::arrow::FutureWhenBuffered(conststd::vector<int>&row_groups,conststd::vector<int>&column_indices)const#
Wait for the specified row groups and column indices to be pre-buffered.
After the returned Future completes, reading the specified row groups/columns will not block.
PreBuffer must be called first. This method does not throw.
- structContents#
- std::shared_ptr<PageIndexReader>GetPageIndexReader()#
- classFileReader#
Arrow read adapter class for deserializing Parquet files as Arrow row batches.
This interfaces caters for different use cases and thus provides different interfaces. In its most simplistic form, we cater for a user that wants to read the whole Parquet at once with the
FileReader::ReadTable
method.More advanced users that also want to implement parallelism on top of each single Parquet files should do this on the RowGroup level. For this, they can call
FileReader::RowGroup(i)->ReadTable
to receive only the specified RowGroup as a table.In the most advanced situation, where a consumer wants to independently read RowGroups in parallel and consume each column individually, they can call
FileReader::RowGroup(i)->Column(j)->Read
and receive anarrow::Column
instance.Finally, one can also get a stream of record batches using
FileReader::GetRecordBatchReader()
. This can internally decode columns in parallel if use_threads was enabled in theArrowReaderProperties.The parquet format supports an optional integer field_id which can be assigned to a field. Arrow will convert these field IDs to a metadata key named PARQUET:field_id on the appropriate field.
Public Functions
- virtual::arrow::StatusGetSchema(std::shared_ptr<::arrow::Schema>*out)=0#
Return arrow schema for all the columns.
- virtual::arrow::StatusReadColumn(inti,std::shared_ptr<::arrow::ChunkedArray>*out)=0#
Read column as a whole into a chunked array.
The index i refers the index of the top level schema field, which may be nested or flat - e.g.
0 foo.bar foo.bar.baz foo.qux 1 foo2 2 foo3
i=0 will read the entire foo struct, i=1 the foo2 primitive column etc
- ::arrow::StatusGetRecordBatchReader(std::unique_ptr<::arrow::RecordBatchReader>*out)#
Return a RecordBatchReader of all row groups and columns.
- Deprecated:
Deprecated in 19.0.0. Usearrow::Result version instead.
- virtual::arrow::Result<std::unique_ptr<::arrow::RecordBatchReader>>GetRecordBatchReader()=0#
Return a RecordBatchReader of all row groups and columns.
- ::arrow::StatusGetRecordBatchReader(conststd::vector<int>&row_group_indices,std::unique_ptr<::arrow::RecordBatchReader>*out)#
Return a RecordBatchReader of row groups selected from row_group_indices.
Note that the ordering in row_group_indices matters. FileReaders must outlive their RecordBatchReaders.
- Deprecated:
Deprecated in 19.0.0. Usearrow::Result version instead.
- Returns:
error Status if row_group_indices contains an invalid index
- virtual::arrow::Result<std::unique_ptr<::arrow::RecordBatchReader>>GetRecordBatchReader(conststd::vector<int>&row_group_indices)=0#
Return a RecordBatchReader of row groups selected from row_group_indices.
Note that the ordering in row_group_indices matters. FileReaders must outlive their RecordBatchReaders.
- Returns:
error Result if row_group_indices contains an invalid index
- ::arrow::StatusGetRecordBatchReader(conststd::vector<int>&row_group_indices,conststd::vector<int>&column_indices,std::unique_ptr<::arrow::RecordBatchReader>*out)#
Return a RecordBatchReader of row groups selected from row_group_indices, whose columns are selected by column_indices.
Note that the ordering in row_group_indices and column_indices matter. FileReaders must outlive their RecordBatchReaders.
- Deprecated:
Deprecated in 19.0.0. Usearrow::Result version instead.
- Returns:
error Status if either row_group_indices or column_indices contains an invalid index
- virtual::arrow::Result<std::unique_ptr<::arrow::RecordBatchReader>>GetRecordBatchReader(conststd::vector<int>&row_group_indices,conststd::vector<int>&column_indices)=0#
Return a RecordBatchReader of row groups selected from row_group_indices, whose columns are selected by column_indices.
Note that the ordering in row_group_indices and column_indices matter. FileReaders must outlive their RecordBatchReaders.
- Returns:
error Result if either row_group_indices or column_indices contains an invalid index
- ::arrow::StatusGetRecordBatchReader(conststd::vector<int>&row_group_indices,conststd::vector<int>&column_indices,std::shared_ptr<::arrow::RecordBatchReader>*out)#
Return a RecordBatchReader of row groups selected from row_group_indices, whose columns are selected by column_indices.
Note that the ordering in row_group_indices and column_indices matter. FileReaders must outlive their RecordBatchReaders.
- Parameters:
row_group_indices – which row groups to read (order determines read order).
column_indices – which columns to read (order determines output schema).
out –[out] record batch stream from parquet data.
- Returns:
error Status if either row_group_indices or column_indices contains an invalid index
- virtual::arrow::Result<std::function<::arrow::Future<std::shared_ptr<::arrow::RecordBatch>>()>>GetRecordBatchGenerator(std::shared_ptr<FileReader>reader,conststd::vector<int>row_group_indices,conststd::vector<int>column_indices,::arrow::internal::Executor*cpu_executor=NULLPTR,int64_trows_to_readahead=0)=0#
Return a generator of record batches.
TheFileReader must outlive the generator, so this requires that you pass in a shared_ptr.
- Returns:
error Result if either row_group_indices or column_indices contains an invalid index
- virtual::arrow::StatusReadTable(std::shared_ptr<::arrow::Table>*out)=0#
Read all columns into a Table.
- virtual::arrow::StatusReadTable(conststd::vector<int>&column_indices,std::shared_ptr<::arrow::Table>*out)=0#
Read the given columns into a Table.
The indicated column indices are relative to the internal representation of the parquet table. For instance : 0 foo.bar foo.bar.baz 0 foo.bar.baz2 1 foo.qux 2 1 foo2 3 2 foo3 4
i=0 will read foo.bar.baz, i=1 will read only foo.bar.baz2 and so on. Only leaf fields have indices; foo itself doesn’t have an index. To get the index for a particular leaf field, one can use manifest().schema_fields to get the top level fields, and then walk the tree to identify the relevant leaf fields and access its column_index. To get the total number of leaf fields, use FileMetadata.num_columns().
- virtual::arrow::StatusScanContents(std::vector<int>columns,constint32_tcolumn_batch_size,int64_t*num_rows)=0#
Scan file contents with one thread, return number of rows.
- virtualstd::shared_ptr<RowGroupReader>RowGroup(introw_group_index)=0#
Return a reader for the RowGroup, this object must not outlive theFileReader.
- virtualintnum_row_groups()const=0#
The number of row groups in the file.
- virtualvoidset_use_threads(booluse_threads)=0#
Set whether to use multiple threads during reads of multiple columns.
By default only one thread is used.
- virtualvoidset_batch_size(int64_tbatch_size)=0#
Set number of records to read per batch for the RecordBatchReader.
Public Static Functions
- static::arrow::StatusMake(::arrow::MemoryPool*pool,std::unique_ptr<ParquetFileReader>reader,constArrowReaderProperties&properties,std::unique_ptr<FileReader>*out)#
Factory function to create aFileReader from aParquetFileReader and properties.
- static::arrow::StatusMake(::arrow::MemoryPool*pool,std::unique_ptr<ParquetFileReader>reader,std::unique_ptr<FileReader>*out)#
Factory function to create aFileReader from aParquetFileReader.
- virtual::arrow::StatusGetSchema(std::shared_ptr<::arrow::Schema>*out)=0#
- classFileReaderBuilder#
Experimental helper class for bindings (like Python) that struggle either with std::move or C++ exceptions.
Public Functions
- ::arrow::StatusOpen(std::shared_ptr<::arrow::io::RandomAccessFile>file,constReaderProperties&properties=default_reader_properties(),std::shared_ptr<FileMetaData>metadata=NULLPTR)#
CreateFileReaderBuilder from Arrow file and optional properties / metadata.
- ::arrow::StatusOpenFile(conststd::string&path,boolmemory_map=false,constReaderProperties&props=default_reader_properties(),std::shared_ptr<FileMetaData>metadata=NULLPTR)#
CreateFileReaderBuilder from file path and optional properties / metadata.
- FileReaderBuilder*memory_pool(::arrow::MemoryPool*pool)#
Set Arrow MemoryPool for memory allocation.
- FileReaderBuilder*properties(constArrowReaderProperties&arg_properties)#
Set Arrow reader properties.
- ::arrow::StatusBuild(std::unique_ptr<FileReader>*out)#
BuildFileReader instance.
- ::arrow::StatusOpen(std::shared_ptr<::arrow::io::RandomAccessFile>file,constReaderProperties&properties=default_reader_properties(),std::shared_ptr<FileMetaData>metadata=NULLPTR)#
- ::arrow::StatusOpenFile(std::shared_ptr<::arrow::io::RandomAccessFile>,::arrow::MemoryPool*allocator,std::unique_ptr<FileReader>*reader)#
BuildFileReader from Arrow file and MemoryPool.
Advanced settings are supported through theFileReaderBuilder class.
- Deprecated:
Deprecated in 19.0.0. Usearrow::Result version instead.
- ::arrow::Result<std::unique_ptr<FileReader>>OpenFile(std::shared_ptr<::arrow::io::RandomAccessFile>,::arrow::MemoryPool*allocator)#
BuildFileReader from Arrow file and MemoryPool.
Advanced settings are supported through theFileReaderBuilder class.
- classStreamReader#
A class for reading Parquet files using an output stream type API.
The values given must be of the correct type i.e. the type must match the file schema exactly otherwise aParquetException will be thrown.
The user must explicitly advance to the next row using theEndRow() function or EndRow input manipulator.
Required and optional fields are supported:
Required fields are read using operator>>(T)
Optional fields are read with operator>>(std::optional<T>)
Note that operator>>(std::optional<T>) can be used to read required fields.
Similarly operator>>(T) can be used to read optional fields. However, if the value is not present then aParquetException will be raised.
Currently there is no support for repeated fields.
Public Functions
- voidEndRow()#
Terminate current row and advance to next one.
- Throws:
ParquetException – if all columns in the row were not read or skipped.
- int64_tSkipColumns(int64_tnum_columns_to_skip)#
Skip the data in the next columns.
If the number of columns exceeds the columns remaining on the current row then skipping is terminated - it doesnot continue skipping columns on the next row. Skipping of columns still requires the use ‘EndRow’ even if all remaining columns were skipped.
- Returns:
Number of columns actually skipped.
- int64_tSkipRows(int64_tnum_rows_to_skip)#
Skip the data in the next rows.
Skipping of rows is not allowed if reading of data for the current row is not finished. Skipping of rows will be terminated if the end of file is reached.
- Returns:
Number of rows actually skipped.
Parquet writer#
- classWriterProperties#
- classBuilder#
Public Functions
- inlineBuilder*memory_pool(MemoryPool*pool)#
Specify the memory pool for the writer. Default default_memory_pool.
- inlineBuilder*enable_dictionary()#
Enable dictionary encoding in general for all columns.
Default enabled.
- inlineBuilder*disable_dictionary()#
Disable dictionary encoding in general for all columns.
Default enabled.
- inlineBuilder*enable_dictionary(conststd::string&path)#
Enable dictionary encoding for column specified by
path
.Default enabled.
- inlineBuilder*enable_dictionary(conststd::shared_ptr<schema::ColumnPath>&path)#
Enable dictionary encoding for column specified by
path
.Default enabled.
- inlineBuilder*disable_dictionary(conststd::string&path)#
Disable dictionary encoding for column specified by
path
.Default enabled.
- inlineBuilder*disable_dictionary(conststd::shared_ptr<schema::ColumnPath>&path)#
Disable dictionary encoding for column specified by
path
.Default enabled.
- inlineBuilder*dictionary_pagesize_limit(int64_tdictionary_psize_limit)#
Specify the dictionary page size limit per row group. Default 1MB.
- inlineBuilder*write_batch_size(int64_twrite_batch_size)#
Specify the write batch size while writing batches of Arrow values into Parquet.
Default 1024.
- inlineBuilder*max_row_group_length(int64_tmax_row_group_length)#
Specify the max number of rows to put in a single row group.
Default 1Mi rows.
- inlineBuilder*data_page_version(ParquetDataPageVersiondata_page_version)#
Specify the data page version.
Default V1.
- inlineBuilder*version(ParquetVersion::typeversion)#
Specify the Parquet file version.
Default PARQUET_2_6.
- inlineBuilder*encoding(Encoding::typeencoding_type)#
Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
- inlineBuilder*encoding(conststd::string&path,Encoding::typeencoding_type)#
Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
- inlineBuilder*encoding(conststd::shared_ptr<schema::ColumnPath>&path,Encoding::typeencoding_type)#
Define the encoding that is used when we don’t utilise dictionary encoding.
This either apply if dictionary encoding is disabled or if we fallback as the dictionary grew too large.
- inlineBuilder*compression(Compression::typecodec)#
Specify compression codec in general for all columns.
Default UNCOMPRESSED.
- inlineBuilder*max_statistics_size(size_tmax_stats_sz)#
Specify max statistics size to store min max value.
Default 4KB.
- inlineBuilder*compression(conststd::string&path,Compression::typecodec)#
Specify compression codec for the column specified by
path
.Default UNCOMPRESSED.
- inlineBuilder*compression(conststd::shared_ptr<schema::ColumnPath>&path,Compression::typecodec)#
Specify compression codec for the column specified by
path
.Default UNCOMPRESSED.
- inlineBuilder*compression_level(intcompression_level)#
Specify the default compression level for the compressor in every column.
In case a column does not have an explicitly specified compression level, the default one would be used.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
If other compressor-specific options need to be set in addition to the compression level, use the codec_options method.
- inlineBuilder*compression_level(conststd::string&path,intcompression_level)#
Specify a compression level for the compressor for the column described by path.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
- inlineBuilder*compression_level(conststd::shared_ptr<schema::ColumnPath>&path,intcompression_level)#
Specify a compression level for the compressor for the column described by path.
The provided compression level is compressor specific. The user would have to familiarize oneself with the available levels for the selected compressor. If the compressor does not allow for selecting different compression levels, calling this function would not have any effect. Parquet and Arrow do not validate the passed compression level. If no level is selected by the user or if the special std::numeric_limits<int>::min() value is passed, then Arrow selects the compression level.
- inlineBuilder*codec_options(conststd::shared_ptr<::arrow::util::CodecOptions>&codec_options)#
Specify the default codec options for the compressor in every column.
The codec options allow configuring the compression level as well as other codec-specific options.
- inlineBuilder*codec_options(conststd::string&path,conststd::shared_ptr<::arrow::util::CodecOptions>&codec_options)#
Specify the codec options for the compressor for the column described by path.
- inlineBuilder*codec_options(conststd::shared_ptr<schema::ColumnPath>&path,conststd::shared_ptr<::arrow::util::CodecOptions>&codec_options)#
Specify the codec options for the compressor for the column described by path.
- inlineBuilder*encryption(std::shared_ptr<FileEncryptionProperties>file_encryption_properties)#
Define the file encryption properties.
Default NULL.
- inlineBuilder*enable_statistics(conststd::string&path)#
Enable statistics for the column specified by
path
.Default enabled.
- inlineBuilder*enable_statistics(conststd::shared_ptr<schema::ColumnPath>&path)#
Enable statistics for the column specified by
path
.Default enabled.
- inlineBuilder*set_sorting_columns(std::vector<SortingColumn>sorting_columns)#
Define the sorting columns.
Default empty.
If sorting columns are set, user should ensure that records are sorted by sorting columns. Otherwise, the storing data will be inconsistent with sorting_columns metadata.
- inlineBuilder*disable_statistics(conststd::string&path)#
Disable statistics for the column specified by
path
.Default enabled.
- inlineBuilder*disable_statistics(conststd::shared_ptr<schema::ColumnPath>&path)#
Disable statistics for the column specified by
path
.Default enabled.
- inlineBuilder*enable_store_decimal_as_integer()#
Allow decimals with 1 <= precision <= 18 to be stored as integers.
In Parquet, DECIMAL can be stored in any of the following physical types:
int32: for 1 <= precision <= 9.
int64: for 10 <= precision <= 18.
fixed_len_byte_array: precision is limited by the array size. Length n can store <= floor(log_10(2^(8*n - 1) - 1)) base-10 digits.
binary: precision is unlimited. The minimum number of bytes to store the unscaled value is used.
By default, this is DISABLED and all decimal types annotate fixed_len_byte_array.
When enabled, the C++ writer will use following physical types to store decimals:
int32: for 1 <= precision <= 9.
int64: for 10 <= precision <= 18.
fixed_len_byte_array: for precision > 18.
As a consequence, decimal columns stored in integer types are more compact.
- inlineBuilder*disable_store_decimal_as_integer()#
Disable decimal logical type with 1 <= precision <= 18 to be stored as integer physical type.
Default disabled.
- inlineBuilder*enable_write_page_index()#
Enable writing page index in general for all columns.
Default disabled.
Writing statistics to the page index disables the old method of writing statistics to each data page header. The page index makes filtering more efficient than the page header, as it gathers all the statistics for a Parquet file in a single place, avoiding scattered I/O.
Please check the link below for more details:apache/parquet-format
- inlineBuilder*disable_write_page_index()#
Disable writing page index in general for all columns. Default disabled.
- inlineBuilder*enable_write_page_index(conststd::string&path)#
Enable writing page index for column specified by
path
. Default disabled.
- inlineBuilder*enable_write_page_index(conststd::shared_ptr<schema::ColumnPath>&path)#
Enable writing page index for column specified by
path
. Default disabled.
- inlineBuilder*disable_write_page_index(conststd::string&path)#
Disable writing page index for column specified by
path
. Default disabled.
- inlineBuilder*disable_write_page_index(conststd::shared_ptr<schema::ColumnPath>&path)#
Disable writing page index for column specified by
path
. Default disabled.
- inlineBuilder*set_size_statistics_level(SizeStatisticsLevellevel)#
Set the level to write size statistics for all columns.
Default is None.
- Parameters:
level – The level to write size statistics. Note that if page index is not enabled, page level size statistics will not be written even if the level is set to PageAndColumnChunk.
- inlinestd::shared_ptr<WriterProperties>build()#
Build theWriterProperties with the builder parameters.
- Returns:
TheWriterProperties defined by the builder.
- inlineBuilder*memory_pool(MemoryPool*pool)#
- classBuilder#
- classArrowWriterProperties#
Public Functions
- inlineboolcompliant_nested_types()const#
Enable nested type naming according to the parquet specification.
Older versions of arrow wrote out field names for nested lists based on the name of the field. According to the parquet specification they should always be “element”.
- inlineEngineVersionengine_version()const#
The underlying engine version to use when writing Arrow data.
V2 is currently the latest V1 is considered deprecated but left in place in case there are bugs detected in V2.
- inlinebooluse_threads()const#
Returns whether the writer will use multiple threads to write columns in parallel in the buffered row group mode.
- ::arrow::internal::Executor*executor()const#
Returns the executor used to write columns in parallel.
- classBuilder#
Public Functions
- inlineBuilder*disable_deprecated_int96_timestamps()#
Disable writing legacy int96 timestamps (default disabled).
- inlineBuilder*enable_deprecated_int96_timestamps()#
Enable writing legacy int96 timestamps (default disabled).
May be turned on to write timestamps compatible with older Parquet writers. This takes precedent over coerce_timestamps.
- inlineBuilder*coerce_timestamps(::arrow::TimeUnit::typeunit)#
Coerce all timestamps to the specified time unit.
- Parameters:
unit – time unit to truncate to. For Parquet versions 1.0 and 2.4, nanoseconds are casted to microseconds.
- inlineBuilder*allow_truncated_timestamps()#
Allow loss of data when truncating timestamps.
This is disallowed by default and an error will be returned.
- inlineBuilder*disallow_truncated_timestamps()#
Disallow loss of data when truncating timestamps (default).
- inlineBuilder*store_schema()#
EXPERIMENTAL: Write binary serialized Arrow schema to the file, to enable certain read options (like “read_dictionary”) to be set automatically.
- inlineBuilder*enable_compliant_nested_types()#
When enabled, will not preserve Arrow field names for list types.
Instead of using the field names Arrow uses for the values array of list types (default “item”), will use “element”, as is specified in the Parquet spec.
This is enabled by default.
- inlineBuilder*set_engine_version(EngineVersionversion)#
Set the version of the Parquet writer engine.
- inlineBuilder*set_use_threads(booluse_threads)#
Set whether to use multiple threads to write columns in parallel in the buffered row group mode.
WARNING: If writing multiple files in parallel in the same executor, deadlock may occur if use_threads is true. Please disable it in this case.
Default is false.
- inlineBuilder*set_executor(::arrow::internal::Executor*executor)#
Set the executor to write columns in parallel in the buffered row group mode.
Default is nullptr and the default cpu executor will be used.
- inlinestd::shared_ptr<ArrowWriterProperties>build()#
Create the final properties.
- inlineBuilder*disable_deprecated_int96_timestamps()#
- inlineboolcompliant_nested_types()const#
- classFileWriter#
IterativeFileWriter class.
For basic usage, can write a Table at a time, creating one or more row groups per write call.
For advanced usage, can write column-by-column: Start a new RowGroup or Chunk with NewRowGroup, then write column-by-column the whole column chunk.
If PARQUET:field_id is present as a metadata key on a field, and the corresponding value is a nonnegative integer, then it will be used as the field_id in the parquet file.
Public Functions
- ::arrow::Result<std::unique_ptr<FileWriter>>Open(const::arrow::Schema&schema,MemoryPool*pool,std::shared_ptr<::arrow::io::OutputStream>sink,std::shared_ptr<WriterProperties>properties=default_writer_properties(),std::shared_ptr<ArrowWriterProperties>arrow_properties=default_arrow_writer_properties())#
Try to create an Arrow to Parquet file writer.
- Since
11.0.0
- Parameters:
schema – schema of data that will be passed.
pool – memory pool to use.
sink – output stream to write Parquet data.
properties – general Parquet writer properties.
arrow_properties – Arrow-specific writer properties.
- virtual::arrow::StatusWriteTable(const::arrow::Table&table,int64_tchunk_size=DEFAULT_MAX_ROW_GROUP_LENGTH)=0#
Write a Table to Parquet.
- Parameters:
table – Arrow table to write.
chunk_size – maximum number of rows to write per row group.
- virtual::arrow::StatusNewRowGroup()=0#
Start a new row group.
Returns an error if not all columns have been written.
- inline::arrow::StatusNewRowGroup(int64_tchunk_size)#
Start a new row group.
- Deprecated:
Deprecated in 19.0.0.
- virtual::arrow::StatusWriteColumnChunk(const::arrow::Array&data)=0#
Write ColumnChunk in row group using an array.
- virtual::arrow::StatusWriteColumnChunk(conststd::shared_ptr<::arrow::ChunkedArray>&data,int64_toffset,int64_tsize)=0#
Write ColumnChunk in row group using slice of a ChunkedArray.
- virtual::arrow::StatusWriteColumnChunk(conststd::shared_ptr<::arrow::ChunkedArray>&data)=0#
Write ColumnChunk in a row group using a ChunkedArray.
- virtual::arrow::StatusNewBufferedRowGroup()=0#
Start a new buffered row group.
Returns an error if not all columns have been written.
- virtual::arrow::StatusWriteRecordBatch(const::arrow::RecordBatch&batch)=0#
Write a RecordBatch into the buffered row group.
Multiple RecordBatches can be written into the same row group through this method.
WriterProperties.max_row_group_length() is respected and a new row group will be created if the current row group exceeds the limit.
Batches get flushed to the output stream onceNewBufferedRowGroup() orClose() is called.
WARNING: If you are writing multiple files in parallel in the same executor, deadlock may occur ifArrowWriterProperties::use_threads is set to true to write columns in parallel. Please disable use_threads option in this case.
- virtual::arrow::StatusAddKeyValueMetadata(conststd::shared_ptr<const::arrow::KeyValueMetadata>&key_value_metadata)=0#
Add key-value metadata to the file.
WARNING: If
store_schema
is enabled,ARROW:schema
would be stored in the key-value metadata. Overwriting this key would result instore_schema
being unusable during read.Note
This will overwrite any existing metadata with the same key.
- Parameters:
key_value_metadata –[in] the metadata to add.
- Returns:
Error ifClose() has been called.
- ::arrow::Result<std::unique_ptr<FileWriter>>Open(const::arrow::Schema&schema,MemoryPool*pool,std::shared_ptr<::arrow::io::OutputStream>sink,std::shared_ptr<WriterProperties>properties=default_writer_properties(),std::shared_ptr<ArrowWriterProperties>arrow_properties=default_arrow_writer_properties())#
- ::arrow::Statusparquet::arrow::WriteTable(const::arrow::Table&table,MemoryPool*pool,std::shared_ptr<::arrow::io::OutputStream>sink,int64_tchunk_size=DEFAULT_MAX_ROW_GROUP_LENGTH,std::shared_ptr<WriterProperties>properties=default_writer_properties(),std::shared_ptr<ArrowWriterProperties>arrow_properties=default_arrow_writer_properties())#
Write a Table to Parquet.
This writes one table in a single shot. To write a Parquet file with multiple tables iteratively, seeparquet::arrow::FileWriter.
- Parameters:
table – Table to write.
pool – memory pool to use.
sink – output stream to write Parquet data.
chunk_size – maximum number of rows to write per row group.
properties – general Parquet writer properties.
arrow_properties – Arrow-specific writer properties.
- classStreamWriter#
A class for writing Parquet files using an output stream type API.
The values given must be of the correct type i.e. the type must match the file schema exactly otherwise aParquetException will be thrown.
The user must explicitly indicate the end of the row using theEndRow() function or EndRow output manipulator.
A maximum row group size can be configured, the default size is 512MB. Alternatively the row group size can be set to zero and the user can create new row groups by calling theEndRowGroup() function or using the EndRowGroup output manipulator.
Required and optional fields are supported:
Required fields are written using operator<<(T)
Optional fields are written using operator<<(std::optional<T>).
Note that operator<<(T) can be used to write optional fields.
Similarly, operator<<(std::optional<T>) can be used to write required fields. However if the optional parameter does not have a value (i.e. it is nullopt) then aParquetException will be raised.
Currently there is no support for repeated fields.
Public Functions
- StreamWriter&operator<<(boolv)#
Output operators for required fields.
These can also be used for optional fields when a value must be set.
- template<intN>
inlineStreamWriter&operator<<(constchar(&v)[N])# Output operators for fixed length strings.
- StreamWriter&operator<<(constchar*v)#
Output operators for variable length strings.
- template<typenameT>
inlineStreamWriter&operator<<(constoptional<T>&v)# Output operator for optional fields.
- int64_tSkipColumns(intnum_columns_to_skip)#
Skip the next N columns of optional data.
If there are less than N columns remaining then the excess columns are ignored.
- Throws:
ParquetException – if there is an attempt to skip any required column.
- Returns:
Number of columns actually skipped.
- voidEndRow()#
Terminate the current row and advance to next one.
- Throws:
ParquetException – if all columns in the row were not written or skipped.
- voidEndRowGroup()#
Terminate the current row group and create new one.
- structFixedStringView#
Helper class to write fixed length strings.
This is useful as the standard string view (such as std::string_view) is for variable length data.
ORC#
- classORCFileReader#
Read an ArrowTable orRecordBatch from an ORC file.
Public Functions
- Result<std::shared_ptr<Schema>>ReadSchema()#
Return the schema read from the ORC file.
- Returns:
the returnedSchema object
- Result<std::shared_ptr<Table>>Read()#
Read the file as aTable.
The table will be composed of one record batch per stripe.
- Returns:
the returnedTable
- Result<std::shared_ptr<Table>>Read(conststd::shared_ptr<Schema>&schema)#
Read the file as aTable.
The table will be composed of one record batch per stripe.
- Result<std::shared_ptr<Table>>Read(conststd::vector<int>&include_indices)#
Read the file as aTable.
The table will be composed of one record batch per stripe.
- Parameters:
include_indices –[in] the selected field indices to read
- Returns:
the returnedTable
- Result<std::shared_ptr<Table>>Read(conststd::vector<std::string>&include_names)#
Read the file as aTable.
The table will be composed of one record batch per stripe.
- Parameters:
include_names –[in] the selected field names to read
- Returns:
the returnedTable
- Result<std::shared_ptr<Table>>Read(conststd::shared_ptr<Schema>&schema,conststd::vector<int>&include_indices)#
Read the file as aTable.
The table will be composed of one record batch per stripe.
- Result<std::shared_ptr<RecordBatch>>ReadStripe(int64_tstripe)#
Read a single stripe as aRecordBatch.
- Parameters:
stripe –[in] the stripe index
- Returns:
the returnedRecordBatch
- Result<std::shared_ptr<RecordBatch>>ReadStripe(int64_tstripe,conststd::vector<int>&include_indices)#
Read a single stripe as aRecordBatch.
- Parameters:
stripe –[in] the stripe index
include_indices –[in] the selected field indices to read
- Returns:
the returnedRecordBatch
- Result<std::shared_ptr<RecordBatch>>ReadStripe(int64_tstripe,conststd::vector<std::string>&include_names)#
Read a single stripe as aRecordBatch.
- Parameters:
stripe –[in] the stripe index
include_names –[in] the selected field names to read
- Returns:
the returnedRecordBatch
- StatusSeek(int64_trow_number)#
Seek to designated row.
InvokeNextStripeReader() after seek will return stripe reader starting from designated row.
- Parameters:
row_number –[in] the rows number to seek
- Result<std::shared_ptr<RecordBatchReader>>NextStripeReader(int64_tbatch_size)#
Get a stripe level record batch iterator.
Each record batch will have up to
batch_size
rows. NextStripeReader serves as a fine-grained alternative to ReadStripe which may cause OOM issues by loading the whole stripe into memory.Note this will only read rows for the current stripe, not the entire file.
- Parameters:
batch_size –[in] the maximum number of rows in each record batch
- Returns:
the returned stripe reader
- Result<std::shared_ptr<RecordBatchReader>>NextStripeReader(int64_tbatch_size,conststd::vector<int>&include_indices)#
Get a stripe level record batch iterator.
Each record batch will have up to
batch_size
rows. NextStripeReader serves as a fine-grained alternative to ReadStripe which may cause OOM issues by loading the whole stripe into memory.Note this will only read rows for the current stripe, not the entire file.
- Parameters:
batch_size –[in] the maximum number of rows in each record batch
include_indices –[in] the selected field indices to read
- Returns:
the stripe reader
- Result<std::shared_ptr<RecordBatchReader>>GetRecordBatchReader(int64_tbatch_size,conststd::vector<std::string>&include_names)#
Get a record batch iterator for the entire file.
Each record batch will have up to
batch_size
rows.- Parameters:
batch_size –[in] the maximum number of rows in each record batch
include_names –[in] the selected field names to read, if not empty (otherwise all fields are read)
- Returns:
the record batch iterator
- int64_tNumberOfStripes()#
The number of stripes in the file.
- int64_tNumberOfRows()#
The number of rows in the file.
- StripeInformationGetStripeInformation(int64_tstripe)#
StripeInformation for each stripe.
- FileVersionGetFileVersion()#
Get the format version of the file.
Currently known values are 0.11 and 0.12.
- Returns:
The FileVersion of the ORC file.
- std::stringGetSoftwareVersion()#
Get the software instance and version that wrote this file.
- Returns:
a user-facing string that specifies the software version
- Result<Compression::type>GetCompression()#
Get the compression kind of the file.
- Returns:
The kind of compression in the ORC file.
- int64_tGetCompressionSize()#
Get the buffer size for the compression.
- Returns:
Number of bytes to buffer for the compression codec.
- int64_tGetRowIndexStride()#
Get the number of rows per an entry in the row index.
- Returns:
the number of rows per an entry in the row index or 0 if there is no row index.
- WriterIdGetWriterId()#
Get ID of writer that generated the file.
- Returns:
UNKNOWN_WRITER if the writer ID is undefined
- int32_tGetWriterIdValue()#
Get the writer id value when getWriterId() returns an unknown writer.
- Returns:
the integer value of the writer ID.
- WriterVersionGetWriterVersion()#
Get the version of the writer.
- Returns:
the version of the writer.
- int64_tGetNumberOfStripeStatistics()#
Get the number of stripe statistics in the file.
- Returns:
the number of stripe statistics
- int64_tGetContentLength()#
Get the length of the data stripes in the file.
- Returns:
return the number of bytes in stripes
- int64_tGetStripeStatisticsLength()#
Get the length of the file stripe statistics.
- Returns:
the number of compressed bytes in the file stripe statistics
- int64_tGetFileFooterLength()#
Get the length of the file footer.
- Returns:
the number of compressed bytes in the file footer
- int64_tGetFilePostscriptLength()#
Get the length of the file postscript.
- Returns:
the number of bytes in the file postscript
- int64_tGetFileLength()#
Get the total length of the file.
- Returns:
the number of bytes in the file
- std::stringGetSerializedFileTail()#
Get the serialized file tail.
Useful if another reader of the same file wants to avoid re-reading the file tail. See ReadOptions.SetSerializedFileTail().
- Returns:
a string of bytes with the file tail
- Result<std::shared_ptr<constKeyValueMetadata>>ReadMetadata()#
Return the metadata read from the ORC file.
- Returns:
AKeyValueMetadata object containing the ORC metadata
Public Static Functions
- staticResult<std::unique_ptr<ORCFileReader>>Open(conststd::shared_ptr<io::RandomAccessFile>&file,MemoryPool*pool)#
Creates a new ORC reader.
- Parameters:
file –[in] the data source
pool –[in] aMemoryPool to use for buffer allocations
- Returns:
the returned reader object
- Result<std::shared_ptr<Schema>>ReadSchema()#
- structWriteOptions#
Options for the ORC Writer.
Public Members
- int64_tbatch_size=1024#
Number of rows the ORC writer writes at a time, default 1024.
- FileVersionfile_version=FileVersion(0,12)#
Which ORC file version to use, default FileVersion(0, 12)
- int64_tstripe_size=64*1024*1024#
Size of each ORC stripe in bytes, default 64 MiB.
- Compression::typecompression=Compression::UNCOMPRESSED#
The compression codec of the ORC file, there is no compression by default.
- int64_tcompression_block_size=64*1024#
The size of each compression block in bytes, default 64 KiB.
- CompressionStrategycompression_strategy=CompressionStrategy::kSpeed#
The compression strategy i.e.
speed vs size reduction, default CompressionStrategy::kSpeed
- int64_trow_index_stride=10000#
The number of rows per an entry in the row index, default 10000.
- doublepadding_tolerance=0.0#
The padding tolerance, default 0.0.
- doubledictionary_key_size_threshold=0.0#
The dictionary key size threshold.
0 to disable dictionary encoding. 1 to always enable dictionary encoding, default 0.0
- std::vector<int64_t>bloom_filter_columns#
The array of columns that use the bloom filter, default empty.
- doublebloom_filter_fpp=0.05#
The upper limit of the false-positive rate of the bloom filter, default 0.05.
- int64_tbatch_size=1024#
- classORCFileWriter#
Write an ArrowTable orRecordBatch to an ORC file.
Public Functions
- StatusWrite(constTable&table)#
Write a table.
This can be called multiple times.
Tables passed in subsequent calls must match the schema of the table that was written first.
- Parameters:
table –[in] the Arrow table from which data is extracted.
- Returns:
- StatusWrite(constRecordBatch&record_batch)#
Write aRecordBatch.
This can be called multiple times.
RecordBatches passed in subsequent calls must match the schema of theRecordBatch that was written first.
- Parameters:
record_batch –[in] the ArrowRecordBatch from which data is extracted.
- Returns:
Public Static Functions
- staticResult<std::unique_ptr<ORCFileWriter>>Open(io::OutputStream*output_stream,constWriteOptions&write_options=WriteOptions())#
Creates a new ORC writer.
- Parameters:
output_stream –[in] a pointer to theio::OutputStream to write into
write_options –[in] the ORC writer options for Arrow
- Returns:
the returned writer object
- StatusWrite(constTable&table)#