Reading and Writing ORC files#

TheApache ORC project provides astandardized open-source columnar storage format for use in data analysissystems. It was created originally for use inApache Hadoop with systems likeApache Drill,Apache Hive,ApacheImpala, andApache Spark adopting it as a shared standard for highperformance data IO.

Apache Arrow is an ideal in-memory representation layer for data that is being reador written with ORC files.

Supported ORC features#

The ORC format has many features, and we support a subset of them.

Data types#

Here are a list of ORC types and mapped Arrow types.

Logical type

Mapped Arrow type

Notes

BOOLEAN

Boolean

BYTE

Int8

SHORT

Int16

INT

Int32

LONG

Int64

FLOAT

Float32

DOUBLE

Float64

STRING

String/LargeString

(1)

BINARY

Binary/LargeBinary/FixedSizeBinary

(1)

TIMESTAMP

Timestamp/Date64

(1) (2)

TIMESTAMP_INSTANT

Timestamp

(2)

LIST

List/LargeList/FixedSizeList

(1)

MAP

Map

STRUCT

Struct

UNION

SparseUnion/DenseUnion

(1)

DECIMAL

Decimal128/Decimal256

(1)

DATE

Date32

VARCHAR

String

(3)

CHAR

String

(3)

  • (1) On the read side the ORC type is read as the first corresponding Arrow type in the table.

  • (2) On the write side the ORC TIMESTAMP_INSTANT is used when timezone is provided, otherwiseORC TIMESTAMP is used. On the read side both ORC TIMESTAMP and TIMESTAMP_INSTANT types are readas the Arrow Timestamp type witharrow::TimeUnit::NANO and timezone is set toUTC for ORC TIMESTAMP_INSTANT type only.

  • (3) On the read side both ORC CHAR and VARCHAR types are read as the Arrow String type. ORC CHARand VARCHAR types are not supported on the write side.

Compression#

Compression codec

SNAPPY

GZIP/ZLIB

LZ4

ZSTD

Unsupported compression codec: LZO.

Reading ORC Files#

TheORCFileReader class reads data for an entirefile or stripe into an::arrow::Table.

ORCFileReader#

TheORCFileReader class requires a::arrow::io::RandomAccessFile instance representing the inputfile.

#include<arrow/adapters/orc/adapter.h>{// ...arrow::Statusst;arrow::MemoryPool*pool=default_memory_pool();std::shared_ptr<arrow::io::RandomAccessFile>input=...;// Open ORC file readerautomaybe_reader=arrow::adapters::orc::ORCFileReader::Open(input,pool);if(!maybe_reader.ok()){// Handle error instantiating file reader...}std::unique_ptr<arrow::adapters::orc::ORCFileReader>reader=maybe_reader.ValueOrDie();// Read entire file as a single Arrow tableautomaybe_table=reader->Read();if(!maybe_table.ok()){// Handle error reading ORC data...}std::shared_ptr<arrow::Table>table=maybe_table.ValueOrDie();}

Writing ORC Files#

ORCFileWriter#

An ORC file is written to aOutputStream.

#include<arrow/adapters/orc/adapter.h>{// Oneshot write// ...std::shared_ptr<arrow::io::OutputStream>output=...;autowriter_options=WriterOptions();automaybe_writer=arrow::adapters::orc::ORCFileWriter::Open(output.get(),writer_options);if(!maybe_writer.ok()){// Handle error instantiating file writer...}std::unique_ptr<arrow::adapters::orc::ORCFileWriter>writer=maybe_writer.ValueOrDie();if(!(writer->Write(*input_table)).ok()){// Handle write error...}if(!(writer->Close()).ok()){// Handle close error...}}