Reading and Writing the Apache ORC Format#

TheApache ORC project provides astandardized open-source columnar storage format for use in data analysissystems. It was created originally for use inApache Hadoop with systems likeApache Drill,Apache Hive,ApacheImpala, andApache Spark adopting it as a shared standard for highperformance data IO.

Apache Arrow is an ideal in-memory representation layer for data that is being reador written with ORC files.

Obtaining pyarrow with ORC Support#

If you installedpyarrow with pip or conda, it should be built with ORCsupport bundled:

>>>frompyarrowimportorc

If you are buildingpyarrow from source, you must use-DARROW_ORC=ON when compiling the C++ libraries and enable the ORCextensions when buildingpyarrow. See thePython Development page for more details.

Reading and Writing Single Files#

The functionsread_table() andwrite_table()read and write thepyarrow.Table object, respectively.

Let’s look at a simple table:

>>>importnumpyasnp>>>importpyarrowaspa>>>table=pa.table(...{...'one':[-1,np.nan,2.5],...'two':['foo','bar','baz'],...'three':[True,False,True]...}...)

We write this to ORC format withwrite_table:

>>>frompyarrowimportorc>>>orc.write_table(table,'example.orc')

This creates a single ORC file. In practice, an ORC dataset may consistof many files in many directories. We can read a single file back withread_table:

>>>table2=orc.read_table('example.orc')

You can pass a subset of columns to read, which can be much faster than readingthe whole file (due to the columnar layout):

>>>orc.read_table('example.orc',columns=['one','three'])pyarrow.Tableone: doublethree: bool----one: [[-1,nan,2.5]]three: [[true,false,true]]

We need not use a string to specify the origin of the file. It can be any of:

  • A file path as a string

  • A Python file object

  • A pathlib.Path object

  • ANativeFile from PyArrow

In general, a Python file object will have the worst read performance, while astring file path or an instance ofNativeFile (especially memorymaps) will perform the best.

We can also read partitioned datasets with multiple ORC files through thepyarrow.dataset interface.

ORC file writing options#

write_table() has a number of options tocontrol various settings when writing an ORC file.

  • file_version, the ORC format version to use.'0.11' ensurescompatibility with older readers, while'0.12' is the newer one.

  • stripe_size, to control the approximate size of data within a columnstripe. This currently defaults to 64MB.

See thewrite_table() docstring for more details.

Finer-grained Reading and Writing#

read_table uses theORCFile class, which has other features:

>>>orc_file=orc.ORCFile('example.orc')>>>orc_file.metadata-- metadata -->>>orc_file.schemaone: doubletwo: stringthree: bool>>>orc_file.nrows3

See theORCFile docstring for more details.

As you can learn more in theApache ORC format, an ORC file consists ofmultiple stripes.read_table will read all of the stripes andconcatenate them into a single table. You can read individual stripes withread_stripe:

>>>orc_file.nstripes1>>>orc_file.read_stripe(0)pyarrow.RecordBatchone: doubletwo: stringthree: bool

We can write an ORC file usingORCWriter:

>>>withorc.ORCWriter('example2.orc')aswriter:...writer.write(table)

Compression#

The data pages within a column in a row group can be compressed after theencoding passes (dictionary, RLE encoding). In PyArrow we don’t use compressionby default, but Snappy, ZSTD, Zlib, and LZ4 are also supported:

>>>orc.write_table(table,where,compression='uncompressed')>>>orc.write_table(table,where,compression='zlib')>>>orc.write_table(table,where,compression='zstd')>>>orc.write_table(table,where,compression='snappy')

Snappy generally results in better performance, while Zlib may yield smallerfiles.

Reading from cloud storage#

In addition to local files, pyarrow supports other filesystems, such as cloudfilesystems, through thefilesystem keyword:

>>>frompyarrowimportfs>>>s3=fs.S3FileSystem(region="us-east-2")>>>table=orc.read_table("bucket/object/key/prefix",filesystem=s3)