Reading JSON files#

Line-separated JSON files can either be read as a single Arrow Tablewith aTableReader or streamed as RecordBatches with aStreamingReader.

Both of these readers require anarrow::io::InputStream instancerepresenting the input file. Their behavior can be customized using acombination ofReadOptions,ParseOptions, andother parameters.

TableReader#

TableReader reads an entire file in one shot as aTable. Eachindependent JSON object in the input file is converted to a row inthe output table.

#include"arrow/json/api.h"{// ...arrow::MemoryPool*pool=default_memory_pool();std::shared_ptr<arrow::io::InputStream>input=...;autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();// Instantiate TableReader from input stream and optionsautomaybe_reader=arrow::json::TableReader::Make(pool,input,read_options,parse_options);if(!maybe_reader.ok()){// Handle TableReader instantiation error...}autoreader=*maybe_reader;// Read table from JSON fileautomaybe_table=reader->Read();if(!maybe_table.ok()){// Handle JSON read error// (for example a JSON syntax error or failed type conversion)}autotable=*maybe_table;}

StreamingReader#

StreamingReader reads a file incrementally from blocks of a roughly equal byte size, each yielding aRecordBatch. Each independent JSON object in a blockis converted to a row in the output batch.

All batches adhere to a consistentSchema, which isderived from the first loaded batch. Alternatively, an explicit schemamay be passed viaParseOptions.

#include"arrow/json/api.h"{// ...autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();std::shared_ptr<arrow::io::InputStream>stream;autoresult=arrow::json::StreamingReader::Make(stream,read_options,parse_options);if(!result.ok()){// Handle instantiation error}std::shared_ptr<arrow::json::StreamingReader>reader=*result;for(arrow::Result<std::shared_ptr<arrow::RecordBatch>>maybe_batch:*reader){if(!maybe_batch.ok()){// Handle read/parse error}std::shared_ptr<arrow::RecordBatch>batch=*maybe_batch;// Operate on each batch...}}

Data types#

Since JSON values are typed, the possible Arrow data types on outputdepend on the input value types. Top-level JSON values should always beobjects. The fields of top-level objects are taken to represent columnsin the Arrow data. For each name/value pair in a JSON object, there aretwo possible modes of deciding the output data type:

  • if the name is inParseOptions::explicit_schema,conversion of the JSON value to the corresponding Arrow data type isattempted;

  • otherwise, the Arrow data type is determined via type inference onthe JSON value, trying out a number of Arrow data types in order.

The following tables show the possible combinations for each of thosetwo modes.

Explicit conversions from JSON to Arrow#

JSON value type

Allowed Arrow data types

Null

Any (including Null)

Number

All Integer types, Float32, Float64,Date32, Date64, Time32, Time64

Boolean

Boolean

String

Binary, LargeBinary, String, LargeString,Timestamp

Array

List

Object (nested)

Struct

Implicit type inference from JSON to Arrow#

JSON value type

Inferred Arrow data types (in order)

Null

Null, any other

Number

Int64, Float64

Boolean

Boolean

String

Timestamp (with seconds unit), String

Array

List

Object (nested)

Struct