Reading JSON files #

Line-separated JSON files can either be read as a single Arrow Tablewith aTableReader or streamed as RecordBatches with aStreamingReader.

Both of these readers require anarrow::io::InputStream instancerepresenting the input file. Their behavior can be customized using acombination ofReadOptions,ParseOptions, andother parameters.

TableReader#

TableReader reads an entire file in one shot as aTable. Eachindependent JSON object in the input file is converted to a row inthe output table.

#include"arrow/json/api.h"{// ...arrow::MemoryPool*pool=default_memory_pool();std::shared_ptr<arrow::io::InputStream>input=...;autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();// Instantiate TableReader from input stream and optionsautomaybe_reader=arrow::json::TableReader::Make(pool,input,read_options,parse_options);if(!maybe_reader.ok()){// Handle TableReader instantiation error...}autoreader=*maybe_reader;// Read table from JSON fileautomaybe_table=reader->Read();if(!maybe_table.ok()){// Handle JSON read error// (for example a JSON syntax error or failed type conversion)}autotable=*maybe_table;}

StreamingReader#

StreamingReader reads a file incrementally from blocks of a roughly equal byte size, each yielding aRecordBatch. Each independent JSON object in a blockis converted to a row in the output batch.

All batches adhere to a consistentSchema, which isderived from the first loaded batch. Alternatively, an explicit schemamay be passed viaParseOptions.

#include"arrow/json/api.h"{// ...autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();std::shared_ptr<arrow::io::InputStream>stream;autoresult=arrow::json::StreamingReader::Make(stream,read_options,parse_options);if(!result.ok()){// Handle instantiation error}std::shared_ptr<arrow::json::StreamingReader>reader=*result;for(arrow::Result<std::shared_ptr<arrow::RecordBatch>>maybe_batch:*reader){if(!maybe_batch.ok()){// Handle read/parse error}std::shared_ptr<arrow::RecordBatch>batch=*maybe_batch;// Operate on each batch...}}

Data types#

Since JSON values are typed, the possible Arrow data types on outputdepend on the input value types. Top-level JSON values should always beobjects. The fields of top-level objects are taken to represent columnsin the Arrow data. For each name/value pair in a JSON object, there aretwo possible modes of deciding the output data type:

if the name is inParseOptions::explicit_schema,conversion of the JSON value to the corresponding Arrow data type isattempted;
otherwise, the Arrow data type is determined via type inference onthe JSON value, trying out a number of Arrow data types in order.

The following tables show the possible combinations for each of thosetwo modes.

Explicit conversions from JSON to Arrow#
JSON value type	Allowed Arrow data types
Null	Any (including Null)
Number	All Integer types, Float32, Float64,Date32, Date64, Time32, Time64
Boolean	Boolean
String	Binary, LargeBinary, String, LargeString,Timestamp
Array	List
Object (nested)	Struct

Implicit type inference from JSON to Arrow#
JSON value type	Inferred Arrow data types (in order)
Null	Null, any other
Number	Int64, Float64
Boolean	Boolean
String	Timestamp (with seconds unit), String
Array	List
Object (nested)	Struct

On this page

Edit on GitHub

Movatterモバイル変換

Reading JSON files#

TableReader#

StreamingReader#

Data types#

Reading JSON files #