Reading JSON files#
Line-separated JSON files can either be read as a single Arrow Tablewith aTableReader or streamed as RecordBatches with aStreamingReader.
Both of these readers require anarrow::io::InputStream instancerepresenting the input file. Their behavior can be customized using acombination ofReadOptions,ParseOptions, andother parameters.
See also
TableReader#
TableReader reads an entire file in one shot as aTable. Eachindependent JSON object in the input file is converted to a row inthe output table.
#include"arrow/json/api.h"{// ...arrow::MemoryPool*pool=default_memory_pool();std::shared_ptr<arrow::io::InputStream>input=...;autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();// Instantiate TableReader from input stream and optionsautomaybe_reader=arrow::json::TableReader::Make(pool,input,read_options,parse_options);if(!maybe_reader.ok()){// Handle TableReader instantiation error...}autoreader=*maybe_reader;// Read table from JSON fileautomaybe_table=reader->Read();if(!maybe_table.ok()){// Handle JSON read error// (for example a JSON syntax error or failed type conversion)}autotable=*maybe_table;}
StreamingReader#
StreamingReader reads a file incrementally from blocks of a roughly equal byte size, each yielding aRecordBatch. Each independent JSON object in a blockis converted to a row in the output batch.
All batches adhere to a consistentSchema, which isderived from the first loaded batch. Alternatively, an explicit schemamay be passed viaParseOptions.
#include"arrow/json/api.h"{// ...autoread_options=arrow::json::ReadOptions::Defaults();autoparse_options=arrow::json::ParseOptions::Defaults();std::shared_ptr<arrow::io::InputStream>stream;autoresult=arrow::json::StreamingReader::Make(stream,read_options,parse_options);if(!result.ok()){// Handle instantiation error}std::shared_ptr<arrow::json::StreamingReader>reader=*result;for(arrow::Result<std::shared_ptr<arrow::RecordBatch>>maybe_batch:*reader){if(!maybe_batch.ok()){// Handle read/parse error}std::shared_ptr<arrow::RecordBatch>batch=*maybe_batch;// Operate on each batch...}}
Data types#
Since JSON values are typed, the possible Arrow data types on outputdepend on the input value types. Top-level JSON values should always beobjects. The fields of top-level objects are taken to represent columnsin the Arrow data. For each name/value pair in a JSON object, there aretwo possible modes of deciding the output data type:
if the name is in
ParseOptions::explicit_schema,conversion of the JSON value to the corresponding Arrow data type isattempted;otherwise, the Arrow data type is determined via type inference onthe JSON value, trying out a number of Arrow data types in order.
The following tables show the possible combinations for each of thosetwo modes.
JSON value type | Allowed Arrow data types |
|---|---|
Null | Any (including Null) |
Number | All Integer types, Float32, Float64,Date32, Date64, Time32, Time64 |
Boolean | Boolean |
String | Binary, LargeBinary, String, LargeString,Timestamp |
Array | List |
Object (nested) | Struct |
JSON value type | Inferred Arrow data types (in order) |
|---|---|
Null | Null, any other |
Number | Int64, Float64 |
Boolean | Boolean |
String | Timestamp (with seconds unit), String |
Array | List |
Object (nested) | Struct |

