Reading JSON files#
Arrow supports reading columnar data from line-delimited JSON files.In this context, a JSON file consists of multiple JSON objects, one per line,representing individual data rows. For example, this file representstwo rows of data with four columns “a”, “b”, “c”, “d”:
{"a":1,"b":2.0,"c":"foo","d":false}{"a":4,"b":-5.5,"c":null,"d":true}
The features currently offered are the following:
multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension,such as
my_data.json.gz)sophisticated type inference (see below)
Note
Currently only the line-delimited JSON format is supported.
Usage#
JSON reading functionality is available through thepyarrow.json module.In many cases, you will simply call theread_json() functionwith the file path you want to read from:
>>>frompyarrowimportjson>>>fn='my_data.json'>>>table=json.read_json(fn)>>>tablepyarrow.Tablea: int64b: doublec: stringd: bool>>>table.to_pandas() a b c d0 1 2.0 foo False1 4 -5.5 None True
Automatic Type Inference#
Arrowdata types are inferred from the JSON types andvalues of each column:
JSON null values convert to the
nulltype, but can fall back to anyother type.JSON booleans convert to
bool_.JSON numbers convert to
int64, falling back tofloat64if anon-integer is encountered.JSON strings of the kind “YYYY-MM-DD” and “YYYY-MM-DD hh:mm:ss” convertto
timestamp[s], falling back toutf8if a conversion error occurs.JSON arrays convert to a
listtype, and inference proceeds recursivelyon the JSON arrays’ values.Nested JSON objects convert to a
structtype, and inference proceedsrecursively on the JSON objects’ values.
Thus, reading this JSON file:
{"a":[1,2],"b":{"c":true,"d":"1991-02-03"}}{"a":[3,4,5],"b":{"c":false,"d":"2019-04-01"}}
returns the following data:
>>>table=json.read_json("my_data.json")>>>tablepyarrow.Tablea: list<item: int64> child 0, item: int64b: struct<c: bool, d: timestamp[s]> child 0, c: bool child 1, d: timestamp[s]>>>table.to_pandas() a b0 [1, 2] {'c': True, 'd': 1991-02-03 00:00:00}1 [3, 4, 5] {'c': False, 'd': 2019-04-01 00:00:00}
Customized parsing#
To alter the default parsing settings in case of reading JSON files with anunusual structure, you should create aParseOptions instanceand pass it toread_json(). For example, you can pass an explicitschema in order to bypass automatic type inference.
Similarly, you can choose performance settings by passing aReadOptions instance toread_json().
Incremental reading#
For memory-constrained environments, it is also possible to read a JSON fileone batch at a time, usingopen_json().
In this case, type inference is done on the first block and types are frozen afterwards.To make sure the right data types are inferred, either setReadOptions.block_size to a large enough value, or useParseOptions.explicit_schema to set the desired data types explicitly.

