Reading JSON files#

Arrow supports reading columnar data from line-delimited JSON files.In this context, a JSON file consists of multiple JSON objects, one per line,representing individual data rows. For example, this file representstwo rows of data with four columns “a”, “b”, “c”, “d”:

{"a":1,"b":2.0,"c":"foo","d":false}{"a":4,"b":-5.5,"c":null,"d":true}

The features currently offered are the following:

  • multi-threaded or single-threaded reading

  • automatic decompression of input files (based on the filename extension,such asmy_data.json.gz)

  • sophisticated type inference (see below)

Note

Currently only the line-delimited JSON format is supported.

Usage#

JSON reading functionality is available through thepyarrow.json module.In many cases, you will simply call theread_json() functionwith the file path you want to read from:

>>>frompyarrowimportjson>>>fn='my_data.json'>>>table=json.read_json(fn)>>>tablepyarrow.Tablea: int64b: doublec: stringd: bool>>>table.to_pandas()   a    b     c      d0  1  2.0   foo  False1  4 -5.5  None   True

Automatic Type Inference#

Arrowdata types are inferred from the JSON types andvalues of each column:

  • JSON null values convert to thenull type, but can fall back to anyother type.

  • JSON booleans convert tobool_.

  • JSON numbers convert toint64, falling back tofloat64 if anon-integer is encountered.

  • JSON strings of the kind “YYYY-MM-DD” and “YYYY-MM-DD hh:mm:ss” converttotimestamp[s], falling back toutf8 if a conversion error occurs.

  • JSON arrays convert to alist type, and inference proceeds recursivelyon the JSON arrays’ values.

  • Nested JSON objects convert to astruct type, and inference proceedsrecursively on the JSON objects’ values.

Thus, reading this JSON file:

{"a":[1,2],"b":{"c":true,"d":"1991-02-03"}}{"a":[3,4,5],"b":{"c":false,"d":"2019-04-01"}}

returns the following data:

>>>table=json.read_json("my_data.json")>>>tablepyarrow.Tablea: list<item: int64>  child 0, item: int64b: struct<c: bool, d: timestamp[s]>  child 0, c: bool  child 1, d: timestamp[s]>>>table.to_pandas()           a                                       b0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}

Customized parsing#

To alter the default parsing settings in case of reading JSON files with anunusual structure, you should create aParseOptions instanceand pass it toread_json(). For example, you can pass an explicitschema in order to bypass automatic type inference.

Similarly, you can choose performance settings by passing aReadOptions instance toread_json().

Incremental reading#

For memory-constrained environments, it is also possible to read a JSON fileone batch at a time, usingopen_json().

In this case, type inference is done on the first block and types are frozen afterwards.To make sure the right data types are inferred, either setReadOptions.block_size to a large enough value, or useParseOptions.explicit_schema to set the desired data types explicitly.