Reading JSON files #

Arrow supports reading columnar data from line-delimited JSON files.In this context, a JSON file consists of multiple JSON objects, one per line,representing individual data rows. For example, this file representstwo rows of data with four columns “a”, “b”, “c”, “d”:

{"a":1,"b":2.0,"c":"foo","d":false}{"a":4,"b":-5.5,"c":null,"d":true}

The features currently offered are the following:

multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension,such asmy_data.json.gz)
sophisticated type inference (see below)

Note

Currently only the line-delimited JSON format is supported.

Usage#

JSON reading functionality is available through thepyarrow.json module.In many cases, you will simply call theread_json() functionwith the file path you want to read from:

>>>frompyarrowimportjson>>>fn='my_data.json'>>>table=json.read_json(fn)>>>tablepyarrow.Tablea: int64b: doublec: stringd: bool>>>table.to_pandas()   a    b     c      d0  1  2.0   foo  False1  4 -5.5  None   True

Automatic Type Inference#

Arrowdata types are inferred from the JSON types andvalues of each column:

JSON null values convert to thenull type, but can fall back to anyother type.
JSON booleans convert tobool_.
JSON numbers convert toint64, falling back tofloat64 if anon-integer is encountered.
JSON strings of the kind “YYYY-MM-DD” and “YYYY-MM-DD hh:mm:ss” converttotimestamp[s], falling back toutf8 if a conversion error occurs.
JSON arrays convert to alist type, and inference proceeds recursivelyon the JSON arrays’ values.
Nested JSON objects convert to astruct type, and inference proceedsrecursively on the JSON objects’ values.

Thus, reading this JSON file:

{"a":[1,2],"b":{"c":true,"d":"1991-02-03"}}{"a":[3,4,5],"b":{"c":false,"d":"2019-04-01"}}

returns the following data:

>>>table=json.read_json("my_data.json")>>>tablepyarrow.Tablea: list<item: int64>  child 0, item: int64b: struct<c: bool, d: timestamp[s]>  child 0, c: bool  child 1, d: timestamp[s]>>>table.to_pandas()           a                                       b0     [1, 2]   {'c': True, 'd': 1991-02-03 00:00:00}1  [3, 4, 5]  {'c': False, 'd': 2019-04-01 00:00:00}

Customized parsing#

To alter the default parsing settings in case of reading JSON files with anunusual structure, you should create aParseOptions instanceand pass it toread_json(). For example, you can pass an explicitschema in order to bypass automatic type inference.

Similarly, you can choose performance settings by passing aReadOptions instance toread_json().

Incremental reading#

For memory-constrained environments, it is also possible to read a JSON fileone batch at a time, usingopen_json().

In this case, type inference is done on the first block and types are frozen afterwards.To make sure the right data types are inferred, either setReadOptions.block_size to a large enough value, or useParseOptions.explicit_schema to set the desired data types explicitly.

On this page

Edit on GitHub

Movatterモバイル変換

Reading JSON files#

Usage#

Automatic Type Inference#

Customized parsing#

Incremental reading#

Reading JSON files #