Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

CLI for streaming JSON

License

NotificationsYou must be signed in to change notification settings

pkoppstein/jm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

36 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

jm andjm.py are scripts which make it easy to splat (that is tostream the top-level values in) JSON arrays or JSON objectslosslessly, even if they occur in very large JSON documents.(Losslessly here refers primarily to numeric precision, but jmalso handles duplicate keys within a JSON object losslessly as well.)

Once installed, each script is typically trivial or very easy to use, e.g.to splat the top-level array of a JSON document in a file namedinput.json one could write:

jm input.json

or

jm.py input.json

For example:

$ jm <<< '{"a": 1,"b": "2", "c": {"d": 3} }'1"2"{"d":3}

Further examples and variations are shown below.

For large inputs,jm is typically 3 or more times faster thanjm.py andconsumes significantly less memory, but jm.py can be made to ignorecomments, and Pythonistas might findjm.py of interest as it is easy to modify.

jm requires PHP 8 and requires the installation ofJSON Machine package.

jm.py requires Python 3, and requires the installation of theijson package.

Terminology

In this document, splatting a JSON array is to be understood asproducing a stream of the top-level items in the array (one line peritem), and similarly, streaming a JSON object means producing a streamof the top-level values, or of the corresponding key-value singletonobjects if the -s option is specified. Splatting other JSON valuessimply means printing them.

jm and jm.py similarities and differences

The two scripts are quite similar in terms of capabilties andtypical usage, but there are important differences, notably in the waypaths to subdocuments are specified: jm uses JSON Pointer (RFC 6901), whereas jm.pyhas a less comprehensive notation. As already illustrated, however,typically no path need be specified at all.

If the yajl2 library has been installed (as bybrew install yajl)then jm.py will ignore /* C-style comments. */

Numbers

jm.py preserves the precision of numbers, at least to the extent thatthe ijson and simplejson packages allow.

The way in whichjm prints JSON numbers depends on the--recodeand--bigint_as_string options, which are mutually exclusive:

  • --recode causes all JSON numbers to be presented as PHP numeric valueswith the potential loss of information this implies;
  • --bigint_as_string causes JSON "big integers" to be converted tostrings to avoid loss of information, but other numbers will beconverted to PHP numeric values;
  • if neither of these options is specified, the literal form of numbers is preserved.

Synopsis of jm

Usage: jm [ OPTIONS ]  [ FILEPATH ... ]or:    jm [-h | --help]where FILEPATH defaults to stdin, and the other options are:     -s | --keys | --tag KEYNAME     --array     --bigint_as_string | --recode     --count | --limit=LIMIT     --pointer=JSONPOINTER *     --versionThe --tag option precludes the --array, --keys, and -s options.JSONPOINTER defaults to ''.* Several pointers may be specified by repeating the --pointer option.

For details, simply invoke the script with the --help option, orreview the documentation contained within the script itself.

Synopsis of jm.py

usage: jm.py [-h] [-i IPATH] [-s] [--values] [-k] [--count] [--limit LIMIT] [--tag KEYNAME] [-v] [-V] [filename ...]Stream a JSON array or object.positional arguments:  filenameoptions:  -h, --help            show this help message and exit  -i IPATH, --ipath IPATH                        the ijson path to the object or array to be streamed  -s, --singleton       stream JSON objects as single-key objects  --values              stream JSON objects by printing the values of their keys  -k, --keys            stream JSON objects by printing their keys  --count               count the number of lines that would be printed  --limit LIMIT         limit the number of JSON values (lines) printed  --tag KEYNAME         instead of emitting a line X of JSON, emit TAG<tab>X where TAG is determined by KEYNAME (see below)  -v                    verbose mode  -V, --version         show program's version number and exitThe --limit and --count options are mutually exclusive,as are the --tag, --s, --keys and --values options.

For details, simply invoke the script with the --help option, orreview the documentation contained within the script itself.

Examples:

In these examples, $JM means that jm and jm.py can beused interchangeably.

(1) $JM <<< '[1,"2", {"a": 4}, [5.0000000000000000000000000006]]'yields:1"2"{"a":4}[5.0000000000000000000000000006]
(2) $JM --tag a <<< '[{"a": 1}, {"a": [2]}, {"b": [3]}]' | sed 's/\t/<tab>/'yields1<tab>{"a": 1}[2]<tab>{"a": [2]}<tab>{"b": [3]}
(3) jm --keys <<< '{"a": 1, "b": [2]}'is equivalent tojm.py --keys --ipath '' <<< '{"a": 1, "b": [2]}'Both yield:"a""b"
(4) jm <<< '{"a": 1, "b": [2,3]}'is equivalent tojm.py --ipath '' --values <<< '{"a": 1, "b": [2,3]}'Both yield:1[2,3]
(5) jm -s <<< '{"a": 1, "b": [2,3]}'is equivalent tojm.py --ipath '' -s <<< '{"a": 1, "b": [2,3]}'Both yield:{"a": 1}{"b": [2,3]}
(6) jm --pointer "/results" <<< '{"results": {"a": 1, "b": [2,3]}}'is equivalent tojm.py --values --ipath "results" <<< '{"results": {"a": 1, "b": [2,3]}}'Both yield the same stream as (4) above, namely:1[2, 3]
(7) jm --bigint_as_string <<< '[10000000000000000000002, 3.0000000000000000000004]'yields"10000000000000000000002"3
(8) jm --array <<< '{"a": 1, "b": [2,3]}'yields[1,[2,3]]
(9) jm --recode <(echo '[1.000000000000000001,20000000000000000003]')yields12.0e+19
(10) jm --pointer "/-" <<< '[1,[2,3]]'yields123

Note that in the last example, the JSON Pointer "/-" points in turn tothe items in the top-level array (i.e. 1 and then [2,3]), and that streaming 1 produces 1, and streaming [2,3]produces 2 and then 3.

Installation of jm

(1) Install "JSON Machine"

The simplest way to install "JSON Machine" is usually to run:

composer require halaxa/json-machine

in the user's home directory or in the same directory in which you intend to install thejm script.

To installcomposer, you could trybrew install composer usinghomebrew. See "Additional Documentation" below for further detailsand alternatives.

If you wish to clone or download the JSON Machine repository insteadof installing it usingcomposer, then please note thatjm willassume it resides in the directory~/github/json-machine/.

(2) Download the file namedjm from this repository.

If at all possible, ensure it is executable (e.g.chmod +x jm).Otherwise, it can still be run as a PHP script, e.g. for help:

php jm --help

Installation of jm.py

  1. Ensure that both simplejson and ijson are installed, e.g.:
    pip install simplejson    pip install ijson
  1. Install yajl (OPTIONAL)

If you want to strip /* C-style comments */ from the input files,then ensure the yajl library has been installed (e.g. viabrew install yajl).

  1. Download the file namedjm.py in the bin directory of this repository.

If at all possible, ensure it is executable (e.g.chmod +x jm.py).Otherwise, it can still be run as a python3 script, e.g. for help:

python3 jm.py --help

Performance Comparisons

The following two tables show some performance metrics for two queriesagainst a 1.5GB "real-world" JSON data set: the traffic violationsdata set from Montgomery County, MD. The file is about 1.5GB.Further details about it are given below.

The entries in the tables are based on runs of /usr/bin/time -lp on a 3GHz Mac Mini.

In the tables, u+s is the user+system time in seconds, and mrss is themaximum resident set size in MB. To facilitate comparison, the firstentry of the first table shows metrics for thewc command forcounting the number of lines in the file.

Except for this first line, the queries in the first tablecompute the length of the .data array (i.e. 1829779):

u+smrss (MB)command
1.3s1.7wc -l
169s1.3jm --pointer=/data --count
306s1.8jm.py -i data.item --count
40s3987.0jq '.data|length'
42s6442.0gojq '.data|length'
53s7330.0fq '.data|length'
46s10346.0jaq '.data|length'

In the second table, the queries extract the value of .meta.view.createdAt (i.e. 1403103517):

u+smrss (MB)command
0.0s2.0jq -n --stream "$CMD"
0.0s3.4gojq -n --stream "$CMD"
0.1s13.6jm --pointer=/meta/view/createdAt
0.2s17.6jm.py -i meta.view.createdAt --limit 1
233.0s18.0jm.py -i meta.view.createdAt
50.3s3869.1jq .meta.view.createdAt
50.3s6108.6gojq .meta.view.createdAt
57.3s8252.4fq .meta.view.createdAt
56.2s10474.1jaq .meta.view.createdAt

CMD='first(inputs|select(length==2 and .[0]==["meta","view","createdAt"]))|.[1]'

The file used in all cases was obtained on Jan 11, 2023fromhttps://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.jsonIt is archived athttps://web.archive.org/web/20230112063656/https://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.json

The file size is 1,459,336,880 bytes,and the .meta.view.createdAt value in the file is 1403103517.

Additional Documentation

Acknowledgements

Special thanks tohttps://github.com/halaxa, the creator of "JSON Machine".

Thanks also to the creators and maintainers of ijson.

About

CLI for streaming JSON

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp