- Notifications
You must be signed in to change notification settings - Fork0
pkoppstein/jm
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
jm
andjm.py
are scripts which make it easy to splat (that is tostream the top-level values in) JSON arrays or JSON objectslosslessly, even if they occur in very large JSON documents.(Losslessly here refers primarily to numeric precision, but jmalso handles duplicate keys within a JSON object losslessly as well.)
Once installed, each script is typically trivial or very easy to use, e.g.to splat the top-level array of a JSON document in a file namedinput.json one could write:
jm input.json
or
jm.py input.json
For example:
$ jm <<< '{"a": 1,"b": "2", "c": {"d": 3} }'1"2"{"d":3}
Further examples and variations are shown below.
For large inputs,jm
is typically 3 or more times faster thanjm.py
andconsumes significantly less memory, but jm.py can be made to ignorecomments, and Pythonistas might findjm.py
of interest as it is easy to modify.
jm
requires PHP 8 and requires the installation ofJSON Machine package.
jm.py
requires Python 3, and requires the installation of theijson package.
In this document, splatting a JSON array is to be understood asproducing a stream of the top-level items in the array (one line peritem), and similarly, streaming a JSON object means producing a streamof the top-level values, or of the corresponding key-value singletonobjects if the -s option is specified. Splatting other JSON valuessimply means printing them.
The two scripts are quite similar in terms of capabilties andtypical usage, but there are important differences, notably in the waypaths to subdocuments are specified: jm uses JSON Pointer (RFC 6901), whereas jm.pyhas a less comprehensive notation. As already illustrated, however,typically no path need be specified at all.
If the yajl2 library has been installed (as bybrew install yajl
)then jm.py will ignore /* C-style comments. */
jm.py
preserves the precision of numbers, at least to the extent thatthe ijson and simplejson packages allow.
The way in whichjm
prints JSON numbers depends on the--recode
and--bigint_as_string
options, which are mutually exclusive:
- --recode causes all JSON numbers to be presented as PHP numeric valueswith the potential loss of information this implies;
- --bigint_as_string causes JSON "big integers" to be converted tostrings to avoid loss of information, but other numbers will beconverted to PHP numeric values;
- if neither of these options is specified, the literal form of numbers is preserved.
Usage: jm [ OPTIONS ] [ FILEPATH ... ]or: jm [-h | --help]where FILEPATH defaults to stdin, and the other options are: -s | --keys | --tag KEYNAME --array --bigint_as_string | --recode --count | --limit=LIMIT --pointer=JSONPOINTER * --versionThe --tag option precludes the --array, --keys, and -s options.JSONPOINTER defaults to ''.* Several pointers may be specified by repeating the --pointer option.
For details, simply invoke the script with the --help option, orreview the documentation contained within the script itself.
usage: jm.py [-h] [-i IPATH] [-s] [--values] [-k] [--count] [--limit LIMIT] [--tag KEYNAME] [-v] [-V] [filename ...]Stream a JSON array or object.positional arguments: filenameoptions: -h, --help show this help message and exit -i IPATH, --ipath IPATH the ijson path to the object or array to be streamed -s, --singleton stream JSON objects as single-key objects --values stream JSON objects by printing the values of their keys -k, --keys stream JSON objects by printing their keys --count count the number of lines that would be printed --limit LIMIT limit the number of JSON values (lines) printed --tag KEYNAME instead of emitting a line X of JSON, emit TAG<tab>X where TAG is determined by KEYNAME (see below) -v verbose mode -V, --version show program's version number and exitThe --limit and --count options are mutually exclusive,as are the --tag, --s, --keys and --values options.
For details, simply invoke the script with the --help option, orreview the documentation contained within the script itself.
In these examples, $JM means that jm and jm.py can beused interchangeably.
(1) $JM <<< '[1,"2", {"a": 4}, [5.0000000000000000000000000006]]'yields:1"2"{"a":4}[5.0000000000000000000000000006]
(2) $JM --tag a <<< '[{"a": 1}, {"a": [2]}, {"b": [3]}]' | sed 's/\t/<tab>/'yields1<tab>{"a": 1}[2]<tab>{"a": [2]}<tab>{"b": [3]}
(3) jm --keys <<< '{"a": 1, "b": [2]}'is equivalent tojm.py --keys --ipath '' <<< '{"a": 1, "b": [2]}'Both yield:"a""b"
(4) jm <<< '{"a": 1, "b": [2,3]}'is equivalent tojm.py --ipath '' --values <<< '{"a": 1, "b": [2,3]}'Both yield:1[2,3]
(5) jm -s <<< '{"a": 1, "b": [2,3]}'is equivalent tojm.py --ipath '' -s <<< '{"a": 1, "b": [2,3]}'Both yield:{"a": 1}{"b": [2,3]}
(6) jm --pointer "/results" <<< '{"results": {"a": 1, "b": [2,3]}}'is equivalent tojm.py --values --ipath "results" <<< '{"results": {"a": 1, "b": [2,3]}}'Both yield the same stream as (4) above, namely:1[2, 3]
(7) jm --bigint_as_string <<< '[10000000000000000000002, 3.0000000000000000000004]'yields"10000000000000000000002"3
(8) jm --array <<< '{"a": 1, "b": [2,3]}'yields[1,[2,3]]
(9) jm --recode <(echo '[1.000000000000000001,20000000000000000003]')yields12.0e+19
(10) jm --pointer "/-" <<< '[1,[2,3]]'yields123
Note that in the last example, the JSON Pointer "/-" points in turn tothe items in the top-level array (i.e. 1 and then [2,3]), and that streaming 1 produces 1, and streaming [2,3]produces 2 and then 3.
(1) Install "JSON Machine"
The simplest way to install "JSON Machine" is usually to run:
composer require halaxa/json-machine
in the user's home directory or in the same directory in which you intend to install thejm
script.
To installcomposer
, you could trybrew install composer
usinghomebrew. See "Additional Documentation" below for further detailsand alternatives.
If you wish to clone or download the JSON Machine repository insteadof installing it usingcomposer
, then please note thatjm
willassume it resides in the directory~/github/json-machine/
.
(2) Download the file namedjm
from this repository.
If at all possible, ensure it is executable (e.g.chmod +x jm
).Otherwise, it can still be run as a PHP script, e.g. for help:
php jm --help
- Ensure that both simplejson and ijson are installed, e.g.:
pip install simplejson pip install ijson
- Install yajl (OPTIONAL)
If you want to strip /* C-style comments */ from the input files,then ensure the yajl library has been installed (e.g. viabrew install yajl
).
- Download the file named
jm.py
in the bin directory of this repository.
If at all possible, ensure it is executable (e.g.chmod +x jm.py
).Otherwise, it can still be run as a python3 script, e.g. for help:
python3 jm.py --help
The following two tables show some performance metrics for two queriesagainst a 1.5GB "real-world" JSON data set: the traffic violationsdata set from Montgomery County, MD. The file is about 1.5GB.Further details about it are given below.
The entries in the tables are based on runs of /usr/bin/time -lp on a 3GHz Mac Mini.
In the tables, u+s is the user+system time in seconds, and mrss is themaximum resident set size in MB. To facilitate comparison, the firstentry of the first table shows metrics for thewc
command forcounting the number of lines in the file.
Except for this first line, the queries in the first tablecompute the length of the .data array (i.e. 1829779):
u+s | mrss (MB) | command |
---|---|---|
1.3s | 1.7 | wc -l |
169s | 1.3 | jm --pointer=/data --count |
306s | 1.8 | jm.py -i data.item --count |
40s | 3987.0 | jq '.data|length' |
42s | 6442.0 | gojq '.data|length' |
53s | 7330.0 | fq '.data|length' |
46s | 10346.0 | jaq '.data|length' |
In the second table, the queries extract the value of .meta.view.createdAt (i.e. 1403103517):
u+s | mrss (MB) | command |
---|---|---|
0.0s | 2.0 | jq -n --stream "$CMD" |
0.0s | 3.4 | gojq -n --stream "$CMD" |
0.1s | 13.6 | jm --pointer=/meta/view/createdAt |
0.2s | 17.6 | jm.py -i meta.view.createdAt --limit 1 |
233.0s | 18.0 | jm.py -i meta.view.createdAt |
50.3s | 3869.1 | jq .meta.view.createdAt |
50.3s | 6108.6 | gojq .meta.view.createdAt |
57.3s | 8252.4 | fq .meta.view.createdAt |
56.2s | 10474.1 | jaq .meta.view.createdAt |
CMD='first(inputs|select(length==2 and .[0]==["meta","view","createdAt"]))|.[1]'
The file used in all cases was obtained on Jan 11, 2023fromhttps://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.jsonIt is archived athttps://web.archive.org/web/20230112063656/https://data.montgomerycountymd.gov/api/views/4mse-ku6q/rows.json
The file size is 1,459,336,880 bytes,and the .meta.view.createdAt value in the file is 1403103517.
- "JSON Machine" e.g.https://github.com/halaxa/json-machine
- "JSON Pointer" e.g.https://www.rfc-editor.org/rfc/rfc6901#section-5
- composer e.g.https://getcomposer.org/doc/00-intro.md
- homebrew e.g.https://brew.sh
Special thanks tohttps://github.com/halaxa, the creator of "JSON Machine".
Thanks also to the creators and maintainers of ijson.