- Notifications
You must be signed in to change notification settings - Fork2
Parsing gigabytes of JSON per second. Zig port of simdjson with fundamental features.
License
EzequielRamis/zimdjson
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
JSON is everywhere on the Internet. Servers spend alot of time parsing it. We need a fresh approach.
Welcome to zimdjson: a high-performance JSON parser that takes advantage of SIMD vector instructions, based on the paperParsing Gigabytes of JSON per Second.
The majority of the source code is based on the C++ implementationhttps://github.com/simdjson/simdjson with the addition of some fundamental features like:
- Streaming support which can handle arbitrarily large documents with O(1) of memory usage.
- An ergonomic,Serde-like deserialization interface thanks to Zig's compile-time reflection. SeeReflection-based JSON.
- More efficient memory usage.
Install the zimdjson library by running the following command in your project root:
zig fetch --save git+https://github.com/ezequielramis/zimdjson#0.1.1
Then write the following in yourbuild.zig
:
constzimdjson=b.dependency("zimdjson", .{});exe.root_module.addImport("zimdjson",zimdjson.module("zimdjson"));
As an example, download a sample file calledtwitter.json
.
Then execute the following:
conststd=@import("std");constzimdjson=@import("zimdjson");pubfnmain()!void {vargpa=std.heap.GeneralPurposeAllocator(.{}).init;constallocator=gpa.allocator();varparser=zimdjson.ondemand.StreamParser(.default).init;deferparser.deinit(allocator);constfile=trystd.fs.cwd().openFile("twitter.json", .{});deferfile.close();constdocument=tryparser.parseFromReader(allocator,file.reader().any());constmetadata_count=trydocument.at("search_metadata").at("count").asUnsigned();std.debug.print("{} results.", .{metadata_count});}
> zig build run100 results.
To see how the streaming parser above handles multi-gigabyte JSON documents with minimal memory usage, download one ofthese dumps or play it with a file of your choice.
Currently, targets with Linux, Windows, or macOS operating systems and CPUs with SIMD capabilities are supported. Missing targets can be added by contributing.
The most recent documentation can be found inhttps://zimdjson.ramis.ar.
Although the provided interfaces are simple enough, it is expected to have unnecessary boilerplate when deserializing lots of data structures. Thank to Zig's compile-time reflection, we can eliminate it:
conststd=@import("std");constzimdjson=@import("zimdjson");constFilm=struct {name: []constu8,year:u32,characters: []const []constu8,// we could also use std.ArrayListUnmanaged([]const u8)};pubfnmain()!void {vargpa=std.heap.GeneralPurposeAllocator(.{}).init;constallocator=gpa.allocator();varparser=zimdjson.ondemand.FullParser(.default).init;deferparser.deinit(allocator);constjson=\\{\\ "name": "Esperando la carroza",\\ "year": 1985,\\ "characters": [\\ "Mamá Cora",\\ "Antonio",\\ "Sergio",\\ "Emilia",\\ "Jorge"\\ ]\\} ;constdocument=tryparser.parseFromSlice(allocator,json);constfilm=trydocument.as(Film,allocator, .{});deferfilm.deinit();trystd.testing.expectEqualDeep(Film{ .name="Esperando la carroza", .year=1985, .characters= &.{"Mamá Cora","Antonio","Sergio","Emilia","Jorge", }, },film.value, );}
This is just a simple example, but this way of deserializing is as powerful asSerde, so there is a lot of more features we can use, such as:
- Deserializing data structures from the Zig Standard Library.
- Renaming fields.
- Using different union representations.
- Custom handling unknown fields.
To see all available options it offers checkout itsreference.
To see all supported Zig Standard Library's data structures checkoutthis list.
To see how it can be really used checkout thetest suite for more examples.
Note
As a rule of thumb, do not trust any benchmark — always verify it yourself. There may be biases that favor a particular candidate, including mine.
The following picture represents parsing speed in GB/s of similar tasks presented in the paperOn-Demand JSON: A Better Way to ParseDocuments?, where the first three tasks iterate overtwitter.json
and the others iterate over a 626MB JSON file calledsystemsPopulated.json
fromthese dumps.
Ok, it seems the benchmark got borked but it is not, because of how cache works on small files and how the streaming parser happily ended finding out the tweet in the middle of the file.
Let's get rid of that task to see better the other results.
The following picture corresponds to a second simple benchmark, representing parsing speed in GB/s for near-complete parsing of thetwitter.json
file with reflection-based parsers (serde_json
,std.json
).
Note: If you look closely, you'll notice that "zimdjson (On-Demand, Unordered)" is the slowest of all. This is, unfortunately, a behaviour that also occurs withsimdjson
when object keys are unordered. If you do not know the order, it can be mitigated by using an schema. Thanks to theglaze library author for pointing this out.
All benchmarks were run on a 3.30GHz Intel Skylake processor.
About
Parsing gigabytes of JSON per second. Zig port of simdjson with fundamental features.