- Notifications
You must be signed in to change notification settings - Fork1
VastJSON library in C++: structured json with high level cached items (for giant JSON files)
License
igormcoelho/vastjson
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
VastJSON library in C++: structured json with top level cached items for giant JSON files -usingnlohmann::json
This project emerged from a practical need... seenlohmann::json Issue 1613.
Names like Big, Large, Huge, were already taken... so we found something BIGGER, LARGER... and FAST! So it's VastJSON :)
Right now, this works fine for large json objects/dictionaries, a mode calledvast objects
.The way it is now, we could also easily supportvast lists
(a single list with thousands of elements), where indexing is partially done.And maybe with more efforts, allow these modes to cooperate into some hybrid strategy, where user "points out" where are thebig parts of json, e.g., "B" -> "B1" is big list; "C" is big object; or "root" is big object (current mode); etc.This would be nice for general purpose, but not trivial to implement now.
Currently, these modes are supported:
BIG_ROOT_DICT_NO_ROOT_LIST
: json consists of a huge dictionary/object, without any list as top-level element (and some other possible small bugs... see flag.hasError
and warnings)BIG_ROOT_DICT_GENERIC
: json consists of a huge dictionary/object (no constraints regarding format or top-level fields)
The more constrained mode should be the fastest (currentlyBIG_ROOT_DICT_NO_ROOT_LIST
).
cd tests && make
This is "almost" a Single Header library, namedVastJSON.hpp, just copy it into your project, but remember to copy its only dependency together:json.hpp.
If you prefer, you can blend these files together, into a single header file (maybe we can also provide that for future official releases).
There exist amazing libraries for JSON in C++, such as nlohmann::json and RapidJSON.
But...
Imagine you have a JSON file with thousands of top-level entries...I had a json file in disk with 1.5GB size, and when it got loaded into memory, it surpassed 10GB!
It's not entirely clear to me why these things happen, as overhead of json structure should be tiny, but I needed a solution anyway... and then I share it with you all.
The idea is simple:
- instead of completely loading JSON structure in memory, user is able to lazy-load only specific entries, thus reducing the burden of the json file processing.
- user can drop JSON structure from memory whenever is wanted, but keeping some string version of it for future re-loads, if necessary
- user can also permanently drop entries, if not intending to use anymore
Sure. The best example is some situation like this:
{ "A" : { /* large amount of data here */ }, "B" : { /* large amount of data here */ }, /* thousands of entries here */ "ZZZZZ..." : { /* large amount of data here */ }}
In this case, BigJSON would load the structure, while avoiding to parse internal items, only those in top-level.
Example in C++:
std::ifstream if_test("demo/test.json"); vastjson::VastJSON bigj(if_test); std::cout << "LOADED #KEYS = " << bigj.size() << std::endl;
Sure.Imagine you have a SINGLE (or a few) top-level entries, with all the burden inside:
{ "single entry" : { /* large amount of data here */ }}
Definitely, do NOT use this library, if that's your case.
Currently, it uses thenice json library from nlohmann.
Right now, this is already used successfully for very large databases!
This library must be included before#include <nlohmann/json.hpp>
, since it pre-defines some parsing operationssee explanation here.
The modeBIG_ROOT_DICT_GENERIC
depends on exceptions (because of this issue with nlohmann::json), whileBIG_ROOT_DICT_NO_ROOT_LIST
does not.It seems to be possible tofix this with custom sax parsers, but it's not done yet.
Now multiple strategies are supported, some are faster and more risky, such asBIG_ROOT_DICT_NO_ROOT_LIST
, and some are safer, such asBIG_ROOT_DICT_GENERIC
.
For example, the fast approach ofBIG_ROOT_DICT_NO_ROOT_LIST
allows parsing errors to occur, so it will inform userwith a warning message and flag.hasError = true
. This is the correct/expected behavior, not a bug.If some parsing error occurs "silently" (without triggering warnings), this may be considered a bug, so feel free to file an Issue here.
Note that this library is not meant for strict validation the integrity of large json files (which we assume to be correct), this is focused only on fast and safe processing of big json files.Adding more tests totests/
folder is certainly a nice approach to increasing project safety.
Consider JSON:
{ "A" : { }, "B" : { "B1":10, "B2":"abcd" }, "Z" : { }}
And filemain.cpp
(seedemo/
folder):
#include <vastjson/VastJSON.hpp>#include <iostream>// ...int main() { std::ifstream if_test2("demo/test2.json"); vastjson::VastJSON bigj2(if_test2); // standard BIG_ROOT_DICT_GENERIC std::cout << "LOADED #KEYS = " << bigj2.size() << std::endl; // 3 std::cout << bigj2["A"] << std::endl; std::cout << bigj2["B"]["B2"] << std::endl; std::cout << bigj2["Z"] << std::endl; // ...}
If stream file ownership is given to VastJSON, it will only consume necessary parts of JSON.
{ "A" : { }, "B" : { "B1":10, "B2":"abcd" }, "C" : { }, "Z" : { }}
// using non-standard strategy BIG_ROOT_DICT_NO_ROOT_LISTvastjson::VastJSON bigj3(new std::ifstream("demo/test3.json"), BIG_ROOT_DICT_NO_ROOT_LIST);// pending operationsstd::cout << "isPending(): " << bigj3.isPending() << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl;std::cout << "getUntil(\"\",1) first key is found" << std::endl;// get first keysbigj3.getUntil("", 1);// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++) std::cout << it->first << std::endl;// direct access will load morestd::cout << "direct access to bigj3[\"B\"][\"B1\"] = " << bigj3["B"]["B1"] << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl; // still pendingstd::cout << "isPending(): " << bigj3.isPending() << std::endl;// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++) std::cout << it->first << std::endl;// real size (will force performing top-level indexing)std::cout << "compute size will force top-level indexing...\nsize(): " << bigj3.size() << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl; // not pending anymorestd::cout << "isPending(): " << bigj3.isPending() << std::endl;// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++) std::cout << it->first << std::endl;
Output:
isPending(): 1cacheSize(): 0getUntil("",1) first key is foundAdirect access to bigj3["B"]["B1"] = 10cacheSize(): 2isPending(): 1ABcompute size will force top-level indexing... size(): 4cacheSize(): 4isPending(): 0ABCZ
bazel build ..../bazel-bin/app_demo
Output should be:
LOADED #KEYS = 3{}"abcd"{}
If you use Bazel Build, just add this into yourWORKSPACE.bazel
:
git_repository( name='VastJSON', remote='https://github.com/igormcoelho/vastjson.git', branch='main')
InBUILD.bazel
, add this dependency:
deps = ["@VastJSON//src/vastjson:vastjson_lib"]
Then, just#include <vastjson/VastJSON.hpp>
in your C++ source files.
MIT License