igormcoelho/vastjsonPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star3

VastJSON library in C++: structured json with high level cached items (for giant JSON files)

License

MIT license

3 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
.vscode		.vscode
demo		demo
libs		libs
src		src
tests		tests
.clang-format		.clang-format
.gitignore		.gitignore
BUILD.bazel		BUILD.bazel
LICENSE		LICENSE
README.md		README.md
WORKSPACE.bazel		WORKSPACE.bazel
makefile		makefile
test_vastjson.py		test_vastjson.py

Repository files navigation

vastjson

VastJSON library in C++: structured json with top level cached items for giant JSON files -usingnlohmann::json

This project emerged from a practical need... seenlohmann::json Issue 1613.

Why name it VastJSON?

Names like Big, Large, Huge, were already taken... so we found something BIGGER, LARGER... and FAST! So it's VastJSON :)

Ideas and Roadmap

Right now, this works fine for large json objects/dictionaries, a mode calledvast objects.The way it is now, we could also easily supportvast lists (a single list with thousands of elements), where indexing is partially done.And maybe with more efforts, allow these modes to cooperate into some hybrid strategy, where user "points out" where are thebig parts of json, e.g., "B" -> "B1" is big list; "C" is big object; or "root" is big object (current mode); etc.This would be nice for general purpose, but not trivial to implement now.

Currently, these modes are supported:

BIG_ROOT_DICT_NO_ROOT_LIST: json consists of a huge dictionary/object, without any list as top-level element (and some other possible small bugs... see flag.hasError and warnings)
BIG_ROOT_DICT_GENERIC: json consists of a huge dictionary/object (no constraints regarding format or top-level fields)

The more constrained mode should be the fastest (currentlyBIG_ROOT_DICT_NO_ROOT_LIST).

Run tests and use

cd tests && make

This is "almost" a Single Header library, namedVastJSON.hpp, just copy it into your project, but remember to copy its only dependency together:json.hpp.

If you prefer, you can blend these files together, into a single header file (maybe we can also provide that for future official releases).

Why is it for?

There exist amazing libraries for JSON in C++, such as nlohmann::json and RapidJSON.

But...

Imagine you have a JSON file with thousands of top-level entries...I had a json file in disk with 1.5GB size, and when it got loaded into memory, it surpassed 10GB!

It's not entirely clear to me why these things happen, as overhead of json structure should be tiny, but I needed a solution anyway... and then I share it with you all.

The idea is simple:

instead of completely loading JSON structure in memory, user is able to lazy-load only specific entries, thus reducing the burden of the json file processing.
user can drop JSON structure from memory whenever is wanted, but keeping some string version of it for future re-loads, if necessary
user can also permanently drop entries, if not intending to use anymore

Can you give me one example?

Sure. The best example is some situation like this:

{   "A" : {  /* large amount of data here */ },   "B" : {  /* large amount of data here */ },   /* thousands of entries here */   "ZZZZZ..." : {  /* large amount of data here */ }}

In this case, BigJSON would load the structure, while avoiding to parse internal items, only those in top-level.

Example in C++:

    std::ifstream if_test("demo/test.json");    vastjson::VastJSON bigj(if_test);    std::cout << "LOADED #KEYS = " << bigj.size() << std::endl;

Can you give me a BAD example?

Sure.Imagine you have a SINGLE (or a few) top-level entries, with all the burden inside:

{   "single entry" : {  /* large amount of data here */ }}

Definitely, do NOT use this library, if that's your case.

How is this implemented?

Currently, it uses thenice json library from nlohmann.

Known Issues

Right now, this is already used successfully for very large databases!

Include order issue

This library must be included before#include <nlohmann/json.hpp>, since it pre-defines some parsing operationssee explanation here.

Exceptions enabled for`BIG_ROOT_DICT_GENERIC`

The modeBIG_ROOT_DICT_GENERIC depends on exceptions (because of this issue with nlohmann::json), whileBIG_ROOT_DICT_NO_ROOT_LIST does not.It seems to be possible tofix this with custom sax parsers, but it's not done yet.

Warnings and errors over different strategies

Now multiple strategies are supported, some are faster and more risky, such asBIG_ROOT_DICT_NO_ROOT_LIST, and some are safer, such asBIG_ROOT_DICT_GENERIC.

For example, the fast approach ofBIG_ROOT_DICT_NO_ROOT_LIST allows parsing errors to occur, so it will inform userwith a warning message and flag.hasError = true. This is the correct/expected behavior, not a bug.If some parsing error occurs "silently" (without triggering warnings), this may be considered a bug, so feel free to file an Issue here.

Note that this library is not meant for strict validation the integrity of large json files (which we assume to be correct), this is focused only on fast and safe processing of big json files.Adding more tests totests/ folder is certainly a nice approach to increasing project safety.

Usage

Consider JSON:

{    "A" : { },    "B" : {             "B1":10,            "B2":"abcd"           },    "Z" : { }}

And filemain.cpp (seedemo/ folder):

#include <vastjson/VastJSON.hpp>#include <iostream>// ...int main() {    std::ifstream if_test2("demo/test2.json");    vastjson::VastJSON bigj2(if_test2); // standard BIG_ROOT_DICT_GENERIC    std::cout << "LOADED #KEYS = " << bigj2.size() << std::endl; // 3    std::cout << bigj2["A"] << std::endl;    std::cout << bigj2["B"]["B2"] << std::endl;    std::cout << bigj2["Z"] << std::endl;    // ...}

Lazy loading giant files

If stream file ownership is given to VastJSON, it will only consume necessary parts of JSON.

{    "A" : { },    "B" : {             "B1":10,            "B2":"abcd"           },    "C" : { },    "Z" : { }}

// using non-standard strategy BIG_ROOT_DICT_NO_ROOT_LISTvastjson::VastJSON bigj3(new std::ifstream("demo/test3.json"), BIG_ROOT_DICT_NO_ROOT_LIST);// pending operationsstd::cout << "isPending(): " << bigj3.isPending() << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl;std::cout << "getUntil(\"\",1) first key is found" << std::endl;// get first keysbigj3.getUntil("", 1);// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++)    std::cout << it->first << std::endl;// direct access will load morestd::cout << "direct access to bigj3[\"B\"][\"B1\"] = " << bigj3["B"]["B1"] << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl;    // still pendingstd::cout << "isPending(): " << bigj3.isPending() << std::endl;// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++)    std::cout << it->first << std::endl;// real size (will force performing top-level indexing)std::cout << "compute size will force top-level indexing...\nsize(): " << bigj3.size() << std::endl;// cache sizestd::cout << "cacheSize(): " << bigj3.cacheSize() << std::endl;    // not pending anymorestd::cout << "isPending(): " << bigj3.isPending() << std::endl;// iterate over top-level keys (cached only!)for (auto it = bigj3.begin(); it != bigj3.end(); it++)    std::cout << it->first << std::endl;

Output:

isPending(): 1cacheSize(): 0getUntil("",1) first key is foundAdirect access to bigj3["B"]["B1"] = 10cacheSize(): 2isPending(): 1ABcompute size will force top-level indexing... size(): 4cacheSize(): 4isPending(): 0ABCZ

Build with Bazel

bazel build ..../bazel-bin/app_demo

Output should be:

LOADED #KEYS = 3{}"abcd"{}

Using Bazel

If you use Bazel Build, just add this into yourWORKSPACE.bazel:

git_repository(    name='VastJSON',    remote='https://github.com/igormcoelho/vastjson.git',    branch='main')

InBUILD.bazel, add this dependency:

deps = ["@VastJSON//src/vastjson:vastjson_lib"]

Then, just#include <vastjson/VastJSON.hpp> in your C++ source files.

License

MIT License

About

VastJSON library in C++: structured json with high level cached items (for giant JSON files)

Languages

C++99.6%
Other0.4%

Movatterモバイル変換

License

igormcoelho/vastjson

Folders and files

Latest commit

History

Repository files navigation

vastjson

Why name it VastJSON?

Ideas and Roadmap

Run tests and use

Why is it for?

Can you give me one example?

Can you give me a BAD example?

How is this implemented?

Known Issues

Include order issue

Exceptions enabled forBIG_ROOT_DICT_GENERIC

Warnings and errors over different strategies

Usage

Lazy loading giant files

Build with Bazel

Using Bazel

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Contributors2

Uh oh!

Languages

Exceptions enabled for`BIG_ROOT_DICT_GENERIC`

Packages