Description
Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json.
On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor.
On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.

Checklist

I have read and agreed to theRethinkDB Contributor License Agreement

References

Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export

Tested with python 2.7.16 and python 3.8.5

iantocristian added2 commits

February 19, 2021 20:10

feat: jsongz export/import

5ec52e0

chore: pylintify

f6e3e40

iantocristian force-pushed thegzip-json-export branch from23176e9 tof6e3e40Compare

February 19, 2021 18:45

Copy link

Member

gabor-boros commentedMar 5, 2021

Hello@iantocristian 👋
First of all, thank you for your contribution here! 🎉

Could you please add some unit/integration tests for this functionality?

Copy link

Author

iantocristian commentedMar 6, 2021

👋@gabor-boros

I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.

iantocristian added2 commits

March 22, 2021 20:40

fix: allow single file jsongz import

a6d8978

fix: apply --max-document-size for directory imports too

a88f11c

gabor-boros previously approved these changes

Mar 23, 2021

View reviewed changes

Copy link

Member

gabor-boros commentedMar 23, 2021

@lsabi could you please double check this?

lsabi reviewed

Mar 23, 2021

View reviewed changes

Copy link

Contributor

lsabi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

All in all it seems good. The style has been maintained but there are missing tests, which could become a problem.

We could write them in a second moment.

Apart from the comments I've added it seems ok.@gabor-boros do you have anything to add? Especially for the comment about the new line

rethinkdb/_export.py Outdated

		ifoptions.format=="jsongz":
		ifoptions.compression_levelisNone:
		options.compression_level=-1
		elifoptions.compression_level<0oroptions.compression_level>9:

Copy link

Contributor

lsabiMar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What if someone passes -1 as option?
In my opinion it should be changed intoelif options.compression_level < -1 or options.compression_level > 9:

Copy link

Author

iantocristianMar 24, 2021•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My reasoning was: passing-1 is the same as not specifying the compression level, it's not really setting thecompression_level.
Happy to make the change as suggested though, it might make it easier to switch between setting and not setting the compression level.

Copy link

Contributor

lsabiMar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I understand your reasoning, but then if someone passes-1,if options.compression_level is None is evaluatedFalse,if options.compression_level < 0 or options.compression_level > 9 is evaluated toTrue and raises an exception. Which is incorrect, since-1 is an acceptable value.

If you have another suggestion on how to handle such situation, feel free to add it. Mine was just a potential suggestion on how to handle it.

All in all, It's no big deal to support-1, but could prevent some errors/exceptions, since I assume that the default value is an acceptable value.

Copy link

Author

iantocristianMar 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated as suggested.

rethinkdb/_export.py

		foriteminlist(row.keys()):
		ifitemnotinfields:
		delrow[item]
		iffirst:

Copy link

Contributor

lsabiMar 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This implies that the objects in theJSON array are each on a separate line.

I'm no compression expert, but since it'll be binary and unreadable from a high level perspective, why not skip the new line? Object would be written as follows
[{...},{...},{...}...{...}]
which would save n + 1 + 1 new lines (n-1 between each object, plus 2 for the first and the last, + 1 for the last row whichEOF and is good. On huge table this could imply saving a considerable amount of bits.@gabor-boros do you have any knowledge about the topic?

Or maybe I'm wrong and the \n are used for compression. Let me know, I'm curious now.

Copy link

Author

iantocristianMar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The code here is replicating what the json export does.\n are not used for compression.

One can still unpack the jsongz file get the json file from inside (it's a standard gzip file), in which case the formatting might help. I considered the gains from removing the\n-s marginal, but you might be right. If you have really small documents, the extra end lines might make a difference.

Copy link

Author

iantocristianMar 24, 2021•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should say thatimport script includes a custom json paarser (which I found odd, not sure what the reason for using a custom parser was, performance perhaps?) which might be affected by the lack of new lines (I expect it to be happy though).

Copy link

Contributor

lsabiMar 24, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if you have really small documents, the extra end lines might make a difference

Did you mean big documents? Hehe

Regarding the custom json parser, I have no clue why there's a custom one. Probably, when the library has been written, there were parsers that did not fit/match the requirements. Nowadays there are tons of high performance parsers. In order to not break anything, I would keep the custom one for now.

Copy link

Author

iantocristian commentedMar 24, 2021

... there are missing tests, which could become a problem. We could write them in a second moment.

I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.

Copy link

Author

iantocristian commentedMar 24, 2021•
edited
Loading

One other comment I got was related to thejsongz extension I used for the data files - why notjson.gz? I usedjsongz because splitext can't handlejson.gz and it would have required more code changes elsewhere in the scripts.

Downside of usingjsongz is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g.gzip -D command won't like thejsongz extension). Any thoughts about this?

Copy link

Contributor

lsabi commentedMar 24, 2021

I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.

Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like

SUPPORTED_EXT = {    "json": True,    "gz": {"json": True}}

This way, files ending withjson can be decoded immediately, while those ending withgz have to havejson before thegz which is to be checked. Although I don't know how much work it could be performing the switch.

fix: allow compression_level -1

1e61100

iantocristian dismissedgabor-boros’sstale review via1e61100

March 26, 2021 06:49

Copy link

Contributor

lsabi commentedMar 26, 2021

To me it looks good.

Only the the new line feed which I don't know if it's worth removing it or not.

We can, in a second moment, write tests and check how much it influences the size of the generated file.

@gabor-boros what do you think? For me it can be passed

Copy link

Author

iantocristian commentedMar 27, 2021

Only the the new line feed which I don't know if it's worth removing it or not.

It's one new line per document right? + another one at the start and another one end.

For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain.
For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain.
Anything larger than 1kb gain is negligible.

Not worth it imo.

Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.

Copy link

Contributor

lsabi commentedMar 28, 2021

Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.

Nevertheless, as I said, this can be done in a second moment.

What's the point about the jsongz? I don't understand it

Copy link

Author

iantocristian commentedMar 30, 2021•
edited
Loading

Nevertheless, as I said, this can be done in a second moment.

👍

What's the point about the jsongz? I don't understand it

That it wasn't my intention to change the content that's being dumped, just to compress it.
Could have been an option for thejson_writer but I thought it's less risk to have a separate writer.

Copy link

Contributor

lsabi commentedMar 30, 2021

Don't worry, we can keep them separate and merge one day.

@gabor-boros do you have anything to add/to complain about this PR?

Copy link

AlexC commentedMar 22, 2024

@lsabi /@gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks

Labels

None yet

Movatterモバイル変換

Uh oh!

Compressed output format (jsongz) for rethinkdb export/import#251

Are you sure you want to change the base?

Compressed output format (jsongz) for rethinkdb export/import#251

Uh oh!

Conversation

iantocristian commentedFeb 19, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

gabor-boros commentedMar 5, 2021

Uh oh!

iantocristian commentedMar 6, 2021

Uh oh!

gabor-boros commentedMar 23, 2021

Uh oh!

lsabi left a comment

Choose a reason for hiding this comment

Uh oh!

lsabiMar 23, 2021

Choose a reason for hiding this comment

Uh oh!

iantocristianMar 24, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsabiMar 24, 2021

Choose a reason for hiding this comment

Uh oh!

iantocristianMar 26, 2021

Choose a reason for hiding this comment

Uh oh!

lsabiMar 23, 2021

Choose a reason for hiding this comment

Uh oh!

iantocristianMar 24, 2021

Choose a reason for hiding this comment

Uh oh!

iantocristianMar 24, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lsabiMar 24, 2021

Choose a reason for hiding this comment

Uh oh!

iantocristian commentedMar 24, 2021

Uh oh!

iantocristian commentedMar 24, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lsabi commentedMar 24, 2021

Uh oh!

lsabi commentedMar 26, 2021

Uh oh!

iantocristian commentedMar 27, 2021

Uh oh!

lsabi commentedMar 28, 2021

Uh oh!

iantocristian commentedMar 30, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lsabi commentedMar 30, 2021

Uh oh!

AlexC commentedMar 22, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

iantocristian commentedFeb 19, 2021•
edited
Loading

iantocristianMar 24, 2021•
edited
Loading

iantocristianMar 24, 2021•
edited
Loading

iantocristian commentedMar 24, 2021•
edited
Loading

iantocristian commentedMar 30, 2021•
edited
Loading