Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Compressed output format (jsongz) for rethinkdb export/import#251

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Open
iantocristian wants to merge5 commits intorethinkdb:master
base:master
Choose a base branch
Loading
fromiantocristian:gzip-json-export

Conversation

@iantocristian
Copy link

@iantocristianiantocristian commentedFeb 19, 2021
edited
Loading

Reason for the change
#249

Description
Implemented feature in rethindkb-export and rethinkdb-import scripts: a new export format jsongz - gzipped json.
On export side, there is a new jsongz writer (based on the json writer implementation), passing the output through a zilb compressor.
On the import side, JsonGzSourceFile extends JsonSourceFile (slightly modified) and can read from the gzipped json data files directly. And an addition to SourceFile constructor to read the uncompressed size from the gzip trailer.

Checklist

References

Usage:
rethinkdb-export -e test -d export --format jsongz
rethinkdb-export -e test -d export --format jsongz --compression-level 5
rethinkdb-import -i test -d export

Tested with python 2.7.16 and python 3.8.5

@gabor-boros
Copy link
Member

Hello@iantocristian 👋
First of all, thank you for your contribution here! 🎉

Could you please add some unit/integration tests for this functionality?

@iantocristian
Copy link
Author

👋@gabor-boros

I would but it looks like no unit/integration tests exist for the import/export scripts in general 😅 . Seems like a big job.

gabor-boros
gabor-boros previously approved these changesMar 23, 2021
@gabor-boros
Copy link
Member

@lsabi could you please double check this?

Copy link
Contributor

@lsabilsabi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

All in all it seems good. The style has been maintained but there are missing tests, which could become a problem.

We could write them in a second moment.

Apart from the comments I've added it seems ok.@gabor-boros do you have anything to add? Especially for the comment about the new line

ifoptions.format=="jsongz":
ifoptions.compression_levelisNone:
options.compression_level=-1
elifoptions.compression_level<0oroptions.compression_level>9:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What if someone passes -1 as option?
In my opinion it should be changed intoelif options.compression_level < -1 or options.compression_level > 9:

Copy link
Author

@iantocristianiantocristianMar 24, 2021
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My reasoning was: passing-1 is the same as not specifying the compression level, it's not really setting thecompression_level.
Happy to make the change as suggested though, it might make it easier to switch between setting and not setting the compression level.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I understand your reasoning, but then if someone passes-1,if options.compression_level is None is evaluatedFalse,if options.compression_level < 0 or options.compression_level > 9 is evaluated toTrue and raises an exception. Which is incorrect, since-1 is an acceptable value.

If you have another suggestion on how to handle such situation, feel free to add it. Mine was just a potential suggestion on how to handle it.

All in all, It's no big deal to support-1, but could prevent some errors/exceptions, since I assume that the default value is an acceptable value.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Updated as suggested.

foriteminlist(row.keys()):
ifitemnotinfields:
delrow[item]
iffirst:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This implies that the objects in theJSON array are each on a separate line.

I'm no compression expert, but since it'll be binary and unreadable from a high level perspective, why not skip the new line? Object would be written as follows
[{...},{...},{...}...{...}]
which would save n + 1 + 1 new lines (n-1 between each object, plus 2 for the first and the last, + 1 for the last row whichEOF and is good. On huge table this could imply saving a considerable amount of bits.@gabor-boros do you have any knowledge about the topic?

Or maybe I'm wrong and the \n are used for compression. Let me know, I'm curious now.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The code here is replicating what the json export does.\n are not used for compression.

One can still unpack the jsongz file get the json file from inside (it's a standard gzip file), in which case the formatting might help. I considered the gains from removing the\n-s marginal, but you might be right. If you have really small documents, the extra end lines might make a difference.

Copy link
Author

@iantocristianiantocristianMar 24, 2021
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Should say thatimport script includes a custom json paarser (which I found odd, not sure what the reason for using a custom parser was, performance perhaps?) which might be affected by the lack of new lines (I expect it to be happy though).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

if you have really small documents, the extra end lines might make a difference

Did you mean big documents? Hehe

Regarding the custom json parser, I have no clue why there's a custom one. Probably, when the library has been written, there were parsers that did not fit/match the requirements. Nowadays there are tons of high performance parsers. In order to not break anything, I would keep the custom one for now.

iantocristian reacted with thumbs up emoji
@iantocristian
Copy link
Author

... there are missing tests, which could become a problem. We could write them in a second moment.

I am on the same page here. We need some tests for the backup and restore functionality. But it feels like a different story / PR is in order.

@iantocristian
Copy link
Author

iantocristian commentedMar 24, 2021
edited
Loading

One other comment I got was related to thejsongz extension I used for the data files - why notjson.gz? I usedjsongz because splitext can't handlejson.gz and it would have required more code changes elsewhere in the scripts.

Downside of usingjsongz is that unpacking is more cumbersome - in most cases requiring the extension to be changed before unpacking (e.g.gzip -D command won't like thejsongz extension). Any thoughts about this?

@lsabi
Copy link
Contributor

I haven't worked on the import nor export scripts, but an option is to have a list of supported extensions and try to match the last part of the filename against the list.

Another alternative, could be to have a hash list with the supported extensions pointing towards other extensions like

SUPPORTED_EXT = {    "json": True,    "gz": {"json": True}}

This way, files ending withjson can be decoded immediately, while those ending withgz have to havejson before thegz which is to be checked. Although I don't know how much work it could be performing the switch.

iantocristian reacted with thumbs up emoji

@lsabi
Copy link
Contributor

To me it looks good.

Only the the new line feed which I don't know if it's worth removing it or not.

We can, in a second moment, write tests and check how much it influences the size of the generated file.

@gabor-boros what do you think? For me it can be passed

@iantocristian
Copy link
Author

Only the the new line feed which I don't know if it's worth removing it or not.

It's one new line per document right? + another one at the start and another one end.

For a table with 1000 documents with average size of 1kb, you gain 1002 bytes, uncompressed, less than 0.1% gain.
For a table with 10000 documents with average size 200 bytes, you gain 10002 bytes, uncompressed, roughly 0.5% gain.
Anything larger than 1kb gain is negligible.

Not worth it imo.

Another point is that jsongz = gzipped json. So it should be the same output as json, but compressed.

@lsabi
Copy link
Contributor

Percentages vary based on the size of the document. But, if you have a table with millions of records, it implies saving MBs of space. Sure it'll not be much in comparison to the total size, but it may be better and easier to fit it into memory. I'm not sure there'll be such a big table, though.

Nevertheless, as I said, this can be done in a second moment.

What's the point about the jsongz? I don't understand it

@iantocristian
Copy link
Author

iantocristian commentedMar 30, 2021
edited
Loading

Nevertheless, as I said, this can be done in a second moment.

👍

What's the point about the jsongz? I don't understand it

That it wasn't my intention to change the content that's being dumped, just to compress it.
Could have been an option for thejson_writer but I thought it's less risk to have a separate writer.

@lsabi
Copy link
Contributor

Don't worry, we can keep them separate and merge one day.

@gabor-boros do you have anything to add/to complain about this PR?

@AlexC
Copy link

@lsabi /@gabor-boros just wondering if there was an update on getting this merged in? It would be super useful for us. Thanks

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@lsabilsabilsabi left review comments

@gabor-borosgabor-borosgabor-boros left review comments

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@iantocristian@gabor-boros@lsabi@AlexC

[8]ページ先頭

©2009-2025 Movatter.jp