NotificationsYou must be signed in to change notification settings
Fork1.2k
Star23k

[WIP] Lifting the padding requirement from simdjson APIs#1665

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

lemire wants to merge88 commits intomaster

base:master

Choose a base branch

fromdlemire/nopadding

Draft

[WIP] Lifting the padding requirement from simdjson APIs#1665

lemire wants to merge88 commits intomasterfromdlemire/nopadding

Conversation

Copy link

Member

lemire commentedJul 21, 2021•
edited
Loading

@jkeiser did a lot of hard work on two PRs to remove the need for padding in the On Demand front-end:#1511 and#1606

These PRs are diverging from our main branch more and more. So we need to refresh them. Let us do so.

merge fully all stale PRs
actually turn on the feature
update documentation
assess performance change
debug/make tests pass, run through sanitizers
Recover some of the performance loss.

In DOM, there is an unacceptable performance regression.

In On Demand, there is definitively a performance regression, but it is in the expected range (5%). Factors to take into account: (1) we improve the usability (2) in some cases, we save a copy and some memory usage which might be worth more than 5% (3) we might be able to claw back this extra 5%. At a glance, the regression can be explained by the extra number of instructions we need. For example, we get the worst regression with distinct_user_id (pointer), but the number of instructions went from 2166639 to 2431248, so a 12% uptick. (4) The ~5% range is the kind of performance difference we get just by switching compiler... so it is unlikely to displease an existing user (they will not notice the regression).

task	main branch	this PR	loss (%)
amazon_cellphones	2.04 GB/s	1.92 GB/s	6%
large_amazon_cellphones	2.07 GB/s	2.18 GB/s	-5%
partial_tweets	3.04 GB/s	3.00 GB/s	1%
large_random	0.77 GB/s	0.74 GB/s	4%
large_random (unordered)	0.74 GB/s	0.71 GB/s	4%
kostya	2.29 GB/s	2.18 GB/s	5%
distinct_user_id	3.50 GB/s	3.35 GB/s	4%
distinct_user_id (pointer)	3.14 GB/s	2.85 GB/s	9%
find_tweet	4.81 GB/s	4.81 GB/s	0%
top_tweet	3.40 GB/s	3.41 GB/s	0%
top_tweet (forward)	3.11 GB/s	3.01 GB/s	3 %

It is a PR on top of#1663

Fixes#174

jkeiserand others added30 commits

March 20, 2021 14:23

Normalize the words "position" and "end"

7f545a8

Don't step off the end of the index buffer

a334303

Split ondemand_object_tests into smaller files

02762fa

Add __SIMDJSON_CHECK_EOF feature flag

09601eb

This adds "guarded" (i.e., memory-safe) number parsing routines.

1fbd604

Adding potentially useless definition.

081f954

Adding the functions that I meant to add earlier.

42c8541

Merge pull request#1516from simdjson/dlemire/guarded_number_parsing

4d0f102

guarded number parsing (to be merged into Remove padding access from iteration#1511)

Merge branch 'master' into jkeiser/no-padding

d7b08c8

Merge branch 'master' into jkeiser/no-padding

76f7e4a

Merge branch 'master' into jkeiser/no-padding

595c628

Minor correction.

dadfb51

Missing #ifdef SIMDJSON_DEVELOPMENT_CHECKS

930b0be

that got destroyed in the merge. Putting it back.

Another merge issue.

c0fb5f7

Merge branch 'master' into jkeiser/no-padding

0c14707

Merging again. Fixing issue.

0d8f6f5

Merge branch 'master' into jkeiser/no-padding

10a7b9b

Merge issue

633784a

Don't allocate when parsing root number

eda0f19

Don't read padding at all for root booleans and numbers

da4de2e

Don't use padding parsing booleans

b02aedb

Minor changes to compilation test.

ba0406f

I suspect that sanitizers might increase stack usage, to combat that …

f45c59f

…you may set SIMDJSON_NO_FORCE_INLINING (#1614)

Remove unlikely from null / bool parsing

bab51e5

Remove need for padding from SIMD string parsing

6ba9ad3

Make len available to any dom_parser_implementation user

45fbd8c

Don't change the order of fields to avoid perf changes

ac647e6

Fix ARM compile errors

42a31ad

Don't step off the end of the index buffer

6a1f8fc

Split ondemand_object_tests into smaller files

78d1cd9

lemire added7 commits

July 23, 2021 21:56

Let us stress the memory overflow in more tests.

60b6a16

More fixes.

657489e

Turning on developer mode explicitly throughout.

7c8bc65

Merge branch 'dlemire/nopadding' of github.com:simdjson/simdjson into…

ce11f41

… dlemire/nopadding

Syncing the singleheader files.

b6aff44

Guarding the peek/advance.

2907912

Merge branch 'dlemire/nopadding' of github.com:simdjson/simdjson into…

450cb7b

… dlemire/nopadding

Copy link

MemberAuthor

lemire commentedJul 24, 2021

The more time I spend on this, the more depressing it gets.

It breaks everything and seems to imply greater and greater performance tradeoffs.

Copy link

MemberAuthor

lemire commentedJul 24, 2021

On Rome, Twitter/DOM goes from 2.5 to 2.3. It is not the end of the world, but it is months of optimization work undone. Not to mention all the safety that we seem to sacrifice.

To be revisited.

Copy link

MemberAuthor

lemire commentedJul 24, 2021

It looks like this will need a lot of work still.

lemire modified the milestones:1.0,2.0

Jul 24, 2021

lemire added researchExploration of the unknown and removed on demandRelated to simdjson::ondemand functionality labels

Jul 24, 2021

Copy link

MemberAuthor

lemire commentedJul 24, 2021

Marking as research. This is not ready for 1.0.

lemire marked this pull request as draft

July 24, 2021 13:57

Copy link

MemberAuthor

lemire commentedJul 24, 2021

Marked for 2.0.

lemire changed the title~~Lifting the padding requirement from simdjson APIs~~[WIP] Lifting the padding requirement from simdjson APIs

Jul 25, 2021

Copy link

MemberAuthor

lemire commentedJul 25, 2021

I have removed this from the 1.0 milestone.

My main concern is that this change introduces many vulnerabilities and would require extensive testing.

Another concern is that it seems to bring about significant performance regressions, sometimes exceeding 10%.

I could live with the loss of performance but not if it comes with a considerable performance penalty on top of it. The two together make for a bad mix.

Copy link

MemberAuthor

lemire commentedJul 25, 2021

So this will require extensive work.

Copy link

Member

jkeiser commentedAug 4, 2021

Agreed. If I had time to work on it right now and push it forward I might feel different, but between moving to a new house, kids being home for dinner, and pushing towards a major milestone at work I'm saturated.

It feels like there is a way to do this with low performance loss, but the needle has to be threaded very carefully. The target in my head has been up to 5% performance loss, all told (which is roughly the white noise level of the compiler for Dom).

My only sadness--which is absolutely not a reason to delay 1.0--is that making a performance-for-feature trade-off after 1.0 will be harder to sell and therefore may require more engineering effort to preserve a padded version for users that can afford it. Not that that is bad, just that it's more work for smaller gains.
Note that it's possible to put padding back for Dom without changing on demand.

Copy link

MemberAuthor

lemire commentedAug 4, 2021

@jkeiser Thanks. My concern right now has mostly to do with bugs. Pushing something that is not sufficiently well tested as "1.0" would be showing bad judgement.

My only sadness--which is absolutely not a reason to delay 1.0--is that making a performance-for-feature trade-off after 1.0 will be harder to sell and therefore may require more engineering effort to preserve a padded version for users that can afford it. Not that that is bad, just that it's more work for smaller gains.

There is a flip side to this story: as I was working on it, I thought that it was a waste to undo all of the hard work to accelerate processing by going back to a slower version even when users do not do it.

I think that the proper solution might be to use templates to 'branch' on the approach (do you have padding or not?).

Also, from a research angle, there might be other ways to cook this particular chicken. The way I thought you were going to go for is something different... Instead of doing all of these bound checks all over the place... examine the document when you get started, adjust the structural index so that at a strategic location you get a bogus error. This bogus error brings you into a distinct mode where you finish the processing with more careful code. Then you'd get the no-padding for free (given a large enough input).

This "bogus error" approach is also how I would try to handle the "stage 1 in chunks". You give me a 6 GB JSON document. I index it in chunks of 1 MB. I change the index so that somewhere before the end of the chunk, I encounter a bogus error. Then I know to load a new index.

I'll open an issue.

lemire mentioned this pull request

Aug 4, 2021

Experiment with bogus-error approach for no-overhead bound-check-like behavior#1686

Open

Copy link

MemberAuthor

lemire commentedAug 4, 2021

@jkeiser Note that this PR us probably very close to a working state. I am just nervous about how many bugs I will be missing. Sending this out as version 1.0 does not seem right.

Copy link

MemberAuthor

lemire commentedAug 7, 2021

@jkeiser As it is, I have been working quite a bit just to get thecurrent code base ready for release and I am not happy, and it has been months. A reset like the one suggested by this PR would require massive engineering efforts to get back at the level of maturity we have.

Copy link

zptan commentedAug 24, 2023•
edited
Loading

2 years passed since this PR sent, any update here? Or is it given up?

Our client receives JSON data from a server, and have to create apadded_string instance before parsing the data, which is a perf penalty

I have not done tests to check the perf gap between

no padding, just parse the data
padd the data and parse

Can padding be an option? If so, users can choose the best way

Copy link

MemberAuthor

lemire commentedAug 24, 2023

This PR is obsolete in its current state and is not being pursued.

@zptan If you'd like to pick it up and complete it, that would be great.

Copy link

zptan commentedSep 9, 2023

I just found this commit

Jereq/load-gltf@bc70241

So it seems that we already can parse without copying + padding the original json

std::string json_no_padding ="{}";simdjson::ondemand::parser parser;simdjson::ondemand::document doc = parser.iterate(json_no_padding, json_no_padding.size() + simdjson::SIMDJSON_PADDING);

Copy link

MemberAuthor

lemire commentedSep 9, 2023

So it seems that we already can parse without copying + padding the original json

You definitively can do as you describe but if the string is located near the end of a page, and the next page is not allocated, the resulting code could crash.

EzequielRamis mentioned this pull request

Mar 22, 2025

Add link to zimdjson#2352

Merged

Labels

research

Exploration of the unknown

Movatterモバイル変換

[WIP] Lifting the padding requirement from simdjson APIs#1665

Are you sure you want to change the base?

[WIP] Lifting the padding requirement from simdjson APIs#1665

Uh oh!

Conversation

lemire commentedJul 21, 2021• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lemire commentedJul 24, 2021

Uh oh!

lemire commentedJul 24, 2021

Uh oh!

lemire commentedJul 24, 2021

Uh oh!

lemire commentedJul 24, 2021

Uh oh!

lemire commentedJul 24, 2021

Uh oh!

lemire commentedJul 25, 2021

Uh oh!

lemire commentedJul 25, 2021

Uh oh!

jkeiser commentedAug 4, 2021

Uh oh!

lemire commentedAug 4, 2021

Uh oh!

lemire commentedAug 4, 2021

Uh oh!

lemire commentedAug 7, 2021

Uh oh!

zptan commentedAug 24, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

lemire commentedAug 24, 2023

Uh oh!

zptan commentedSep 9, 2023

Uh oh!

lemire commentedSep 9, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lemire commentedJul 21, 2021•
edited
Loading

zptan commentedAug 24, 2023•
edited
Loading