Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

bpo-46372: Try to mutate the LHS during fastfloat ops#30594

Closed
brandtbucher wants to merge 18 commits intopython:mainfrom
brandtbucher:inplace-ops
Closed

bpo-46372: Try to mutate the LHS during fastfloat ops#30594
brandtbucher wants to merge 18 commits intopython:mainfrom
brandtbucher:inplace-ops

Conversation

@brandtbucher
Copy link
Member

@brandtbucherbrandtbucher commentedJan 14, 2022
edited by gvanrossum
Loading

This modifies the existingint andfloat specializations to mutate the LHS operand in-place when possible. It also unifies the implementations of all of these operations.

https://bugs.python.org/issue46372

sweeneyde reacted with rocket emoji
@sweeneyde
Copy link
Member

I think_PyLong_Add and the like could potentially be deleted as well.

brandtbucher reacted with thumbs up emoji

Copy link
Member

@markshannonmarkshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few general comments:

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow?
Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

It would be worth adding stats to give us some idea what fraction of operations can use this optimization or something like it. In other words, what fraction of operations have the lhs with a ref-count of 1 (or 2 when assigning to same variable)? That would be generally useful information beyond this particular optimization.

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it.
We can determine at compile time whether a binary operator if of the forms += x ors = s + x (wheres is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for thes = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment:I have made the requested changes; please review again.

@brandtbucher
Copy link
MemberAuthor

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

Yeah, that makes sense. They can also be updated as the implementation changes, too.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow? Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

Looks like your intuition is correct. Here is a comparison ofmain,int-only,float-only, and both:

Details
Benchmarkmain-0inplace-ops-int-0inplace-ops-float-0inplace-ops-0
2to3264 msnot significantnot significant266 ms: 1.01x slower
chameleon7.64 ms7.40 ms: 1.03x faster7.43 ms: 1.03x faster7.45 ms: 1.03x faster
chaos72.9 ms72.3 ms: 1.01x faster74.2 ms: 1.02x slowernot significant
crypto_pyaes84.9 ms85.5 ms: 1.01x slower85.6 ms: 1.01x slower84.2 ms: 1.01x faster
deltablue4.21 msnot significantnot significant4.24 ms: 1.01x slower
django_template35.7 ms36.2 ms: 1.02x slowernot significantnot significant
dulwich_log65.3 ms65.7 ms: 1.01x slower65.0 ms: 1.00x faster65.8 ms: 1.01x slower
fannkuch404 ms385 ms: 1.05x faster395 ms: 1.02x faster394 ms: 1.02x faster
float78.0 ms77.4 ms: 1.01x faster76.8 ms: 1.02x fasternot significant
go149 ms151 ms: 1.01x slower150 ms: 1.01x slowernot significant
hexiom6.65 ms6.58 ms: 1.01x faster6.71 ms: 1.01x slower6.76 ms: 1.02x slower
json_dumps12.7 msnot significant12.9 ms: 1.02x slower12.8 ms: 1.01x slower
json_loads25.4 us25.6 us: 1.01x slower25.2 us: 1.01x faster25.5 us: 1.00x slower
logging_format6.51 us6.36 us: 1.02x faster6.42 us: 1.02x faster6.40 us: 1.02x faster
logging_silent111 ns108 ns: 1.03x faster110 ns: 1.01x faster107 ns: 1.04x faster
logging_simple5.90 us5.87 us: 1.00x faster5.86 us: 1.01x fasternot significant
mako11.4 ms11.4 ms: 1.00x faster11.5 ms: 1.01x slowernot significant
meteor_contest105 ms103 ms: 1.01x faster103 ms: 1.01x faster103 ms: 1.02x faster
nbody96.5 ms98.0 ms: 1.02x slower95.1 ms: 1.01x fasternot significant
nqueens83.9 msnot significant85.2 ms: 1.02x slower85.5 ms: 1.02x slower
pathlib19.4 msnot significantnot significant19.2 ms: 1.01x faster
pickle9.86 us9.71 us: 1.02x faster9.70 us: 1.02x faster10.1 us: 1.02x slower
pickle_dict27.2 us28.6 us: 1.05x slowernot significant28.2 us: 1.04x slower
pickle_list4.55 us4.61 us: 1.01x slower4.61 us: 1.01x slower4.41 us: 1.03x faster
pickle_pure_python335 us331 us: 1.01x faster333 us: 1.01x faster332 us: 1.01x faster
pidigits194 ms195 ms: 1.00x slower197 ms: 1.02x slower189 ms: 1.02x faster
pyflate447 ms442 ms: 1.01x faster441 ms: 1.01x faster441 ms: 1.01x faster
python_startup8.16 ms8.23 ms: 1.01x slower8.17 ms: 1.00x slower8.20 ms: 1.00x slower
python_startup_no_site5.79 ms5.83 ms: 1.01x slower5.80 ms: 1.00x slower5.82 ms: 1.00x slower
raytrace315 ms310 ms: 1.02x faster310 ms: 1.02x faster309 ms: 1.02x faster
regex_compile139 ms136 ms: 1.02x faster138 ms: 1.01x faster138 ms: 1.01x faster
regex_dna216 ms221 ms: 1.03x slowernot significant216 ms: 1.00x slower
regex_effbot3.26 msnot significant3.33 ms: 1.02x slowernot significant
regex_v824.1 msnot significant23.5 ms: 1.03x faster24.0 ms: 1.01x faster
richards53.1 ms51.6 ms: 1.03x faster51.4 ms: 1.03x faster51.6 ms: 1.03x faster
scimark_fft326 ms341 ms: 1.04x slower314 ms: 1.04x faster333 ms: 1.02x slower
scimark_lu112 msnot significant110 ms: 1.01x faster109 ms: 1.03x faster
scimark_monte_carlo72.1 ms69.5 ms: 1.04x faster71.3 ms: 1.01x faster68.6 ms: 1.05x faster
scimark_sor125 ms121 ms: 1.04x faster124 ms: 1.01x faster121 ms: 1.04x faster
scimark_sparse_mat_mult4.65 ms4.76 ms: 1.02x slower4.41 ms: 1.05x faster4.30 ms: 1.08x faster
spectral_norm103 ms93.1 ms: 1.10x faster93.1 ms: 1.10x faster91.7 ms: 1.12x faster
sympy_expand500 msnot significantnot significant502 ms: 1.01x slower
sympy_integrate21.3 ms21.3 ms: 1.00x faster21.4 ms: 1.00x slower21.4 ms: 1.00x slower
sympy_sum168 msnot significant168 ms: 1.00x faster169 ms: 1.01x slower
sympy_str302 ms299 ms: 1.01x faster300 ms: 1.01x fasternot significant
telco6.52 msnot significant6.63 ms: 1.02x slower6.35 ms: 1.03x faster
unpack_sequence44.4 ns45.4 ns: 1.02x slowernot significant45.9 ns: 1.03x slower
unpickle14.2 us14.4 us: 1.01x slower14.0 us: 1.02x fasternot significant
unpickle_list5.08 us5.16 us: 1.02x slower5.03 us: 1.01x faster5.12 us: 1.01x slower
unpickle_pure_python251 us247 us: 1.02x faster253 us: 1.01x slower247 us: 1.01x faster
xml_etree_parse155 msnot significant157 ms: 1.01x slower157 ms: 1.01x slower
xml_etree_iterparse107 msnot significantnot significant108 ms: 1.01x slower
xml_etree_generate81.5 ms80.6 ms: 1.01x faster80.7 ms: 1.01x faster80.9 ms: 1.01x faster
xml_etree_process57.9 ms57.7 ms: 1.00x fasternot significant58.4 ms: 1.01x slower
Geometric mean(ref)1.00x faster1.01x faster1.01x faster

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it. We can determine at compile time whether a binary operator if of the forms += x ors = s + x (wheres is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for thes = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I've updated the PR to use the adaptive entry for this purpose. I understand the benefits of a compile-time check, but doing the check at specialization time seems a lot simpler in this particular case.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Shall I drop theint changes, then?

@markshannon
Copy link
Member

Looks good.
Do you have up to date performance numbers?

@brandtbucher
Copy link
MemberAuthor

brandtbucher commentedJan 20, 2022
edited
Loading

Do you have time for a quick Teams call@markshannon? Might be easier to chat about the stats I gathered in-person.

(If not, maybe tomorrow or Monday.)

@markshannon
Copy link
Member

markshannon commentedJan 21, 2022
edited
Loading

Somewhat surprisingbenchmarking results.
An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

@sweeneyde
Copy link
Member

Somewhat surprisingbenchmarking results. An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

pidigits slowdown should be somewhat expected, since it's mostly huge-integer operations, which are specialized and handled by _PyLong_Add and such before this PR but not after.

@brandtbucherbrandtbucher changed the titlebpo-46372: Try to mutate the LHS during fastint/float opsbpo-46372: Try to mutate the LHS during fastfloat opsJan 24, 2022
@brandtbucher
Copy link
MemberAuthor

Hm, looks like removing therefcount == 2 optimization from thefloat ops wiped out most of the performance gains.

I’ll go ahead and try the less-branchy version that we discussed this morning (an adaptive superinstruction based on the existing in-place string addition op).

@markshannon
Copy link
Member

LGTM. Feel free to merge once the merge conflicts are fixed.

Copy link
Member

@gvanrossumgvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice and simple!

Py_DECREF(rhs); \
STACK_SHRINK(1); \
if (Py_REFCNT(lhs) == 1) { \
PyFloat_AS_DOUBLE(lhs) = d; \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Presumably (see PEP 674) Victor won't like this use of a macro as a target. :-)

Nevertheless, this whole macro is so nice and straightforward compared to ints!

brandtbucher reacted with thumbs up emoji
double d = l OP r; \
Py_DECREF(rhs); \
STACK_SHRINK(1); \
if (Py_REFCNT(lhs) == 1) { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would there be any benefit in checking whether perhaps rhs has a refcount of 1, in case lhs doesn't? While people are probably more likely to writereturn x+1 rather thanreturn 1+x (though even I could sometimes see a reason to do that, from "rhyming" with some other operations or just how the textbook presents the formula), writingreturn 2*x is more likely thanreturn x*2. And of course for asymmetric operations you may not have a choice, likereturn 1/x.

Of course, there's a cost (more code, more branches). There's also the fact that+= and friends steer this in the direction you're taking here.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Of course, there's a cost (more code, more branches). There's also the fact that+= and friends steer this in the direction you're taking here.

Yeah, those are my main motivations for sticking with the LHS here.

@brandtbucher
Copy link
MemberAuthor

Okay, I just pushed variants offloat += float andfloat -= float that optimize the case wherelhs has two references: one on the stack, and one named local reference that is about to be overwritten by the next instruction. It costs one extra (fast, predictable) branch, but lets us both mutatelhs in-placeand skip the following store entirely!

This gets us back up to a 1% improvement:

Slower (15):- crypto_pyaes: 83.5 ms +- 0.9 ms -> 85.1 ms +- 1.2 ms: 1.02x slower- sqlite_synth: 2.71 us +- 0.04 us -> 2.74 us +- 0.07 us: 1.01x slower- django_template: 34.8 ms +- 0.4 ms -> 35.2 ms +- 0.5 ms: 1.01x slower- richards: 49.5 ms +- 0.7 ms -> 50.1 ms +- 0.8 ms: 1.01x slower- pickle_list: 4.51 us +- 0.06 us -> 4.56 us +- 0.05 us: 1.01x slower- scimark_sor: 122 ms +- 1 ms -> 123 ms +- 2 ms: 1.01x slower- fannkuch: 385 ms +- 2 ms -> 389 ms +- 5 ms: 1.01x slower- float: 76.1 ms +- 0.8 ms -> 76.8 ms +- 0.9 ms: 1.01x slower- deltablue: 4.07 ms +- 0.05 ms -> 4.10 ms +- 0.05 ms: 1.01x slower- chameleon: 7.34 ms +- 0.09 ms -> 7.39 ms +- 0.12 ms: 1.01x slower- pickle_pure_python: 326 us +- 2 us -> 328 us +- 3 us: 1.00x slower- regex_compile: 135 ms +- 1 ms -> 135 ms +- 1 ms: 1.00x slower- xml_etree_generate: 79.0 ms +- 0.6 ms -> 79.2 ms +- 0.9 ms: 1.00x slower- python_startup_no_site: 5.87 ms +- 0.00 ms -> 5.87 ms +- 0.01 ms: 1.00x slower- python_startup: 8.23 ms +- 0.01 ms -> 8.24 ms +- 0.01 ms: 1.00x slowerFaster (25):- scimark_sparse_mat_mult: 4.56 ms +- 0.10 ms -> 4.27 ms +- 0.04 ms: 1.07x faster- regex_dna: 220 ms +- 1 ms -> 210 ms +- 1 ms: 1.05x faster- spectral_norm: 99.0 ms +- 0.9 ms -> 94.8 ms +- 2.8 ms: 1.04x faster- pyflate: 456 ms +- 5 ms -> 438 ms +- 3 ms: 1.04x faster- scimark_lu: 116 ms +- 5 ms -> 112 ms +- 3 ms: 1.04x faster- scimark_fft: 326 ms +- 2 ms -> 318 ms +- 2 ms: 1.03x faster- nbody: 94.2 ms +- 4.0 ms -> 91.7 ms +- 0.7 ms: 1.03x faster- scimark_monte_carlo: 70.9 ms +- 1.1 ms -> 69.1 ms +- 1.5 ms: 1.03x faster- nqueens: 85.7 ms +- 0.8 ms -> 83.7 ms +- 1.0 ms: 1.02x faster- logging_simple: 5.93 us +- 0.07 us -> 5.80 us +- 0.06 us: 1.02x faster- mako: 11.5 ms +- 0.1 ms -> 11.3 ms +- 0.1 ms: 1.02x faster- json_dumps: 12.6 ms +- 0.2 ms -> 12.4 ms +- 0.2 ms: 1.02x faster- pathlib: 19.2 ms +- 0.2 ms -> 18.9 ms +- 0.3 ms: 1.02x faster- logging_format: 6.44 us +- 0.08 us -> 6.35 us +- 0.07 us: 1.01x faster- hexiom: 6.62 ms +- 0.05 ms -> 6.54 ms +- 0.10 ms: 1.01x faster- chaos: 71.0 ms +- 0.6 ms -> 70.3 ms +- 0.5 ms: 1.01x faster- json_loads: 25.3 us +- 0.8 us -> 25.0 us +- 0.3 us: 1.01x faster- xml_etree_iterparse: 107 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster- unpickle: 14.0 us +- 0.2 us -> 13.9 us +- 0.2 us: 1.01x faster- pickle_dict: 27.4 us +- 0.1 us -> 27.1 us +- 0.2 us: 1.01x faster- go: 145 ms +- 1 ms -> 144 ms +- 1 ms: 1.01x faster- raytrace: 307 ms +- 5 ms -> 305 ms +- 2 ms: 1.01x faster- regex_effbot: 3.21 ms +- 0.03 ms -> 3.18 ms +- 0.04 ms: 1.01x faster- dulwich_log: 64.9 ms +- 0.4 ms -> 64.5 ms +- 0.4 ms: 1.01x faster- pidigits: 192 ms +- 0 ms -> 192 ms +- 0 ms: 1.00x fasterBenchmark hidden because not significant (18): 2to3, logging_silent, meteor_contest, pickle, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_processGeometric mean: 1.01x faster

@gvanrossum
Copy link
Member

Seems you have a merge conflict. Also, Mark should probably re-review this?

brandtbucher reacted with thumbs up emoji

Copy link
Member

@markshannonmarkshannon left a comment
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not a fan of the big macro. Otherwise, looks good though.

};

// NOTE: INPLACE must be a compile-time constant to avoid runtime branching!
#define BINARY_OP_FAST_FLOAT(OP, INPLACE) \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not a fan of giant macros and I don't like the hidden compile time control flow.
Given that the inplace and non-inplace forms specialized forms differ in a fundamental way, I'd like that to be explicit in the instructions themselves.

Maybe have two different macros, one for the inplace form and one for the non-inplace form.
You should be able to factor out some of the common code into helper macros.
That might be clearer. Up to you.

Copy link
MemberAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Makes sense. I'll go ahead and split it up.

Copy link
Member

@iritkatrieliritkatriel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This has merge conflicts now.

@bedevere-bot
Copy link

When you're done making the requested changes, leave the comment:I have made the requested changes; please review again.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@markshannonmarkshannonmarkshannon requested changes

@sweeneydesweeneydesweeneyde left review comments

@iritkatrieliritkatrieliritkatriel requested changes

@gvanrossumgvanrossumgvanrossum approved these changes

Assignees

No one assigned

Labels

awaiting changesperformancePerformance or resource usage

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

8 participants

@brandtbucher@sweeneyde@bedevere-bot@markshannon@gvanrossum@iritkatriel@the-knights-who-say-ni@ezio-melotti

[8]ページ先頭

©2009-2026 Movatter.jp