python/cpythonPublic

NotificationsYou must be signed in to change notification settings
Fork34.3k
Star72.1k

bpo-46372: Try to mutate the LHS during fast`float` ops#30594

Closed

brandtbucher wants to merge 18 commits intopython:mainfrom

brandtbucher:inplace-ops

Closed

bpo-46372: Try to mutate the LHS during fast`float` ops#30594
brandtbucher wants to merge 18 commits intopython:mainfrom
brandtbucher:inplace-ops

Conversation

Copy link

Member

brandtbucher commentedJan 14, 2022•
edited by gvanrossum
Loading

This modifies the existing~~int and~~float specializations to mutate the LHS operand in-place when possible. It also unifies the implementations of all of these operations.

https://bugs.python.org/issue46372

brandtbucher added5 commits

January 11, 2022 12:25

Perform specialized ops in-place when possible

cf64e13

Break out common logic

5e49a13

Clean things up a bit

f7187d9

Clean up macro

dd19cec

Simplify the "small" int case

78cc5c4

brandtbucher added the performancePerformance or resource usage label

Jan 14, 2022

brandtbucher requested a review frommarkshannon as acode owner

January 14, 2022 06:44

the-knights-who-say-ni added the CLA signed label

Jan 14, 2022

bedevere-bot added the awaiting core review label

Jan 14, 2022

blurb add

218b10b

sweeneyde reviewed

Jan 14, 2022

View reviewed changes

Python/ceval.c OutdatedShow resolvedHide resolved

Copy link

Member

sweeneyde commentedJan 14, 2022

I think_PyLong_Add and the like could potentially be deleted as well.

markshannon requested changes

Jan 17, 2022

View reviewed changes

Copy link

Member

markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A few general comments:

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow?
Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

It would be worth adding stats to give us some idea what fraction of operations can use this optimization or something like it. In other words, what fraction of operations have the lhs with a ref-count of 1 (or 2 when assigning to same variable)? That would be generally useful information beyond this particular optimization.

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it.
We can determine at compile time whether a binary operator if of the forms += x ors = s + x (wheres is a "fast" local). We should do that.

I don't know if it is worth having two specializations, one for thes = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Python/ceval.c OutdatedShow resolvedHide resolved

bedevere-bot removed the awaiting core review label

Jan 17, 2022

Copy link

bedevere-bot commentedJan 17, 2022

When you're done making the requested changes, leave the comment:I have made the requested changes; please review again.

bedevere-bot added the awaiting changes label

Jan 17, 2022

sweeneyde reviewed

Jan 17, 2022

View reviewed changes

Python/ceval.c OutdatedShow resolvedHide resolved

brandtbucher added4 commits

January 18, 2022 14:14

Catch up with main

505fd8e

Apply first round of code revew comments

6b4cc6d

Fix 32-bit Windows errrors

f21a813

Perform the inplace checks when specializing

d6a69d5

Copy link

MemberAuthor

brandtbucher commentedJan 19, 2022

Could you attach the benchmark results to the PR, not the issue? The speedup depends on the actual implementation, not the general idea.

Yeah, that makes sense. They can also be updated as the implementation changes, too.

I'm a bit surprised that this offers much of a speedup given the complexity of the int operators. Maybe the current implementation is just really slow? Does the specialization of ints add much, or is it mainly the float specialization giving the speedup? I'd be curious to see how a float-only version compares.

Looks like your intuition is correct. Here is a comparison ofmain,int-only,float-only, and both:

Details

Benchmark	main-0	inplace-ops-int-0	inplace-ops-float-0	inplace-ops-0
2to3	264 ms	not significant	not significant	266 ms: 1.01x slower
chameleon	7.64 ms	7.40 ms: 1.03x faster	7.43 ms: 1.03x faster	7.45 ms: 1.03x faster
chaos	72.9 ms	72.3 ms: 1.01x faster	74.2 ms: 1.02x slower	not significant
crypto_pyaes	84.9 ms	85.5 ms: 1.01x slower	85.6 ms: 1.01x slower	84.2 ms: 1.01x faster
deltablue	4.21 ms	not significant	not significant	4.24 ms: 1.01x slower
django_template	35.7 ms	36.2 ms: 1.02x slower	not significant	not significant
dulwich_log	65.3 ms	65.7 ms: 1.01x slower	65.0 ms: 1.00x faster	65.8 ms: 1.01x slower
fannkuch	404 ms	385 ms: 1.05x faster	395 ms: 1.02x faster	394 ms: 1.02x faster
float	78.0 ms	77.4 ms: 1.01x faster	76.8 ms: 1.02x faster	not significant
go	149 ms	151 ms: 1.01x slower	150 ms: 1.01x slower	not significant
hexiom	6.65 ms	6.58 ms: 1.01x faster	6.71 ms: 1.01x slower	6.76 ms: 1.02x slower
json_dumps	12.7 ms	not significant	12.9 ms: 1.02x slower	12.8 ms: 1.01x slower
json_loads	25.4 us	25.6 us: 1.01x slower	25.2 us: 1.01x faster	25.5 us: 1.00x slower
logging_format	6.51 us	6.36 us: 1.02x faster	6.42 us: 1.02x faster	6.40 us: 1.02x faster
logging_silent	111 ns	108 ns: 1.03x faster	110 ns: 1.01x faster	107 ns: 1.04x faster
logging_simple	5.90 us	5.87 us: 1.00x faster	5.86 us: 1.01x faster	not significant
mako	11.4 ms	11.4 ms: 1.00x faster	11.5 ms: 1.01x slower	not significant
meteor_contest	105 ms	103 ms: 1.01x faster	103 ms: 1.01x faster	103 ms: 1.02x faster
nbody	96.5 ms	98.0 ms: 1.02x slower	95.1 ms: 1.01x faster	not significant
nqueens	83.9 ms	not significant	85.2 ms: 1.02x slower	85.5 ms: 1.02x slower
pathlib	19.4 ms	not significant	not significant	19.2 ms: 1.01x faster
pickle	9.86 us	9.71 us: 1.02x faster	9.70 us: 1.02x faster	10.1 us: 1.02x slower
pickle_dict	27.2 us	28.6 us: 1.05x slower	not significant	28.2 us: 1.04x slower
pickle_list	4.55 us	4.61 us: 1.01x slower	4.61 us: 1.01x slower	4.41 us: 1.03x faster
pickle_pure_python	335 us	331 us: 1.01x faster	333 us: 1.01x faster	332 us: 1.01x faster
pidigits	194 ms	195 ms: 1.00x slower	197 ms: 1.02x slower	189 ms: 1.02x faster
pyflate	447 ms	442 ms: 1.01x faster	441 ms: 1.01x faster	441 ms: 1.01x faster
python_startup	8.16 ms	8.23 ms: 1.01x slower	8.17 ms: 1.00x slower	8.20 ms: 1.00x slower
python_startup_no_site	5.79 ms	5.83 ms: 1.01x slower	5.80 ms: 1.00x slower	5.82 ms: 1.00x slower
raytrace	315 ms	310 ms: 1.02x faster	310 ms: 1.02x faster	309 ms: 1.02x faster
regex_compile	139 ms	136 ms: 1.02x faster	138 ms: 1.01x faster	138 ms: 1.01x faster
regex_dna	216 ms	221 ms: 1.03x slower	not significant	216 ms: 1.00x slower
regex_effbot	3.26 ms	not significant	3.33 ms: 1.02x slower	not significant
regex_v8	24.1 ms	not significant	23.5 ms: 1.03x faster	24.0 ms: 1.01x faster
richards	53.1 ms	51.6 ms: 1.03x faster	51.4 ms: 1.03x faster	51.6 ms: 1.03x faster
scimark_fft	326 ms	341 ms: 1.04x slower	314 ms: 1.04x faster	333 ms: 1.02x slower
scimark_lu	112 ms	not significant	110 ms: 1.01x faster	109 ms: 1.03x faster
scimark_monte_carlo	72.1 ms	69.5 ms: 1.04x faster	71.3 ms: 1.01x faster	68.6 ms: 1.05x faster
scimark_sor	125 ms	121 ms: 1.04x faster	124 ms: 1.01x faster	121 ms: 1.04x faster
scimark_sparse_mat_mult	4.65 ms	4.76 ms: 1.02x slower	4.41 ms: 1.05x faster	4.30 ms: 1.08x faster
spectral_norm	103 ms	93.1 ms: 1.10x faster	93.1 ms: 1.10x faster	91.7 ms: 1.12x faster
sympy_expand	500 ms	not significant	not significant	502 ms: 1.01x slower
sympy_integrate	21.3 ms	21.3 ms: 1.00x faster	21.4 ms: 1.00x slower	21.4 ms: 1.00x slower
sympy_sum	168 ms	not significant	168 ms: 1.00x faster	169 ms: 1.01x slower
sympy_str	302 ms	299 ms: 1.01x faster	300 ms: 1.01x faster	not significant
telco	6.52 ms	not significant	6.63 ms: 1.02x slower	6.35 ms: 1.03x faster
unpack_sequence	44.4 ns	45.4 ns: 1.02x slower	not significant	45.9 ns: 1.03x slower
unpickle	14.2 us	14.4 us: 1.01x slower	14.0 us: 1.02x faster	not significant
unpickle_list	5.08 us	5.16 us: 1.02x slower	5.03 us: 1.01x faster	5.12 us: 1.01x slower
unpickle_pure_python	251 us	247 us: 1.02x faster	253 us: 1.01x slower	247 us: 1.01x faster
xml_etree_parse	155 ms	not significant	157 ms: 1.01x slower	157 ms: 1.01x slower
xml_etree_iterparse	107 ms	not significant	not significant	108 ms: 1.01x slower
xml_etree_generate	81.5 ms	80.6 ms: 1.01x faster	80.7 ms: 1.01x faster	80.9 ms: 1.01x faster
xml_etree_process	57.9 ms	57.7 ms: 1.00x faster	not significant	58.4 ms: 1.01x slower
Geometric mean	(ref)	1.00x faster	1.01x faster	1.01x faster

As a general rule, if something can be done at compile time, it should be. If not then it should be done at specialization time. Only do stuff at runtime, if we can't avoid it. We can determine at compile time whether a binary operator if of the forms += x ors = s + x (wheres is a "fast" local). We should do that.
I don't know if it is worth having two specializations, one for thes = s + x and one for the general form, or embedding the information in the cache. I suspect that it won't matter.

I've updated the PR to use the adaptive entry for this purpose. I understand the benefits of a compile-time check, but doing the check at specialization time seems a lot simpler in this particular case.

I like the float optimization, but I think our efforts to speed up int operations should focus or making int objects more efficient first. Otherwise the coupling of the interpreter to the int implementation is going to make improving the latter even more difficult.

Shall I drop theint changes, then?

sweeneyde reviewed

Jan 19, 2022

View reviewed changes

Python/specialize.c OutdatedShow resolvedHide resolved

brandtbucherand others added3 commits

January 18, 2022 22:35

Remove duplicate code

f11cf6d

Simplify things further

b6f041a

Remove int and refcount == 2 optimizations

eb586b1

Copy link

Member

markshannon commentedJan 20, 2022

Looks good.
Do you have up to date performance numbers?

Copy link

MemberAuthor

brandtbucher commentedJan 20, 2022•
edited
Loading

Do you have time for a quick Teams call@markshannon? Might be easier to chat about the stats I gathered in-person.

(If not, maybe tomorrow or Monday.)

Copy link

Member

markshannon commentedJan 21, 2022•
edited
Loading

Somewhat surprisingbenchmarking results.
An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

Copy link

Member

sweeneyde commentedJan 21, 2022

Somewhat surprisingbenchmarking results. An overall speedup, but slowdowns on some more numerical benchmarks, particularly pidigits and spectral_norm.

pidigits slowdown should be somewhat expected, since it's mostly huge-integer operations, which are specialized and handled by _PyLong_Add and such before this PR but not after.

Catch up with main

def9c64

brandtbucher changed the title~~bpo-46372: Try to mutate the LHS during fastint/float ops~~bpo-46372: Try to mutate the LHS during fastfloat ops

Jan 24, 2022

Revert int changes

6e262ba

Copy link

MemberAuthor

brandtbucher commentedJan 25, 2022

Hm, looks like removing therefcount == 2 optimization from thefloat ops wiped out most of the performance gains.

I’ll go ahead and try the less-branchy version that we discussed this morning (an adaptive superinstruction based on the existing in-place string addition op).

Copy link

Member

markshannon commentedJan 25, 2022

LGTM. Feel free to merge once the merge conflicts are fixed.

Catch up with main

37fab55

brandtbucher mentioned this pull request

Jan 31, 2022

Try to mutate the LHS during fastfloat opsfaster-cpython/ideas#243

Closed

gvanrossum approved these changes

Jan 31, 2022

View reviewed changes

Copy link

Member

gvanrossum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice and simple!

Python/ceval.c Outdated

		Py_DECREF(rhs); \
		STACK_SHRINK(1); \
		if (Py_REFCNT(lhs) == 1) { \
		PyFloat_AS_DOUBLE(lhs) = d; \

Copy link

Member

gvanrossumJan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Presumably (see PEP 674) Victor won't like this use of a macro as a target. :-)

Nevertheless, this whole macro is so nice and straightforward compared to ints!

Python/ceval.c Outdated

		double d = l OP r; \
		Py_DECREF(rhs); \
		STACK_SHRINK(1); \
		if (Py_REFCNT(lhs) == 1) { \

Copy link

Member

gvanrossumJan 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would there be any benefit in checking whether perhaps rhs has a refcount of 1, in case lhs doesn't? While people are probably more likely to writereturn x+1 rather thanreturn 1+x (though even I could sometimes see a reason to do that, from "rhyming" with some other operations or just how the textbook presents the formula), writingreturn 2*x is more likely thanreturn x*2. And of course for asymmetric operations you may not have a choice, likereturn 1/x.

Of course, there's a cost (more code, more branches). There's also the fact that+= and friends steer this in the direction you're taking here.

Copy link

MemberAuthor

brandtbucherFeb 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Of course, there's a cost (more code, more branches). There's also the fact that+= and friends steer this in the direction you're taking here.

Yeah, those are my main motivations for sticking with the LHS here.

bedevere-bot added awaiting merge and removed awaiting changes labels

Jan 31, 2022

Add the "adaptive superinstructions" back

04f186c

Copy link

MemberAuthor

brandtbucher commentedFeb 1, 2022

Okay, I just pushed variants offloat += float andfloat -= float that optimize the case wherelhs has two references: one on the stack, and one named local reference that is about to be overwritten by the next instruction. It costs one extra (fast, predictable) branch, but lets us both mutatelhs in-placeand skip the following store entirely!

This gets us back up to a 1% improvement:

Slower (15):- crypto_pyaes: 83.5 ms +- 0.9 ms -> 85.1 ms +- 1.2 ms: 1.02x slower- sqlite_synth: 2.71 us +- 0.04 us -> 2.74 us +- 0.07 us: 1.01x slower- django_template: 34.8 ms +- 0.4 ms -> 35.2 ms +- 0.5 ms: 1.01x slower- richards: 49.5 ms +- 0.7 ms -> 50.1 ms +- 0.8 ms: 1.01x slower- pickle_list: 4.51 us +- 0.06 us -> 4.56 us +- 0.05 us: 1.01x slower- scimark_sor: 122 ms +- 1 ms -> 123 ms +- 2 ms: 1.01x slower- fannkuch: 385 ms +- 2 ms -> 389 ms +- 5 ms: 1.01x slower- float: 76.1 ms +- 0.8 ms -> 76.8 ms +- 0.9 ms: 1.01x slower- deltablue: 4.07 ms +- 0.05 ms -> 4.10 ms +- 0.05 ms: 1.01x slower- chameleon: 7.34 ms +- 0.09 ms -> 7.39 ms +- 0.12 ms: 1.01x slower- pickle_pure_python: 326 us +- 2 us -> 328 us +- 3 us: 1.00x slower- regex_compile: 135 ms +- 1 ms -> 135 ms +- 1 ms: 1.00x slower- xml_etree_generate: 79.0 ms +- 0.6 ms -> 79.2 ms +- 0.9 ms: 1.00x slower- python_startup_no_site: 5.87 ms +- 0.00 ms -> 5.87 ms +- 0.01 ms: 1.00x slower- python_startup: 8.23 ms +- 0.01 ms -> 8.24 ms +- 0.01 ms: 1.00x slowerFaster (25):- scimark_sparse_mat_mult: 4.56 ms +- 0.10 ms -> 4.27 ms +- 0.04 ms: 1.07x faster- regex_dna: 220 ms +- 1 ms -> 210 ms +- 1 ms: 1.05x faster- spectral_norm: 99.0 ms +- 0.9 ms -> 94.8 ms +- 2.8 ms: 1.04x faster- pyflate: 456 ms +- 5 ms -> 438 ms +- 3 ms: 1.04x faster- scimark_lu: 116 ms +- 5 ms -> 112 ms +- 3 ms: 1.04x faster- scimark_fft: 326 ms +- 2 ms -> 318 ms +- 2 ms: 1.03x faster- nbody: 94.2 ms +- 4.0 ms -> 91.7 ms +- 0.7 ms: 1.03x faster- scimark_monte_carlo: 70.9 ms +- 1.1 ms -> 69.1 ms +- 1.5 ms: 1.03x faster- nqueens: 85.7 ms +- 0.8 ms -> 83.7 ms +- 1.0 ms: 1.02x faster- logging_simple: 5.93 us +- 0.07 us -> 5.80 us +- 0.06 us: 1.02x faster- mako: 11.5 ms +- 0.1 ms -> 11.3 ms +- 0.1 ms: 1.02x faster- json_dumps: 12.6 ms +- 0.2 ms -> 12.4 ms +- 0.2 ms: 1.02x faster- pathlib: 19.2 ms +- 0.2 ms -> 18.9 ms +- 0.3 ms: 1.02x faster- logging_format: 6.44 us +- 0.08 us -> 6.35 us +- 0.07 us: 1.01x faster- hexiom: 6.62 ms +- 0.05 ms -> 6.54 ms +- 0.10 ms: 1.01x faster- chaos: 71.0 ms +- 0.6 ms -> 70.3 ms +- 0.5 ms: 1.01x faster- json_loads: 25.3 us +- 0.8 us -> 25.0 us +- 0.3 us: 1.01x faster- xml_etree_iterparse: 107 ms +- 1 ms -> 106 ms +- 1 ms: 1.01x faster- unpickle: 14.0 us +- 0.2 us -> 13.9 us +- 0.2 us: 1.01x faster- pickle_dict: 27.4 us +- 0.1 us -> 27.1 us +- 0.2 us: 1.01x faster- go: 145 ms +- 1 ms -> 144 ms +- 1 ms: 1.01x faster- raytrace: 307 ms +- 5 ms -> 305 ms +- 2 ms: 1.01x faster- regex_effbot: 3.21 ms +- 0.03 ms -> 3.18 ms +- 0.04 ms: 1.01x faster- dulwich_log: 64.9 ms +- 0.4 ms -> 64.5 ms +- 0.4 ms: 1.01x faster- pidigits: 192 ms +- 0 ms -> 192 ms +- 0 ms: 1.00x fasterBenchmark hidden because not significant (18): 2to3, logging_silent, meteor_contest, pickle, regex_v8, sqlalchemy_declarative, sqlalchemy_imperative, sympy_expand, sympy_integrate, sympy_sum, sympy_str, telco, tornado_http, unpack_sequence, unpickle_list, unpickle_pure_python, xml_etree_parse, xml_etree_processGeometric mean: 1.01x faster

Copy link

Member

gvanrossum commentedFeb 1, 2022

Seems you have a merge conflict. Also, Mark should probably re-review this?

Catch up with main

e0d0042

brandtbucher requested a review frommarkshannon

February 1, 2022 23:58

sweeneyde reviewed

Feb 2, 2022

View reviewed changes

Python/specialize.cShow resolvedHide resolved

markshannon reviewed

Feb 10, 2022

View reviewed changes

Copy link

Member

markshannon left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Not a fan of the big macro. Otherwise, looks good though.

Python/ceval.c

		};

		// NOTE: INPLACE must be a compile-time constant to avoid runtime branching!
		#define BINARY_OP_FAST_FLOAT(OP, INPLACE) \

Copy link

Member

markshannonFeb 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not a fan of giant macros and I don't like the hidden compile time control flow.
Given that the inplace and non-inplace forms specialized forms differ in a fundamental way, I'd like that to be explicit in the instructions themselves.

Maybe have two different macros, one for the inplace form and one for the non-inplace form.
You should be able to factor out some of the common code into helper macros.
That might be clearer. Up to you.