NotificationsYou must be signed in to change notification settings
Fork32.1k
Star67.3k

GH-98686: Quicken everything#98687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

brandtbucher merged 12 commits intopython:mainfrombrandtbucher:quicken-everything

Nov 2, 2022

Merged

GH-98686: Quicken everything#98687

brandtbucher merged 12 commits intopython:mainfrombrandtbucher:quicken-everything

Nov 2, 2022

Conversation

Copy link

Member

brandtbucher commentedOct 25, 2022•
edited by bedevere-bot
Loading

Issue:Quicken everything #98686

brandtbucher added12 commits

October 6, 2022 14:44

First pass at quickening everything

9308140

Fix up deepfreeze

30282eb

Catch up with main

a7fd54f

ADAPTIVE_BACKOFF_START = 1

0a204e1

Fix __lltrace__

d407a51

Fix test_dis

fc5c338

Handle double-quickening correctly

659add7

Fix test_embed

f9f9c3a

Remove some unneeded _PyOpcode_Deopts

5bf4b12

fixup

9b59294

blurb add

6e3869c

Catch up with main

17292a3

brandtbucher added performance

Performance or resource usage

interpreter-core(Objects, Python, Grammar, and Parser dirs) labels

Oct 25, 2022

brandtbucher self-assigned this

Oct 25, 2022

brandtbucher requested a review frommarkshannon as acode owner

October 25, 2022 23:52

bedevere-bot added the awaiting core review label

Oct 25, 2022

Copy link

MemberAuthor

brandtbucher commentedOct 25, 2022

1% faster:

Slower (10):- xml_etree_iterparse: 102 ms +- 2 ms -> 106 ms +- 2 ms: 1.04x slower- python_startup_no_site: 6.09 ms +- 0.01 ms -> 6.29 ms +- 0.01 ms: 1.03x slower- spectral_norm: 94.8 ms +- 3.9 ms -> 96.7 ms +- 2.0 ms: 1.02x slower- python_startup: 8.47 ms +- 0.01 ms -> 8.63 ms +- 0.01 ms: 1.02x slower- chameleon: 6.54 ms +- 0.06 ms -> 6.63 ms +- 0.07 ms: 1.01x slower- nqueens: 79.9 ms +- 1.1 ms -> 81.0 ms +- 0.9 ms: 1.01x slower- pickle: 10.3 us +- 0.1 us -> 10.4 us +- 0.1 us: 1.01x slower- json: 4.66 ms +- 0.10 ms -> 4.71 ms +- 0.10 ms: 1.01x slower- regex_dna: 204 ms +- 1 ms -> 205 ms +- 1 ms: 1.01x slower- regex_compile: 129 ms +- 1 ms -> 130 ms +- 0 ms: 1.00x slowerFaster (42):- richards: 45.2 ms +- 0.7 ms -> 42.2 ms +- 0.8 ms: 1.07x faster- regex_v8: 22.5 ms +- 0.1 ms -> 21.3 ms +- 0.1 ms: 1.06x faster- pidigits: 198 ms +- 0 ms -> 189 ms +- 0 ms: 1.05x faster- go: 141 ms +- 2 ms -> 135 ms +- 1 ms: 1.05x faster- scimark_sor: 108 ms +- 2 ms -> 104 ms +- 1 ms: 1.04x faster- mdp: 2.55 sec +- 0.01 sec -> 2.48 sec +- 0.01 sec: 1.03x faster- pyflate: 410 ms +- 3 ms -> 398 ms +- 3 ms: 1.03x faster- deepcopy_reduce: 2.95 us +- 0.05 us -> 2.86 us +- 0.04 us: 1.03x faster- scimark_lu: 111 ms +- 2 ms -> 108 ms +- 2 ms: 1.03x faster- pickle_list: 4.21 us +- 0.05 us -> 4.11 us +- 0.04 us: 1.03x faster- deltablue: 3.40 ms +- 0.04 ms -> 3.32 ms +- 0.03 ms: 1.02x faster- scimark_fft: 322 ms +- 2 ms -> 315 ms +- 4 ms: 1.02x faster- genshi_text: 21.6 ms +- 0.2 ms -> 21.2 ms +- 0.2 ms: 1.02x faster- coroutines: 25.0 ms +- 0.3 ms -> 24.5 ms +- 0.3 ms: 1.02x faster- xml_etree_process: 54.2 ms +- 0.8 ms -> 53.2 ms +- 0.6 ms: 1.02x faster- xml_etree_generate: 77.0 ms +- 1.2 ms -> 75.6 ms +- 0.7 ms: 1.02x faster- deepcopy: 334 us +- 6 us -> 328 us +- 3 us: 1.02x faster- django_template: 33.1 ms +- 0.3 ms -> 32.5 ms +- 0.3 ms: 1.02x faster- tornado_http: 94.2 ms +- 1.4 ms -> 92.6 ms +- 1.4 ms: 1.02x faster- pycparser: 1.10 sec +- 0.02 sec -> 1.08 sec +- 0.02 sec: 1.02x faster- thrift: 756 us +- 14 us -> 745 us +- 9 us: 1.02x faster- pickle_dict: 31.3 us +- 0.2 us -> 30.9 us +- 0.2 us: 1.01x faster- fannkuch: 381 ms +- 4 ms -> 376 ms +- 6 ms: 1.01x faster- sqlalchemy_declarative: 134 ms +- 3 ms -> 133 ms +- 3 ms: 1.01x faster- async_tree_cpu_io_mixed: 743 ms +- 20 ms -> 734 ms +- 16 ms: 1.01x faster- pickle_pure_python: 288 us +- 2 us -> 284 us +- 4 us: 1.01x faster- logging_simple: 5.85 us +- 0.09 us -> 5.79 us +- 0.09 us: 1.01x faster- logging_format: 6.49 us +- 0.07 us -> 6.42 us +- 0.08 us: 1.01x faster- sympy_str: 283 ms +- 3 ms -> 280 ms +- 3 ms: 1.01x faster- sqlglot_normalize: 105 ms +- 1 ms -> 104 ms +- 1 ms: 1.01x faster- 2to3: 251 ms +- 1 ms -> 248 ms +- 1 ms: 1.01x faster- mako: 9.76 ms +- 0.05 ms -> 9.67 ms +- 0.04 ms: 1.01x faster- pathlib: 17.7 ms +- 0.3 ms -> 17.5 ms +- 0.2 ms: 1.01x faster- gunicorn: 1.08 ms +- 0.01 ms -> 1.07 ms +- 0.01 ms: 1.01x faster- sqlite_synth: 2.62 us +- 0.06 us -> 2.60 us +- 0.04 us: 1.01x faster- sympy_expand: 457 ms +- 3 ms -> 454 ms +- 7 ms: 1.01x faster- regex_effbot: 3.52 ms +- 0.01 ms -> 3.50 ms +- 0.01 ms: 1.01x faster- dulwich_log: 61.2 ms +- 0.3 ms -> 60.9 ms +- 0.3 ms: 1.01x faster- sqlglot_optimize: 51.5 ms +- 0.4 ms -> 51.2 ms +- 0.3 ms: 1.01x faster- nbody: 95.8 ms +- 0.8 ms -> 95.4 ms +- 0.9 ms: 1.00x faster- generators: 72.7 ms +- 0.2 ms -> 72.4 ms +- 0.3 ms: 1.00x faster- sympy_integrate: 20.4 ms +- 0.1 ms -> 20.4 ms +- 0.1 ms: 1.00x fasterBenchmark hidden because not significant (33): aiohttp, async_tree_none, async_tree_io, async_tree_memoization, chaos, coverage, crypto_pyaes, deepcopy_memo, float, genshi_xml, hexiom, html5lib, json_dumps, json_loads, logging_silent, meteor_contest, mypy, pprint_safe_repr, pprint_pformat, pylint, raytrace, scimark_monte_carlo, scimark_sparse_mat_mult, sqlalchemy_imperative, sqlglot_parse, sqlglot_transpile, sympy_sum, telco, unpack_sequence, unpickle, unpickle_list, unpickle_pure_python, xml_etree_parseGeometric mean: 1.01x faster

Copy link

Member

markshannon commentedOct 26, 2022

I notice that startup (both with and without-S) is slower. Presumably that is the overhead of calling_PyCode_Quicken() all the time.
Presumably all that overhead will go when we move quickening to the bytecode compiler.

markshannon requested changes

Oct 26, 2022

View reviewed changes

Copy link

Member

markshannon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

The original design made a distinction between the initial adaptive counter (which was zero as there was already a delay for quickening), and the first value for the backoff counter (15).

This PR loses that distinction.

Maybe we don't need that distinction, we already have to hit 53 misses before falling back to the adaptive form? Maybe we should retry specialization almost immediately and rely on the exponential backoff in case of repeated failures?

How about setting the initial value to 3 (or 5 or 7) to avoid specializing too aggressively at startup.
We can reuse the same value, as it makes little difference once misses are taken into account (53 + 7 is near enough 53 + 15)

One final point (not a fault of this PR, but relevant) is that inadaptive_counter_start()
The counter is set to (2**backoff-1), there is no reason for this.
We could set the counter to3 or5 and the backoff to4.

@brandtbucher thoughts?

Include/internal/pycore_code.h

		/* The initial counter value is31 == 2*ADAPTIVE_BACKOFF_START - 1 /
		#define ADAPTIVE_BACKOFF_START5
		/* The initial counter value is1 == 2*ADAPTIVE_BACKOFF_START - 1 /
		#define ADAPTIVE_BACKOFF_START1

Copy link

Member

markshannonOct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This is for backoff, after failure. Do you want to change this, or just lower the initial counter value?

Python/specialize.c

		uint8_t adaptive_opcode = _PyOpcode_Adaptive[opcode];
		if (adaptive_opcode) {
		_Py_SET_OPCODE(instructions[i], adaptive_opcode);
		// Make sure the adaptive counter is zero:
		assert(instructions[i + 1] == 0);
		instructions[i + 1] = adaptive_counter_start();

Copy link

Member

markshannonOct 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't think this is what you want.adaptive_counter_start() is supposed to be usedafter resetting to the adaptive form after repeated misses.

Maybe renameadaptive_counter_start toadaptive_initial_backoff_value() to avoid future confusion.?

What values did you try for the initial counter value?
1 seems low, as we might be spending too much effort specializing startup code.

bedevere-bot removed the awaiting core review label

Oct 26, 2022

Copy link

bedevere-bot commentedOct 26, 2022

When you're done making the requested changes, leave the comment:I have made the requested changes; please review again.

bedevere-bot added the awaiting changes label

Oct 26, 2022

bedevere-bot mentioned this pull request

Nov 1, 2022

Quicken everything#98686

Open

Copy link

MemberAuthor

brandtbucher commentedNov 1, 2022

Recapping our discussion on Monday:

An initial value of1 seems to work best (1% faster) out of all the values I tried (0,1,3,7,15,31,63,127,255,511,1023,2047, and4095). With a value of1, we specialize thesecond time each adaptive instruction is run. This makes sense: running twice is probably a much better indicator of "hotness" than running once, but additional warmup delays only prevent specialization. Most types stabilize by the second loop iteration, too.

It's not strictly necessary to use the same value for the initial warmup delay and the initial backoff value, but1 seems to work well for both. We can always tweak/rename one of them later if we want.