Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307

Open
eendebakpt wants to merge 3 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased
Open

gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)#146307
eendebakpt wants to merge 3 commits intopython:mainfrom
eendebakpt:jit_float_inplace_rebased

Conversation

@eendebakpt
Copy link
Contributor

@eendebakpteendebakpt commentedMar 22, 2026
edited
Loading

We can add the following tier 2 micro-ops that mutate the uniquely-referenced operand:

  • _BINARY_OP_ADD_FLOAT_INPLACE /_INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_SUBTRACT_FLOAT_INPLACE /_INPLACE_RIGHT — unique LHS / RHS
  • _BINARY_OP_MULTIPLY_FLOAT_INPLACE /_INPLACE_RIGHT — unique LHS / RHS
  • _UNARY_NEGATIVE_FLOAT_INPLACE — unique operand

The_RIGHT variants handle commutative ops (add, multiply) plus subtract when only the RHS is unique. The optimizer emits these inoptimizer_bytecodes.c whenPyJitRef_IsUnique(left) orPyJitRef_IsUnique(right) is true and the operand is a known float. The mutated operand is marked as borrowed so the following_POP_TOP becomes_POP_TOP_NOP.

Micro-benchmarks:

ExpressionmainoptimizedSpeedup
total += a*b + c24.0 ns/iter11.5 ns/iter2.1x
total += a + b16.9 ns/iter11.0 ns/iter1.5x
total += a*b + c*d28.5 ns/iter18.3 ns/iter1.6x

pyperformance nbody (20k iterations):

mainoptimizedSpeedup
nbody60.6 ms49.0 ms1.19x (19% faster)

Followup

Some operations that could be added in followup PR's: division of floats, operations on a float and an int with a uniquely referenced float, integer operations (but this is move involved because of small ints and number of digits), operations with complex numbers.

Script
"""Demo script for the inplace float mutation optimization in the tier 2 optimizer../configure --enable-experimental-jit=interpreter --with-pydebug./configure --enable-experimental-jit=yes --with-pydebugUsage:    ./python jit_float_demo.py           # filtered trace (float-related ops)    ./python jit_float_demo.py --all     # full trace (all ops)"""import sysimport timeitSHOW_ALL = "--all" in sys.argv# --- System info ---print("=" * 60)print("CPython JIT / Tier 2 Demo")print("=" * 60)print(f"Python version:  {sys.version}")print(f"Debug build:     {hasattr(sys, 'gettotalrefcount')}")print(f"Free-threaded:   {not sys._is_gil_enabled()}")jit_mod = getattr(sys, "_jit", None)if jit_mod is not None:    print(f"JIT available:   {jit_mod.is_available()}")    print(f"JIT enabled:     {jit_mod.is_enabled()}")else:    print("JIT:             not compiled in")tier2 = Falsetry:    from _testinternalcapi import TIER2_THRESHOLD    tier2 = True    print(f"Tier 2:          enabled (threshold={TIER2_THRESHOLD})")except (ImportError, AttributeError):    print("Tier 2:          disabled (build without _Py_TIER2)")print()# --- Example functions ---def f_adds(n, a, b, c):    """a*b + c per iteration — the multiply result is unique and reused."""    total = 0.0    for i in range(n):        total = a + b + c    return totaldef f_chain(n, a, b, c):    """a*b + c per iteration — the multiply result is unique and reused."""    total = 0.0    for i in range(n):        total += a * b + c    return totaldef f_simple_add(n, a, b):    """a + b per iteration — no unique intermediate."""    total = 0.0    for i in range(n):        total += a + b    return totaldef f_long_chain(n, a, b, c, d):    """a*b + c*d per iteration — two unique intermediates."""    total = 0.0    for i in range(n):        total += a * b + c * d    return totaldef f_negate(n, a, b):    """a*b + c*d per iteration — two unique intermediates."""    total = 0.0    for i in range(n):        total = - (a + b)    return total# --- Warm up to trigger tier 2 ---LOOP = 10_000f_adds(LOOP, 2.0, 3.0, 4.0)f_chain(LOOP, 2.0, 3.0, 4.0)f_simple_add(LOOP, 2.0, 3.0)f_long_chain(LOOP, 2.0, 3.0, 4.0, 5.0)# --- Op annotation ---def annotate_op(name, oparg, func):    """Return a human-readable annotation for a uop."""    varnames = func.__code__.co_varnames    consts = func.__code__.co_consts    # _LOAD_FAST_BORROW_3 → local index 3    for prefix in ("_LOAD_FAST_BORROW_", "_LOAD_FAST_", "_SWAP_FAST_"):        if name.startswith(prefix):            idx = int(name[len(prefix):])            local = varnames[idx] if idx < len(varnames) else f"local{idx}"            return local    # _LOAD_CONST_INLINE_BORROW etc — operand is a pointer, not useful    # but if oparg is a small index into consts, show it    if "LOAD_CONST" in name and oparg < len(consts):        return repr(consts[oparg])    # Binary ops    if "MULTIPLY" in name:        return "*"    if "SUBTRACT" in name:        return "-"    if "ADD" in name and "UNICODE" not in name:        return "+"    # Guards    if name == "_GUARD_TOS_FLOAT":        return "top is float?"    if name == "_GUARD_NOS_FLOAT":        return "2nd is float?"    if name == "_GUARD_TOS_INT":        return "top is int?"    if name == "_GUARD_NOS_INT":        return "2nd is int?"    if "NOT_EXHAUSTED" in name:        return "iter not done?"    # Pop / cleanup    if name == "_POP_TOP_NOP":        return "skip (borrowed/null)"    if name == "_POP_TOP_FLOAT":        return "decref float"    if name == "_POP_TOP_INT":        return "decref int"    # Control flow    if name == "_JUMP_TO_TOP":        return "loop"    if name == "_EXIT_TRACE":        return "exit"    if name == "_DEOPT":        return "deoptimize"    if name == "_ERROR_POP_N":        return "error handler"    if name == "_START_EXECUTOR":        return "trace entry"    if name == "_MAKE_WARM":        return "warmup counter"    return ""# --- Show traces ---FILTER_KEYWORDS = (    "FLOAT", "INPLACE", "BINARY_OP", "NOP",    "LOAD_FAST", "LOAD_CONST", "GUARD",)has_get_executor = Falseif tier2:    try:        from _opcode import get_executor        has_get_executor = True    except ImportError:        passif has_get_executor:    mode = "all ops" if SHOW_ALL else "float-related ops only"    print("-" * 60)    print(f"Tier 2 traces ({mode})")    print("-" * 60)    for label, func in [        ("f_adds: total = a + b + c", f_adds),        ("f_chain: total += a * b + c", f_chain),        ("f_simple_add: total += a + b", f_simple_add),        ("f_long_chain: total += a * b + c * d", f_long_chain),    ]:        code = func.__code__        found = False        for i in range(len(code.co_code) // 2):            try:                ex = get_executor(code, i * 2)            except (ValueError, TypeError, RuntimeError):                continue            if ex is None:                continue            print(f"\n  {label}")            for j, op in enumerate(ex):                name, oparg = op[0], op[1]                if not SHOW_ALL:                    if not any(k in name for k in FILTER_KEYWORDS):                        continue                annotation = annotate_op(name, oparg, func)                marker = " <<<" if "INPLACE" in name else ""                if annotation:                    print(f"    {j:3d}: {name:45s} # {annotation}{marker}")                else:                    print(f"    {j:3d}: {name}{marker}")            found = True            break        if not found:            print(f"\n  {label}: (no executor found)")    print()else:    print("-" * 60)    print("Tier 2 traces: skipped (tier 2 not available)")    print("-" * 60)    print()# --- Benchmark ---print("-" * 60)print("Benchmark")print("-" * 60)N = 2_000_000INNER = 1000benchmarks = [    ("total = a + b + c", lambda: f_adds(INNER, 2.0, 3.0, 4.0)),    ("total += a*b + c  ", lambda: f_chain(INNER, 2.0, 3.0, 4.0)),    ("total += a + b    ", lambda: f_simple_add(INNER, 2.0, 3.0)),    ("total += a*b + c*d", lambda: f_long_chain(INNER, 2.0, 3.0, 4.0, 5.0)),    ("total = - (a + b)    ", lambda: f_negate(INNER, 2.0, 3.0)),]for label, fn in benchmarks:    iters = N // INNER    t = timeit.timeit(fn, number=iters)    ns_per = t / N * 1e9    print(f"  {label}:  {t:.3f}s  ({ns_per:.0f} ns/iter)")print()print("The 'a*b + c' case benefits from _BINARY_OP_ADD_FLOAT_INPLACE:")print("the result of a*b is uniquely referenced, so the addition")print("mutates it in place instead of allocating a new float.")# --- N-body benchmark from pyperformance ---print()print("-" * 60)print("N-body benchmark (from pyperformance)")print("-" * 60)PI = 3.14159265358979323SOLAR_MASS = 4 * PI * PIDAYS_PER_YEAR = 365.24def _nbody_make_system():    bodies = [        # sun        ([0.0, 0.0, 0.0], [0.0, 0.0, 0.0], SOLAR_MASS),        # jupiter        ([4.84143144246472090e+00, -1.16032004402742839e+00, -1.03622044471123109e-01],         [1.66007664274403694e-03*DAYS_PER_YEAR, 7.69901118419740425e-03*DAYS_PER_YEAR, -6.90460016972063023e-05*DAYS_PER_YEAR],         9.54791938424326609e-04*SOLAR_MASS),        # saturn        ([8.34336671824457987e+00, 4.12479856412430479e+00, -4.03523417114321381e-01],         [-2.76742510726862411e-03*DAYS_PER_YEAR, 4.99852801234917238e-03*DAYS_PER_YEAR, 2.30417297573763929e-05*DAYS_PER_YEAR],         2.85885980666130812e-04*SOLAR_MASS),        # uranus        ([1.28943695621391310e+01, -1.51111514016986312e+01, -2.23307578892655734e-01],         [2.96460137564761618e-03*DAYS_PER_YEAR, 2.37847173959480950e-03*DAYS_PER_YEAR, -2.96589568540237556e-05*DAYS_PER_YEAR],         4.36624404335156298e-05*SOLAR_MASS),        # neptune        ([1.53796971148509165e+01, -2.59193146099879641e+01, 1.79258772950371181e-01],         [2.68067772490389322e-03*DAYS_PER_YEAR, 1.62824170038242295e-03*DAYS_PER_YEAR, -9.51592254519715870e-05*DAYS_PER_YEAR],         5.15138902046611451e-05*SOLAR_MASS),    ]    pairs = []    for x in range(len(bodies) - 1):        for y in bodies[x + 1:]:            pairs.append((bodies[x], y))    return bodies, pairsdef _nbody_advance(dt, n, bodies, pairs):    for i in range(n):        for (([x1, y1, z1], v1, m1), ([x2, y2, z2], v2, m2)) in pairs:            dx = x1 - x2            dy = y1 - y2            dz = z1 - z2            mag = dt * ((dx * dx + dy * dy + dz * dz) ** (-1.5))            b1m = m1 * mag            b2m = m2 * mag            v1[0] -= dx * b2m            v1[1] -= dy * b2m            v1[2] -= dz * b2m            v2[0] += dx * b1m            v2[1] += dy * b1m            v2[2] += dz * b1m        for (r, [vx, vy, vz], m) in bodies:            r[0] += dt * vx            r[1] += dt * vy            r[2] += dt * vzdef _nbody_report_energy(bodies, pairs, e=0.0):    for (((x1, y1, z1), v1, m1), ((x2, y2, z2), v2, m2)) in pairs:        dx = x1 - x2        dy = y1 - y2        dz = z1 - z2        e -= (m1 * m2) / ((dx * dx + dy * dy + dz * dz) ** 0.5)    for (r, [vx, vy, vz], m) in bodies:        e += m * (vx * vx + vy * vy + vz * vz) / 2.    return edef _nbody_offset_momentum(ref, bodies, px=0.0, py=0.0, pz=0.0):    for (r, [vx, vy, vz], m) in bodies:        px -= vx * m        py -= vy * m        pz -= vz * m    (r, v, m) = ref    v[0] = px / m    v[1] = py / m    v[2] = pz / mdef bench_nbody(iterations=20000):    bodies, pairs = _nbody_make_system()    _nbody_offset_momentum(bodies[0], bodies)    _nbody_report_energy(bodies, pairs)    _nbody_advance(0.01, iterations, bodies, pairs)    _nbody_report_energy(bodies, pairs)# Warmupbench_nbody(1000)NBODY_RUNS = 5t = timeit.timeit(lambda: bench_nbody(20000), number=NBODY_RUNS)print(f"  nbody (20k iterations, {NBODY_RUNS} runs): {t/NBODY_RUNS*1000:.1f} ms/run")

corona10 and YukiNagat0 reacted with thumbs up emoji
…y-referenced operands in placeWhen the tier 2 optimizer can prove that an operand to a floatoperation is uniquely referenced (refcount 1), mutate it in placeinstead of allocating a new PyFloatObject.New tier 2 micro-ops:- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE (unique LHS)- _BINARY_OP_{ADD,SUBTRACT,MULTIPLY}_FLOAT_INPLACE_RIGHT (unique RHS)- _UNARY_NEGATIVE_FLOAT_INPLACE (unique operand)Speeds up the pyperformance nbody benchmark by ~19%.Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Avoid compound assignment (+=, -=, *=) directly on ob_fval ininplace float ops. On 32-bit Windows, this generates JIT stencilswith _xmm register references that MSVC cannot parse. Instead,read into a local double, compute, and write back.Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add -mno-sse to clang args for i686-pc-windows-msvc target. The COFF32stencil converter cannot handle _xmm register references that clangemits for inline float arithmetic. Using x87 FPU instructions avoidsthis. SSE is optional on 32-bit x86; x87 is the baseline.Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@eendebakpteendebakpt changed the titleDraft: gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)gh-146306: Optimize float operations by mutating uniquely-referenced operands in place (JIT only)Mar 22, 2026
@eendebakpt
Copy link
ContributorAuthor

Selected pyperformance benchmarks:

### chaos ###Mean +- std dev: 60.7 ms +- 1.5 ms -> 59.5 ms +- 1.5 ms: 1.02x fasterNot significant### fannkuch ###Mean +- std dev: 376 ms +- 4 ms -> 375 ms +- 3 ms: 1.00x fasterNot significant### float ###Mean +- std dev: 60.9 ms +- 3.0 ms -> 58.8 ms +- 2.7 ms: 1.04x fasterSignificant (t=4.00)### nbody ###Mean +- std dev: 87.2 ms +- 1.1 ms -> 72.3 ms +- 0.4 ms: 1.21x fasterSignificant (t=98.34)### raytrace ###Mean +- std dev: 304 ms +- 5 ms -> 299 ms +- 5 ms: 1.02x fasterNot significant### scimark_fft ###Mean +- std dev: 305 ms +- 1 ms -> 297 ms +- 3 ms: 1.03x fasterSignificant (t=19.01)### scimark_lu ###Mean +- std dev: 86.5 ms +- 0.4 ms -> 84.8 ms +- 0.4 ms: 1.02x fasterNot significant### scimark_monte_carlo ###Mean +- std dev: 60.7 ms +- 0.7 ms -> 59.8 ms +- 0.3 ms: 1.02x fasterNot significant### scimark_sor ###Mean +- std dev: 101 ms +- 1 ms -> 99 ms +- 3 ms: 1.03x fasterSignificant (t=7.38)### scimark_sparse_mat_mult ###Mean +- std dev: 5.53 ms +- 0.02 ms -> 5.14 ms +- 0.02 ms: 1.08x fasterSignificant (t=130.74)### spectral_norm ###Mean +- std dev: 84.0 ms +- 0.6 ms -> 79.0 ms +- 0.4 ms: 1.06x fasterSignificant (t=53.73)

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@markshannonmarkshannonAwaiting requested review from markshannonmarkshannon is a code owner

@tomasr8tomasr8Awaiting requested review from tomasr8tomasr8 is a code owner

@Fidget-SpinnerFidget-SpinnerAwaiting requested review from Fidget-SpinnerFidget-Spinner is a code owner

@savannahostrowskisavannahostrowskiAwaiting requested review from savannahostrowskisavannahostrowski is a code owner

@brandtbucherbrandtbucherAwaiting requested review from brandtbucherbrandtbucher is a code owner

@diegorussodiegorussoAwaiting requested review from diegorussodiegorusso is a code owner

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@eendebakpt@savannahostrowski

[8]ページ先頭

©2009-2026 Movatter.jp