Summary
MicroPython supports the standardsys.settrace() facility, to trace bytecode execution. That feature is enabled throughMICROPY_PY_SYS_SETTRACE and is disabled by default because it really slows the VM down, and increases memory use in many places due to the overhead of tracking the state of functions.
This PR adds a new optionMICROPY_PY_SYS_SETTRACE_DUAL_VM that, when enabled, duplicates the VM, themp_execute_bytecode() function:
- One version is called
mp_execute_bytecode_standard() and has settrace disabled. - The other version is called
mp_execute_bytecode_tracing() and has settrace enabled.
Then a wrapper functionmp_execute_bytecode() calls into the appropriate VM depending on whether the user has enabled a settrace callback or not.
The result is:
- An increase of about 5k in firmware size (on ARM Cortex-M) when enabling the dual-VM (that's on top of having settrace already enabled).
- Almost no performance is lost if a settrace callback is not registered.
- Full support for settrace if needed.
This is useful for, eg, webassembly where PyScript wants to havesys.settrace enabled, and can afford to have a dual-VM in order not to lose performance with settrace enabled.
This PR also includes some minor fixes for the webassembly port to be able to run the performance benchmarks on that port.
Testing
Tested by running the standard test suite on the unix and webassembly ports, which also have dual-VM enabled in CI.
On the webassembly port, theperf_bench/misc_aes.py performance test runs 200 times faster (!) with dual-VM enabled, compared to just enabling settrace without the dual-VM. So it's a very big win there.
On PYBV10 the metrics are:
- code size increases by 14.5k with
MICROPY_PY_SYS_SETTRACE enabled - code size increases by an extra 5.3k with
MICROPY_PY_SYS_SETTRACE_DUAL_VM enabled as well - performance is roughly halved when enabling settrace
- performance is pretty much all regained when enabling dual-VM
The performance measurements on PYBV10 in detail
EnablingMICROPY_PY_SYS_SETTRACE:
diff of scores (higher is better)N=100 M=100 perf-pyb-std -> perf-pyb-settrace diff diff% (error%)bm_chaos.py 356.93 -> 261.95 : -94.98 = -26.610% (+/-0.01%)bm_fannkuch.py 72.90 -> 61.28 : -11.62 = -15.940% (+/-0.00%)bm_fft.py 2331.03 -> 1681.29 : -649.74 = -27.874% (+/-0.02%)bm_hexiom.py 45.75 -> 13.35 : -32.40 = -70.820% (+/-0.00%)bm_nqueens.py 4111.33 -> 2695.70 : -1415.63 = -34.432% (+/-0.00%)bm_pidigits.py 663.16 -> 606.83 : -56.33 = -8.494% (+/-0.02%)bm_wordcount.py 47.36 -> 39.05 : -8.31 = -17.546% (+/-0.01%)core_import_mpy_multi.py 640.86 -> 483.22 : -157.64 = -24.598% (+/-0.01%)core_import_mpy_single.py 105.97 -> 84.55 : -21.42 = -20.213% (+/-0.01%)core_locals.py 40.28 -> 32.41 : -7.87 = -19.538% (+/-0.00%)core_qstr.py 204.36 -> 192.74 : -11.62 = -5.686% (+/-0.00%)core_str.py 25.86 -> 22.55 : -3.31 = -12.800% (+/-0.00%)core_yield_from.py 355.61 -> 3.84 : -351.77 = -98.920% (+/-0.00%)misc_aes.py 407.85 -> 32.12 : -375.73 = -92.125% (+/-0.01%)misc_mandel.py 3166.90 -> 2791.44 : -375.46 = -11.856% (+/-0.01%)misc_pystone.py 2378.74 -> 279.43 : -2099.31 = -88.253% (+/-0.00%)misc_raytrace.py 361.02 -> 256.57 : -104.45 = -28.932% (+/-0.01%)viper_call0.py 499.49 -> 537.73 : +38.24 = +7.656% (+/-0.00%)viper_call1a.py 522.73 -> 514.72 : -8.01 = -1.532% (+/-0.01%)viper_call1b.py 379.77 -> 385.81 : +6.04 = +1.590% (+/-0.00%)viper_call1c.py 384.11 -> 389.38 : +5.27 = +1.372% (+/-0.01%)viper_call2a.py 510.03 -> 502.42 : -7.61 = -1.492% (+/-0.00%)viper_call2b.py 335.06 -> 341.81 : +6.75 = +2.015% (+/-0.01%)
Tests that are heavy on the VM (misc_aes.py andmisc_pystone.py) run at about half the speed with settrace enabled.
Enabling bothMICROPY_PY_SYS_SETTRACE andMICROPY_PY_SYS_SETTRACE_DUAL_VM (comparing to settrace completely disabled):
diff of scores (higher is better)N=100 M=100 perf-pyb-std -> perf-pyb-settrace-dual-vm diff diff% (error%)bm_chaos.py 356.93 -> 353.71 : -3.22 = -0.902% (+/-0.01%)bm_fannkuch.py 72.90 -> 72.78 : -0.12 = -0.165% (+/-0.00%)bm_fft.py 2331.03 -> 2320.26 : -10.77 = -0.462% (+/-0.00%)bm_float.py 5713.01 -> 5607.72 : -105.29 = -1.843% (+/-0.02%)bm_hexiom.py 45.75 -> 45.06 : -0.69 = -1.508% (+/-0.00%)bm_nqueens.py 4111.33 -> 4035.90 : -75.43 = -1.835% (+/-0.00%)bm_pidigits.py 663.16 -> 639.93 : -23.23 = -3.503% (+/-0.02%)bm_wordcount.py 47.36 -> 46.71 : -0.65 = -1.372% (+/-0.00%)core_import_mpy_multi.py 640.86 -> 608.33 : -32.53 = -5.076% (+/-0.01%)core_import_mpy_single.py 105.97 -> 101.29 : -4.68 = -4.416% (+/-0.01%)core_locals.py 40.28 -> 39.92 : -0.36 = -0.894% (+/-0.00%)core_qstr.py 204.36 -> 202.55 : -1.81 = -0.886% (+/-0.03%)core_str.py 25.86 -> 25.71 : -0.15 = -0.580% (+/-0.00%)core_yield_from.py 355.61 -> 328.12 : -27.49 = -7.730% (+/-0.01%)misc_aes.py 407.85 -> 398.19 : -9.66 = -2.369% (+/-0.01%)misc_mandel.py 3166.90 -> 3148.39 : -18.51 = -0.584% (+/-0.01%)misc_pystone.py 2378.74 -> 2336.25 : -42.49 = -1.786% (+/-0.00%)misc_raytrace.py 361.02 -> 357.93 : -3.09 = -0.856% (+/-0.01%)viper_call0.py 499.49 -> 495.08 : -4.41 = -0.883% (+/-0.01%)viper_call1a.py 522.73 -> 514.77 : -7.96 = -1.523% (+/-0.00%)viper_call1b.py 379.77 -> 374.68 : -5.09 = -1.340% (+/-0.01%)viper_call1c.py 384.11 -> 378.07 : -6.04 = -1.572% (+/-0.00%)viper_call2a.py 510.03 -> 502.42 : -7.61 = -1.492% (+/-0.01%)viper_call2b.py 335.06 -> 332.42 : -2.64 = -0.788% (+/-0.00%)
Performance is almost all regained with dual-VM enabled.
Trade-offs and Alternatives
This obviously duplicates the VM, but I tried to make it as simple as possible in the implementation so that it's easy to maintain. In the future this mechanism could potentially be extended to optimise for other things, although I can't think of any obvious candidates for that.
An alternative to having a dual-VM could be to try and optimise the internal VM macrosFRAME_UPDATE andTRACE_TICK. Although that could save code size, no matter how well these macros are optimised it won't be able to reach the performance of a dual-VM architecture.
I'm not sure exactly how CPython implements efficientsys.settrace these days (they've changed a lot recently with their VM), but could look there for alternative ideas as well.
Summary
MicroPython supports the standard
sys.settrace()facility, to trace bytecode execution. That feature is enabled throughMICROPY_PY_SYS_SETTRACEand is disabled by default because it really slows the VM down, and increases memory use in many places due to the overhead of tracking the state of functions.This PR adds a new option
MICROPY_PY_SYS_SETTRACE_DUAL_VMthat, when enabled, duplicates the VM, themp_execute_bytecode()function:mp_execute_bytecode_standard()and has settrace disabled.mp_execute_bytecode_tracing()and has settrace enabled.Then a wrapper function
mp_execute_bytecode()calls into the appropriate VM depending on whether the user has enabled a settrace callback or not.The result is:
This is useful for, eg, webassembly where PyScript wants to have
sys.settraceenabled, and can afford to have a dual-VM in order not to lose performance with settrace enabled.This PR also includes some minor fixes for the webassembly port to be able to run the performance benchmarks on that port.
Testing
Tested by running the standard test suite on the unix and webassembly ports, which also have dual-VM enabled in CI.
On the webassembly port, the
perf_bench/misc_aes.pyperformance test runs 200 times faster (!) with dual-VM enabled, compared to just enabling settrace without the dual-VM. So it's a very big win there.On PYBV10 the metrics are:
MICROPY_PY_SYS_SETTRACEenabledMICROPY_PY_SYS_SETTRACE_DUAL_VMenabled as wellThe performance measurements on PYBV10 in detail
Enabling
MICROPY_PY_SYS_SETTRACE:Tests that are heavy on the VM (
misc_aes.pyandmisc_pystone.py) run at about half the speed with settrace enabled.Enabling both
MICROPY_PY_SYS_SETTRACEandMICROPY_PY_SYS_SETTRACE_DUAL_VM(comparing to settrace completely disabled):Performance is almost all regained with dual-VM enabled.
Trade-offs and Alternatives
This obviously duplicates the VM, but I tried to make it as simple as possible in the implementation so that it's easy to maintain. In the future this mechanism could potentially be extended to optimise for other things, although I can't think of any obvious candidates for that.
An alternative to having a dual-VM could be to try and optimise the internal VM macros
FRAME_UPDATEandTRACE_TICK. Although that could save code size, no matter how well these macros are optimised it won't be able to reach the performance of a dual-VM architecture.I'm not sure exactly how CPython implements efficient
sys.settracethese days (they've changed a lot recently with their VM), but could look there for alternative ideas as well.