This script only measuring the performance of inner loops of ufunc,
the idea behind it is to remove umath object calls from the equation,
in order to reduce the number of noises and provides stable ratios.

eric-wieser reviewed

Apr 15, 2020

View reviewed changes

benchmarks/misc/benchin_ufunc.py OutdatedShow resolvedHide resolved

eric-wieser reviewed

Apr 15, 2020

View reviewed changes

benchmarks/misc/benchin_ufunc.py OutdatedShow resolvedHide resolved

Copy link

Member

eric-wieser commentedApr 15, 2020

Can we reuse our existing benchmark machinery here?

Copy link

MemberAuthor

seiko2plus commentedApr 15, 2020•
edited
Loading

@eric-wieser, I tried to use ASV but the result wasn't stable enough, check thispatch andpatch2 from#13516, the idea behind this patch is to benchmarking only the inner loop of ufunc in order to reduce the noises as much as possible, also ASV is kinda slow too.

EDIT: I moved the two mentioned patches to separate pull-requests#15992 and#15990

Copy link

Member

eric-wieser commentedApr 15, 2020

It would be nice if we could at least hook into ASV for things like benchmark result comparisons and storage, rather than building our own version of those too. It might be worth starting a conversation with@pv about the best way to do that.

Copy link

Member

seberg commentedApr 15, 2020•
edited
Loading

@seiko2plus you are repeating the function run multiple times here within yourrun function. May that be enough to stabilize the results a bit in asv?

EDIT: This got lost: "You are doing a few other things here that you are not doing in the asv version."

For example, if you just define therun function in C (and monkeypatch it into Benchmark), and make it do a couple of C-level calls (to offset the ~200ns or so overhead. That might be enough to get a stable result as well?

seiko2plus force-pushed thenew_ufunc_benchmark branch from3fb1562 to28b0c07Compare

April 15, 2020 16:37

Copy link

MemberAuthor

seiko2plus commentedApr 15, 2020

@seberg, ASV already collect multiple samples for each benchmark, but still not stable enough even on idle CPU.

This script is not providing a replacement for the current ASV implementation, the main reason behind it is to detect any performance changes in the inner loops of ufunc and removing the functionality of umath and multiarry from the equation in order to reduce the noises as much as possible, it also provides more testing cases like multiple strides, sizes and better control for the testing process.

For example, if you just define the run function in C (and monkeypatch it into Benchmark), and make it do a couple of C-level calls (to offset the ~200ns or so overhead. That might be enough to get a stable result as well?

The problem is ASV doesn't provide a way to specify the elapsed time manually.

seiko2plus force-pushed thenew_ufunc_benchmark branch 2 times, most recently froma58ab33 to5f4bbdeCompare

April 15, 2020 19:41

Copy link

MemberAuthor

seiko2plus commentedApr 15, 2020•
edited
Loading

EDIT: This got lost: "You are doing a few other things here that you are not doing in the asv version."

@seberg, I moved the mentioned patches from#13516, into a separate pull#15992 and#15990. also modified the number of repeats and samples to be equal to the default settings of this script.
but still, the ratio of ASV not stable enough.

seiko2plus mentioned this pull request

Apr 15, 2020

ENH: Provides a deep benchmark for universal functions#15992

Closed

seiko2plus changed the title~~ENH: Benchmark script for the inner loops of universal functions.~~ENH: A standalone benchmark script for the inner loops of ufunc

Apr 15, 2020

seiko2plus force-pushed thenew_ufunc_benchmark branch from5f4bbde to9b4245bCompare

April 15, 2020 20:24

seiko2plus marked this pull request as draft

April 17, 2020 02:25

Copy link

Member

r-devulap commentedApr 17, 2020

One reason that could be causing noise is turbo mode. In case you haven't already done, I would recommend disabling for benchmarking purposes (set/sys/devices/system/cpu/intel_pstate/no_turboto 1). May be that will help? I haven't had too much variability while benchmarking ufunc's withasv.

Copy link

MemberAuthor

seiko2plus commentedApr 18, 2020

@r-devulap, Before I run any benchmarks, I usually do:

isolate logical cores from scheduling through linux kernel optionsisolcpus andrcu_nocbs
reducing scheduling-clock ticks throughnohz_full for the isolated cores
use option--cpu-affinity that comes with this script or what ASV provides for the isolated cores
use scaling governor performance via /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
disable turbo boost via/sys/devices/system/cpu/intel_pstate/no_turbo
make sure thatASLR(address space layout randomization) state is 'full randomization' through
set 2 to /proc/sys/kernel/randomize_va_space

Lately, I realized a python module calledpyperf, provides a tool to tune the system with the above tips and many more via commandpyperf system tune

However, it seems I should have an idle hardware in order to get almost stable ratios for ASV not just isolate some logical cores since any involved system calls that interpret the thread during collecting the benchmark samples will eliminate the benefits from isolating the logical cores viaisolcpus andrcu_nocbs.

One of the things I don't like in ASV that its uses a separate process for each collected sample,
which makes it too slow.

seiko2plus force-pushed thenew_ufunc_benchmark branch froma28f11e to230bc23Compare

April 18, 2020 17:13

seiko2plus marked this pull request as ready for review

April 18, 2020 17:13

seiko2plus force-pushed thenew_ufunc_benchmark branch 2 times, most recently from8408248 tof17305eCompare

April 19, 2020 13:06

charris added 01 - Enhancement 28 - Benchmark component: benchmarks labels

Apr 19, 2020

charris changed the title~~ENH: A standalone benchmark script for the inner loops of ufunc~~ENH: Standalone benchmark script for the inner loops of ufunc

Apr 19, 2020

seiko2plus force-pushed thenew_ufunc_benchmark branch 3 times, most recently fromdbce6f3 toe62c951Compare

April 23, 2020 03:08

seiko2plus force-pushed thenew_ufunc_benchmark branch froma2ed2e5 toa231322Compare

April 29, 2020 02:08

seiko2plus mentioned this pull request

May 1, 2020

ENH: enable multi-platform SIMD compiler optimizations#13516

Merged

Copy link

Member

mattip commentedJul 10, 2020

ping@pv. Is there something here that we all are missing?

seiko2plus mentioned this pull request

Jul 11, 2020

ENH: Move dispatch-able umath fast-loops to the new dispatcher#16396

Closed

ENH: Standalone benchmark script for the inner loops of ufunc

5e557b5

    This script only measuring the performance of inner loops    of ufunc, the idea behind it is to remove umath object calls    from the equation, in order to reduce the number of noises and    provides stable ratios.

seiko2plus force-pushed thenew_ufunc_benchmark branch froma231322 to5e557b5Compare

October 7, 2020 07:28

seiko2plus mentioned this pull request

Oct 7, 2020

ENH:Umath Replace raw SIMD of unary float point(32-64) with NPYV - g0#16247

Merged

11 tasks

seiko2plus mentioned this pull request

Oct 20, 2020

SIMD: Replace raw SIMD of sin/cos with NPYV(universal intrinsics)#17587

Merged

5 tasks

Copy link

Contributor

hameerabbasi commentedNov 9, 2020•
edited
Loading

I ran this PR on a live environment without a desktop (Ubuntu Server), using the method in the PR description. The noise was around 3% and this PR had a performance impact of ±5%, so not too much of a difference.

Whoops, had the wrong tab open. This comment was meant for#16247, copy pasting there.