Benchmark Utils - torch.utils.benchmark#
Created On: Nov 02, 2020 | Last Updated On: Jun 12, 2025
- classtorch.utils.benchmark.Timer(stmt='pass',setup='pass',global_setup='',timer=<built-infunctionperf_counter>,globals=None,label=None,sub_label=None,description=None,env=None,num_threads=1,language=Language.PYTHON)[source]#
Helper class for measuring execution time of PyTorch statements.
For a full tutorial on how to use this class, see:https://pytorch.org/tutorials/recipes/recipes/benchmark.html
The PyTorch Timer is based ontimeit.Timer (and in fact usestimeit.Timer internally), but with several key differences:
- Runtime aware:
Timer will perform warmups (important as some elements of PyTorch arelazily initialized), set threadpool size so that comparisons areapples-to-apples, and synchronize asynchronous accelerator functions whennecessary.
- Focus on replicates:
When measuring code, and particularly complex kernels / models,run-to-run variation is a significant confounding factor. It isexpected that all measurements should include replicates to quantifynoise and allow median computation, which is more robust than mean.To that effect, this class deviates from thetimeit API byconceptually mergingtimeit.Timer.repeat andtimeit.Timer.autorange.(Exact algorithms are discussed in method docstrings.) Thetimeitmethod is replicated for cases where an adaptive strategy is notdesired.
- Optional metadata:
When defining a Timer, one can optionally specifylabel,sub_label,description, andenv. (Defined later) These fields are included inthe representation of result object and by theCompare class to groupand display results for comparison.
- Instruction counts
In addition to wall times, Timer can run a statement under Callgrindand report instructions executed.
Directly analogous totimeit.Timer constructor arguments:
stmt,setup,timer,globals
PyTorch Timer specific constructor arguments:
label,sub_label,description,env,num_threads
- Parameters:
stmt (str) – Code snippet to be run in a loop and timed.
setup (str) – Optional setup code. Used to define variables used instmt
global_setup (str) – (C++ only)Code which is placed at the top level of the file for things like#include statements.
timer (Callable[[],float]) – Callable which returns the current time. If PyTorch was builtwithout accelerators or there is no accelerator present, this defaults totimeit.default_timer; otherwise it will synchronize accelerators beforemeasuring the time.
globals (dict[str,Any]|None) – A dict which defines the global variables whenstmt is beingexecuted. This is the other method for providing variables whichstmt needs.
label (str |None) – String which summarizesstmt. For instance, ifstmt is“torch.nn.functional.relu(torch.add(x, 1, out=out))”one might set label to “ReLU(x + 1)” to improve readability.
sub_label (str |None) –
Provide supplemental information to disambiguate measurementswith identical stmt or label. For instance, in our exampleabove sub_label might be “float” or “int”, so that it is easyto differentiate:“ReLU(x + 1): (float)”
”ReLU(x + 1): (int)”when printing Measurements or summarizing usingCompare.
description (str |None) –
String to distinguish measurements with identical label andsub_label. The principal use ofdescription is to signal toCompare the columns of data. For instance one might set itbased on the input size to create a table of the form:
|n=1|n=4|...-------------...ReLU(x+1):(float)|...|...|...ReLU(x+1):(int)|...|...|...
usingCompare. It is also included when printing a Measurement.
env (str |None) – This tag indicates that otherwise identical tasks were run indifferent environments, and are therefore not equivalent, forinstance when A/B testing a change to a kernel.Compare willtreat Measurements with differentenv specification as distinctwhen merging replicate runs.
num_threads (int) – The size of the PyTorch threadpool when executingstmt. Singlethreaded performance is important as both a key inference workloadand a good indicator of intrinsic algorithmic efficiency, so thedefault is set to one. This is in contrast to the default PyTorchthreadpool size which tries to utilize all cores.
- adaptive_autorange(threshold=0.1,*,min_run_time=0.01,max_run_time=10.0,callback=None)[source]#
Similar toblocked_autorange but also checks for variablility in measurementsand repeats until iqr/median is smaller thanthreshold ormax_run_time is reached.
At a high level, adaptive_autorange executes the following pseudo-code:
`setup`times = []while times.sum < max_run_time start = timer() for _ in range(block_size): `stmt` times.append(timer() - start) enough_data = len(times)>3 and times.sum > min_run_time small_iqr=times.iqr/times.mean<threshold if enough_data and small_iqr: break
- Parameters:
- Returns:
AMeasurement object that contains measured runtimes andrepetition counts, and can be used to compute statistics.(mean, median, etc.)
- Return type:
- blocked_autorange(callback=None,min_run_time=0.2)[source]#
Measure many replicates while keeping timer overhead to a minimum.
At a high level, blocked_autorange executes the following pseudo-code:
`setup`total_time = 0while total_time < min_run_time start = timer() for _ in range(block_size): `stmt` total_time += (timer() - start)
Note the variableblock_size in the inner loop. The choice of blocksize is important to measurement quality, and must balance twocompeting objectives:
A small block size results in more replicates and generallybetter statistics.
A large block size better amortizes the cost oftimerinvocation, and results in a less biased measurement. This isimportant because accelerator synchronization time is non-trivial(order single to low double digit microseconds) and wouldotherwise bias the measurement.
blocked_autorange sets block_size by running a warmup period,increasing block size until timer overhead is less than 0.1% ofthe overall computation. This value is then used for the mainmeasurement loop.
- Returns:
AMeasurement object that contains measured runtimes andrepetition counts, and can be used to compute statistics.(mean, median, etc.)
- Return type:
- collect_callgrind(number:int,*,repeats:None,collect_baseline:bool,retain_out_file:bool)→CallgrindStats[source]#
- collect_callgrind(number:int,*,repeats:int,collect_baseline:bool,retain_out_file:bool)→tuple[CallgrindStats,...]
Collect instruction counts using Callgrind.
Unlike wall times, instruction counts are deterministic(modulo non-determinism in the program itself and small amounts ofjitter from the Python interpreter.) This makes them ideal for detailedperformance analysis. This method runsstmt in a separate processso that Valgrind can instrument the program. Performance is severelydegraded due to the instrumentation, however this is ameliorated bythe fact that a small number of iterations is generally sufficient toobtain good measurements.
In order to use this methodvalgrind,callgrind_control, andcallgrind_annotate must be installed.
Because there is a process boundary between the caller (this process)and thestmt execution,globals cannot contain arbitrary in-memorydata structures. (Unlike timing methods) Instead, globals arerestricted to builtins,nn.Modules’s, and TorchScripted functions/modulesto reduce the surprise factor from serialization and subsequentdeserialization. TheGlobalsBridge class provides more detail on thissubject. Take particular care with nn.Modules: they rely on pickle andyou may need to add an import tosetup for them to transfer properly.
By default, a profile for an empty statement will be collected andcached to indicate how many instructions are from the Python loop whichdrivesstmt.
- Returns:
ACallgrindStats object which provides instruction counts andsome basic facilities for analyzing and manipulating results.
- timeit(number=1000000)[source]#
Mirrors the semantics of timeit.Timer.timeit().
Execute the main statement (stmt)number times.https://docs.python.org/3/library/timeit.html#timeit.Timer.timeit
- Return type:
- classtorch.utils.benchmark.Measurement(number_per_run,raw_times,task_spec,metadata=None)[source]#
The result of a Timer measurement.
This class stores one or more measurements of a given statement. It isserializable and provides several convenience methods(including a detailed __repr__) for downstream consumers.
- staticmerge(measurements)[source]#
Convenience method for merging replicates.
Merge will extrapolate times tonumber_per_run=1 and will nottransfer any metadata. (Since it might differ between replicates)
- Return type:
- propertysignificant_figures:int#
Approximate significant figure estimate.
This property is intended to give a convenient way to estimate theprecision of a measurement. It only uses the interquartile region toestimate statistics to try to mitigate skew from the tails, anduses a static z value of 1.645 since it is not expected to be usedfor small values ofn, so z can approximatet.
The significant figure estimation used in conjunction with thetrim_sigfig method to provide a more human interpretable datasummary. __repr__ does not use this method; it simply displays rawvalues. Significant figure estimation is intended forCompare.
- classtorch.utils.benchmark.CallgrindStats(task_spec,number_per_run,built_with_debug_symbols,baseline_inclusive_stats,baseline_exclusive_stats,stmt_inclusive_stats,stmt_exclusive_stats,stmt_callgrind_out)[source]#
Top level container for Callgrind results collected by Timer.
Manipulation is generally done using the FunctionCounts class, which isobtained by callingCallgrindStats.stats(…). Several conveniencemethods are provided as well; the most significant isCallgrindStats.as_standardized().
- as_standardized()[source]#
Strip library names and some prefixes from function strings.
When comparing two different sets of instruction counts, on stumblingblock can be path prefixes. Callgrind includes the full filepathwhen reporting a function (as it should). However, this can causeissues when diffing profiles. If a key component such as Pythonor PyTorch was built in separate locations in the two profiles, whichcan result in something resembling:
23234231/tmp/first_build_dir/thing.c:foo(...)9823794/tmp/first_build_dir/thing.c:bar(...)...53453.../aten/src/Aten/...:function_that_actually_changed(...)...-9823794/tmp/second_build_dir/thing.c:bar(...)-23234231/tmp/second_build_dir/thing.c:foo(...)
Stripping prefixes can ameliorate this issue by regularizing thestrings and causing better cancellation of equivalent call siteswhen diffing.
- Return type:
- counts(*,denoise=False)[source]#
Returns the total number of instructions executed.
SeeFunctionCounts.denoise() for an explanation of thedenoise arg.
- Return type:
- delta(other,inclusive=False)[source]#
Diff two sets of counts.
One common reason to collect instruction counts is to determine thethe effect that a particular change will have on the number of instructionsneeded to perform some unit of work. If a change increases that number, thenext logical question is “why”. This generally involves looking at what partif the code increased in instruction count. This function automates thatprocess so that one can easily diff counts on both an inclusive andexclusive basis.
- Return type:
- stats(inclusive=False)[source]#
Returns detailed function counts.
Conceptually, the FunctionCounts returned can be thought of as a tupleof (count, path_and_function_name) tuples.
inclusive matches the semantics of callgrind. If True, the countsinclude instructions executed by children.inclusive=True is usefulfor identifying hot spots in code;inclusive=False is useful forreducing noise when diffing counts from two different runs. (SeeCallgrindStats.delta(…) for more details)
- Return type:
- classtorch.utils.benchmark.FunctionCounts(_data,inclusive,truncate_rows=True,_linewidth=None)[source]#
Container for manipulating Callgrind results.
- It supports:
Addition and subtraction to combine or diff results.
Tuple-like indexing.
Adenoise function which strips CPython calls which are known tobe non-deterministic and quite noisy.
Two higher order methods (filter andtransform) for custommanipulation.
- denoise()[source]#
Remove known noisy instructions.
Several instructions in the CPython interpreter are rather noisy. Theseinstructions involve unicode to dictionary lookups which Python uses tomap variable names. FunctionCounts is generally a content agnosticcontainer, however this is sufficiently important for obtainingreliable results to warrant an exception.
- Return type:
- filter(filter_fn)[source]#
Keep only the elements wherefilter_fn applied to function name returns True.
- Return type:
- classtorch.utils.benchmark.Compare(results)[source]#
Helper class for displaying the results of many measurements in aformatted table.
The table format is based on the information fields provided in
torch.utils.benchmark.Timer(description,label,sub_label,num_threads, etc).The table can be directly printed using
print()or casted as astr.For a full tutorial on how to use this class, see:https://pytorch.org/tutorials/recipes/recipes/benchmark.html
- Parameters:
results (list[Measurement]) – List of Measurement to display.