Note
Go to the endto download the full example code.
Profiling your PyTorch Module#
Created On: Dec 30, 2020 | Last Updated: Nov 11, 2025 | Last Verified: Nov 05, 2024
Author:Suraj Subramanian
PyTorch includes a profiler API that is useful to identify the time andmemory costs of various PyTorch operations in your code. Profiler can beeasily integrated in your code, and the results can be printed as a tableor returned in a JSON trace file.
Note
Profiler supports multithreaded models. Profiler runs in thesame thread as the operation but it will also profile child operatorsthat might run in another thread. Concurrently-running profilers will bescoped to their own thread to prevent mixing of results.
Note
PyTorch 1.8 introduces the new API that will replace the older profiler APIin the future releases. Check the new API atthis page.
Head on over tothisrecipefor a quicker walkthrough of Profiler API usage.
importtorchimportnumpyasnpfromtorchimportnnimporttorch.autograd.profilerasprofiler
Performance debugging using Profiler#
Profiler can be useful to identify performance bottlenecks in yourmodels. In this example, we build a custom module that performs twosub-tasks:
a linear transformation on the input, and
use the transformation result to get indices on a mask tensor.
We wrap the code for each sub-task in separate labelled context managers usingprofiler.record_function("label"). In the profiler output, theaggregate performance metrics of all operations in the sub-task willshow up under its corresponding label.
Note that using Profiler incurs some overhead, and is best used only for investigatingcode. Remember to remove it if you are benchmarking runtimes.
classMyModule(nn.Module):def__init__(self,in_features:int,out_features:int,bias:bool=True):super(MyModule,self).__init__()self.linear=nn.Linear(in_features,out_features,bias)defforward(self,input,mask):withprofiler.record_function("LINEAR PASS"):out=self.linear(input)withprofiler.record_function("MASK INDICES"):threshold=out.sum(axis=1).mean().item()hi_idx=np.argwhere(mask.cpu().numpy()>threshold)hi_idx=torch.from_numpy(hi_idx).cuda()returnout,hi_idx
Profile the forward pass#
We initialize random input and mask tensors, and the model.
Before we run the profiler, we warm-up CUDA to ensure accurateperformance benchmarking. We wrap the forward pass of our module in theprofiler.profile context manager. Thewith_stack=True parameter appends thefile and line number of the operation in the trace.
Warning
with_stack=True incurs an additional overhead, and is better suited for investigating code.Remember to remove it if you are benchmarking performance.
model=MyModule(500,10).cuda()input=torch.rand(128,500).cuda()mask=torch.rand((500,500,500),dtype=torch.double).cuda()# warm-upmodel(input,mask)withprofiler.profile(with_stack=True,profile_memory=True)asprof:out,idx=model(input,mask)
Print profiler results#
Finally, we print the profiler results.profiler.key_averagesaggregates the results by operator name, and optionally by inputshapes and/or stack trace events.Grouping by input shapes is useful to identify which tensor shapesare utilized by the model.
Here, we usegroup_by_stack_n=5 which aggregates runtimes by theoperation and its traceback (truncated to the most recent 5 events), anddisplay the events in the order they are registered. The table can alsobe sorted by passing asort_by argument (refer to thedocs forvalid sorting keys).
Note
When running profiler in a notebook, you might see entries like<ipython-input-18-193a910735e8>(13):forwardinstead of filenames in the stacktrace. These correspond to<notebook-cell>(linenumber):calling-function.
print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total',row_limit=5))"""(Some columns are omitted)------------- ------------ ------------ ------------ --------------------------------- Name Self CPU % Self CPU Self CPU Mem Source Location------------- ------------ ------------ ------------ --------------------------------- MASK INDICES 87.88% 5.212s -953.67 Mb /mnt/xarfuse/.../torch/au <ipython-input-...>(10): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ aten::copy_ 12.07% 715.848ms 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ LINEAR PASS 0.01% 350.151us -20 b /mnt/xarfuse/.../torch/au <ipython-input-...>(7): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ aten::addmm 0.00% 293.342us 0 b /mnt/xarfuse/.../torch/nn /mnt/xarfuse/.../torch/nn /mnt/xarfuse/.../torch/nn <ipython-input-...>(8): forward /mnt/xarfuse/.../torch/nn aten::mean 0.00% 235.095us 0 b <ipython-input-...>(11): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/----------------------------- ------------ ---------- ----------------------------------Self CPU time total: 5.931s"""
Improve memory performance#
Note that the most expensive operations - in terms of memory and time -are atforward(10) representing the operations within MASK INDICES. Let’s try totackle the memory consumption first. We can see that the.to()operation at line 12 consumes 953.67 Mb. This operation copiesmask to the CPU.mask is initialized with atorch.double datatype. Can we reduce the memory footprint by castingit totorch.float instead?
model=MyModule(500,10).cuda()input=torch.rand(128,500).cuda()mask=torch.rand((500,500,500),dtype=torch.float).cuda()# warm-upmodel(input,mask)withprofiler.profile(with_stack=True,profile_memory=True)asprof:out,idx=model(input,mask)print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total',row_limit=5))"""(Some columns are omitted)----------------- ------------ ------------ ------------ -------------------------------- Name Self CPU % Self CPU Self CPU Mem Source Location----------------- ------------ ------------ ------------ -------------------------------- MASK INDICES 93.61% 5.006s -476.84 Mb /mnt/xarfuse/.../torch/au <ipython-input-...>(10): forward /mnt/xarfuse/ /torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ aten::copy_ 6.34% 338.759ms 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ aten::as_strided 0.01% 281.808us 0 b <ipython-input-...>(11): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ aten::addmm 0.01% 275.721us 0 b /mnt/xarfuse/.../torch/nn /mnt/xarfuse/.../torch/nn /mnt/xarfuse/.../torch/nn <ipython-input-...>(8): forward /mnt/xarfuse/.../torch/nn aten::_local 0.01% 268.650us 0 b <ipython-input-...>(11): forward _scalar_dense /mnt/xarfuse/.../torch/nn <ipython-input-...>(9): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/----------------- ------------ ------------ ------------ --------------------------------Self CPU time total: 5.347s"""
The CPU memory footprint for this operation has halved.
Improve time performance#
While the time consumed has also reduced a bit, it’s still too high.Turns out copying a matrix from CUDA to CPU is pretty expensive!Theaten::copy_ operator inforward(12) copiesmask to CPUso that it can use the NumPyargwhere function.aten::copy_ atforward(13)copies the array back to CUDA as a tensor. We could eliminate both of these if we use atorch functionnonzero() here instead.
classMyModule(nn.Module):def__init__(self,in_features:int,out_features:int,bias:bool=True):super(MyModule,self).__init__()self.linear=nn.Linear(in_features,out_features,bias)defforward(self,input,mask):withprofiler.record_function("LINEAR PASS"):out=self.linear(input)withprofiler.record_function("MASK INDICES"):threshold=out.sum(axis=1).mean()hi_idx=(mask>threshold).nonzero(as_tuple=True)returnout,hi_idxmodel=MyModule(500,10).cuda()input=torch.rand(128,500).cuda()mask=torch.rand((500,500,500),dtype=torch.float).cuda()# warm-upmodel(input,mask)withprofiler.profile(with_stack=True,profile_memory=True)asprof:out,idx=model(input,mask)print(prof.key_averages(group_by_stack_n=5).table(sort_by='self_cpu_time_total',row_limit=5))"""(Some columns are omitted)-------------- ------------ ------------ ------------ --------------------------------- Name Self CPU % Self CPU Self CPU Mem Source Location-------------- ------------ ------------ ------------ --------------------------------- aten::gt 57.17% 129.089ms 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(25): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ aten::nonzero 37.38% 84.402ms 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(25): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ INDEX SCORE 3.32% 7.491ms -119.21 Mb /mnt/xarfuse/.../torch/au <ipython-input-...>(10): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(25): <module> /mnt/xarfuse/.../IPython/aten::as_strided 0.20% 441.587us 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(25): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/ aten::nonzero _numpy 0.18% 395.602us 0 b <ipython-input-...>(12): forward /mnt/xarfuse/.../torch/nn <ipython-input-...>(25): <module> /mnt/xarfuse/.../IPython/ /mnt/xarfuse/.../IPython/-------------- ------------ ------------ ------------ ---------------------------------Self CPU time total: 225.801ms"""
Further Reading#
We have seen how Profiler can be used to investigate time and memory bottlenecks in PyTorch models.Read more about Profiler here: