torch.Tensor.record_stream#
- Tensor.record_stream(stream)#
Marks the tensor as having been used by this stream. When the tensoris deallocated, ensure the tensor memory is not reused for another tensoruntil all work queued on
streamat the time of deallocation iscomplete.Note
The caching allocator is aware of only the stream where a tensor wasallocated. Due to the awareness, it already correctly manages the lifecycle of tensors on only one stream. But if a tensor is used on a streamdifferent from the stream of origin, the allocator might reuse the memoryunexpectedly. Calling this method lets the allocator know which streamshave used the tensor.
Warning
This method is most suitable for use cases where you are providing afunction that created a tensor on a side stream, and want users to be ableto make use of the tensor without having to think carefully about streamsafety when making use of them. These safety guarantees come at someperformance and predictability cost (analogous to the tradeoff between GCand manual memory management), so if you are in a situation whereyou manage the full lifetime of your tensors, you may consider insteadmanually managing CUDA events so that calling this method is not necessary.In particular, when you call this method, on later allocations theallocator will poll the recorded stream to see if all operations havecompleted yet; you can potentially race with side stream computation andnon-deterministically reuse or fail to reuse memory for an allocation.
You can safely use tensors allocated on side streams without
record_stream(); you must manually ensure thatany non-creation stream uses of a tensor are synced back to the creationstream before you deallocate the tensor. As the CUDA caching allocatorguarantees that the memory will only be reused with the same creation stream,this is sufficient to ensure that writes to future reallocations of thememory will be delayed until non-creation stream uses are done.(Counterintuitively, you may observe that on the CPU side we have alreadyreallocated the tensor, even though CUDA kernels on the old tensor arestill in progress. This is fine, because CUDA operations on the newtensor will appropriately wait for the old operations to complete, as theyare all on the same stream.)Concretely, this looks like this:
withtorch.cuda.stream(s0):x=torch.zeros(N)s1.wait_stream(s0)withtorch.cuda.stream(s1):y=some_comm_op(x)...somecomputeons0...# synchronize creation stream s0 to side stream s1# before deallocating xs0.wait_stream(s1)delx
Note that some discretion is required when deciding when to perform
s0.wait_stream(s1). In particular, if we were to wait immediatelyaftersome_comm_op, there wouldn’t be any point in having the sidestream; it would be equivalent to have runsome_comm_opons0.Instead, the synchronization must be placed at some appropriate, laterpoint in time where you expect the side streams1to have finishedwork. This location is typically identified via profiling, e.g., usingChrome traces producedtorch.autograd.profiler.profile.export_chrome_trace(). If youplace the wait too early, work on s0 will block untils1has finished,preventing further overlapping of communication and computation. If youplace the wait too late, you will use more memory than is strictlynecessary (as you are keepingxlive for longer.) For a concreteexample of how this guidance can be applied in practice, see this post:FSDP and CUDACachingAllocator.