torch.Tensor.record_stream #

Tensor.record_stream(stream)#

Marks the tensor as having been used by this stream. When the tensoris deallocated, ensure the tensor memory is not reused for another tensoruntil all work queued onstream at the time of deallocation iscomplete.

Note

The caching allocator is aware of only the stream where a tensor wasallocated. Due to the awareness, it already correctly manages the lifecycle of tensors on only one stream. But if a tensor is used on a streamdifferent from the stream of origin, the allocator might reuse the memoryunexpectedly. Calling this method lets the allocator know which streamshave used the tensor.

Warning

This method is most suitable for use cases where you are providing afunction that created a tensor on a side stream, and want users to be ableto make use of the tensor without having to think carefully about streamsafety when making use of them. These safety guarantees come at someperformance and predictability cost (analogous to the tradeoff between GCand manual memory management), so if you are in a situation whereyou manage the full lifetime of your tensors, you may consider insteadmanually managing CUDA events so that calling this method is not necessary.In particular, when you call this method, on later allocations theallocator will poll the recorded stream to see if all operations havecompleted yet; you can potentially race with side stream computation andnon-deterministically reuse or fail to reuse memory for an allocation.

You can safely use tensors allocated on side streams withoutrecord_stream(); you must manually ensure thatany non-creation stream uses of a tensor are synced back to the creationstream before you deallocate the tensor. As the CUDA caching allocatorguarantees that the memory will only be reused with the same creation stream,this is sufficient to ensure that writes to future reallocations of thememory will be delayed until non-creation stream uses are done.(Counterintuitively, you may observe that on the CPU side we have alreadyreallocated the tensor, even though CUDA kernels on the old tensor arestill in progress. This is fine, because CUDA operations on the newtensor will appropriately wait for the old operations to complete, as theyare all on the same stream.)

Concretely, this looks like this:

withtorch.cuda.stream(s0):x=torch.zeros(N)s1.wait_stream(s0)withtorch.cuda.stream(s1):y=some_comm_op(x)...somecomputeons0...# synchronize creation stream s0 to side stream s1# before deallocating xs0.wait_stream(s1)delx

Note that some discretion is required when deciding when to performs0.wait_stream(s1). In particular, if we were to wait immediatelyaftersome_comm_op, there wouldn’t be any point in having the sidestream; it would be equivalent to have runsome_comm_op ons0.Instead, the synchronization must be placed at some appropriate, laterpoint in time where you expect the side streams1 to have finishedwork. This location is typically identified via profiling, e.g., usingChrome traces producedtorch.autograd.profiler.profile.export_chrome_trace(). If youplace the wait too early, work on s0 will block untils1 has finished,preventing further overlapping of communication and computation. If youplace the wait too late, you will use more memory than is strictlynecessary (as you are keepingx live for longer.) For a concreteexample of how this guidance can be applied in practice, see this post:FSDP and CUDACachingAllocator.

On this page

Show Source

PyTorch Libraries

Movatterモバイル変換

torch.Tensor.record_stream #

Docs

Tutorials

Resources

Movatterモバイル変換

torch.Tensor.record_stream#

Docs

Tutorials

Resources

torch.Tensor.record_stream #