This section describes the stream management functions of the CUDA runtime application programming interface.
Type of stream callback functions.
Resets all persisting lines in cache to normal status. Takes effect on function return.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
This function is slated for eventual deprecation and removal. If you do not require the callback to execute in case of a device error, consider usingcudaLaunchHostFunc. Additionally, this function is not supported withcudaStreamBeginCapture andcudaStreamEndCapture, unlikecudaLaunchHostFunc.
Adds a callback to be called on the host after all currently enqueued items in the stream have completed. For each cudaStreamAddCallback call, a callback will be executed exactly once. The callback will block later work in the stream until it is finished.
The callback may be passedcudaSuccess or an error code. In the event of a device error, all subsequently executed callbacks will receive an appropriatecudaError_t.
Callbacks must not make any CUDA API calls. Attempting to use CUDA APIs may result incudaErrorNotPermitted. Callbacks must not perform any synchronization that may depend on outstanding device work or other callbacks that are not mandated to run earlier. Callbacks without a mandated order (in independent streams) execute in undefined order and may be serialized.
For the purposes of Unified Memory, callback execution makes a number of guarantees:
The callback stream is considered idle for the duration of the callback. Thus, for example, a callback may always use memory attached to the callback stream.
The start of execution of a callback has the same effect as synchronizing an event recorded in the same stream immediately prior to the callback. It thus synchronizes streams which have been "joined" prior to the callback.
Adding device work to any stream does not have the effect of making the stream active until all preceding callbacks have executed. Thus, for example, a callback might use global attached memory even if work has been added to another stream, if it has been properly ordered with an event.
Completion of a callback does not cause a stream to become active except as described above. The callback stream will remain idle if no device work follows the callback, and will remain idle across consecutive callbacks without device work in between. Thus, for example, stream synchronization can be done by signaling from a callback at the end of the stream.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamQuery,cudaStreamSynchronize,cudaStreamWaitEvent,cudaStreamDestroy,cudaMallocManaged,cudaStreamAttachMemAsync,cudaLaunchHostFunc,cuStreamAddCallback
Enqueues an operation instream to specify stream association oflength bytes of memory starting fromdevPtr. This function is a stream-ordered operation, meaning that it is dependent on, and will only take effect when, previous work in stream has completed. Any previous association is automatically replaced.
devPtr must point to an one of the following types of memories:
managed memory declared using the __managed__ keyword or allocated withcudaMallocManaged.
a valid host-accessible region of system-allocated pageable memory. This type of memory may only be specified if the device associated with the stream reports a non-zero value for the device attributecudaDevAttrPageableMemoryAccess.
For managed allocations,length must be either zero or the entire allocation's size. Both indicate that the entire allocation's stream association is being changed. Currently, it is not possible to change stream association for a portion of a managed allocation.
For pageable allocations,length must be non-zero.
The stream association is specified usingflags which must be one ofcudaMemAttachGlobal,cudaMemAttachHost orcudaMemAttachSingle. The default value forflags iscudaMemAttachSingle If thecudaMemAttachGlobal flag is specified, the memory can be accessed by any stream on any device. If thecudaMemAttachHost flag is specified, the program makes a guarantee that it won't access the memory on the device from any stream on a device that has a zero value for the device attributecudaDevAttrConcurrentManagedAccess. If thecudaMemAttachSingle flag is specified andstream is associated with a device that has a zero value for the device attributecudaDevAttrConcurrentManagedAccess, the program makes a guarantee that it will only access the memory on the device fromstream. It is illegal to attach singly to the NULL stream, because the NULL stream is a virtual global stream and not a specific stream. An error will be returned in this case.
When memory is associated with a single stream, the Unified Memory system will allow CPU access to this memory region so long as all operations instream have completed, regardless of whether other streams are active. In effect, this constrains exclusive ownership of the managed memory region by an active GPU to per-stream activity instead of whole-GPU activity.
Accessing memory on the device from streams that are not associated with it will produce undefined results. No error checking is performed by the Unified Memory system to ensure that kernels launched into other streams do not access this region.
It is a program's responsibility to order calls tocudaStreamAttachMemAsync via events, synchronization or other means to ensure legal access to memory at all times. Data visibility and coherency will be changed appropriately for all kernels which follow a stream-association change.
Ifstream is destroyed while data is associated with it, the association is removed and the association reverts to the default visibility of the allocation as specified atcudaMallocManaged. For __managed__ variables, the default association is alwayscudaMemAttachGlobal. Note that destroying a stream is an asynchronous operation, and as a result, the change to default association won't happen until all work in the stream has completed.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamWaitEvent,cudaStreamSynchronize,cudaStreamAddCallback,cudaStreamDestroy,cudaMallocManaged,cuStreamAttachMemAsync
Begin graph capture onstream. When a stream is in capture mode, all operations pushed into the stream will not be executed, but will instead be captured into a graph, which will be returned viacudaStreamEndCapture. Capture may not be initiated ifstream iscudaStreamLegacy. Capture must be ended on the same stream in which it was initiated, and it may only be initiated if the stream is not already in capture mode. The capture mode may be queried viacudaStreamIsCapturing. A unique id representing the capture sequence may be queried viacudaStreamGetCaptureInfo.
Ifmode is not cudaStreamCaptureModeRelaxed,cudaStreamEndCapture must be called on this stream from the same thread.
Kernels captured using this API must not use texture and surface references. Reading or writing through any texture or surface reference is undefined behavior. This restriction does not apply to texture and surface objects.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreate,cudaStreamIsCapturing,cudaStreamEndCapture,cudaThreadExchangeStreamCaptureMode
Begin graph capture onstream. When a stream is in capture mode, all operations pushed into the stream will not be executed, but will instead be captured intograph, which will be returned viacudaStreamEndCapture.
Capture may not be initiated ifstream iscudaStreamLegacy. Capture must be ended on the same stream in which it was initiated, and it may only be initiated if the stream is not already in capture mode. The capture mode may be queried viacudaStreamIsCapturing. A unique id representing the capture sequence may be queried viacudaStreamGetCaptureInfo.
Ifmode is not cudaStreamCaptureModeRelaxed,cudaStreamEndCapture must be called on this stream from the same thread.
Kernels captured using this API must not use texture and surface references. Reading or writing through any texture or surface reference is undefined behavior. This restriction does not apply to texture and surface objects.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreate,cudaStreamIsCapturing,cudaStreamEndCapture,cudaThreadExchangeStreamCaptureMode
Copies attributes from source streamsrc to destination streamdst. Both streams must have the same context.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Creates a new asynchronous stream on the context that is current to the calling host thread. If no context is current to the calling host thread, then the primary context for a device is selected, made current to the calling thread, and initialized before creating a stream on it.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreateWithPriority,cudaStreamCreateWithFlags,cudaStreamGetPriority,cudaStreamGetFlags,cudaStreamGetDevice,cudaStreamQuery,cudaStreamSynchronize,cudaStreamWaitEvent,cudaStreamAddCallback,cudaSetDevice,cudaStreamDestroy,cuStreamCreate
Creates a new asynchronous stream on the context that is current to the calling host thread. If no context is current to the calling host thread, then the primary context for a device is selected, made current to the calling thread, and initialized before creating a stream on it. Theflags argument determines the behaviors of the stream. Valid values forflags are
cudaStreamDefault: Default stream creation flag.
cudaStreamNonBlocking: Specifies that work running in the created stream may run concurrently with work in stream 0 (the NULL stream), and that the created stream should perform no implicit synchronization with stream 0.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithPriority,cudaStreamGetFlags,cudaStreamGetDevice,cudaStreamQuery,cudaStreamSynchronize,cudaStreamWaitEvent,cudaStreamAddCallback,cudaSetDevice,cudaStreamDestroy,cuStreamCreate
Creates a stream with the specified priority and returns a handle inpStream. The stream is created on the context that is current to the calling host thread. If no context is current to the calling host thread, then the primary context for a device is selected, made current to the calling thread, and initialized before creating a stream on it. This affects the scheduling priority of work in the stream. Priorities provide a hint to preferentially run work with higher priority when possible, but do not preempt already-running work or provide any other functional guarantee on execution order.
priority follows a convention where lower numbers represent higher priorities. '0' represents default priority. The range of meaningful numerical priorities can be queried usingcudaDeviceGetStreamPriorityRange. If the specified priority is outside the numerical range returned bycudaDeviceGetStreamPriorityRange, it will automatically be clamped to the lowest or the highest number in the range.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Stream priorities are supported only on GPUs with compute capability 3.5 or higher.
In the current implementation, only compute kernels launched in priority streams are affected by the stream's priority. Stream priorities have no effect on host-to-device and device-to-host memory operations.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaDeviceGetStreamPriorityRange,cudaStreamGetPriority,cudaStreamQuery,cudaStreamWaitEvent,cudaStreamAddCallback,cudaStreamSynchronize,cudaSetDevice,cudaStreamDestroy,cuStreamCreateWithPriority
Destroys and cleans up the asynchronous stream specified bystream.
In case the device is still doing work in the streamstream whencudaStreamDestroy() is called, the function will return immediately and the resources associated withstream will be released automatically once the device has completed all work instream.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
Use of the handle after this call is undefined behavior.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamQuery,cudaStreamWaitEvent,cudaStreamSynchronize,cudaStreamAddCallback,cuStreamDestroy
End capture onstream, returning the captured graph viapGraph. Capture must have been initiated onstream via a call tocudaStreamBeginCapture. If capture was invalidated, due to a violation of the rules of stream capture, then a NULL graph will be returned.
If themode argument tocudaStreamBeginCapture was not cudaStreamCaptureModeRelaxed, this call must be from the same thread ascudaStreamBeginCapture.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreate,cudaStreamBeginCapture,cudaStreamIsCapturing,cudaGraphDestroy
Queries attributeattr fromhStream and stores it in corresponding member ofvalue_out.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Query stream state related to stream capture.
If called oncudaStreamLegacy (the "null stream") while a stream not created withcudaStreamNonBlocking is capturing, returnscudaErrorStreamCaptureImplicit.
Valid data (other than capture status) is returned only if both of the following are true:
the call returns cudaSuccess
the returned capture status iscudaStreamCaptureStatusActive
IfedgeData_out is non-NULL thendependencies_out must be as well. Ifdependencies_out is non-NULL andedgeData_out is NULL, but there is non-zero edge data for one or more of the current stream dependencies, the call will returncudaErrorLossyQuery.
Graph objects are not threadsafe.More here.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamBeginCapture,cudaStreamIsCapturing,cudaStreamUpdateCaptureDependencies
cudaSuccess,cudaErrorInvalidValue, cudaErrorDeviceUnavailable,
Returns in*device the device of the stream.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaSetDevice,cudaGetDevice,cudaStreamCreate,cudaStreamGetPriority,cudaStreamGetFlags,cuStreamGetId
Query the flags of a stream. The flags are returned inflags. SeecudaStreamCreateWithFlags for a list of valid flags.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreateWithPriority,cudaStreamCreateWithFlags,cudaStreamGetPriority,cudaStreamGetDevice,cuStreamGetFlags
Query the Id of a stream. The Id is returned instreamId. The Id is unique for the life of the program.
The stream handlehStream can refer to any of the following:
a stream created via any of the CUDA runtime APIs such ascudaStreamCreate,cudaStreamCreateWithFlags andcudaStreamCreateWithPriority, or their driver API equivalents such ascuStreamCreate orcuStreamCreateWithPriority. Passing an invalid handle will result in undefined behavior.
any of the special streams such as the NULL stream,cudaStreamLegacy andcudaStreamPerThread respectively. The driver API equivalents of these are also accepted which are NULL,CU_STREAM_LEGACY andCU_STREAM_PER_THREAD.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreateWithPriority,cudaStreamCreateWithFlags,cudaStreamGetPriority,cudaStreamGetFlags,cuStreamGetId
Query the priority of a stream. The priority is returned in inpriority. Note that if the stream was created with a priority outside the meaningful numerical range returned bycudaDeviceGetStreamPriorityRange, this function returns the clamped priority. SeecudaStreamCreateWithPriority for details about priority clamping.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreateWithPriority,cudaDeviceGetStreamPriorityRange,cudaStreamGetFlags,cudaStreamGetDevice,cuStreamGetPriority
Return the capture status ofstream viapCaptureStatus. After a successful call,*pCaptureStatus will contain one of the following:
cudaStreamCaptureStatusNone: The stream is not capturing.
cudaStreamCaptureStatusActive: The stream is capturing.
cudaStreamCaptureStatusInvalidated: The stream was capturing but an error has invalidated the capture sequence. The capture sequence must be terminated withcudaStreamEndCapture on the stream where it was initiated in order to continue usingstream.
Note that, if this is called oncudaStreamLegacy (the "null stream") while a blocking stream on the same device is capturing, it will returncudaErrorStreamCaptureImplicit and*pCaptureStatus is unspecified after the call. The blocking stream capture is not invalidated.
When a blocking stream is capturing, the legacy stream is in an unusable state until the blocking stream capture is terminated. The legacy stream is not supported for stream capture, but attempted use would have an implicit dependency on the capturing stream(s).
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaStreamCreate,cudaStreamBeginCapture,cudaStreamEndCapture
ReturnscudaSuccess if all operations instream have completed, orcudaErrorNotReady if not.
For the purposes of Unified Memory, a return value ofcudaSuccess is equivalent to having calledcudaStreamSynchronize().
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamWaitEvent,cudaStreamSynchronize,cudaStreamAddCallback,cudaStreamDestroy,cuStreamQuery
Sets attributeattr onhStream from corresponding attribute ofvalue. The updated attribute will be applied to subsequent work submitted to the stream. It will not affect previously submitted work.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Blocks untilstream has completed all operations. If thecudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the stream is finished with all of its tasks.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamQuery,cudaStreamWaitEvent,cudaStreamAddCallback,cudaStreamDestroy,cuStreamSynchronize
Modifies the dependency set of a capturing stream. The dependency set is the set of nodes that the next captured node in the stream will depend on.
Valid flags arecudaStreamAddCaptureDependencies andcudaStreamSetCaptureDependencies. These control whether the set passed to the API is added to the existing set or replaces it. A flags value of 0 defaults tocudaStreamAddCaptureDependencies.
Nodes that are removed from the dependency set via this API do not result incudaErrorStreamCaptureUnjoined if they are unreachable from the stream atcudaStreamEndCapture.
ReturnscudaErrorIllegalState if the stream is not capturing.
Note that this function may also return error codes from previous, asynchronous launches.
See also:
Makes all future work submitted tostream wait for all work captured inevent. SeecudaEventRecord() for details on what is captured by an event. The synchronization will be performed efficiently on the device when applicable.event may be from a different device thanstream.
flags include:
cudaEventWaitDefault: Default event creation flag.
cudaEventWaitExternal: Event is captured in the graph as an external event node when performing stream capture.
This function uses standarddefault stream semantics.
Note that this function may also return error codes from previous, asynchronous launches.
Note that this function may also returncudaErrorInitializationError,cudaErrorInsufficientDriver orcudaErrorNoDevice if this call tries to initialize internal CUDA RT state.
Note that as specified bycudaStreamAddCallback no CUDA function may be called from callback.cudaErrorNotPermitted may, but is not guaranteed to, be returned as a diagnostic in such case.
See also:
cudaStreamCreate,cudaStreamCreateWithFlags,cudaStreamQuery,cudaStreamSynchronize,cudaStreamAddCallback,cudaStreamDestroy,cuStreamWaitEvent
Sets the calling thread's stream capture interaction mode to the value contained in*mode, and overwrites*mode with the previous mode for the thread. To facilitate deterministic behavior across function or module boundaries, callers are encouraged to use this API in a push-pop fashion:
cudaStreamCaptureMode mode = desiredMode;cudaThreadExchangeStreamCaptureMode(&mode); ...cudaThreadExchangeStreamCaptureMode(&mode); // restore previous mode
During stream capture (seecudaStreamBeginCapture), some actions, such as a call tocudaMalloc, may be unsafe. In the case ofcudaMalloc, the operation is not enqueued asynchronously to a stream, and is not observed by stream capture. Therefore, if the sequence of operations captured viacudaStreamBeginCapture depended on the allocation being replayed whenever the graph is launched, the captured graph would be invalid.
Therefore, stream capture places restrictions on API calls that can be made within or concurrently to acudaStreamBeginCapture-cudaStreamEndCapture sequence. This behavior can be controlled via this API and flags tocudaStreamBeginCapture.
A thread's mode is one of the following:
cudaStreamCaptureModeGlobal: This is the default mode. If the local thread has an ongoing capture sequence that was not initiated withcudaStreamCaptureModeRelaxed atcuStreamBeginCapture, or if any other thread has a concurrent capture sequence initiated withcudaStreamCaptureModeGlobal, this thread is prohibited from potentially unsafe API calls.
cudaStreamCaptureModeThreadLocal: If the local thread has an ongoing capture sequence not initiated withcudaStreamCaptureModeRelaxed, it is prohibited from potentially unsafe API calls. Concurrent capture sequences in other threads are ignored.
cudaStreamCaptureModeRelaxed: The local thread is not prohibited from potentially unsafe API calls. Note that the thread is still prohibited from API calls which necessarily conflict with stream capture, for example, attemptingcudaEventQuery on an event that was last recorded inside a capture sequence.
Note that this function may also return error codes from previous, asynchronous launches.
See also: