cuBLAS

The API Reference guide for cuBLAS, the CUDA Basic Linear Algebra Subroutine library.

1.Introduction

The cuBLAS library is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA®CUDA™ runtime. It allows the user to access the computational resources of NVIDIA Graphics Processing Unit (GPU).

The cuBLAS Library exposes four sets of APIs:

ThecuBLAS API, which is simply called cuBLAS API in this document (starting with CUDA 6.0),
ThecuBLASXt API (starting with CUDA 6.0), and
ThecuBLASLt API (starting with CUDA 10.1)
ThecuBLASDx API (not shipped with the CUDA Toolkit)

To use the cuBLAS API, the application must allocate the required matrices and vectors in the GPU memory space, fill them with data, call the sequence of desired cuBLAS functions, and then upload the results from the GPU memory space back to the host. The cuBLAS API also provides helper functions for writing and retrieving data from the GPU.

To use the cuBLASXt API, the application may have the data on the Host or any of the devices involved in the computation, and the Library will take care of dispatching the operation to, and transferring the data to, one or multiple GPUs present in the system, depending on the user request.

The cuBLASLt is a lightweight library dedicated to GEneral Matrix-to-matrix Multiply (GEMM) operations with a new flexible API. This library adds flexibility in matrix data layouts, input types, compute types, and also in choosing the algorithmic implementations and heuristics through parameter programmability. After a set of options for the intended GEMM operation are identified by the user, these options can be used repeatedly for different inputs. This is analogous to how cuFFT and FFTW first create a plan and reuse for same size and type FFTs with different input data.

1.1.Data Layout

For maximum compatibility with existing Fortran environments, the cuBLAS library uses column-major storage, and 1-based indexing. Since C and C++ use row-major storage, applications written in these languages can not use the native array semantics for two-dimensional arrays. Instead, macros or inline functions should be defined to implement matrices on top of one-dimensional arrays. For Fortran code ported to C in mechanical fashion, one may chose to retain 1-based indexing to avoid the need to transform loops. In this case, the array index of a matrix element in row “i” and column “j” can be computed via the following macro

#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))

Here, ld refers to the leading dimension of the matrix, which in the case of column-major storage is the number of rows of the allocated matrix (even if only a submatrix of it is being used). For natively written C and C++ code, one would most likely choose 0-based indexing, in which case the array index of a matrix element in row “i” and column “j” can be computed via the following macro

#define IDX2C(i,j,ld) (((j)*(ld))+(i))

1.2.New and Legacy cuBLAS API

Starting with version 4.0, the cuBLAS Library provides a new API, in addition to the existing legacy API. This section discusses why a new API is provided, the advantages of using it, and the differences with the existing legacy API.

Warning

The legacy cuBLAS API is deprecated and will be removed in future release.

The new cuBLAS library API can be used by including the header filecublas_v2.h. It has the following features that the legacy cuBLAS API does not have:

Thehandle to the cuBLAS library context is initialized using the function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs. This also allows the cuBLAS APIs to be reentrant.
The scalars$\alpha$ and$\beta$ can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This change allows library functions to execute asynchronously using streams even when$\alpha$ and$\beta$ are generated by a previous kernel.
When a library routine returns a scalar result, it can be returned by reference on the host or the device, instead of only being allowed to be returned by value only on the host. This change allows library routines to be called asynchronously when the scalar result is generated and returned by reference on the device resulting in maximum parallelism.
The error statuscublasStatus_t is returned by all cuBLAS library function calls. This change facilitates debugging and simplifies software development. Note thatcublasStatus was renamedcublasStatus_t to be more consistent with other types in the cuBLAS library.
ThecublasAlloc() andcublasFree() functions have been deprecated. This change removes these unnecessary wrappers aroundcudaMalloc() andcudaFree(), respectively.
The functioncublasSetKernelStream() was renamedcublasSetStream() to be more consistent with the other CUDA libraries.

The legacy cuBLAS API, explained in more detail inUsing the cuBLAS Legacy API, can be used by including the header filecublas.h. Since the legacy API is identical to the previously released cuBLAS library API, existing applications will work out of the box and automatically use this legacy API without any source code changes.

The current and the legacy cuBLAS APIs cannot be used simultaneously in a single translation unit: including bothcublas.h andcublas_v2.h header files will lead to compilation errors due to incompatible symbol redeclarations.

In general, new applications should not use the legacy cuBLAS API, and existing applications should convert to using the new API if it requires sophisticated and optimal stream parallelism, or if it calls cuBLAS routines concurrently from multiple threads.

For the rest of the document, the new cuBLAS Library API will simply be referred to as the cuBLAS Library API.

As mentioned earlier the interfaces to the legacy and the cuBLAS library APIs are the header filecublas.h andcublas_v2.h, respectively. In addition, applications using the cuBLAS library need to link against:

The DSOcublas.so for Linux,
The DLLcublas.dll for Windows, or
The dynamic librarycublas.dylib for Mac OS X.

Note

The same dynamic library implements both the new and legacy cuBLAS APIs.

1.3.Example Code

For sample code references please see the two examples below. They show an application written in C using the cuBLAS library API with two indexing styles (Example 1. “Application Using C and cuBLAS: 1-based indexing” and Example 2. “Application Using C and cuBLAS: 0-based Indexing”).

//Example 1. Application Using C and cuBLAS: 1-based indexing//-----------------------------------------------------------#include<stdio.h>#include<stdlib.h>#include<math.h>#include<cuda_runtime.h>#include"cublas_v2.h"#define M 6#define N 5#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))static__inline__voidmodify(cublasHandle_thandle,float*m,intldm,intn,intp,intq,floatalpha,floatbeta){cublasSscal(handle,n-q+1,&alpha,&m[IDX2F(p,q,ldm)],ldm);cublasSscal(handle,ldm-p+1,&beta,&m[IDX2F(p,q,ldm)],1);}intmain(void){cudaError_tcudaStat;cublasStatus_tstat;cublasHandle_thandle;inti,j;float*devPtrA;float*a=0;a=(float*)malloc(M*N*sizeof(*a));if(!a){printf("host memory allocation failed");returnEXIT_FAILURE;}for(j=1;j<=N;j++){for(i=1;i<=M;i++){a[IDX2F(i,j,M)]=(float)((i-1)*N+j);}}cudaStat=cudaMalloc((void**)&devPtrA,M*N*sizeof(*a));if(cudaStat!=cudaSuccess){printf("device memory allocation failed");free(a);returnEXIT_FAILURE;}stat=cublasCreate(&handle);if(stat!=CUBLAS_STATUS_SUCCESS){printf("CUBLAS initialization failed\n");free(a);cudaFree(devPtrA);returnEXIT_FAILURE;}stat=cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data download failed");free(a);cudaFree(devPtrA);cublasDestroy(handle);returnEXIT_FAILURE;}modify(handle,devPtrA,M,N,2,3,16.0f,12.0f);stat=cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data upload failed");free(a);cudaFree(devPtrA);cublasDestroy(handle);returnEXIT_FAILURE;}cudaFree(devPtrA);cublasDestroy(handle);for(j=1;j<=N;j++){for(i=1;i<=M;i++){printf("%7.0f",a[IDX2F(i,j,M)]);}printf("\n");}free(a);returnEXIT_SUCCESS;}

//Example 2. Application Using C and cuBLAS: 0-based indexing//-----------------------------------------------------------#include<stdio.h>#include<stdlib.h>#include<math.h>#include<cuda_runtime.h>#include"cublas_v2.h"#define M 6#define N 5#define IDX2C(i,j,ld) (((j)*(ld))+(i))static__inline__voidmodify(cublasHandle_thandle,float*m,intldm,intn,intp,intq,floatalpha,floatbeta){cublasSscal(handle,n-q,&alpha,&m[IDX2C(p,q,ldm)],ldm);cublasSscal(handle,ldm-p,&beta,&m[IDX2C(p,q,ldm)],1);}intmain(void){cudaError_tcudaStat;cublasStatus_tstat;cublasHandle_thandle;inti,j;float*devPtrA;float*a=0;a=(float*)malloc(M*N*sizeof(*a));if(!a){printf("host memory allocation failed");returnEXIT_FAILURE;}for(j=0;j<N;j++){for(i=0;i<M;i++){a[IDX2C(i,j,M)]=(float)(i*N+j+1);}}cudaStat=cudaMalloc((void**)&devPtrA,M*N*sizeof(*a));if(cudaStat!=cudaSuccess){printf("device memory allocation failed");free(a);returnEXIT_FAILURE;}stat=cublasCreate(&handle);if(stat!=CUBLAS_STATUS_SUCCESS){printf("CUBLAS initialization failed\n");free(a);cudaFree(devPtrA);returnEXIT_FAILURE;}stat=cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data download failed");free(a);cudaFree(devPtrA);cublasDestroy(handle);returnEXIT_FAILURE;}modify(handle,devPtrA,M,N,1,2,16.0f,12.0f);stat=cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data upload failed");free(a);cudaFree(devPtrA);cublasDestroy(handle);returnEXIT_FAILURE;}cudaFree(devPtrA);cublasDestroy(handle);for(j=0;j<N;j++){for(i=0;i<M;i++){printf("%7.0f",a[IDX2C(i,j,M)]);}printf("\n");}free(a);returnEXIT_SUCCESS;}

1.4.Forward Compatibility

cuBLAS library can work on future GPUs in most cases thanks to PTX JIT. However, there are certain limitations:

There are no performance guarantees: running on new hardware may be slower despite better theoretical peaks.
There is limited forward compatibility for narrow precisions (FP4 and FP8) and tiled 8-bit integer layouts.

1.5.Floating Point Emulation

Floating point emulation was first introduced in CUDA 12.9 and is used to further accelerate matrix multiplication for higher precision data types. Floating point emulation works by first transforming the inputs into multiple lower precision values, then leverages lower precision hardware units to compute partial results, and finally recombines the results back into full precision. These algorithms can provide a significant performance advantage over native precision arithmetic while maintaining the same or better accuracy; however, the results are not IEEE-754 compliant.

Floating Point Emulation Support Overview
Floating Point Emulation Algorithm	Precision Emulated	Supported compute capabilities	CUDA Version
BF16x9	FP32	10.0, 10.3	12.9+
Fixed-Point	FP64	8.x, 9.0, 10.0, 11.0, 12.x	13.0u2+

To enable floating point emulation without any code changes, the following environment variables can be used.

Floating Point Emulation Environment Variables
Environment Variable	Description
`CUBLAS_EMULATION_STRATEGY`	An environment variable for overriding the default emulation strategy. The valid values are`performant` and`eager`; seecublasEmulationStrategy_t for more details.
`CUBLAS_EMULATE_SINGLE_PRECISION`	An environment variable for enabling and disabling single precision floating point emulation using the values 1 and 0, respectively.
`CUBLAS_EMULATE_DOUBLE_PRECISION`	An environment variable for enabling and disabling double precision floating point emulation using the values 1 and 0, respectively.
`CUBLAS_FIXEDPOINT_EMULATION_MANTISSA_BIT_COUNT`	The number of mantissa bits to be used for fixed-point emulation. When set, emulated algorithms will use the specified number of mantissa bits. This is equivalent to callingcublasSetFixedPointEmulationMantissaControl() with`CUDA_EMULATION_MANTISSA_CONTROL_FIXED` (seecudaEmulationMantissaControl_t) andcublasSetFixedPointEmulationMaxMantissaBitCount() to the user-provided value.

1.5.1.BF16x9

The BF16x9 algorithm is used for emulating FP32 arithmetic. An FP32 value can be exactly represented as three BF16 values as follows:

\[\begin{split}a & = a_0 + 2^{-8} a_1 + 2^{-16} a_2 \\\end{split}\]

We can fully reconstruct the FP32 value from the BF16 values without any loss of accuracy. Using this, we define an FMA operation (d = ab + c) as follows:

\[\begin{split}d & = ab + c \\ & = (a_0 + 2^{-8} a_1 + 2^{-16} a_2) \cdot (b_0 + 2^{-8} b_1 + 2^{-16} b_2) + c \\ & = a_0b_0 + 2^{-8}a_0b_1 + 2^{-16}a_0b_2 \\ & \quad + 2^{-8}a_1b_0 + 2^{-16}a_1b_1 + 2^{-24}a_1b_2 \\ & \quad + 2^{-16}a_2b_0 + 2^{-24}a_2b_1 + 2^{-32}a_2b_2 + c \\\end{split}\]

In practice, the BF16 tensor cores are utilized rather than FMA units and this idea naturally extends into complex arithmetic as well.

While BF16x9 can be supported on all hardware, it only provides a performance advantage when peak BF16 throughput is more than nine times greater than peak FP32 throughput. It also requires special hardware features to apply the additional scaling factors in a performant manner. As a result, BF16x9 is only supported on select architectures. See theFloating Point Emulation Support Overview table for more details.

1.5.2.Fixed-Point

Fixed-point emulation is used for emulating FP64 arithmetic and follows theOzaki Scheme. Fixed-point representations emulate floating point through the addition of a shared power of two scaling factor and by encoding the remaining dynamic range of floating point within mantissa bits. The scaling factor is shared for elements in the same row of the A matrix or column of the B matrix and is used to logically scale all elements to be between -1 and 1 inclusively.

Due to the large dynamic range of FP64, there is no single configuration of fixed-point which is both performant and accurate for all floating point inputs. Therefore, we enable two flavors of fixed-point emulation:Dynamic Mantissa Control andFixed Mantissa Control. These configurations can be set withcublasSetFixedPointEmulationMantissaControl().

1.5.2.1.Dynamic Mantissa Control

Dynamic mantissa control represents the cuBLAS library default mantissa control. Our automatic dynamic precision framework computes the proper number of fixed-point mantissa bits required to maintain equal or better accuracy than FP64. If the number of required mantissa bits exceeds a library defined default (seeDefault Library Configurations) or a user provided maximum number of bits (seecublasSetFixedPointEmulationMaxMantissaBitCount()), the framework dynamically dispatches to native FP64.

1.5.2.2.Fixed Mantissa Control

Fixed mantissa control can be leveraged to further accelerate fixed-point emulation. The user can provide the number of mantissa bits for the fixed-point representation viacublasSetFixedPointEmulationMaxMantissaBitCount(); however, without the automatic dynamic precision framework, it is not possible to guarantee equal or better accuracy than FP64 arithmetic.

1.5.2.3.Representation and Mappings

The fixed-point representation consists of a shared scaling factor for elements in the same row or column of a matrix, a sign bit, and mantissa bits. We store the sign bit and mantissa bits within 8-bit integers. Each matrix of 8-bit integers are referred to as a slice and the computational cost grows quadratically with the number of slices. The formula to convert mantissa bit count to slice count is as follows:

\[\text{sliceCount} = \text{ceildiv}(\text{mantissaBitCount} + 1, 8)\]

Note

The number of mantissa bits will always be rounded up to fully occupy the least significant slice

1.5.2.4.Fixed-Point Workspace Requirements

To compute with fixed-point emulation, the A and B matrices are translated into a fixed-point representation in workspace memory. This leads to workspace requirements that are problem size and emulation parameter dependent. The following function will provide a safe bound (possibly overestimating) on the workspace required for fixed-point emulation:

size_tgetFixedPointWorkspaceSizeInBytes(intm,intn,intk,intbatchCount,boolisComplex,cudaEmulationMantissaControlmantissaControl,intmaxMantissaBitCount){constexprdoubleMULTIPLIER=1.25;intmult=isComplex?2:1;intnumSlices=ceildiv(maxMantissaBitCount+1,8);intpadded_m=ceildiv(m,1024)*1024;intpadded_n=ceildiv(n,1024)*1024;intpadded_k=ceildiv(k,128)*128;intnum_blocks_k=ceildiv(k,64);size_tgemm_workspace=sizeof(int8_t)*((size_t)padded_m*padded_k+(size_t)padded_n*padded_k)*mult*numSlices;gemm_workspace+=sizeof(int32_t)*((size_t)padded_m+padded_n)*mult;if(isComplex){gemm_workspace+=sizeof(double)*(size_t)m*n*mult*mult;}size_tadp_workspace=0;if(mantissaControl==CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC){adp_workspace=sizeof(int32_t)*((size_t)m*num_blocks_k+(size_t)n*num_blocks_k+(size_t)m*n)*mult;}constexprsize_tCONSTANT_SIZE=128*1024*1024;return(size_t)(std::max(gemm_workspace,adp_workspace)*batchCount*MULTIPLIER)+CONSTANT_SIZE;}

This function can be used to manage your own workspace memory withcublasSetWorkspace(), which can be used to guaranteereproducible results andimprove performance.

1.5.2.5.Fixed-Point Performance Guide

Fixed-point emulation allows users to make performance and precision trade-offs for further acceleration. For dynamic mantissa control, users are able to configure the automatic dynamic precision framework to use fewer or more bits than the accuracy of native FP64 requires withcublasSetFixedPointEmulationMantissaBitOffset(). Fixed mantissa control can be similarly tuned by increasing or decreasing the number of mantissa bits withcublasSetFixedPointEmulationMaxMantissaBitCount().

Due to the largefixed-point workspace requirements, asynchronous allocation is done withcudaMallocAsync(). In cases where not enough GEMMs are called to amortize the cost of memory allocation, or very frequent CUDA stream synchronization occurs, you can improve performance by:

Reducing the number of CUDA stream synchronizations
Managing your own memory and providing workspace withcublasSetWorkspace()
Allowing thedefault memory pool to retain memory between synchronizations

1.5.3.Default Library Configurations

Library default values for emulation are subject to change.

Emulation Configuration Default Values
API	Mantissa Control	Default Behavior
cublasGetEmulationStrategy()	Not applicable	`CUBLAS_EMULATION_STRATEGY_DEFAULT`
cublasGetEmulationSpecialValuesSupport()	Not applicable	`CUBLAS_EMULATION_SPECIAL_VALUES_SUPPORT_DEFAULT`
cublasGetFixedPointEmulationMantissaControl()	Not applicable	`CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC`
cublasGetFixedPointEmulationMaxMantissaBitCount()	CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC	79
cublasGetFixedPointEmulationMaxMantissaBitCount()	CUDA_EMULATION_MANTISSA_CONTROL_FIXED	55
cublasGetFixedPointEmulationMantissaBitOffset()	Not applicable	0
cublasGetFixedPointEmulationMantissaBitCountPointer()	Not applicable	NULL

1.5.4.Support For Floating Point Special Values

The implementations of floating point emulation algorithms maintain the accuracy of the emulated precision for both normal and denormalized values but may not adhere to the IEEE-754 standard with respect to$\text{Inf}$,$\text{NaN}$, or signed zeros. If the underlying emulated algorithm cannot implicitly support a given special value, and the library is configured to support it (seecublasSetEmulationSpecialValuesSupport()), then extra steps are taken to support it. The following table shows which special values are implicitly supported for each emulation algorithm.

Emulation Algorithms Implicit Special Values Support
Floating Point Emulation Algorithm	Implicitly Supported Special Values
BF16x9	$\text{NaN}$
Fixed-Point	None

2.Using the cuBLAS API

2.1.General Description

This section describes how to use the cuBLAS library API.

2.1.1.Error Status

All cuBLAS library function calls return the error statuscublasStatus_t.

2.1.2.cuBLAS Context

The application must initialize a handle to the cuBLAS library context by calling thecublasCreate() function. Then, the handle is explicitly passed to every subsequent library function call. Once the application finishes using the library, it must call the functioncublasDestroy() to release the resources associated with the cuBLAS library context.

This approach allows the user to explicitly control the library setup when using multiple host threads and multiple GPUs. For example, the application can usecudaSetDevice() to associate different devices with different host threads and in each of those host threads it can initialize a unique handle to the cuBLAS library context, which will use the particular device associated with that host thread. Then, the cuBLAS library function calls made with different handles will automatically dispatch the computation to different devices.

The device associated with a particular cuBLAS context is assumed to remain unchanged between the correspondingcublasCreate() andcublasDestroy() calls. In order for the cuBLAS library to use a different device in the same host thread, the application must set the new device to be used by callingcudaSetDevice() and then create another cuBLAS context, which will be associated with the new device, by callingcublasCreate(). When multiple devices are available, applications must ensure that the device associated with a given cuBLAS context is current (e.g. by callingcudaSetDevice()) before invoking cuBLAS functions with this context.

A cuBLAS library context is tightly coupled with the CUDA context that is current at the time of thecublasCreate() call. An application that uses multiple CUDA contexts is required to create a cuBLAS context per CUDA context and make sure the former never outlives the latter. Starting from version 12.8, cuBLAS detects if the underlying CUDA context is tied to a graphics context and follows the shared memory size limits that are set in such case.

2.1.3.Thread Safety

The library is thread safe and its functions can be called from multiple host threads, even with the same handle. When multiple threads share the same handle, extreme care needs to be taken when the handle configuration is changed because that change will affect potentially subsequent cuBLAS calls in all threads. It is even more true for the destruction of the handle. So it is not recommended that multiple thread share the same cuBLAS handle.

Additional considerations apply when the same handle is used from multiple threads with a user provided workspace. SeecublasSetWorkspace() for details.

2.1.4.Results Reproducibility

By design, all cuBLAS API routines from a given toolkit version, generate the same bit-wise results at every run when executed on GPUs with the same architecture and the same number of SMs. However, bit-wise reproducibility is not guaranteed across toolkit versions because the implementation might differ due to some implementation changes.

This guarantee no longer holds when multiple CUDA streams are active orfixed-point emulation is used. If multiple concurrent streams are active, the library may optimize total performance by picking different internal implementations.

Note

The non-deterministic behavior of multi-stream execution is due to library optimizations in selecting internal workspace for the routines running in parallel streams. To avoid this effect user can either:

provide a separate workspace for each used stream using thecublasSetWorkspace() function, or
have one cuBLAS handle per stream, or
usecublasLtMatmul() instead of GEMM-family of functions and provide user owned workspace, or
set a debug environment variableCUBLAS_WORKSPACE_CONFIG to:16:8 (may limit overall performance) or:4096:8 (will increase library footprint in GPU memory by approximately 24MiB).

The non-deterministic behavior offixed-point emulation is due to the large workspace memory requirements (seeFixed-Point Workspace Requirements for details). This requires dynamically allocating memory withcudaMallocAsync() and allocation failures result in fallbacks to non-emulated routines. To avoid this effect, users can provide workspace viacublasSetWorkspace() to meet fixed-point emulation workspace requirements.

Any of those settings will allow for deterministic behavior even with multiple concurrent streams sharing a single cuBLAS handle.

This behavior is expected to change in a future release.

For some routines such ascublas<t>symv() andcublas<t>hemv(), an alternate significantly faster routine can be chosen using the routinecublasSetAtomicsMode(). In that case, the results are not guaranteed to be bit-wise reproducible because atomics are used for the computation.

2.1.5.Scalar Parameters

There are two categories of the functions that use scalar parameters :

Functions that takealpha and/orbeta parameters by reference on the host or the device as scaling factors, such asgemm.
Functions that return a scalar result on the host or the device such asamax(),amin,asum(),rotg(),rotmg(),dot() andnrm2().

For the functions of the first category, when the pointer mode is set toCUBLAS_POINTER_MODE_HOST, the scalar parametersalpha and/orbeta can be on the stack or allocated on the heap, shouldn’t be placed in managed memory. Underneath, the CUDA kernels related to those functions will be launched with the value ofalpha and/orbeta. Therefore if they were allocated on the heap, they can be freed just after the return of the call even though the kernel launch is asynchronous. When the pointer mode is set toCUBLAS_POINTER_MODE_DEVICE,alpha and/orbeta must be accessible on the device and their values should not be modified until the kernel is done. Note that sincecudaFree() does an implicitcudaDeviceSynchronize(),cudaFree() can still be called onalpha and/orbeta just after the call but it would defeat the purpose of using this pointer mode in that case.

For the functions of the second category, when the pointer mode is set toCUBLAS_POINTER_MODE_HOST, these functions block the CPU, until the GPU has completed its computation and the results have been copied back to the Host. When the pointer mode is set toCUBLAS_POINTER_MODE_DEVICE, these functions return immediately. In this case, similar to matrix and vector results, the scalar result is ready only when execution of the routine on the GPU has completed. This requires proper synchronization in order to read the result from the host.

In either case, the pointer modeCUBLAS_POINTER_MODE_DEVICE allows the library functions to execute completely asynchronously from the Host even whenalpha and/orbeta are generated by a previous kernel. For example, this situation can arise when iterative methods for solution of linear systems and eigenvalue problems are implemented using the cuBLAS library.

2.1.6.Parallelism with Streams

If the application uses the results computed by multiple independent tasks, CUDA™ streams can be used to overlap the computation performed in these tasks.

The application can conceptually associate each stream with each task. In order to achieve the overlap of computation between the tasks, the user should create CUDA™ streams using the functioncudaStreamCreate() and set the stream to be used by each individual cuBLAS library routine by callingcublasSetStream() just before calling the actual cuBLAS routine. Note thatcublasSetStream() resets the user-provided workspace to the default workspace pool; seecublasSetWorkspace(). Then, the computation performed in separate streams would be overlapped automatically when possible on the GPU. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work.

We recommend using the new cuBLAS API with scalar parameters and results passed by reference in the device memory to achieve maximum overlap of the computation when using streams.

A particular application of streams, batching of multiple small kernels, is described in the following section.

2.1.7.Batching Kernels

In this section, we explain how to use streams to batch the execution of small kernels. For instance, suppose that we have an application where we need to make many small independent matrix-matrix multiplications with dense matrices.

It is clear that even with millions of small independent matrices we will not be able to achieve the sameGFLOPS rate as with a one large matrix. For example, a single$n \times n$ large matrix-matrix multiplication performs$n^{3}$ operations for$n^{2}$ input size, while 1024$\frac{n}{32} \times \frac{n}{32}$ small matrix-matrix multiplications perform$1024\left( \frac{n}{32} \right)^{3} = \frac{n^{3}}{32}$ operations for the same input size. However, it is also clear that we can achieve a significantly better performance with many small independent matrices compared with a single small matrix.

The architecture family of GPUs allows us to execute multiple kernels simultaneously. Hence, in order to batch the execution of independent kernels, we can run each of them in a separate stream. In particular, in the above example we could create 1024 CUDA™ streams using the functioncudaStreamCreate(), then preface each call tocublas<t>gemm() with a call tocublasSetStream() with a different stream for each of the matrix-matrix multiplications (note thatcublasSetStream() resets user-provided workspace to the default workspace pool, seecublasSetWorkspace()). This will ensure that when possible the different computations will be executed concurrently. Although the user can create many streams, in practice it is not possible to have more than 32 concurrent kernels executing at the same time.

2.1.8.Cache Configuration

On some devices, L1 cache and shared memory use the same hardware resources. The cache configuration can be set directly with the CUDA Runtime function cudaDeviceSetCacheConfig. The cache configuration can also be set specifically for some functions using the routine cudaFuncSetCacheConfig. Please refer to the CUDA Runtime API documentation for details about the cache configuration settings.

Because switching from one configuration to another can affect kernels concurrency, the cuBLAS Library does not set any cache configuration preference and relies on the current setting. However, some cuBLAS routines, especially Level-3 routines, rely heavily on shared memory. Thus the cache preference setting might affect adversely their performance.

2.1.9.Static Library Support

The cuBLAS Library is also delivered in a static form aslibcublas_static.a on Linux. The static cuBLAS library and all other static math libraries depend on a common thread abstraction layer library calledlibculibos.a.

For example, on Linux, to compile a small application using cuBLAS, against the dynamic library, the following command can be used:

nvccmyCublasApp.c-lcublas-omyCublasApp

Whereas to compile against the static cuBLAS library, the following command must be used:

nvccmyCublasApp.c-lcublas_static-lculibos-omyCublasApp

It is also possible to use the native Host C++ compiler. Depending on the Host operating system, some additional libraries likepthread ordl might be needed on the linking line. The following command on Linux is suggested :

g++myCublasApp.c-lcublas_static-lculibos-lcudart_static-lpthread-ldl-I<cuda-toolkit-path>/include-L<cuda-toolkit-path>/lib64-omyCublasApp

Note that in the latter case, the librarycuda is not needed. The CUDA Runtime will try to open explicitly thecuda library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available.

Starting with release 11.2, using the typed functions instead of the extension functions (cublas**Ex()) helps in reducing the binary size when linking to static cuBLAS Library.

2.1.10.GEMM Algorithms Numerical Behavior

Some GEMM algorithms split the computation along the dimension K to increase the GPU occupancy, especially when the dimension K is large compared to dimensions M and N. When this type of algorithm is chosen by the cuBLAS heuristics or explicitly by the user, the results of each split is summed deterministically into the resulting matrix to get the final result.

For the routinescublas<t>gemmEx() andcublasGemmEx(), when the compute type is greater than the output type, the sum of the split chunks can potentially lead to some intermediate overflows thus producing a final resulting matrix with some overflows. Those overflows might not have occurred if all the dot products had been accumulated in the compute type before being converted at the end in the output type. This computation side-effect can be easily exposed when the computeType isCUDA_R_32F and Atype, Btype and Ctype areCUDA_R_16F. This behavior can be controlled using the compute precision modeCUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION withcublasSetMathMode()

2.1.11.Tensor Core Usage

Tensor cores were first introduced with Volta GPUs (compute capability 7.0 and above) and significantly accelerate matrix multiplications. Starting with cuBLAS version 11.0.0, the library may automatically make use of Tensor Core capabilities wherever possible, unless they are explicitly disabled by selecting pedantic compute modes in cuBLAS (seecublasSetMathMode(),cublasMath_t).

It should be noted that the library will pick a Tensor Core enabled implementation wherever it determines that it would provide the best performance.

The best performance when using Tensor Cores can be achieved when the matrix dimensions and pointers meet certain memory alignment requirements. Specifically, all of the following conditions must be satisfied to get the most performance out of Tensor Cores:

((op_A==CUBLAS_OP_N?m:k)*AtypeSize)%16==0
((op_B==CUBLAS_OP_N?k:n)*BtypeSize)%16==0
(m*CtypeSize)%16==0
(lda*AtypeSize)%16==0
(ldb*BtypeSize)%16==0
(ldc*CtypeSize)%16==0
intptr_t(A)%16==0
intptr_t(B)%16==0
intptr_t(C)%16==0

To conduct matrix multiplication with FP8 types (see8-bit Floating Point Data Types (FP8) Usage), you must ensure that your matrix dimensions and pointers meet the optimal requirements listed above. Aside from FP8, there are no longer any restrictions on matrix dimensions and memory alignments to use Tensor Cores (starting with cuBLAS version 11.0.0).

2.1.12.CUDA Graphs Support

cuBLAS routines can be captured in CUDA Graph stream capture without restrictions in most situations.

The exception are routines that output results into host buffers (e.g.cublas<t>dot() while pointer modeCUBLAS_POINTER_MODE_HOST is configured), as it enforces synchronization.

For input coefficients (such asalpha,beta) behavior depends on the pointer mode setting:

In the case ofCUBLAS(LT)_POINTER_MODE_HOST, coefficient values are captured in the graph.
In the case of pointer modes with device pointers, coefficient value is accessed using the device pointer at the time of graph execution.

Note

When captured in CUDA Graph stream capture, cuBLAS routines can creatememory nodes through the use of stream-ordered allocation APIs,cudaMallocAsync andcudaFreeAsync. However, as there is currently no support for memory nodes inchild graphs or graphs launchedfrom the device, attempts to capture cuBLAS routines in such scenarios may fail. To avoid this issue, use thecublasSetWorkspace() function to provide user-owned workspace memory.

2.1.13.64-bit Integer Interface

cuBLAS version 12 introduced 64-bit integer capable functions. Each 64-bit integer function is equivalent to a 32-bit integer function with the following changes:

The function name has_64 suffix.
The dimension (problem size) data type changed fromint toint64_t. Examples of dimension:m,n, andk.
The leading dimension data type changed fromint toint64_t. Examples of leading dimension:lda,ldb, andldc.
The vector increment data type changed fromint toint64_t. Examples of vector increment:incx andincy.

For example, consider the following 32-bit integer functions:

cublasStatus_tcublasSetMatrix(introws,intcols,intelemSize,constvoid*A,intlda,void*B,intldb);cublasStatus_tcublasIsamax(cublasHandle_thandle,intn,constfloat*x,intincx,int*result);cublasStatus_tcublasSsyr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*x,intincx,float*A,intlda);

The equivalent 64-bit integer functions are:

cublasStatus_tcublasSetMatrix_64(int64_trows,int64_tcols,int64_telemSize,constvoid*A,int64_tlda,void*B,int64_tldb);cublasStatus_tcublasIsamax_64(cublasHandle_thandle,int64_tn,constfloat*x,int64_tincx,int64_t*result);cublasStatus_tcublasSsyr_64(cublasHandle_thandle,cublasFillMode_tuplo,int64_tn,constfloat*alpha,constfloat*x,int64_tincx,float*A,int64_tlda);

Not every function has a 64-bit integer equivalent. For instance,cublasSetMathMode() doesn’t have any arguments that could meaningfully beint64_t. For documentation brevity, the 64-bit integer APIs are not explicitly listed, but only mentioned that they exist for the relevant functions.

2.2.cuBLAS Datatypes Reference

2.2.1.cublasHandle_t

ThecublasHandle_t type is a pointer type to an opaque structure holding the cuBLAS library context. The cuBLAS library context must be initialized usingcublasCreate() and the returned handle must be passed to all subsequent library function calls. The context should be destroyed at the end usingcublasDestroy().

2.2.2.cublasStatus_t

The type is used for function status returns. All cuBLAS library functions return their status, which can have the following values.

Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The cuBLAS library was not initialized. This is usually caused by the lack of a priorcublasCreate() call, an error in the CUDA Runtime API called by the cuBLAS routine, or an error in the hardware setup. To correct: callcublasCreate() before the function call; and check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed.
`CUBLAS_STATUS_ALLOC_FAILED`	Resource allocation failed inside the cuBLAS library. This is usually caused by a`cudaMalloc()` failure. To correct: prior to the function call, deallocate previously allocated memory as much as possible.
`CUBLAS_STATUS_INVALID_VALUE`	An unsupported value or parameter was passed to the function (a negative vector size, for example). To correct: ensure that all the parameters being passed have valid values.
`CUBLAS_STATUS_ARCH_MISMATCH`	The function requires a feature absent from the device architecture; usually caused by compute capability lower than 5.0. To correct: compile and run the application on a device with appropriate compute capability.
`CUBLAS_STATUS_MAPPING_ERROR`	An access to GPU memory space failed, which is usually caused by a failure to bind a texture. To correct: before the function call, unbind any previously bound textures.
`CUBLAS_STATUS_EXECUTION_FAILED`	The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed.
`CUBLAS_STATUS_INTERNAL_ERROR`	An internal cuBLAS operation failed. This error is usually caused by a`cudaMemcpyAsync()` failure. To correct: check that the hardware, an appropriate version of the driver, and the cuBLAS library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine’s completion.
`CUBLAS_STATUS_NOT_SUPPORTED`	The functionality requested is not supported.
`CUBLAS_STATUS_LICENSE_ERROR`	The functionality requested requires some license and an error was detected when trying to check the current licensing. This error can happen if the license is not present or is expired or if the environment variable NVIDIA_LICENSE_FILE is not set properly.

2.2.3.cublasOperation_t

ThecublasOperation_t type indicates which operation needs to be performed with the dense matrix. Its values correspond to Fortran characters‘N’ or‘n’ (non-transpose),‘T’ or‘t’ (transpose) and‘C’ or‘c’ (conjugate transpose) that are often used as parameters to legacy BLAS implementations.

Value	Meaning
`CUBLAS_OP_N`	The non-transpose operation is selected.
`CUBLAS_OP_T`	The transpose operation is selected.
`CUBLAS_OP_C`	The conjugate transpose operation is selected.

2.2.4.cublasFillMode_t

The type indicates which part (lower or upper) of the dense matrix was filled and consequently should be used by the function. Its values correspond to Fortran charactersL orl (lower) andU oru (upper) that are often used as parameters to legacy BLAS implementations.

Value	Meaning
`CUBLAS_FILL_MODE_LOWER`	The lower part of the matrix is filled.
`CUBLAS_FILL_MODE_UPPER`	The upper part of the matrix is filled.
`CUBLAS_FILL_MODE_FULL`	The full matrix is filled.

2.2.5.cublasDiagType_t

The type indicates whether the main diagonal of the dense matrix is unity and consequently should not be touched or modified by the function. Its values correspond to Fortran characters‘N’ or‘n’ (non-unit) and‘U’ or‘u’ (unit) that are often used as parameters to legacy BLAS implementations.

Value	Meaning
`CUBLAS_DIAG_NON_UNIT`	The matrix diagonal has non-unit elements.
`CUBLAS_DIAG_UNIT`	The matrix diagonal has unit elements.

2.2.6.cublasSideMode_t

The type indicates whether the dense matrix is on the left or right side in the matrix equation solved by a particular function. Its values correspond to Fortran characters‘L’ or‘l’ (left) and‘R’ or‘r’ (right) that are often used as parameters to legacy BLAS implementations.

Value	Meaning
`CUBLAS_SIDE_LEFT`	The matrix is on the left side in the equation.
`CUBLAS_SIDE_RIGHT`	The matrix is on the right side in the equation.

2.2.7.cublasPointerMode_t

ThecublasPointerMode_t type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are present in the function call, all of them must conform to the same single pointer mode. The pointer mode can be set and retrieved usingcublasSetPointerMode() andcublasGetPointerMode() routines, respectively.

Value	Meaning
`CUBLAS_POINTER_MODE_HOST`	The scalars are passed by reference on the host.
`CUBLAS_POINTER_MODE_DEVICE`	The scalars are passed by reference on the device.

2.2.8.cublasAtomicsMode_t

The type indicates whether cuBLAS routines which has an alternate implementation using atomics can be used. The atomics mode can be set and queried usingcublasSetAtomicsMode() andcublasGetAtomicsMode() and routines, respectively.

Value	Meaning
`CUBLAS_ATOMICS_NOT_ALLOWED`	The usage of atomics is not allowed.
`CUBLAS_ATOMICS_ALLOWED`	The usage of atomics is allowed.

2.2.9.cublasGemmAlgo_t

cublasGemmAlgo_t type is an enumerant to specify the algorithm for matrix-matrix multiplication on GPU architectures up tosm_75. Onsm_80 and newer GPU architectures, this enumarant has no effect. cuBLAS has the following algorithm options:

Value	Meaning
`CUBLAS_GEMM_DEFAULT`	Apply Heuristics to select the GEMM algorithm
`CUBLAS_GEMM_ALGO0` to`CUBLAS_GEMM_ALGO23`	Explicitly choose an Algorithm`0..23`. Note: Doesn’t have effect on NVIDIA Ampere architecture GPUs and newer.
`CUBLAS_GEMM_DEFAULT_TENSOR_OP`[DEPRECATED]	This mode is deprecated and will be removed in a future release. Apply Heuristics to select the GEMM algorithm, while allowing use of reduced precision CUBLAS_COMPUTE_32F_FAST_16F kernels (for backward compatibility).
`CUBLAS_GEMM_ALGO0_TENSOR_OP` to`CUBLAS_GEMM_ALGO15_TENSOR_OP`[DEPRECATED]	Those values are deprecated and will be removed in a future release. Explicitly choose a Tensor core GEMM Algorithm`0..15`. Allows use of reduced precision CUBLAS_COMPUTE_32F_FAST_16F kernels (for backward compatibility). Note: Doesn’t have effect on NVIDIA Ampere architecture GPUs and newer.
`CUBLAS_GEMM_AUTOTUNE`	[EXPERIMENTAL] The library will benchmark a number of available algorithms and choose the optimal one for the given problem configuration. Solution is cached in cublas handle so that next calls with the problem size will use the cached configuration. Note: To avoid overwriting the user’s data, the library will allocate the amount of memory corresponding to the size of the output. Note: The benchmarking is not supported during stream capture; CUBLAS_STATUS_NOT_SUPPORTED will be returned under stream capture if no configuration was found in the cache for the given problem size.

2.2.10.cublasMath_t

cublasMath_t enumerate type is used incublasSetMathMode() to choose compute precision modes as defined in the following table. Since this setting does not directly control the use of Tensor Cores, the modeCUBLAS_TENSOR_OP_MATH is being deprecated, and will be removed in a future release.

Value	Meaning
`CUBLAS_DEFAULT_MATH`	This is the default and highest-performance mode that uses compute and intermediate storage precisions with at least the same number of mantissa and exponent bits as requested. Tensor Cores will be used whenever possible.
`CUBLAS_PEDANTIC_MATH`	This mode uses the prescribed precision and standardized arithmetic for all phases of calculations and is primarily intended for numerical robustness studies, testing, and debugging. This mode might not be as performant as the other modes.
`CUBLAS_TF32_TENSOR_OP_MATH`	Enable acceleration of single-precision routines using TF32 tensor cores. Note that input conversions round to nearest even.
`CUBLAS_FP32_EMULATED_BF16X9_MATH`	Enable acceleration of single-precision routines using the BF16x9 algorithm. SeeFloating Point Emulation for more details. For single precision GEMM routines cuBLAS will use the`CUBLAS_COMPUTE_32F_EMULATED_16BFX9` compute type.
`CUBLAS_FP64_EMULATED_FIXEDPOINT_MATH`	Enable acceleration of double-precision routines using fixed-point emulation algorithms. SeeFloating Point Emulation for more details.
`CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION`	Forces any reductions during matrix multiplications to use the accumulator type (that is, compute type) and not the output type in case of mixed precision routines where output type precision is less than the compute type precision. This is a flag that can be set (using a bitwise or operation) alongside any of the other values.
`CUBLAS_TENSOR_OP_MATH` [DEPRECATED]	This mode is deprecated and will be removed in a future release. Allows the library to use Tensor Core operations whenever possible. For single precision GEMM routines cuBLAS will use the`CUBLAS_COMPUTE_32F_FAST_16F` compute type.

2.2.11.cublasComputeType_t

cublasComputeType_t enumerate type is used incublasGemmEx() andcublasLtMatmul() (including all batched and strided batched variants) to choose compute precision modes as defined below.

Value \| Meaning
`CUBLAS_COMPUTE_16F`		This is the default and highest-performance mode for 16-bit half precision floating point and all compute and intermediate storage precisions with at least 16-bit half precision. Tensor Cores will be used whenever possible.
`CUBLAS_COMPUTE_16F_PEDANTIC`		This mode uses 16-bit half precision floating point standardized arithmetic for all phases of calculations and is primarily intended for numerical robustness studies, testing, and debugging. This mode might not be as performant as the other modes since it disables use of tensor cores.
`CUBLAS_COMPUTE_32F`		This is the default 32-bit single precision floating point and uses compute and intermediate storage precisions of at least 32-bits.
`CUBLAS_COMPUTE_32F_PEDANTIC`		Uses 32-bit single precision floating point arithmetic for all phases of calculations and also disables algorithmic optimizations such as Gaussian complexity reduction (3M).
`CUBLAS_COMPUTE_32F_FAST_16F`		Allows the library to use Tensor Cores with automatic down-conversion and 16-bit half-precision compute for 32-bit input and output matrices.
`CUBLAS_COMPUTE_32F_FAST_16BF`		Allows the library to use Tensor Cores with automatic down-convesion and bfloat16 compute for 32-bit input and output matrices. SeeAlternate Floating Point section for more details on bfloat16.
`CUBLAS_COMPUTE_32F_FAST_TF32`		Allows the library to use Tensor Cores with TF32 compute for 32-bit floating point input and output matrices. Note that input conversions round to nearest even. SeeAlternate Floating Point section for more details on TF32 compute.
`CUBLAS_COMPUTE_32F_EMULATED_16BFX9`		Allows the library to use the BF16x9 floating point emulation algorithm for 32-bit floating point arithmetic. SeeFloating Point Emulation for more details.
`CUBLAS_COMPUTE_64F`		This is the default 64-bit double precision floating point and uses compute and intermediate storage precisions of at least 64-bits.
`CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT`		Allows the library to use fixed-point emulation algorithms for 64-bit double precision floating point arithmetic. SeeFloating Point Emulation for more details.
`CUBLAS_COMPUTE_64F_PEDANTIC`		Uses 64-bit double precision floating point arithmetic for all phases of calculations and also disables algorithmic optimizations such as Gaussian complexity reduction (3M).
`CUBLAS_COMPUTE_32I`		This is the default 32-bit integer mode and uses compute and intermediate storage precisions of at least 32-bits.
`CUBLAS_COMPUTE_32I_PEDANTIC`		Uses 32-bit integer arithmetic for all phases of calculations.

Note

Setting the environment variableNVIDIA_TF32_OVERRIDE=0 will override any defaults or programmatic configuration of NVIDIA libraries, and consequently, cuBLAS will not accelerate single-precision computations with TF32 tensor cores.

2.2.12.cublasEmulationStrategy_t

cublasEmulationStrategy_t enumerate type is used incublasSetEmulationStrategy() to choose how to leverage floating point emulation algorithms.

Value	Meaning
`CUBLAS_EMULATION_STRATEGY_DEFAULT`	This is the default emulation strategy and is equivalent to`CUBLAS_EMULATION_STRATEGY_PERFORMANT` unless the`CUBLAS_EMULATION_STRATEGY` environment variable is set.
`CUBLAS_EMULATION_STRATEGY_PERFORMANT`	A strategy which utilizes emulation whenever it provides a performance benefit.
`CUBLAS_EMULATION_STRATEGY_EAGER`	A strategy which utilizes emulation whenever possible.

Note

In general, thecublasSetEmulationStrategy() function takes precedence over the environment variable setting.However, setting the environment variableCUBLAS_EMULATION_STRATEGY toperformant oreager will override the default emulation strategy with the corresponding emulation strategy, even if the default strategy was set by the function call.

2.3.CUDA Datatypes Reference

The chapter describes types shared by multiple CUDA Libraries and defined in the header filelibrary_types.h.

2.3.1.cudaDataType_t

ThecudaDataType_t type is an enumerant to specify the data precision. It is used when the data reference does not carry the type itself (e.g void *)

For example, it is used in the routinecublasSgemmEx().

Value	Meaning
`CUDA_R_16F`	The data type is a 16-bit real half precision floating-point
`CUDA_C_16F`	The data type is a 32-bit structure comprised of two half precision floating-points representing a complex number.
`CUDA_R_16BF`	The data type is a 16-bit real bfloat16 floating-point
`CUDA_C_16BF`	The data type is a 32-bit structure comprised of two bfloat16 floating-points representing a complex number.
`CUDA_R_32F`	The data type is a 32-bit real single precision floating-point
`CUDA_C_32F`	The data type is a 64-bit structure comprised of two single precision floating-points representing a complex number.
`CUDA_R_64F`	The data type is a 64-bit real double precision floating-point
`CUDA_C_64F`	The data type is a 128-bit structure comprised of two double precision floating-points representing a complex number.
`CUDA_R_8I`	The data type is a 8-bit real signed integer
`CUDA_C_8I`	The data type is a 16-bit structure comprised of two 8-bit signed integers representing a complex number.
`CUDA_R_8U`	The data type is a 8-bit real unsigned integer
`CUDA_C_8U`	The data type is a 16-bit structure comprised of two 8-bit unsigned integers representing a complex number.
`CUDA_R_32I`	The data type is a 32-bit real signed integer
`CUDA_C_32I`	The data type is a 64-bit structure comprised of two 32-bit signed integers representing a complex number.
`CUDA_R_8F_E4M3`	The data type is an 8-bit real floating point in E4M3 format
`CUDA_R_8F_E5M2`	The data type is an 8-bit real floating point in E5M2 format
`CUDA_R_4F_E2M1`	The data type is a 4-bit real floating point in E2M1 format

2.3.2.cudaEmulationStrategy_t

ThecudaEmulationStrategy_t is a parameter to specify how to leverage floating point emulation algorithms. This is equivalent tocublasEmulationStrategy_t.

2.3.3.cudaEmulationMantissaControl_t

ThecudaEmulationMantissaControl_t is an enumerated type to specify how to configure how the number of mantissa bits are calculated in floating point emulation algorithms.See SeecublasSetFixedPointEmulationMantissaControl() andcublasGetFixedPointEmulationMaxMantissaBitCount().

Value	Meaning
`CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC`	The number of retained mantissa bits is computed at runtime to ensure the same or better accuracy than the native floatingpoint representation.
`CUDA_EMULATION_MANTISSA_CONTROL_FIXED`	The number of retained mantissa bits is fixed at runtime.

2.3.4.cudaEmulationSpecialValuesSupport_t

ThecudaEmulationSpecialValuesSupport_t is an enumerated type to specify how to configure which floating point special values are required to be supported byfloating point emulation algorithms. SeecublasSetEmulationSpecialValuesSupport() andcublasGetEmulationSpecialValuesSupport().

Value	Meaning
`CUDA_EMULATION_SPECIAL_VALUES_SUPPORT_DEFAULT`	The default special value support mask which contains support for signed infinities and NaN values.
`CUDA_EMULATION_SPECIAL_VALUES_SUPPORT_NONE`	There are no requirements for emulation algorithms to support special values.
`CUDA_EMULATION_SPECIAL_VALUES_SUPPORT_INFINITY`	Require emulation algorithms to handle signed infinity inputs and outputs.
`CUDA_EMULATION_SPECIAL_VALUES_SUPPORT_NAN`	Require emulation algorithms to handle NaN inputs and outputs.

2.3.5.libraryPropertyType_t

ThelibraryPropertyType_t is used as a parameter to specify which property is requested when using the routinecublasGetProperty()

Value	Meaning
`MAJOR_VERSION`	enumerant to query the major version
`MINOR_VERSION`	enumerant to query the minor version
`PATCH_LEVEL`	number to identify the patch level

2.4.cuBLAS Helper Function Reference

2.4.1.cublasCreate()

cublasStatus_tcublasCreate(cublasHandle_t*handle)

This function initializes the cuBLAS library and creates a handle to an opaque structure holding the cuBLAS library context. It allocates hardware resources on the host and device and must be called prior to making any other cuBLAS library calls.

The cuBLAS library context is tied to the current CUDA device. To use the library on multiple devices, one cuBLAS handle needs to be created for each device. See alsocuBLAS Context.

For a given device, multiple cuBLAS handles with different configurations can be created. For multi-threaded applications that use the same device from different threads, the recommended programming model is to create one cuBLAS handle per thread and use that cuBLAS handle for the entire life of the thread.

BecausecublasCreate() allocates some internal resources and the release of those resources by callingcublasDestroy() will implicitly callcudaDeviceSynchronize(), it is recommended to minimize the number of times these functions are called.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The initialization succeeded
`CUBLAS_STATUS_NOT_INITIALIZED`	The CUDA™ Runtime initialization failed
`CUBLAS_STATUS_ALLOC_FAILED`	The resources could not be allocated
`CUBLAS_STATUS_INVALID_VALUE`	`handle` is NULL

2.4.2.cublasDestroy()

cublasStatus_tcublasDestroy(cublasHandle_thandle)

This function releases hardware resources used by the cuBLAS library. This function is usually the last call with a particular handle to the cuBLAS library. BecausecublasCreate() allocates some internal resources and the release of those resources by callingcublasDestroy() will implicitly callcudaDeviceSynchronize(), it is recommended to minimize the number of times these functions are called.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the shut down succeeded
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

2.4.3.cublasGetVersion()

cublasStatus_tcublasGetVersion(cublasHandle_thandle,int*version)

This function returns the version number of the cuBLAS library.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	`version` is NULL

Note

This function can be safely called withhandle set to NULL. This allows users to get the version of the library without a handle. Another way to do this is withcublasGetProperty().

2.4.4.cublasGetProperty()

cublasStatus_tcublasGetProperty(libraryPropertyTypetype,int*value)

This function returns the value of the requested property in memory pointed to by value. Refer tolibraryPropertyType for supported types.

Return Value

Meaning

CUBLAS_STATUS_SUCCESS

The operation completed successfully

CUBLAS_STATUS_INVALID_VALUE

Invalid type or value

Iftype has an invalid value, or
ifvalue is NULL

2.4.5.cublasGetStatusName()

constchar*cublasGetStatusName(cublasStatus_tstatus)

This function returns the string representation of a given status.

Return Value	Meaning
NULL-terminated string	The string representation of the`status`

2.4.6.cublasGetStatusString()

constchar*cublasGetStatusString(cublasStatus_tstatus)

This function returns the description string for a given status.

Return Value	Meaning
NULL-terminated string	The description of the`status`

2.4.7.cublasSetStream()

cublasStatus_tcublasSetStream(cublasHandle_thandle,cudaStream_tstreamId)

This function sets the cuBLAS library stream, which will be used to execute all subsequent calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use thedefault NULL stream. In particular, this routine can be used to change the stream between kernel launches and then to reset the cuBLAS library stream back to NULL. Additionally this function unconditionally resets the cuBLAS library workspace back to the default workspace pool (seecublasSetWorkspace()).

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the stream was set successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

2.4.8.cublasSetWorkspace()

cublasStatus_tcublasSetWorkspace(cublasHandle_thandle,void*workspace,size_tworkspaceSizeInBytes)

This function sets the cuBLAS library workspace to a user-owned device buffer, which will be used to execute all subsequent calls to the cuBLAS library functions (on the currently set stream). If the cuBLAS library workspace is not set, all kernels will use the default workspace pool allocated during the cuBLAS context creation. In particular, this routine can be used to change the workspace between kernel launches. The workspace pointer has to be aligned to at least 256 bytes, otherwiseCUBLAS_STATUS_INVALID_VALUE error is returned. ThecublasSetStream() function unconditionally resets the cuBLAS library workspace back to the default workspace pool. Calling this function, including withworkspaceSizeInBytes equal to 0, will prevent the cuBLAS library from utilizing the default workspace. Too small value ofworkspaceSizeInBytes may cause some routines to fail withCUBLAS_STATUS_ALLOC_FAILED error returned or cause large regressions in performance. Workspace size equal to or larger than 16KiB is enough to preventCUBLAS_STATUS_ALLOC_FAILED error, while a larger workspace can provide performance benefits for some routines.

Note

If the stream set bycublasSetStream() iscudaStreamPerThread and there are multiple threads using the same cuBLAS library handle, then users must manually manage synchronization to avoid possible race conditions in the user provided workspace. Alternatively, users may rely on the default workspace pool which safely guards against race conditions.

Warning

cuBLAS functions may invoke more than one CUDA kernel, and rely on workspace being intact between the invocations. Hence, if cuBLAS handle is configured with user-provided workspace and is being used from multiple threads, it is user’s responsibility to serialize cuBLAS calls between threads, as otherwise the kernels from different cuBLAS invocations might interleave and invalidate the assumptions each of them makes regarding workspace intactness. The default workspace pool managed by cuBLAS is thread safe.

The table below shows the recommended size of user-provided workspace.This is based on the cuBLAS default workspace pool size which is GPU architecture dependent.

GPU Architecture	Recommended workspace size
NVIDIA Hopper Architecture (sm90)	32 MiB
NVIDIA Blackwell Architecture (sm10x)	32 MiB
NVIDIA Blackwell Architecture (sm12x)	32 MiB
Other	4 MiB

Note

If the cuBLAS library is configured to utilizefixed-point emulation, which can be done by setting the corresponding math mode incublasSetMathMode() or calling APIs withCUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT, it can be beneficial to provide more workspace than recommended for the GPU architecture. SeeFixed-Point Workspace Requirements for more details.

The possible error values returned by this function and their meanings are listed below.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The stream was set successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	The`workspace` pointer wasn’t aligned to at least 256 bytes

2.4.9.cublasGetStream()

cublasStatus_tcublasGetStream(cublasHandle_thandle,cudaStream_t*streamId)

This function gets the cuBLAS library stream, which is being used to execute all calls to the cuBLAS library functions. If the cuBLAS library stream is not set, all kernels use thedefault NULL stream.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the stream was returned successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	`streamId` is NULL

2.4.10.cublasGetPointerMode()

cublasStatus_tcublasGetPointerMode(cublasHandle_thandle,cublasPointerMode_t*mode)

This function obtains the pointer mode used by the cuBLAS library. Please see the section on thecublasPointerMode_t type for more details.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The pointer mode was obtained successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	`mode` is NULL

2.4.11.cublasSetPointerMode()

cublasStatus_tcublasSetPointerMode(cublasHandle_thandle,cublasPointerMode_tmode)

This function sets the pointer mode used by the cuBLAS library. Thedefault is for the values to be passed by reference on the host. Please see the section on thecublasPointerMode_t type for more details.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The pointer mode was set successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	`mode` is not`CUBLAS_POINTER_MODE_HOST` or`CUBLAS_POINTER_MODE_DEVICE`

2.4.12.cublasSetVector()

cublasStatus_tcublasSetVector(intn,intelemSize,constvoid*x,intincx,void*y,intincy)

This function supports the64-bit Integer Interface.

This function copiesn elements from a vectorx in host memory space to a vectory in GPU memory space. Elements in both vectors are assumed to have a size ofelemSize bytes. The storage spacing between consecutive elements is given byincx for the source vectorx and byincy for the destination vectory.

Since column-major format for two-dimensional matrices is assumed, if a vector is part of a matrix, a vector increment equal to1 accesses a (partial) column of that matrix. Similarly, using an increment equal to the leading dimension of the matrix results in accesses to a (partial) row of that matrix.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`incx`,`incy`, or`elemSize` are not positive
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.13.cublasGetVector()

cublasStatus_tcublasGetVector(intn,intelemSize,constvoid*x,intincx,void*y,intincy)

This function supports the64-bit Integer Interface.

This function copiesn elements from a vectorx in GPU memory space to a vectory in host memory space. Elements in both vectors are assumed to have a size ofelemSize bytes. The storage spacing between consecutive elements is given byincx for the source vector andincy for the destination vectory.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`incx`,`incy`, or`elemSize` are not positive
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.14.cublasSetMatrix()

cublasStatus_tcublasSetMatrix(introws,intcols,intelemSize,constvoid*A,intlda,void*B,intldb)

This function supports the64-bit Integer Interface.

This function copies a tile ofrowsxcols elements from a matrixA in host memory space to a matrixB in GPU memory space. It is assumed that each element requires storage ofelemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrixA and destination matrixB given inlda andldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`rows` or`cols` are negative, or`elemSize`,`ldaldb` are not positive.
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.15.cublasGetMatrix()

cublasStatus_tcublasGetMatrix(introws,intcols,intelemSize,constvoid*A,intlda,void*B,intldb)

This function supports the64-bit Integer Interface.

This function copies a tile ofrowsxcols elements from a matrixA in GPU memory space to a matrixB in host memory space. It is assumed that each element requires storage ofelemSize bytes and that both matrices are stored in column-major format, with the leading dimension of the source matrixA and destination matrixB given inlda andldb, respectively. The leading dimension indicates the number of rows of the allocated matrix, even if only a submatrix of it is being used.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`rows` or`cols` are negative, or`elemSize`,`ldaldb` are not positive.
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.16.cublasSetVectorAsync()

cublasStatus_tcublasSetVectorAsync(intn,intelemSize,constvoid*hostPtr,intincx,void*devicePtr,intincy,cudaStream_tstream)

This function supports the64-bit Integer Interface.

This function has the same functionality ascublasSetVector(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`incx`,`incy`, or`elemSize` are not positive
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.17.cublasGetVectorAsync()

cublasStatus_tcublasGetVectorAsync(intn,intelemSize,constvoid*devicePtr,intincx,void*hostPtr,intincy,cudaStream_tstream)

This function supports the64-bit Integer Interface.

This function has the same functionality ascublasGetVector(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`incx`,`incy`, or`elemSize` are not positive
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.18.cublasSetMatrixAsync()

cublasStatus_tcublasSetMatrixAsync(introws,intcols,intelemSize,constvoid*A,intlda,void*B,intldb,cudaStream_tstream)

This function supports the64-bit Integer Interface.

This function has the same functionality ascublasSetMatrix(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`rows` or`cols` are negative, or`elemSize`,`ldaldb` are not positive.
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.19.cublasGetMatrixAsync()

cublasStatus_tcublasGetMatrixAsync(introws,intcols,intelemSize,constvoid*A,intlda,void*B,intldb,cudaStream_tstream)

This function supports the64-bit Integer Interface.

This function has the same functionality ascublasGetMatrix(), with the exception that the data transfer is done asynchronously (with respect to the host) using the given CUDA™ stream parameter.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`rows` or`cols` are negative, or`elemSize`,`ldaldb` are not positive.
`CUBLAS_STATUS_MAPPING_ERROR`	There was an error accessing GPU memory

2.4.20.cublasSetAtomicsMode()

cublasStatus_tcublasSetAtomicsMode(cublasHandlethandle,cublasAtomicsMode_tmode)

Some routines likecublas<t>symv() andcublas<t>hemv() have an alternate implementation that use atomics to cumulate results. This implementation is generally significantly faster but can generate results that are not strictly identical from one run to the others. Mathematically, those different results are not significant but when debugging those differences can be prejudicial.

This function allows or disallows the usage of atomics in the cuBLAS library for all routines which have an alternate implementation. When not explicitly specified in the documentation of any cuBLAS routine, it means that this routine does not have an alternate implementation that use atomics. When atomics mode is disabled, each cuBLAS routine should produce the same results from one run to the other when called with identical parameters on the same Hardware.

The default atomics mode of default initializedcublasHandle_t object isCUBLAS_ATOMICS_NOT_ALLOWED. Please see the section on the type for more details.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the atomics mode was set successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

2.4.21.cublasGetAtomicsMode()

cublasStatus_tcublasGetAtomicsMode(cublasHandle_thandle,cublasAtomicsMode_t*mode)

This function queries the atomic mode of a specific cuBLAS context.

The default atomics mode of default initializedcublasHandle_t object isCUBLAS_ATOMICS_NOT_ALLOWED. Please see the section on the type for more details.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The atomics mode was queried successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	The argument`mode` is a NULL pointer

2.4.22.cublasSetMathMode()

cublasStatus_tcublasSetMathMode(cublasHandle_thandle,cublasMath_tmode)

ThecublasSetMathMode() function enables you to choose the compute precision modes as defined bycublasMath_t. Users are allowed to set the compute precision mode as a logical combination of them (except the deprecatedCUBLAS_TENSOR_OP_MATH). For example,cublasSetMathMode(handle,CUBLAS_DEFAULT_MATH|CUBLAS_MATH_DISALLOW_REDUCED_PRECISION_REDUCTION). Please note that the default math mode isCUBLAS_DEFAULT_MATH.

For matrix and compute precisions allowed forcublasGemmEx() andcublasLtMatmul() APIs and their strided variants please refer to:cublasGemmEx() ,cublasGemmBatchedEx(),cublasGemmStridedBatchedEx(), andcublasLtMatmul().

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The math mode was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	An invalid value for mode was specified.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.

2.4.23.cublasGetMathMode()

cublasStatus_tcublasGetMathMode(cublasHandle_thandle,cublasMath_t*mode)

This function returns the math mode used by the library routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The math type was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	If`mode` is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.

2.4.24.cublasSetSmCountTarget()

cublasStatus_tcublasSetSmCountTarget(cublasHandle_thandle,intsmCountTarget)

ThecublasSetSmCountTarget() function allows overriding the number of multiprocessors available to the library during kernels execution.

This option can be used to improve the library performance when cuBLAS routines are known to run concurrently with other work on different CUDA streams. For example, on an NVIDIA A100 GPU, which has 108 multiprocessors, when there is a concurrent kenrel running with grid size of 8, one can usecublasSetSmCountTarget() withsmCountTarget set to100 to override the library heuristics to optimize for running on the remaining 100 multiprocessors.

When set to0, the library returns to its default behavior. The input value should not exceed the device’s multiprocessor count, which can be obtained usingcudaDeviceGetAttribute. Negative values are not accepted.

The user must ensure thread safety when modifying the library handle with this routine similar to when usingcublasSetStream(), etc.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	SM count target was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	The value of`smCountTarget` outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.

2.4.25.cublasGetSmCountTarget()

cublasStatus_tcublasGetSmCountTarget(cublasHandle_thandle,int*smCountTarget)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	SM count target was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	smCountTarget is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.26.cublasSetEmulationStrategy()

cublasStatus_tcublasSetEmulationStrategy(cublasHandle_thandle,cublasEmulationStrategy_temulationStrategy)

ThecublasSetEmulationStrategy() function enables you to select how the library should make use offloating point emulation. For more details, please seecublasEmulationStrategy_t.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The emulation strategy was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	An invalid value for emulation strategy was specified.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.

2.4.27.cublasGetEmulationStrategy()

cublasStatus_tcublasGetEmulationStrategy(cublasHandle_thandle,cublasEmulationStrategy_t*emulationStrategy)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	emulation strategy was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	emulationStrategy is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.28.cublasGetEmulationSpecialValuesSupport()

cublasStatus_tcublasGetEmulationSpecialValuesSupport(cublasHandle_thandle,cudaEmulationSpecialValuesSupport*mask)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	emulation special values support was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mask is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.29.cublasSetEmulationSpecialValuesSupport()

cublasStatus_tcublasSetEmulationSpecialValuesSupport(cublasHandle_thandle,cudaEmulationSpecialValuesSupportmask)

This function sets the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	emulation special values support was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mask is outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.30.cublasGetFixedPointEmulationMantissaControl()

cublasStatus_tcublasGetFixedPointEmulationMantissaControl(cublasHandle_thandle,cudaEmulationMantissaControl*mantissaControl)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	fixed-point emulation mantissa control was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaControl is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.31.cublasSetFixedPointEmulationMantissaControl()

cublasStatus_tcublasSetFixedPointEmulationMantissaControl(cublasHandle_thandle,cudaEmulationMantissaControlmantissaControl)

This function sets the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	fixed-point emulation mantissa control was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaControl is outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.32.cublasGetFixedPointEmulationMaxMantissaBitCount()

cublasStatus_tcublasGetFixedPointEmulationMaxMantissaBitCount(cublasHandle_thandle,int*maxMantissaBitCount)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	maxMantissaBitCount was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	maxMantissaBitCount is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.33.cublasSetFixedPointEmulationMaxMantissaBitCount()

cublasStatus_tcublasSetFixedPointEmulationMaxMantissaBitCount(cublasHandle_thandle,intmaxMantissaBitCount)

This function sets the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	maxMantissaBitCount was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	maxMantissaBitCount is outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.34.cublasGetFixedPointEmulationMantissaBitOffset()

cublasStatus_tcublasGetFixedPointEmulationMantissaBitOffset(cublasHandle_thandle,int*mantissaBitOffset)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	mantissaBitOffset was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaBitOffset is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.35.cublasSetFixedPointEmulationMantissaBitOffset()

cublasStatus_tcublasSetFixedPointEmulationMantissaBitOffset(cublasHandle_thandle,intmantissaBitOffset)

This function sets the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	mantissaBitOffset was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaBitOffset is outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.36.cublasGetFixedPointEmulationMantissaBitCountPointer()

cublasStatus_tcublasGetFixedPointEmulationMantissaBitCountPointer(cublasHandle_thandle,int**mantissaBitCount)

This function obtains the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	mantissaBitCount was returned successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaBitCount is NULL.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.37.cublasSetFixedPointEmulationMantissaBitCountPointer()

cublasStatus_tcublasSetFixedPointEmulationMantissaBitCountPointer(cublasHandle_thandle,int*mantissaBitCount)

This function sets the value previously programmed to the library handle.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	mantissaBitCount was set successfully.
`CUBLAS_STATUS_INVALID_VALUE`	mantissaBitCount is outside of the allowed range.
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized.

2.4.38.cublasLoggerConfigure()

cublasStatus_tcublasLoggerConfigure(intlogIsOn,intlogToStdOut,intlogToStdErr,constchar*logFileName)

This function configures logging during runtime. Besides this type of configuration, it is possible to configure logging with special environment variables which will be checked by libcublas:

CUBLAS_LOGINFO_DBG - setting this environment variable to1 means turning logging on (by default logging is off).
CUBLAS_LOGDEST_DBG - this environment variable encodes where to write the log to:stdout,stderr mean to write log messages to standard output or error streams, respectively. Other values are interpreted as file names.

Parameters

Param.	Memory	In/out	Meaning
logIsOn	host	input	Turn on/off logging completely. By default is off, but is turned on by callingcublasSetLoggerCallback() to user defined callback function.
logToStdOut	host	input	Turn on/off logging to standard output I/O stream. By default is off.
logToStdErr	host	input	Turn on/off logging to standard error I/O stream. By default is off.
logFileName	host	input	Turn on/off logging to file in filesystem specified by it’s name.cublasLoggerConfigure() copies the content of`logFileName`. You should provide null pointer if you are not interested in this type of logging.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully

2.4.39.cublasGetLoggerCallback()

cublasStatus_tcublasGetLoggerCallback(cublasLogCallback*userCallback)

This function retrieves function pointer to previously installed custom user defined callback function viacublasSetLoggerCallback() or zero otherwise.

Param.	Memory	In/out	Meaning
userCallback	host	output	Pointer to user defined callback function.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_INVALID_VALUE`	`userCallback` is NULL

2.4.40.cublasSetLoggerCallback()

cublasStatus_tcublasSetLoggerCallback(cublasLogCallbackuserCallback)

This function installs a custom user defined callback function via cublas C public API.

Param.	Memory	In/out	Meaning
userCallback	host	input	Pointer to user defined callback function.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully

2.5.cuBLAS Level-1 Function Reference

In this chapter we describe the Level-1 Basic Linear Algebra Subprograms (BLAS1) functions that perform scalar and vector based operations. We will use abbreviations <type> for type and <t> for the corresponding short type to make a more concise and clear presentation of the implemented functions. Unless otherwise specified <type> and <t> have the following meanings:

<type>	<t>	Meaning
`float`	`s` or`S`	real single-precision
`double`	`d` or`D`	real double-precision
`cuComplex`	`c` or`C`	complex single-precision
`cuDoubleComplex`	`z` or`Z`	complex double-precision

When the parameters and returned values of the function differ, which sometimes happens for complex input, the <t> can also beSc,Cs,Dz andZd.

The abbreviation$\mathbf{Re}(\cdot)$ and$\mathbf{Im}(\cdot)$ will stand for the real and imaginary part of a number, respectively. Since imaginary part of a real number does not exist, we will consider it to be zero and can usually simply discard it from the equation where it is being used. Also, the$\bar{\alpha}$ will denote the complex conjugate of$\alpha$ .

In general throughout the documentation, the lower case Greek symbols$\alpha$ and$\beta$ will denote scalars, lower case English letters in bold type$\mathbf{x}$ and$\mathbf{y}$ will denote vectors and capital English letters$A$ ,$B$ and$C$ will denote matrices.

2.5.1.cublasI<t>amax()

cublasStatus_tcublasIsamax(cublasHandle_thandle,intn,constfloat*x,intincx,int*result)cublasStatus_tcublasIdamax(cublasHandle_thandle,intn,constdouble*x,intincx,int*result)cublasStatus_tcublasIcamax(cublasHandle_thandle,intn,constcuComplex*x,intincx,int*result)cublasStatus_tcublasIzamax(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,int*result)

This function supports the64-bit Integer Interface.

This function finds the (smallest) index of the element of the maximum magnitude. Hence, the result is the first$i$ such that$\left| \mathbf{Im}\left( {x\lbrack j\rbrack} \right) \middle| + \middle| \mathbf{Re}\left( {x\lbrack j\rbrack} \right) \right|$ is maximum for$i = 1,\ldots,n$ and$j = 1 + \left( {i - 1} \right)*\text{ incx}$ . Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`x`	device	input	<type> vector with elements.
`incx`		input	Stride between consecutive elements of`x`.
`result`	host or device	output	The resulting index, which is set to`0` if`n<=0` or`incx<=0`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`result` is NULL

For references please refer to NETLIB documentation:

isamax(),idamax(),icamax(),izamax()

2.5.2.cublasI<t>amin()

cublasStatus_tcublasIsamin(cublasHandle_thandle,intn,constfloat*x,intincx,int*result)cublasStatus_tcublasIdamin(cublasHandle_thandle,intn,constdouble*x,intincx,int*result)cublasStatus_tcublasIcamin(cublasHandle_thandle,intn,constcuComplex*x,intincx,int*result)cublasStatus_tcublasIzamin(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,int*result)

This function supports the64-bit Integer Interface.

This function finds the (smallest) index of the element of the minimum magnitude. Hence, the result is the first$i$ such that$\left| \mathbf{Im}\left( {x\lbrack j\rbrack} \right) \middle| + \middle| \mathbf{Re}\left( {x\lbrack j\rbrack} \right) \right|$ is minimum for$i = 1,\ldots,n$ and$j = 1 + \left( {i - 1} \right)*\text{incx}$ Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`x`	device	input	<type> vector with elements.
`incx`		input	Stride between consecutive elements of`x`.
`result`	host or device	output	The resulting index, which is set to`0` if`n<=0` or`incx<=0`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`result` is NULL

For references please refer to NETLIB documentation:

isamin()

2.5.3.cublas<t>asum()

cublasStatus_tcublasSasum(cublasHandle_thandle,intn,constfloat*x,intincx,float*result)cublasStatus_tcublasDasum(cublasHandle_thandle,intn,constdouble*x,intincx,double*result)cublasStatus_tcublasScasum(cublasHandle_thandle,intn,constcuComplex*x,intincx,float*result)cublasStatus_tcublasDzasum(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,double*result)

This function supports the64-bit Integer Interface.

This function computes the sum of the absolute values of the elements of vectorx. Hence, the result is$\left. \sum_{i = 1}^{n} \middle| \mathbf{Im}\left( {x\lbrack j\rbrack} \right) \middle| + \middle| \mathbf{Re}\left( {x\lbrack j\rbrack} \right) \right|$ where$j = 1 + \left( {i - 1} \right)*\text{incx}$ . Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`x`	device	input	<type> vector with elements.
`incx`		input	Stride between consecutive elements of`x`.
`result`	host or device	output	The resulting sum, which is set to`0` if`n<=0` or`incx<=0`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`result` is NULL

For references please refer to NETLIB documentation:

sasum(),dasum(),scasum(),dzasum()

2.5.4.cublas<t>axpy()

cublasStatus_tcublasSaxpy(cublasHandle_thandle,intn,constfloat*alpha,constfloat*x,intincx,float*y,intincy)cublasStatus_tcublasDaxpy(cublasHandle_thandle,intn,constdouble*alpha,constdouble*x,intincx,double*y,intincy)cublasStatus_tcublasCaxpy(cublasHandle_thandle,intn,constcuComplex*alpha,constcuComplex*x,intincx,cuComplex*y,intincy)cublasStatus_tcublasZaxpy(cublasHandle_thandle,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function multiplies the vectorx by the scalar$\alpha$ and adds it to the vectory overwriting the latest vector with the result. Hence, the performed operation is$\mathbf{y}\lbrack j\rbrack = \alpha \times \mathbf{x}\lbrack k\rbrack + \mathbf{y}\lbrack j\rbrack$ for$i = 1,\ldots,n$ ,$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`alpha`	host or device	input	<type> scalar used for multiplication.
`n`		input	Number of elements in the vector`x` and`y`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

saxpy(),daxpy(),caxpy(),zaxpy()

2.5.5.cublas<t>copy()

cublasStatus_tcublasScopy(cublasHandle_thandle,intn,constfloat*x,intincx,float*y,intincy)cublasStatus_tcublasDcopy(cublasHandle_thandle,intn,constdouble*x,intincx,double*y,intincy)cublasStatus_tcublasCcopy(cublasHandle_thandle,intn,constcuComplex*x,intincx,cuComplex*y,intincy)cublasStatus_tcublasZcopy(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function copies the vectorx into the vectory. Hence, the performed operation is$\mathbf{y}\lbrack j\rbrack = \mathbf{x}\lbrack k\rbrack$ for$i = 1,\ldots,n$ ,$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x` and`y`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

scopy(),dcopy(),ccopy(),zcopy()

2.5.6.cublas<t>dot()

cublasStatus_tcublasSdot(cublasHandle_thandle,intn,constfloat*x,intincx,constfloat*y,intincy,float*result)cublasStatus_tcublasDdot(cublasHandle_thandle,intn,constdouble*x,intincx,constdouble*y,intincy,double*result)cublasStatus_tcublasCdotu(cublasHandle_thandle,intn,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*result)cublasStatus_tcublasCdotc(cublasHandle_thandle,intn,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*result)cublasStatus_tcublasZdotu(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*result)cublasStatus_tcublasZdotc(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*result)

This function supports the64-bit Integer Interface.

This function computes the dot product of vectorsx andy. Hence, the result is$\sum_{i = 1}^{n}\left( {\mathbf{x}\lbrack k\rbrack \times \mathbf{y}\lbrack j\rbrack} \right)$ where$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that in the first equation the conjugate of the element of vector x should be used if the function name ends in character ‘c’ and that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`result`	host or device	output	The resulting dot product, which is set to`0` if`n<=0`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sdot(),ddot(),cdotu(),cdotc(),zdotu(),zdotc()

2.5.7.cublas<t>nrm2()

cublasStatus_tcublasSnrm2(cublasHandle_thandle,intn,constfloat*x,intincx,float*result)cublasStatus_tcublasDnrm2(cublasHandle_thandle,intn,constdouble*x,intincx,double*result)cublasStatus_tcublasScnrm2(cublasHandle_thandle,intn,constcuComplex*x,intincx,float*result)cublasStatus_tcublasDznrm2(cublasHandle_thandle,intn,constcuDoubleComplex*x,intincx,double*result)

This function supports the64-bit Integer Interface.

This function computes the Euclidean norm of the vectorx. The code uses a multiphase model of accumulation to avoid intermediate underflow and overflow, with the result being equivalent to$\sqrt{\sum_{i = 1}^{n}\left( {\mathbf{x}\lbrack j\rbrack \times \mathbf{x}\lbrack j\rbrack} \right)}$ where$j = 1 + \left( {i - 1} \right)*\text{incx}$ in exact arithmetic. Notice that the last equation reflects 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`result`	host or device	output	The resulting norm, which is set to`0` if`n<=0` or`incx<=0`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`result` is NULL

For references please refer to NETLIB documentation:

snrm2(),dnrm2(),scnrm2(),dznrm2()

2.5.8.cublas<t>rot()

cublasStatus_tcublasSrot(cublasHandle_thandle,intn,float*x,intincx,float*y,intincy,constfloat*c,constfloat*s)cublasStatus_tcublasDrot(cublasHandle_thandle,intn,double*x,intincx,double*y,intincy,constdouble*c,constdouble*s)cublasStatus_tcublasCrot(cublasHandle_thandle,intn,cuComplex*x,intincx,cuComplex*y,intincy,constfloat*c,constcuComplex*s)cublasStatus_tcublasCsrot(cublasHandle_thandle,intn,cuComplex*x,intincx,cuComplex*y,intincy,constfloat*c,constfloat*s)cublasStatus_tcublasZrot(cublasHandle_thandle,intn,cuDoubleComplex*x,intincx,cuDoubleComplex*y,intincy,constdouble*c,constcuDoubleComplex*s)cublasStatus_tcublasZdrot(cublasHandle_thandle,intn,cuDoubleComplex*x,intincx,cuDoubleComplex*y,intincy,constdouble*c,constdouble*s)

This function supports the64-bit Integer Interface.

This function applies Givens rotation matrix (i.e., rotation in the x,y plane counter-clockwise by angle defined by$cos(alpha) = c$,$sin(alpha) = s$):

$G = \begin{pmatrix}c & s \\{- s} & c \\\end{pmatrix}$

to vectorsx andy.

Hence, the result is$\mathbf{x}\lbrack k\rbrack = c \times \mathbf{x}\lbrack k\rbrack + s \times \mathbf{y}\lbrack j\rbrack$ and$\mathbf{y}\lbrack j\rbrack = - s \times \mathbf{x}\lbrack k\rbrack + c \times \mathbf{y}\lbrack j\rbrack$ where$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`c`	host or device	input	Cosine element of the rotation matrix.
`s`	host or device	input	Sine element of the rotation matrix.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

srot(),drot(),crot(),csrot(),zrot(),zdrot()

2.5.9.cublas<t>rotg()

cublasStatus_tcublasSrotg(cublasHandle_thandle,float*a,float*b,float*c,float*s)cublasStatus_tcublasDrotg(cublasHandle_thandle,double*a,double*b,double*c,double*s)cublasStatus_tcublasCrotg(cublasHandle_thandle,cuComplex*a,cuComplex*b,float*c,cuComplex*s)cublasStatus_tcublasZrotg(cublasHandle_thandle,cuDoubleComplex*a,cuDoubleComplex*b,double*c,cuDoubleComplex*s)

This function supports the64-bit Integer Interface.

This function constructs the Givens rotation matrix

$G = \begin{pmatrix}c & s \\{- s} & c \\\end{pmatrix}$

that zeros out the second entry of a$2 \times 1$ vector$\left( {a,b} \right)^{T}$ .

Then, for real numbers we can write

$\begin{pmatrix}c & s \\{- s} & c \\\end{pmatrix}\begin{pmatrix}a \\b \\\end{pmatrix} = \begin{pmatrix}r \\0 \\\end{pmatrix}$

where$c^{2} + s^{2} = 1$ and$r = \pm \sqrt{a^{2} + b^{2}}$ . The parameters$a$ and$b$ are overwritten with$r$ and$z$ , respectively. The value of$z$ is such that$c$ and$s$ may be recovered using the following rules:

$\left( {c,s} \right) = \begin{cases}\left( {\sqrt{1 - z^{2}},z} \right) & {\text{ if }\left| z \middle| < 1 \right.} \\\left( {0.0,1.0} \right) & {\text{ if }\left| z \middle| = 1 \right.} \\\left( 1/z,\sqrt{1 - z^{2}} \right) & {\text{ if }\left| z \middle| > 1 \right.} \\\end{cases}$

For complex numbers we can write

$\begin{pmatrix}c & s \\{- \bar{s}} & c \\\end{pmatrix}\begin{pmatrix}a \\b \\\end{pmatrix} = \begin{pmatrix}r \\0 \\\end{pmatrix}$

where$c^{2} + \left( {\bar{s} \times s} \right) = 1$ and$r = \frac{a}{|a|} \times \parallel \left( {a,b} \right)^{T} \parallel_{2}$ with$\parallel \left( {a,b} \right)^{T} \parallel_{2} = \sqrt{\left| a|^{2} + \middle| B|^{2} \right.}$ for$a \neq 0$ and$r = b$ for$a = 0$ . Finally, the parameter$a$ is overwritten with$r$ on exit.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`a`	host or device	in/out	<type> scalar that is overwritten with$r$ .
`b`	host or device	in/out	<type> scalar that is overwritten with$z$ .
`c`	host or device	output	Cosine element of the rotation matrix.
`s`	host or device	output	Sine element of the rotation matrix.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

srotg(),drotg(),crotg(),zrotg()

2.5.10.cublas<t>rotm()

cublasStatus_tcublasSrotm(cublasHandle_thandle,intn,float*x,intincx,float*y,intincy,constfloat*param)cublasStatus_tcublasDrotm(cublasHandle_thandle,intn,double*x,intincx,double*y,intincy,constdouble*param)

This function supports the64-bit Integer Interface.

This function applies the modified Givens transformation

$H = \begin{pmatrix}h_{11} & h_{12} \\h_{21} & h_{22} \\\end{pmatrix}$

to vectorsx andy.

Hence, the result is$\mathbf{x}\lbrack k\rbrack = h_{11} \times \mathbf{x}\lbrack k\rbrack + h_{12} \times \mathbf{y}\lbrack j\rbrack$ and$\mathbf{y}\lbrack j\rbrack = h_{21} \times \mathbf{x}\lbrack k\rbrack + h_{22} \times \mathbf{y}\lbrack j\rbrack$ where$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

The elements , , and of matrix$H$ are stored inparam[1],param[2],param[3] andparam[4], respectively. Theflag=param[0] defines the following predefined values for the matrix$H$ entries

`flag==-1.0`	`flag==0.0`	`flag==1.0`	`flag==-2.0`
$\begin{pmatrix}h_{11} & h_{12} \\h_{21} & h_{22} \\\end{pmatrix}$	$\begin{pmatrix}{1.0} & h_{12} \\h_{21} & {1.0} \\\end{pmatrix}$	$\begin{pmatrix}h_{11} & {1.0} \\{- 1.0} & h_{22} \\\end{pmatrix}$	$\begin{pmatrix}{1.0} & {0.0} \\{0.0} & {1.0} \\\end{pmatrix}$

Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`param`	host or device	input	<type> vector of 5 elements, where`param[0]` and`param[1..4]` contain the flag and matrix$H$.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

srotm(),drotm()

2.5.11.cublas<t>rotmg()

cublasStatus_tcublasSrotmg(cublasHandle_thandle,float*d1,float*d2,float*x1,constfloat*y1,float*param)cublasStatus_tcublasDrotmg(cublasHandle_thandle,double*d1,double*d2,double*x1,constdouble*y1,double*param)

This function supports the64-bit Integer Interface.

This function constructs the modified Givens transformation

$H = \begin{pmatrix}h_{11} & h_{12} \\h_{21} & h_{22} \\\end{pmatrix}$

that zeros out the second entry of a$2 \times 1$ vector$\left( {\sqrt{d1}*x1,\sqrt{d2}*y1} \right)^{T}$ .

Theflag=param[0] defines the following predefined values for the matrix$H$ entries

`flag==-1.0`	`flag==0.0`	`flag==1.0`	`flag==-2.0`
$\begin{pmatrix}h_{11} & h_{12} \\h_{21} & h_{22} \\\end{pmatrix}$	$\begin{pmatrix}{1.0} & h_{12} \\h_{21} & {1.0} \\\end{pmatrix}$	$\begin{pmatrix}h_{11} & {1.0} \\{- 1.0} & h_{22} \\\end{pmatrix}$	$\begin{pmatrix}{1.0} & {0.0} \\{0.0} & {1.0} \\\end{pmatrix}$

Notice that the values -1.0, 0.0 and 1.0 implied by the flag are not stored in param.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`d1`	host or device	in/out	<type> scalar that is overwritten on exit.
`d2`	host or device	in/out	<type> scalar that is overwritten on exit.
`x1`	host or device	in/out	<type> scalar that is overwritten on exit.
`y1`	host or device	input	<type> scalar.
`param`	host or device	output	<type> vector of 5 elements, where`param[0]` and`param[1-4]` contain the flag and matrix$H$.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

srotmg(),drotmg()

2.5.12.cublas<t>scal()

cublasStatus_tcublasSscal(cublasHandle_thandle,intn,constfloat*alpha,float*x,intincx)cublasStatus_tcublasDscal(cublasHandle_thandle,intn,constdouble*alpha,double*x,intincx)cublasStatus_tcublasCscal(cublasHandle_thandle,intn,constcuComplex*alpha,cuComplex*x,intincx)cublasStatus_tcublasCsscal(cublasHandle_thandle,intn,constfloat*alpha,cuComplex*x,intincx)cublasStatus_tcublasZscal(cublasHandle_thandle,intn,constcuDoubleComplex*alpha,cuDoubleComplex*x,intincx)cublasStatus_tcublasZdscal(cublasHandle_thandle,intn,constdouble*alpha,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function scales the vectorx by the scalar$\alpha$ and overwrites it with the result. Hence, the performed operation is$\mathbf{x}\lbrack j\rbrack = \alpha \times \mathbf{x}\lbrack j\rbrack$ for$i = 1,\ldots,n$ and$j = 1 + \left( {i - 1} \right)*\text{incx}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`alpha`	host or device	input	<type> scalar used for multiplication.
`n`		input	Number of elements in the vector`x`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

:class: table-no-stripes
Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sscal(),dscal(),csscal(),cscal(),zdscal(),zscal()

2.5.13.cublas<t>swap()

cublasStatus_tcublasSswap(cublasHandle_thandle,intn,float*x,intincx,float*y,intincy)cublasStatus_tcublasDswap(cublasHandle_thandle,intn,double*x,intincx,double*y,intincy)cublasStatus_tcublasCswap(cublasHandle_thandle,intn,cuComplex*x,intincx,cuComplex*y,intincy)cublasStatus_tcublasZswap(cublasHandle_thandle,intn,cuDoubleComplex*x,intincx,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function interchanges the elements of vectorx andy. Hence, the performed operation is$\left. \mathbf{y}\lbrack j\rbrack\Leftrightarrow\mathbf{x}\lbrack k\rbrack \right.$ for$i = 1,\ldots,n$ ,$k = 1 + \left( {i - 1} \right)*\text{incx}$ and$j = 1 + \left( {i - 1} \right)*\text{incy}$ . Notice that the last two equations reflect 1-based indexing used for compatibility with Fortran.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sswap(),dswap(),cswap(),zswap()

2.6.cuBLAS Level-2 Function Reference

In this chapter we describe the Level-2 Basic Linear Algebra Subprograms (BLAS2) functions that perform matrix-vector operations.

2.6.1.cublas<t>gbmv()

cublasStatus_tcublasSgbmv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intkl,intku,constfloat*alpha,constfloat*A,intlda,constfloat*x,intincx,constfloat*beta,float*y,intincy)cublasStatus_tcublasDgbmv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intkl,intku,constdouble*alpha,constdouble*A,intlda,constdouble*x,intincx,constdouble*beta,double*y,intincy)cublasStatus_tcublasCgbmv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intkl,intku,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZgbmv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intkl,intku,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the banded matrix-vector multiplication

$\mathbf{y} = \alpha\text{ op}(A)\mathbf{x} + \beta\mathbf{y}$

where$A$ is a banded matrix with$kl$ subdiagonals and$ku$ superdiagonals,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars. Also, for matrix$A$

$\text{ op}(A) = \begin{cases}A & \text{ if trans == $\mathrm{CUBLAS\_OP\_N}$} \\A^{T} & \text{ if trans == $\mathrm{CUBLAS\_OP\_T}$} \\A^{H} & \text{ if trans == $\mathrm{CUBLAS\_OP\_C}$} \\\end{cases}$

The banded matrix$A$ is stored column by column, with the main diagonal stored in row$ku + 1$ (starting in first position), the first superdiagonal stored in row$ku$ (starting in second position), the first subdiagonal stored in row$ku + 2$ (starting in first position), etc. So that in general, the element$A\left( {i,j} \right)$ is stored in the memory locationA(ku+1+i-j,j) for$j = 1,\ldots,n$ and$i \in \left\lbrack {\max\left( {1,j - ku} \right),\min\left( {m,j + kl} \right)} \right\rbrack$ . Also, the elements in the array$A$ that do not conceptually correspond to the elements in the banded matrix (the top left$ku \times ku$ and bottom right$kl \times kl$ triangles) are not referenced.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix`A`.
`n`		input	Number of columns of matrix`A`.
`kl`		input	Number of subdiagonals of matrix`A`.
`ku`		input	Number of superdiagonals of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxn` with`lda>=kl+ku+1`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	input	<type> vector with`n` elements if`trans==CUBLAS_OP_N` and`m` elements otherwise.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector with`m` elements if`trans==CUBLAS_OP_N` and`n` elements otherwise.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0`,`n<0`,`kl<0` or`ku<0`, or if`lda<(kl+ku+1)`, or if`incx==0` or`incy==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T`,`CUBLAS_OP_C`, or if`alpha` or`beta` are NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgbmv(),dgbmv(),cgbmv(),zgbmv()

2.6.2.cublas<t>gemv()

cublasStatus_tcublasSgemv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,constfloat*A,intlda,constfloat*x,intincx,constfloat*beta,float*y,intincy)cublasStatus_tcublasDgemv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constdouble*alpha,constdouble*A,intlda,constdouble*x,intincx,constdouble*beta,double*y,intincy)cublasStatus_tcublasCgemv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZgemv(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the matrix-vector multiplication

$\textbf{y} = \alpha\text{ op}(A)\textbf{x} + \beta\textbf{y}$

where$A$ is a$m \times n$ matrix stored in column-major format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars. Also, for matrix$A$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix`A`.
`n`		input	Number of columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxn` with`lda>=max(1,m)`. Before entry, the leading`m` by`n` part of the array`A` must contain the matrix of coefficients. Unchanged on exit.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.`lda` must be at least`max(1,m)`.
`x`	device	input	<type> vector at least`(1+(n-1)abs(incx))` elements if`trans==CUBLAS_OP_N` and at least`(1+(m-1)abs(incx))` elements otherwise.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector at least`(1+(m-1)abs(incy))` elements if`trans==CUBLAS_OP_N` and at least`(1+(n-1)abs(incy))` elements otherwise.
`incy`		input	Stride between consecutive elements of`y`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`m<0` or`n<0`, or`incx==0` or`incy==0`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgemv(),dgemv(),cgemv(),zgemv()

2.6.3.cublas<t>ger()

cublasStatus_tcublasSger(cublasHandle_thandle,intm,intn,constfloat*alpha,constfloat*x,intincx,constfloat*y,intincy,float*A,intlda)cublasStatus_tcublasDger(cublasHandle_thandle,intm,intn,constdouble*alpha,constdouble*x,intincx,constdouble*y,intincy,double*A,intlda)cublasStatus_tcublasCgeru(cublasHandle_thandle,intm,intn,constcuComplex*alpha,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*A,intlda)cublasStatus_tcublasCgerc(cublasHandle_thandle,intm,intn,constcuComplex*alpha,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*A,intlda)cublasStatus_tcublasZgeru(cublasHandle_thandle,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*A,intlda)cublasStatus_tcublasZgerc(cublasHandle_thandle,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*A,intlda)

This function supports the64-bit Integer Interface.

This function performs the rank-1 update

$A = \begin{cases}{\alpha\mathbf{xy}^{T} + A} & \text{if ger(),geru() is called} \\{\alpha\mathbf{xy}^{H} + A} & \text{if gerc() is called} \\\end{cases}$

where$A$ is a$m \times n$ matrix stored in column-major format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`m`		input	Number of rows of matrix`A`.
`n`		input	Number of columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`m` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`A`	device	in/out	<type> array of dimension`ldaxn` with`lda>=max(1,m)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0`, or if`incx==0` or`incy==0`, or if`alpha` is NULL, or if`lda<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sger(),dger(),cgeru(),cgerc(),zgeru(),zgerc()

2.6.4.cublas<t>sbmv()

cublasStatus_tcublasSsbmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*x,intincx,constfloat*beta,float*y,intincy)cublasStatus_tcublasDsbmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*x,intincx,constdouble*beta,double*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the symmetric banded matrix-vector multiplication

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$

where$A$ is a$n \times n$ symmetric banded matrix with$k$ subdiagonals and superdiagonals,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

Ifuplo==CUBLAS_FILL_MODE_LOWER then the symmetric banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in row 1, the first subdiagonal in row 2 (starting at first position), the second subdiagonal in row 3 (starting at first position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack j,\min(m,j + k)\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the bottom right$k \times k$ triangle) are not referenced.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the symmetric banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in rowk+1, the first superdiagonal in rowk (starting at second position), the second superdiagonal in rowk-1 (starting at third position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+k+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack\max(1,j - k),j\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the top left$k \times k$ triangle) are not referenced.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`k`		input	Number of sub- and super-diagonals of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxn` with`lda>=k+1`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` or`beta` are NULL, or if`lda<(1+k)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssbmv(),dsbmv()

2.6.5.cublas<t>spmv()

cublasStatus_tcublasSspmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*AP,constfloat*x,intincx,constfloat*beta,float*y,intincy)cublasStatus_tcublasDspmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*AP,constdouble*x,intincx,constdouble*beta,double*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the symmetric packed matrix-vector multiplication

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$

where$A$ is a$n \times n$ symmetric matrix stored in packed format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

Ifuplo==CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the symmetric matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+((2*n-j+1)*j)/2] for$j = 1,\ldots,n$ and$i \geq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the symmetric matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+(j*(j+1))/2] for$j = 1,\ldots,n$ and$i \leq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix$A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix$A$ .
`alpha`	host or device	input	<type> scalar used for multiplication.
`AP`	device	input	<type> array with$A$ stored in packed format.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` or`beta` are NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sspmv(),dspmv()

2.6.6.cublas<t>spr()

cublasStatus_tcublasSspr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*x,intincx,float*AP)cublasStatus_tcublasDspr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*x,intincx,double*AP)

This function supports the64-bit Integer Interface.

This function performs the packed symmetric rank-1 update

$A = \alpha\textbf{x}\textbf{x}^{T} + A$

where$A$ is a$n \times n$ symmetric matrix stored in packed format,$\mathbf{x}$ is a vector, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix$A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix$A$ .
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`AP`	device	in/out	<type> array with$A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sspr(),dspr()

2.6.7.cublas<t>spr2()

cublasStatus_tcublasSspr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*x,intincx,constfloat*y,intincy,float*AP)cublasStatus_tcublasDspr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*x,intincx,constdouble*y,intincy,double*AP)

This function supports the64-bit Integer Interface.

This function performs the packed symmetric rank-2 update

$A = \alpha\left( {\textbf{x}\textbf{y}^{T} + \textbf{y}\textbf{x}^{T}} \right) + A$

where$A$ is a$n \times n$ symmetric matrix stored in packed format,$\mathbf{x}$ is a vector, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix$A$ lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix$A$ .
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`AP`	device	in/out	<type> array with$A$ stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sspr2(),dspr2()

2.6.8.cublas<t>symv()

cublasStatus_tcublasSsymv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*A,intlda,constfloat*x,intincx,constfloat*beta,float*y,intincy)cublasStatus_tcublasDsymv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*A,intlda,constdouble*x,intincx,constdouble*beta,double*y,intincy)cublasStatus_tcublasCsymv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,/* host or device pointer */constcuComplex*A,intlda,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZsymv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the symmetric matrix-vector multiplication.

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$where$A$ is a$n \times n$ symmetric matrix stored in lower or upper mode,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

This function has an alternate faster implementation using atomics that can be enabled withcublasSetAtomicsMode().

Please see the section on the functioncublasSetAtomicsMode() for more details about the usage of atomics.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxn` with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<n`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssymv(),dsymv()

2.6.9.cublas<t>syr()

cublasStatus_tcublasSsyr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*x,intincx,float*A,intlda)cublasStatus_tcublasDsyr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*x,intincx,double*A,intlda)cublasStatus_tcublasCsyr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*x,intincx,cuComplex*A,intlda)cublasStatus_tcublasZsyr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,cuDoubleComplex*A,intlda)

This function supports the64-bit Integer Interface.

This function performs the symmetric rank-1 update

$A = \alpha\textbf{x}\textbf{x}^{T} + A$

where$A$ is a$n \times n$ symmetric matrix stored in column-major format,$\mathbf{x}$ is a vector, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`A`	device	in/out	<type> array of dimensions`ldaxn`, with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyr(),dsyr()

2.6.10.cublas<t>syr2()

cublasStatus_tcublasSsyr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constfloat*x,intincx,constfloat*y,intincy,float*A,intldacublasStatus_tcublasDsyr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constdouble*x,intincx,constdouble*y,intincy,double*A,intldacublasStatus_tcublasCsyr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*A,intldacublasStatus_tcublasZsyr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*A,intlda

This function supports the64-bit Integer Interface.

This function performs the symmetric rank-2 update

$A = \alpha\left( {\textbf{x}\textbf{y}^{T} + \textbf{y}\textbf{x}^{T}} \right) + A$

where$A$ is a$n \times n$ symmetric matrix stored in column-major format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`A`	device	in/out	<type> array of dimensions`ldaxn`, with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` is NULL, or if`lda<max(1,n)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyr2(),dsyr2()

2.6.11.cublas<t>tbmv()

cublasStatus_tcublasStbmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constfloat*A,intlda,float*x,intincx)cublasStatus_tcublasDtbmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constdouble*A,intlda,double*x,intincx)cublasStatus_tcublasCtbmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constcuComplex*A,intlda,cuComplex*x,intincx)cublasStatus_tcublasZtbmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constcuDoubleComplex*A,intlda,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function performs the triangular banded matrix-vector multiplication

$\textbf{x} = \text{op}(A)\textbf{x}$

where$A$ is a triangular banded matrix, and$\mathbf{x}$ is a vector. Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\A^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Ifuplo==CUBLAS_FILL_MODE_LOWER then the triangular banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in row1, the first subdiagonal in row2 (starting at first position), the second subdiagonal in row3 (starting at first position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack j,\min(m,j + k)\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the bottom right$k \times k$ triangle) are not referenced.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the triangular banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in rowk+1, the first superdiagonal in rowk (starting at second position), the second superdiagonal in rowk-1 (starting at third position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+k+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack\max(1,j - k,j)\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the top left$k \times k$ triangle) are not referenced.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`k`		input	Number of sub- and super-diagonals of matrix .
`A`	device	input	<type> array of dimension`ldaxn`, with`lda>=k+1`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<(1+k)`
`CUBLAS_STATUS_ALLOC_FAILED`	The allocation of internal scratch memory failed
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

stbmv(),dtbmv(),ctbmv(),ztbmv()

2.6.12.cublas<t>tbsv()

cublasStatus_tcublasStbsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constfloat*A,intlda,float*x,intincx)cublasStatus_tcublasDtbsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constdouble*A,intlda,double*x,intincx)cublasStatus_tcublasCtbsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constcuComplex*A,intlda,cuComplex*x,intincx)cublasStatus_tcublasZtbsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,intk,constcuDoubleComplex*A,intlda,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function solves the triangular banded linear system with a single right-hand-side

$\text{op}(A)\textbf{x} = \textbf{b}$

where$A$ is a triangular banded matrix, and$\mathbf{x}$ and$\mathbf{b}$ are vectors. Also, for matrix$A$

The solution$\mathbf{x}$ overwrites the right-hand-sides$\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`k`		input	Number of sub- and super-diagonals of matrix`A`.
`A`	device	input	<type> array of dimension`ldaxn`, with`lda>=k+1`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<(1+k)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

stbsv(),dtbsv(),ctbsv(),ztbsv()

2.6.13.cublas<t>tpmv()

cublasStatus_tcublasStpmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constfloat*AP,float*x,intincx)cublasStatus_tcublasDtpmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constdouble*AP,double*x,intincx)cublasStatus_tcublasCtpmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuComplex*AP,cuComplex*x,intincx)cublasStatus_tcublasZtpmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuDoubleComplex*AP,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function performs the triangular packed matrix-vector multiplication

$\textbf{x} = \text{op}(A)\textbf{x}$

where$A$ is a triangular matrix stored in packed format, and$\mathbf{x}$ is a vector. Also, for matrix$A$

Ifuplo==CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the triangular matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+((2*n-j+1)*j)/2] for$j = 1,\ldots,n$ and$i \geq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the triangular matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+(j*(j+1))/2] for$A(i,j)$ and$i \leq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`AP`	device	input	<type> array with$A$ stored in packed format.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`
`CUBLAS_STATUS_ALLOC_FAILED`	The allocation of internal scratch memory failed
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

stpmv(),dtpmv(),ctpmv(),ztpmv()

2.6.14.cublas<t>tpsv()

cublasStatus_tcublasStpsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constfloat*AP,float*x,intincx)cublasStatus_tcublasDtpsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constdouble*AP,double*x,intincx)cublasStatus_tcublasCtpsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuComplex*AP,cuComplex*x,intincx)cublasStatus_tcublasZtpsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuDoubleComplex*AP,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function solves the packed triangular linear system with a single right-hand-side

$\text{op}(A)\textbf{x} = \textbf{b}$

where$A$ is a triangular matrix stored in packed format, and$\mathbf{x}$ and$\mathbf{b}$ are vectors. Also, for matrix$A$

The solution$\mathbf{x}$ overwrites the right-hand-sides$\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the triangular matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+(j*(j+1))/2] for$j = 1,\ldots,n$ and$i \leq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`AP`	device	input	<type> array with$A$ stored in packed format.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

stpsv(),dtpsv(),ctpsv(),ztpsv()

2.6.15.cublas<t>trmv()

cublasStatus_tcublasStrmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constfloat*A,intlda,float*x,intincx)cublasStatus_tcublasDtrmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constdouble*A,intlda,double*x,intincx)cublasStatus_tcublasCtrmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuComplex*A,intlda,cuComplex*x,intincx)cublasStatus_tcublasZtrmv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuDoubleComplex*A,intlda,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function performs the triangular matrix-vector multiplication

$\textbf{x} = \text{op}(A)\textbf{x}$

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, and$\mathbf{x}$ is a vector. Also, for matrix$A$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`A`	device	input	<type> array of dimensions`ldaxn` , with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<max(1,n)`
`CUBLAS_STATUS_ALLOC_FAILED`	The allocation of internal scratch memory failed
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strmv(),dtrmv(),ctrmv(),ztrmv()

2.6.16.cublas<t>trsv()

cublasStatus_tcublasStrsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constfloat*A,intlda,float*x,intincx)cublasStatus_tcublasDtrsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constdouble*A,intlda,double*x,intincx)cublasStatus_tcublasCtrsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuComplex*A,intlda,cuComplex*x,intincx)cublasStatus_tcublasZtrsv(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intn,constcuDoubleComplex*A,intlda,cuDoubleComplex*x,intincx)

This function supports the64-bit Integer Interface.

This function solves the triangular linear system with a single right-hand-side

$\text{op}(A)\textbf{x} = \textbf{b}$

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal, and$\mathbf{x}$ and$\mathbf{b}$ are vectors. Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

The solution$\mathbf{x}$ overwrites the right-hand-sides$\mathbf{b}$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`n`		input	Number of rows and columns of matrix`A`.
`A`	device	input	<type> array of dimension`ldaxn`, with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	in/out	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<max(1,n)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strsv(),dtrsv(),ctrsv(),ztrsv()

2.6.17.cublas<t>hemv()

cublasStatus_tcublasChemv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZhemv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the Hermitian matrix-vector multiplication

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$

where$A$ is a$n \times n$ Hermitian matrix stored in lower or upper mode,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

This function has an alternate faster implementation using atomics that can be enabled with

Please see the section on the for more details about the usage of atomics

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxn`, with`lda>=max(1,n)`. The imaginary parts of the diagonal elements are assumed to be zero.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` !=`CUBLAS_FILL_MODE_LOWER` and`uplo!=CUBLAS_FILL_MODE_UPPER`, or if`lda<n`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chemv(),zhemv()

2.6.18.cublas<t>hbmv()

cublasStatus_tcublasChbmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZhbmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the Hermitian banded matrix-vector multiplication

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$

where$A$ is a$n \times n$ Hermitian banded matrix with$k$ subdiagonals and superdiagonals,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

Ifuplo==CUBLAS_FILL_MODE_LOWER then the Hermitian banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in row1, the first subdiagonal in row2 (starting at first position), the second subdiagonal in row3 (starting at first position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack j,\min(m,j + k)\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the bottom right$k \times k$ triangle) are not referenced.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the Hermitian banded matrix$A$ is stored column by column, with the main diagonal of the matrix stored in rowk+1, the first superdiagonal in rowk (starting at second position), the second superdiagonal in rowk-1 (starting at third position), etc. So that in general, the element$A(i,j)$ is stored in the memory locationA(1+k+i-j,j) for$j = 1,\ldots,n$ and$i \in \lbrack\max(1,j - k),j\rbrack$ . Also, the elements in the arrayA that do not conceptually correspond to the elements in the banded matrix (the top left$k \times k$ triangle) are not referenced.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`k`		input	Number of sub- and super-diagonals of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimensions`ldaxn`, with`lda>=k+1`. The imaginary parts of the diagonal elements are assumed to be zero.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then does not have to be a valid input.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`incx==0` or`incy==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<(1+k)`, or if`alpha` or`beta` are NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chbmv(),zhbmv()

2.6.19.cublas<t>hpmv()

cublasStatus_tcublasChpmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*AP,constcuComplex*x,intincx,constcuComplex*beta,cuComplex*y,intincy)cublasStatus_tcublasZhpmv(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*AP,constcuDoubleComplex*x,intincx,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy)

This function supports the64-bit Integer Interface.

This function performs the Hermitian packed matrix-vector multiplication

$\textbf{y} = \alpha A\textbf{x} + \beta\textbf{y}$

where$A$ is a$n \times n$ Hermitian matrix stored in packed format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ and$\beta$ are scalars.

Ifuplo==CUBLAS_FILL_MODE_LOWER then the elements in the lower triangular part of the Hermitian matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+((2*n-j+1)*j)/2] for$j = 1,\ldots,n$ and$i \geq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Ifuplo==CUBLAS_FILL_MODE_UPPER then the elements in the upper triangular part of the Hermitian matrix$A$ are packed together column by column without gaps, so that the element$A(i,j)$ is stored in the memory locationAP[i+(j*(j+1))/2] for$j = 1,\ldots,n$ and$i \leq j$ . Consequently, the packed format requires only$\frac{n(n + 1)}{2}$ elements for storage.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`AP`	device	input	<type> array with$A$ stored in packed format. The imaginary parts of the diagonal elements are assumed to be zero.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`y` does not have to be a valid input.
`y`	device	in/out	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0` or`incy==0`, or if`uplo` !=`CUBLAS_FILL_MODE_LOWER` and`uplo!=CUBLAS_FILL_MODE_UPPER`, or if`alpha` or`beta` are NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chpmv(),zhpmv()

2.6.20.cublas<t>her()

cublasStatus_tcublasCher(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constcuComplex*x,intincx,cuComplex*A,intlda)cublasStatus_tcublasZher(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constcuDoubleComplex*x,intincx,cuDoubleComplex*A,intlda)

This function supports the64-bit Integer Interface.

This function performs the Hermitian rank-1 update

$A = \alpha\textbf{x}\textbf{x}^{H} + A$

where$A$ is a$n \times n$ Hermitian matrix stored in column-major format,$\mathbf{x}$ is a vector, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`A`	device	in/out	<type> array of dimensions`ldaxn`, with`lda>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

cher(),zher()

2.6.21.cublas<t>her2()

cublasStatus_tcublasCher2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*A,intlda)cublasStatus_tcublasZher2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*A,intlda)

This function supports the64-bit Integer Interface.

This function performs the Hermitian rank-2 update

$A = \alpha\textbf{x}\textbf{y}^{H} + \overset{ˉ}{\alpha}\textbf{y}\textbf{x}^{H} + A$

where$A$ is a$n \times n$ Hermitian matrix stored in column-major format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`A`	device	in/out	<type> array of dimension`ldaxn` with`lda>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

cher2(),zher2()

2.6.22.cublas<t>hpr()

cublasStatus_tcublasChpr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*alpha,constcuComplex*x,intincx,cuComplex*AP)cublasStatus_tcublasZhpr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*alpha,constcuDoubleComplex*x,intincx,cuDoubleComplex*AP)

This function supports the64-bit Integer Interface.

This function performs the packed Hermitian rank-1 update

$A = \alpha\textbf{x}\textbf{x}^{H} + A$

where$A$ is a$n \times n$ Hermitian matrix stored in packed format,$\mathbf{x}$ is a vector, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`AP`	device	in/out	<type> array with$A$ stored in packed format. The imaginary parts of the diagonal elements are assumed and set to zero.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chpr(),zhpr()

2.6.23.cublas<t>hpr2()

cublasStatus_tcublasChpr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*alpha,constcuComplex*x,intincx,constcuComplex*y,intincy,cuComplex*AP)cublasStatus_tcublasZhpr2(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*x,intincx,constcuDoubleComplex*y,intincy,cuDoubleComplex*AP)

This function supports the64-bit Integer Interface.

This function performs the packed Hermitian rank-2 update

$A = \alpha\textbf{x}\textbf{y}^{H} + \overset{ˉ}{\alpha}\textbf{y}\textbf{x}^{H} + A$

where$A$ is a$n \times n$ Hermitian matrix stored in packed format,$\mathbf{x}$ and$\mathbf{y}$ are vectors, and$\alpha$ is a scalar.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`n`		input	Number of rows and columns of matrix`A`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`x`	device	input	<type> vector with`n` elements.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`incy`		input	Stride between consecutive elements of`y`.
`AP`	device	in/out	<type> array with$A$ stored in packed format. The imaginary parts of the diagonal elements are assumed and set to zero.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`incx==0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chpr2, zhpr2

2.6.24.cublas<t>gemvBatched()

cublasStatus_tcublasSgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,constfloat*constAarray[],intlda,constfloat*constxarray[],intincx,constfloat*beta,float*constyarray[],intincy,intbatchCount)cublasStatus_tcublasDgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constdouble*alpha,constdouble*constAarray[],intlda,constdouble*constxarray[],intincx,constdouble*beta,double*constyarray[],intincy,intbatchCount)cublasStatus_tcublasCgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuComplex*alpha,constcuComplex*constAarray[],intlda,constcuComplex*constxarray[],intincx,constcuComplex*beta,cuComplex*constyarray[],intincy,intbatchCount)cublasStatus_tcublasZgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*constAarray[],intlda,constcuDoubleComplex*constxarray[],intincx,constcuDoubleComplex*beta,cuDoubleComplex*constyarray[],intincy,intbatchCount)#if defined(__cplusplus)cublasStatus_tcublasHSHgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__half*constAarray[],intlda,const__half*constxarray[],intincx,constfloat*beta,__half*constyarray[],intincy,intbatchCount)cublasStatus_tcublasHSSgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__half*constAarray[],intlda,const__half*constxarray[],intincx,constfloat*beta,float*constyarray[],intincy,intbatchCount)cublasStatus_tcublasTSTgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__nv_bfloat16*constAarray[],intlda,const__nv_bfloat16*constxarray[],intincx,constfloat*beta,__nv_bfloat16*constyarray[],intincy,intbatchCount)cublasStatus_tcublasTSSgemvBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__nv_bfloat16*constAarray[],intlda,const__nv_bfloat16*constxarray[],intincx,constfloat*beta,float*constyarray[],intincy,intbatchCount)#endif

This function supports the64-bit Integer Interface.

$\textbf{y}\lbrack i\rbrack = \alpha\text{op}(A\lbrack i\rbrack)\textbf{x}\lbrack i\rbrack + \beta\textbf{y}\lbrack i\rbrack,\text{ for i} \in \lbrack 0,batchCount - 1\rbrack$

where$\alpha$ and$\beta$ are scalars, and$A$ is an array of pointers to matrice$A\lbrack i\rbrack$ stored in column-major format with dimension$m \times n$ , and$\textbf{x}$ and$\textbf{y}$ are arrays of pointers to vectors. Also, for matrix$A\lbrack i\rbrack$ ,

$\text{op}(A\lbrack i\rbrack) = \left\{ \begin{matrix}{A\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\{A\lbrack i\rbrack}^{T} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\{A\lbrack i\rbrack}^{H} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Note

$\textbf{y}\lbrack i\rbrack$ vectors must not overlap, i.e. the individual gemv operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemv() in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix`A[i]`.
`n`		input	Number of columns of matrix`A[i]`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`Aarray`	device	input	Array of pointers to <type> array, with each array of dim.`ldaxn` with`lda>=max(1,m)`. All pointers must meet certain alignment criteria. Please see below for details.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`A[i]`.
`xarray`	device	input	Array of pointers to <type> array, with each dimension`n` if`trans==CUBLAS_OP_N` and`m` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
`incx`		input	Stride of each one-dimensional array x[i].
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`y` does not have to be a valid input.
`yarray`	device	in/out	Array of pointers to <type> array. It has dimensions`m` if`trans==CUBLAS_OP_N` and`n` otherwise. Vectors`y[i]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.
`incy`		input	Stride of each one-dimensional array y[i].
`batchCount`		input	Number of pointers contained in Aarray, xarray and yarray.

If math mode enables fast math modes when usingcublasSgemvBatched(), pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is recommended that they meet the following rule:

ifk%4==0 then ensureintptr_t(ptr)%16==0,

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	`m<0`,`n<0`, or`batchCount<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.6.25.cublas<t>gemvStridedBatched()

cublasStatus_tcublasSgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,constfloat*A,intlda,longlongintstrideA,constfloat*x,intincx,longlongintstridex,constfloat*beta,float*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasDgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constdouble*alpha,constdouble*A,intlda,longlongintstrideA,constdouble*x,intincx,longlongintstridex,constdouble*beta,double*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasCgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,longlongintstrideA,constcuComplex*x,intincx,longlongintstridex,constcuComplex*beta,cuComplex*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasZgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,longlongintstrideA,constcuDoubleComplex*x,intincx,longlongintstridex,constcuDoubleComplex*beta,cuDoubleComplex*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasHSHgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__half*A,intlda,longlongintstrideA,const__half*x,intincx,longlongintstridex,constfloat*beta,__half*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasHSSgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__half*A,intlda,longlongintstrideA,const__half*x,intincx,longlongintstridex,constfloat*beta,float*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasTSTgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__nv_bfloat16*A,intlda,longlongintstrideA,const__nv_bfloat16*x,intincx,longlongintstridex,constfloat*beta,__nv_bfloat16*y,intincy,longlongintstridey,intbatchCount)cublasStatus_tcublasTSSgemvStridedBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,constfloat*alpha,const__nv_bfloat16*A,intlda,longlongintstrideA,const__nv_bfloat16*x,intincx,longlongintstridex,constfloat*beta,float*y,intincy,longlongintstridey,intbatchCount)

This function supports the64-bit Integer Interface.

This function performs the matrix-vector multiplication of a batch of matrices and vectors. The batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n), leading dimension (lda), increments (incx, incy) and transposition (trans) for their respective A matrix, x and y vectors. Input matrix A and vector x, and output vector y for each instance of the batch are located at fixed offsets in number of elements from their locations in the previous instance. Pointers to A matrix, x and y vectors for the first instance are passed to the function by the user along with offsets in number of elements - strideA, stridex and stridey that determine the locations of input matrices and vectors, and output vectors in future instances.

$\textbf{y} + i*{stridey} = \alpha\text{op}(A + i*{strideA})(\textbf{x} + i*{stridex}) + \beta(\textbf{y} + i*{stridey}),\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

where$\alpha$ and$\beta$ are scalars, and$A$ is an array of pointers to matrix stored in column-major format with dimension$A\lbrack i\rbrack$$m \times n$ , and$\textbf{x}$ and$\textbf{y}$ are arrays of pointers to vectors. Also, for matrix$A\lbrack i\rbrack$

Note

$\textbf{y}\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemv operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemv() in different CUDA streams, rather than use this API.

Note

In the table below, we useA[i],x[i],y[i] as notation for A matrix, and x and y vectors in the ith instance of the batch, implicitly assuming they are respectively offsets in number of elementsstrideA,stridex,stridey away fromA[i-1],x[i-1],y[i-1]. The unit for the offset is number of elements and must not be zero .

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix`A[i]`.
`n`		input	Number of columns of matrix`A[i]`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type>* pointer to the A matrix corresponding to the first instance of the batch, with dimensions`ldaxn` with`lda>=max(1,m)`.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`A[i]`.
`strideA`		input	Value of type long long int that gives the offset in number of elements between`A[i]` and`A[i+1]`
`x`	device	input	<type>* pointer to the x vector corresponding to the first instance of the batch, with each dimension`n` if`trans==CUBLAS_OP_N` and`m` otherwise.
`incx`		input	Stride of each one-dimensional array`x[i]`.
`stridex`		input	Value of type long long int that gives the offset in number of elements between`x[i]` and`x[i+1]`
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`y` does not have to be a valid input.
`y`	device	in/out	<type>* pointer to the y vector corresponding to the first instance of the batch, with each dimension`m` if`trans==CUBLAS_OP_N` and`n` otherwise. Vectors`y[i]` should not overlap; otherwise, undefined behavior is expected.
`incy`		input	Stride of each one-dimensional array`y[i]`.
`stridey`		input	Value of type long long int that gives the offset in number of elements between`y[i]` and`y[i+1]`
`batchCount`		input	Number of GEMVs to perform in the batch.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	`m<0`,`n<0`, or`batchCount<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.7.cuBLAS Level-3 Function Reference

In this chapter we describe the Level-3 Basic Linear Algebra Subprograms (BLAS3) functions that perform matrix-matrix operations.

2.7.1.cublas<t>gemm()

cublasStatus_tcublasSgemm(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,constfloat*beta,float*C,intldc)cublasStatus_tcublasDgemm(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,constdouble*beta,double*C,intldc)cublasStatus_tcublasCgemm(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZgemm(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)cublasStatus_tcublasHgemm(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,const__half*alpha,const__half*A,intlda,const__half*B,intldb,const__half*beta,__half*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the matrix-matrix multiplication

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

where$\alpha$ and$\beta$ are scalars, and$A$ ,$B$ and$C$ are matrices stored in column-major format with dimensions$\text{op}(A)$$m \times k$ ,$\text{op}(B)$$k \times n$ and$C$$m \times n$ , respectively. Also, for matrix$A$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A`) and`C`.
`n`		input	Number of columns of matrix op(`B`) and`C`.
`k`		input	Number of columns of op(`A`) and rows of op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_ARCH_MISMATCH`	In the case ofcublasHgemm() the device does not support math in half precision.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgemm(),dgemm(),cgemm(),zgemm()

2.7.2.cublas<t>gemm3m()

cublasStatus_tcublasCgemm3m(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZgemm3m(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the complex matrix-matrix multiplication, using Gauss complexity reduction algorithm. This can lead to an increase in performance up to 25%

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

Note

These 2 routines are only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A`) and`C`.
`n`		input	Number of columns of matrix op(`B`) and`C`.
`k`		input	Number of columns of op(`A`) and rows of op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.

The possible error values returned by this function and their meanings are listed in the following table:

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capabilites lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to NETLIB documentation:

cgemm(),zgemm()

2.7.3.cublas<t>gemmBatched()

cublasStatus_tcublasHgemmBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,const__half*alpha,const__half*constAarray[],intlda,const__half*constBarray[],intldb,const__half*beta,__half*constCarray[],intldc,intbatchCount)cublasStatus_tcublasSgemmBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constfloat*alpha,constfloat*constAarray[],intlda,constfloat*constBarray[],intldb,constfloat*beta,float*constCarray[],intldc,intbatchCount)cublasStatus_tcublasDgemmBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constdouble*alpha,constdouble*constAarray[],intlda,constdouble*constBarray[],intldb,constdouble*beta,double*constCarray[],intldc,intbatchCount)cublasStatus_tcublasCgemmBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*constAarray[],intlda,constcuComplex*constBarray[],intldb,constcuComplex*beta,cuComplex*constCarray[],intldc,intbatchCount)cublasStatus_tcublasZgemmBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*constAarray[],intlda,constcuDoubleComplex*constBarray[],intldb,constcuDoubleComplex*beta,cuDoubleComplex*constCarray[],intldc,intbatchCount)

This function supports the64-bit Integer Interface.

$C\lbrack i\rbrack = \alpha\text{op}(A\lbrack i\rbrack)\text{op}(B\lbrack i\rbrack) + \beta C\lbrack i\rbrack,\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

where$\alpha$ and$\beta$ are scalars, and$A$ ,$B$ and$C$ are arrays of pointers to matrices stored in column-major format with dimensions$\text{op}(A\lbrack i\rbrack)$$m \times k$ ,$\text{op}(B\lbrack i\rbrack)$$k \times n$ and$C\lbrack i\rbrack$$m \times n$ , respectively. Also, for matrix$A$

and$\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix$B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, that is, the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemm() in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A[i]`) and`C[i]`.
`n`		input	Number of columns of op(`B[i]`) and`C[i]`.
`k`		input	Number of columns of op(`A[i]`) and rows of op(`B[i]`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`Aarray`	device	input	Array of pointers to <type> array, with each array of dim.`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`A[i]`.
`Barray`	device	input	Array of pointers to <type> array, with each array of dim.`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
`ldb`		input	Leading dimension of two-dimensional array used to store each matrix`B[i]`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`Carray`	device	in/out	Array of pointers to <type> array. It has dimensions`ldcxn` with`ldc>=max(1,m)`. Matrices`C[i]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.
`ldc`		input	Leading dimension of two-dimensional array used to store each matrix`C[i]`.
`batchCount`		input	Number of pointers contained in Aarray, Barray and Carray.

If math mode enables fast math modes when usingcublasSgemmBatched(), pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is recommended that they meet the following rule:

ifk%4==0 then ensureintptr_t(ptr)%16==0,

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasHgemmBatched() is only supported for GPU with architecture capabilities equal or greater than 5.3

2.7.4.cublas<t>gemmStridedBatched()

cublasStatus_tcublasHgemmStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,const__half*alpha,const__half*A,intlda,longlongintstrideA,const__half*B,intldb,longlongintstrideB,const__half*beta,__half*C,intldc,longlongintstrideC,intbatchCount)cublasStatus_tcublasSgemmStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constfloat*alpha,constfloat*A,intlda,longlongintstrideA,constfloat*B,intldb,longlongintstrideB,constfloat*beta,float*C,intldc,longlongintstrideC,intbatchCount)cublasStatus_tcublasDgemmStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constdouble*alpha,constdouble*A,intlda,longlongintstrideA,constdouble*B,intldb,longlongintstrideB,constdouble*beta,double*C,intldc,longlongintstrideC,intbatchCount)cublasStatus_tcublasCgemmStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,longlongintstrideA,constcuComplex*B,intldb,longlongintstrideB,constcuComplex*beta,cuComplex*C,intldc,longlongintstrideC,intbatchCount)cublasStatus_tcublasCgemm3mStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,longlongintstrideA,constcuComplex*B,intldb,longlongintstrideB,constcuComplex*beta,cuComplex*C,intldc,longlongintstrideC,intbatchCount)cublasStatus_tcublasZgemmStridedBatched(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,longlongintstrideA,constcuDoubleComplex*B,intldb,longlongintstrideB,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc,longlongintstrideC,intbatchCount)

This function supports the64-bit Integer Interface.

This function performs the matrix-matrix multiplication of a batch of matrices. The batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. Input matrices A, B and output matrix C for each instance of the batch are located at fixed offsets in number of elements from their locations in the previous instance. Pointers to A, B and C matrices for the first instance are passed to the function by the user along with offsets in number of elements - strideA, strideB and strideC that determine the locations of input and output matrices in future instances.

$C + i*{strideC} = \alpha\text{op}(A + i*{strideA})\text{op}(B + i*{strideB}) + \beta(C + i*{strideC}),\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

and$\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix$B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemm() in different CUDA streams, rather than use this API.

Note

In the table below, we useA[i],B[i],C[i] as notation for A, B and C matrices in the ith instance of the batch, implicitly assuming they are respectively offsets in number of elementsstrideA,strideB,strideC away fromA[i-1],B[i-1],C[i-1]. The unit for the offset is number of elements and must not be zero .

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A[i]`) and`C[i]`.
`n`		input	Number of columns of op(`B[i]`) and`C[i]`.
`k`		input	Number of columns of op(`A[i]`) and rows of op(`B[i]`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type>* pointer to the A matrix corresponding to the first instance of the batch, with dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`A[i]`.
`strideA`		input	Value of type long long int that gives the offset in number of elements between`A[i]` and`A[i+1]`
`B`	device	input	<type>* pointer to the B matrix corresponding to the first instance of the batch, with dimensions`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store each matrix`B[i]`.
`strideB`		input	Value of type long long int that gives the offset in number of elements between`B[i]` and`B[i+1]`
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`C`	device	in/out	<type>* pointer to the C matrix corresponding to the first instance of the batch, with dimensions`ldcxn` with`ldc>=max(1,m)`. Matrices`C[i]` should not overlap; otherwise, undefined behavior is expected.
`ldc`		input	Leading dimension of two-dimensional array used to store each matrix`C[i]`.
`strideC`		input	Value of type long long int that gives the offset in number of elements between`C[i]` and`C[i+1]`
`batchCount`		input	Number of GEMMs to perform in the batch.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasHgemmStridedBatched() is only supported for GPU with architecture capabilities equal or greater than 5.3

2.7.5.cublas<t>gemmGroupedBatched()

cublasStatus_tcublasSgemmGroupedBatched(cublasHandle_thandle,constcublasOperation_ttransa_array[],constcublasOperation_ttransb_array[],constintm_array[],constintn_array[],constintk_array[],constfloatalpha_array[],constfloat*constAarray[],constintlda_array[],constfloat*constBarray[],constintldb_array[],constfloatbeta_array[],float*constCarray[],constintldc_array[],intgroup_count,constintgroup_size[])cublasStatus_tcublasDgemmGroupedBatched(cublasHandle_thandle,constcublasOperation_ttransa_array[],constcublasOperation_ttransb_array[],constintm_array[],constintn_array[],constintk_array[],constdoublealpha_array[],constdouble*constAarray[],constintlda_array[],constdouble*constBarray[],constintldb_array[],constdoublebeta_array[],double*constCarray[],constintldc_array[],intgroup_count,constintgroup_size[])

This function supports the64-bit Integer Interface.

This function performs the matrix-matrix multiplication on groups of matrices. A given group is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. However, the dimensions, leading dimensions, transpositions, and scaling factors (alpha, beta) may vary between groups. The address of the input matrices and the output matrix of each instance of the batch are read from arrays of pointers passed to the function by the caller. This is functionally equivalent to the following:

idx=0;fori=0:group_count-1forj=0:group_size[i]-1gemm(transa_array[i],transb_array[i],m_array[i],n_array[i],k_array[i],alpha_array[i],Aarray[idx],lda_array[i],Barray[idx],ldb_array[i],beta_array[i],Carray[idx],ldc_array[i]);idx+=1;endend

where$\text{$\mathrm{alpha\_array}$}$ and$\text{$\mathrm{beta\_array}$}$ are arrays of scaling factors, and$\text{Aarray}$,$\text{Barray}$ and$\text{Carray}$ are arrays of pointers to matrices stored in column-major format. For a given index,$\text{idx}$, that is part of group$i$, the dimensions are:

$\text{op}(\text{Aarray}\lbrack\text{idx}\rbrack)$:$\text{$\mathrm{m\_array}$}\lbrack i\rbrack \times \text{$\mathrm{k\_array}$}\lbrack i\rbrack$
$\text{op}(\text{Barray}\lbrack\text{idx}\rbrack)$:$\text{$\mathrm{k\_array}$}\lbrack i\rbrack \times \text{$\mathrm{n\_array}$}\lbrack i\rbrack$
$\text{Carray}\lbrack\text{idx}\rbrack$:$\text{$\mathrm{m\_array}$}\lbrack i\rbrack \times \text{$\mathrm{n\_array}$}\lbrack i\rbrack$

Note

This API takes arrays of two different lengths. The arrays of dimensions, leading dimensions, transpositions, and scaling factors are of lengthgroup_count and the arrays of matrices are of lengthproblem_count where$\text{$\mathrm{problem\_count}$} = \sum_{i = 0}^{\text{$\mathrm{group\_count}$} - 1} \text{$\mathrm{group\_size}$}\lbrack i\rbrack$

For matrix$A[\text{idx}]$ in group$i$

$\text{op}(A[\text{idx}]) = \left\{ \begin{matrix}A[\text{idx}] & {\text{if }\textsf{$\mathrm{transa\_array}\lbrack i\rbrack$ == $\mathrm{CUBLAS\_OP\_N}$}} \\A[\text{idx}]^{T} & {\text{if }\textsf{$\mathrm{transa\_array}\lbrack i\rbrack$ == $\mathrm{CUBLAS\_OP\_T}$}} \\A[\text{idx}]^{H} & {\text{if }\textsf{$\mathrm{transa\_array}\lbrack i\rbrack$ == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

and$\text{op}(B[\text{idx}])$ is defined similarly for matrix$B[\text{idx}]$ in group$i$.

Note

$C\lbrack\text{idx}\rbrack$ matrices must not overlap, that is, the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemmBatched() in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning	Array Length
`handle`		input	Handle to the cuBLAS library context.
`transa_array`	host	input	Operation op(`A[idx]`) that is non- or (conj.) transpose for each group.	group_count
`transb_array`	host	input	Operation op(`B[idx]`) that is non- or (conj.) transpose for each group.	group_count
`m_array`	host	input	Array containing the number of rows of matrix op(`A[idx]`) and`C[idx]` for each group.	group_count
`n_array`	host	input	Array containing the number of columns of op(`B[idx]`) and`C[idx]` for each group.	group_count
`k_array`	host	input	Array containing the number of columns of op(`A[idx]`) and rows of op(`B[idx]`) for each group.	group_count
`alpha_array`	host	input	Array containing the <type> scalar used for multiplication for each group.	group_count
`Aarray`	device	input	Array of pointers to <type> array, with each array of dim.`lda[i]xk[i]` with`lda[i]>=max(1,m[i])` if`transa[i]==CUBLAS_OP_N` and`lda[i]xm[i]` with`lda[i]>=max(1,k[i])` otherwise. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`lda_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`A[idx]` for each group.	group_count
`Barray`	device	input	Array of pointers to <type> array, with each array of dim.`ldb[i]xn[i]` with`ldb[i]>=max(1,k[i])` if`transb[i]==CUBLAS_OP_N` and`ldb[i]xk[i]` with`ldb[i]>=max(1,n[i])` otherwise. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`ldb_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`B[idx]` for each group.	group_count
`beta_array`	host	input	Array containing the <type> scalar used for multiplication for each group.	group_count
`Carray`	device	in/out	Array of pointers to <type> array. It has dimensions`ldc[i]xn[i]` with`ldc[i]>=max(1,m[i])`. Matrices`C[idx]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`ldc_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`C[idx]` for each group.	group_count
`group_count`	host	input	Number of groups
`group_size`	host	input	Array containing the number of pointers contained in Aarray, Barray and Carray for each group.	group_count

If math mode enables fast math modes when usingcublasSgemmGroupedBatched(), pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is required that they meet the following rule:

ifk%4==0 then ensureintptr_t(ptr)%16==0,

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`transa_array`,`transb_array`,`m_array`,`n_array`,`k_array`,`alpha_array`,`lda_array`,`ldb_array`,`beta_array`,`ldc_array`, or`group_size` are NULL, or if`group_count<0`, or if`m_array[i]<0`,`n_array[i]<0`,`k_array[i]<0`,`group_size[i]<0`, or if`transa_array[i]` and`transb_array[i]` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda_array[i]<max(1,m_array[i])` if`transa_array[i]==CUBLAS_OP_N` and`lda_array[i]<max(1,k_array[i])` otherwise, or if`ldb_array[i]<max(1,k_array[i])` if`transb_array[i]==CUBLAS_OP_N` and`ldb_array[i]<max(1,n_array[i])` otherwise, or if`ldc_array[i]<max(1,m_array[i])`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_NOT_SUPPORTED`	The pointer mode is set to`CUBLAS_POINTER_MODE_DEVICE`

2.7.6.cublas<t>symm()

cublasStatus_tcublasSsymm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,constfloat*beta,float*C,intldc)cublasStatus_tcublasDsymm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,constdouble*beta,double*C,intldc)cublasStatus_tcublasCsymm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZsymm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the symmetric matrix-matrix multiplication

$C = \left\{ \begin{matrix}{\alpha AB + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\{\alpha BA + \beta C} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\\end{matrix} \right.$

where$A$ is a symmetric matrix stored in lower or upper mode,$B$ and$C$ are$m \times n$ matrices, and$\alpha$ and$\beta$ are scalars.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`side`		input	Indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`m`		input	Number of rows of matrix`C` and`B`, with matrix`A` sized accordingly.
`n`		input	Number of columns of matrix`C` and`B`, with matrix`A` sized accordingly.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0`, or if`side` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,m)` when`side==CUBLAS_SIDE_LEFT`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)`, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssymm(),dsymm(),csymm(),zsymm()

2.7.7.cublas<t>syrk()

cublasStatus_tcublasSsyrk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*beta,float*C,intldc)cublasStatus_tcublasDsyrk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*beta,double*C,intldc)cublasStatus_tcublasCsyrk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZsyrk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the symmetric rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a symmetric matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`trans==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix A.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk()

2.7.8.cublas<t>syr2k()

cublasStatus_tcublasSsyr2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,constfloat*beta,float*C,intldc)cublasStatus_tcublasDsyr2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,constdouble*beta,double*C,intldc)cublasStatus_tcublasCsyr2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZsyr2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the symmetric rank-$2k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \text{op}(B)\text{op}(A)^{T}) + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a symmetric matrix stored in lower or upper mode, and$A$ and$B$ are matrices with dimensions$\text{op}(A)$$n \times k$ and$\text{op}(B)$$n \times k$ , respectively. Also, for matrix$A$ and$B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix}{A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\{A^{T}\text{ and }B^{T}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or transpose.
`n`		input	Number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	Number of columns of matrix op(`A`) and op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimensions`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`, then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,n)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldb<max(1,n)` when`trans==CUBLAS_OP_N`, and`ldb<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyr2k(),dsyr2k(),csyr2k(),zsyr2k()

2.7.9.cublas<t>syrkx()

cublasStatus_tcublasSsyrkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,constfloat*beta,float*C,intldc)cublasStatus_tcublasDsyrkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,constdouble*beta,double*C,intldc)cublasStatus_tcublasCsyrkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZsyrkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs a variation of the symmetric rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{T} + \beta C$

This routine can be used when B is in such way that the result is guaranteed to be symmetric. A usual example is when the matrix B is a scaled form of the matrix A: this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routinecublas<t>dgmm().

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or transpose.
`n`		input	Number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	Number of columns of matrix op(`A`) and op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimensions`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`, then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,n)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldb<max(1,n)` when`trans==CUBLAS_OP_N`, and`ldb<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk() and

ssyr2k(),dsyr2k(),csyr2k(),zsyr2k()

2.7.10.cublas<t>trmm()

cublasStatus_tcublasStrmm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,float*C,intldc)cublasStatus_tcublasDtrmm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,double*C,intldc)cublasStatus_tcublasCtrmm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,cuComplex*C,intldc)cublasStatus_tcublasZtrmm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the triangular matrix-matrix multiplication

$C = \left\{ \begin{matrix}{\alpha\text{op}(A)B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\{\alpha B\text{op}(A)} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\\end{matrix} \right.$

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal,$B$ and$C$ are$m \times n$ matrix, and$\alpha$ is a scalar. Also, for matrix$A$

Notice that in order to achieve better parallelism cuBLAS differs from the BLAS API only for this routine. The BLAS API assumes an in-place implementation (with results written back to B), while the cuBLAS API assumes an out-of-place implementation (with results written into C). The application can obtain the in-place functionality of BLAS in the cuBLAS API by passing the address of the matrix B in place of the matrix C. No other overlapping in the input parameters is supported.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`side`		input	Indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`m`		input	Number of rows of matrix`B`, with matrix`A` sized accordingly.
`n`		input	Number of columns of matrix`B`, with matrix`A` sized accordingly.
`alpha`	host or device	input	<type> scalar used for multiplication, if`alpha==0` then`A` is not referenced and`B` does not have to be a valid input.
`A`	device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`C`	device	in/out	<type> array of dimension`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0`,`n<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`side` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`lda<max(1,m)` if`side==CUBLAS_SIDE_LEFT`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)`, or if`ldc<max(1,m)`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strmm(),dtrmm(),ctrmm(),ztrmm()

2.7.11.cublas<t>trsm()

cublasStatus_tcublasStrsm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constfloat*alpha,constfloat*A,intlda,float*B,intldb)cublasStatus_tcublasDtrsm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constdouble*alpha,constdouble*A,intlda,double*B,intldb)cublasStatus_tcublasCtrsm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,cuComplex*B,intldb)cublasStatus_tcublasZtrsm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,cuDoubleComplex*B,intldb)

This function supports the64-bit Integer Interface.

This function solves the triangular linear system with multiple right-hand-sides

$\left\{ \begin{matrix}{\text{op}(A)X = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\{X\text{op}(A) = \alpha B} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\\end{matrix} \right.$

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal,$X$ and$B$ are$m \times n$ matrices, and$\alpha$ is a scalar. Also, for matrix$A$

The solution$X$ overwrites the right-hand-sides$B$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`side`		input	Indicates if matrix`A` is on the left or right of`X`.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`m`		input	Number of rows of matrix`B`, with matrix`A` sized accordingly.
`n`		input	Number of columns of matrix`B`, with matrix`A` is sized accordingly.
`alpha`	host or device	input	<type> scalar used for multiplication, if`alpha==0` then`A` is not referenced and`B` does not have to be a valid input.
`A`	device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	in/out	<type> array. It has dimensions`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0`,`n<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`side` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<max(1,m)` if`side==CUBLAS_SIDE_LEFT`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)`, or if`alpha` is NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strsm(),dtrsm(),ctrsm(),ztrsm()

2.7.12.cublas<t>trsmBatched()

cublasStatus_tcublasStrsmBatched(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constfloat*alpha,constfloat*constA[],intlda,float*constB[],intldb,intbatchCount);cublasStatus_tcublasDtrsmBatched(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constdouble*alpha,constdouble*constA[],intlda,double*constB[],intldb,intbatchCount);cublasStatus_tcublasCtrsmBatched(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuComplex*alpha,constcuComplex*constA[],intlda,cuComplex*constB[],intldb,intbatchCount);cublasStatus_tcublasZtrsmBatched(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*constA[],intlda,cuDoubleComplex*constB[],intldb,intbatchCount);

This function supports the64-bit Integer Interface.

This function solves an array of triangular linear systems with multiple right-hand-sides

$\left\{ \begin{matrix}{\text{op}(A\lbrack i\rbrack)X\lbrack i\rbrack = \alpha B\lbrack i\rbrack} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\{X\lbrack i\rbrack\text{op}(A\lbrack i\rbrack) = \alpha B\lbrack i\rbrack} & {\text{if }\textsf{side == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\\end{matrix} \right.$

where$A\lbrack i\rbrack$ is a triangular matrix stored in lower or upper mode with or without the main diagonal,$X\lbrack i\rbrack$ and$B\lbrack i\rbrack$ are$m \times n$ matrices, and$\alpha$ is a scalar. Also, for matrix$A$

$\text{op}(A\lbrack i\rbrack) = \left\{ \begin{matrix}{A\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\{A^{T}\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\{A^{H}\lbrack i\rbrack} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

The solution$X\lbrack i\rbrack$ overwrites the right-hand-sides$B\lbrack i\rbrack$ on exit.

No test for singularity or near-singularity is included in this function.

This function works for any sizes but is intended to be used for matrices of small sizes where the launch overhead is a significant factor. For bigger sizes, it might be advantageous to callbatchCount times the regularcublas<t>trsm() within a set of CUDA streams.

The current implementation is limited to devices with compute capability above or equal 2.0.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`side`		input	Indicates if matrix`A[i]` is on the left or right of`X[i]`.
`uplo`		input	Indicates if matrix`A[i]` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`diag`		input	Indicates if the elements on the main diagonal of matrix`A[i]` are unity and should not be accessed.
`m`		input	Number of rows of matrix`B[i]`, with matrix`A[i]` sized accordingly.
`n`		input	Number of columns of matrix`B[i]`, with matrix`A[i]` is sized accordingly.
`alpha`	host or device	input	<type> scalar used for multiplication, if`alpha==0` then`A[i]` is not referenced and`B[i]` does not have to be a valid input.
`A`	device	input	Array of pointers to <type> array, with each array of dim.`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A[i]`.
`B`	device	in/out	Array of pointers to <type> array, with each array of dim.`ldbxn` with`ldb>=max(1,m)`. Matrices`B[i]` should not overlap; otherwise, undefined behavior is expected.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B[i]`.
`batchCount`		input	Number of pointers contained in A and B.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0`,`n<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`side` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`diag` is not one of`CUBLAS_DIAG_UNIT` and`CUBLAS_DIAG_NON_UNIT`, or if`lda<max(1,m)` if`side==CUBLAS_SIDE_LEFT`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strsm(),dtrsm(),ctrsm(),ztrsm()

2.7.13.cublas<t>hemm()

cublasStatus_tcublasChemm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasZhemm(cublasHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the Hermitian matrix-matrix multiplication

where$A$ is a Hermitian matrix stored in lower or upper mode,$B$ and$C$ are$m \times n$ matrices, and$\alpha$ and$\beta$ are scalars.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`side`		input	Indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	Indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`m`		input	Number of rows of matrix`C` and`B`, with matrix`A` sized accordingly.
`n`		input	Number of columns of matrix`C` and`B`, with matrix`A` sized accordingly.
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`		input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0`, or if`side` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,m)` when`side==CUBLAS_SIDE_LEFT`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)`, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

chemm(),zhemm()

2.7.14.cublas<t>herk()

cublasStatus_tcublasCherk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constcuComplex*A,intlda,constfloat*beta,cuComplex*C,intldc)cublasStatus_tcublasZherk(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constcuDoubleComplex*A,intlda,constdouble*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a Hermitian matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`beta`		input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

cherk(),zherk()

2.7.15.cublas<t>her2k()

cublasStatus_tcublasCher2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constfloat*beta,cuComplex*C,intldc)cublasStatus_tcublasZher2k(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constdouble*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the Hermitian rank-$2k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \overset{ˉ}{\alpha}\text{op}(B)\text{op}(A)^{H} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a Hermitian matrix stored in lower or upper mode, and$A$ and$B$ are matrices with dimensions$\text{op}(A)$$n \times k$ and$\text{op}(B)$$n \times k$ , respectively. Also, for matrix$A$ and$B$

$\text{op(}A\text{) and op(}B\text{)} = \left\{ \begin{matrix}{A\text{ and }B} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\{A^{H}\text{ and }B^{H}} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	Number of columns of matrix op(`A`) and op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimension`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

cher2k(),zher2k()

2.7.16.cublas<t>herkx()

cublasStatus_tcublasCherkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constfloat*beta,cuComplex*C,intldc)cublasStatus_tcublasZherkx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constdouble*beta,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs a variation of the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \beta C$

This routine can be used when the matrix B is in such way that the result is guaranteed to be hermitian. An usual example is when the matrix B is a scaled form of the matrix A: this is equivalent to B being the product of the matrix A and a diagonal matrix. For an efficient computation of the product of a regular matrix with a diagonal matrix, refer to the routinecublas<t>dgmm().

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	Number of columns of matrix op(`A`) and op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`B`	device	input	<type> array of dimension`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	Real scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)` when`trans==CUBLAS_OP_N`, and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

cherk(),zherk() and

cher2k(),zher2k()

2.8.BLAS-like Extension

This section describes the BLAS-extension functions that perform matrix-matrix operations.

2.8.1.cublas<t>geam()

cublasStatus_tcublasSgeam(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,constfloat*alpha,constfloat*A,intlda,constfloat*beta,constfloat*B,intldb,float*C,intldc)cublasStatus_tcublasDgeam(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,constdouble*alpha,constdouble*A,intlda,constdouble*beta,constdouble*B,intldb,double*C,intldc)cublasStatus_tcublasCgeam(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*beta,constcuComplex*B,intldb,cuComplex*C,intldc)cublasStatus_tcublasZgeam(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*beta,constcuDoubleComplex*B,intldb,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the matrix-matrix addition/transposition

$C = \alpha\text{op}(A) + \beta\text{op}(B)$

where$\alpha$ and$\beta$ are scalars, and$A$ ,$B$ and$C$ are matrices stored in column-major format with dimensions$\text{op}(A)$$m \times n$ ,$\text{op}(B)$$m \times n$ and$C$$m \times n$ , respectively. Also, for matrix$A$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

The operation is out-of-place if C does not overlap A or B.

The in-place mode supports the following two operations,

$C = \alpha\text{*}C + \beta\text{op}(B)$

$C = \alpha\text{op}(A) + \beta\text{*}C$

For in-place mode, ifC==A,ldc==lda andtransa==CUBLAS_OP_N. IfC===B,ldc==ldb andtransb==CUBLAS_OP_N. If the user does not meet above requirements,CUBLAS_STATUS_INVALID_VALUE is returned.

The operation includes the following special cases:

the user can reset matrix C to zero by setting*alpha=beta=0.

the user can transpose matrix A by setting*alpha=1and*beta=0.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A`) and`C`.
`n`		input	Number of columns of matrix op(`B`) and`C`.
`alpha`	host or device	input	<type> scalar used for multiplication. If`*alpha==0`,`A` does not have to be a valid input.
`A`	device	input	<type> array of dimensions`ldaxn` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,n)` otherwise.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)` if`transb==CUBLAS_OP_N` and`ldbxm` with`ldb>=max(1,n)` otherwise.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`*beta==0`,`B` does not have to be a valid input.
`C`	device	output	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0`, or if`transa` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`transb` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N`, and`lda<max(1,n)` otherwise, or if`ldb<max(1,m)` if`transb==CUBLAS_OP_N`, and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`A==C` and`(transa!=CUBLAS_OP_N)\|\|(lda!=ldc)`, or if`B==C` and`(transb!=CUBLAS_OP_N)\|\|(ldb!=ldc)`, or if`alpha` or`beta` are NULL
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.8.2.cublas<t>dgmm()

cublasStatus_tcublasSdgmm(cublasHandle_thandle,cublasSideMode_tmode,intm,intn,constfloat*A,intlda,constfloat*x,intincx,float*C,intldc)cublasStatus_tcublasDdgmm(cublasHandle_thandle,cublasSideMode_tmode,intm,intn,constdouble*A,intlda,constdouble*x,intincx,double*C,intldc)cublasStatus_tcublasCdgmm(cublasHandle_thandle,cublasSideMode_tmode,intm,intn,constcuComplex*A,intlda,constcuComplex*x,intincx,cuComplex*C,intldc)cublasStatus_tcublasZdgmm(cublasHandle_thandle,cublasSideMode_tmode,intm,intn,constcuDoubleComplex*A,intlda,constcuDoubleComplex*x,intincx,cuDoubleComplex*C,intldc)

This function supports the64-bit Integer Interface.

This function performs the matrix-matrix multiplication

$C = \left\{ \begin{matrix}{A \times diag(X)} & {\text{if }\textsf{mode == $\mathrm{CUBLAS\_SIDE\_RIGHT}$}} \\{diag(X) \times A} & {\text{if }\textsf{mode == $\mathrm{CUBLAS\_SIDE\_LEFT}$}} \\\end{matrix} \right.$

where$A$ and$C$ are matrices stored in column-major format with dimensions$m \times n$ .$X$ is a vector of size$n$ ifmode==CUBLAS_SIDE_RIGHT and of size$m$ ifmode==CUBLAS_SIDE_LEFT.$X$ is gathered from one-dimensional array x with strideincx. The absolute value ofincx is the stride and the sign ofincx is direction of the stride. Ifincx is positive, then we forward x from the first element. Otherwise, we backward x from the last element. The formula of X is

$X\lbrack j\rbrack = \left\{ \begin{matrix}{x\lbrack j \times incx\rbrack} & {\text{if }incx \geq 0} \\{x\lbrack(\chi - 1) \times |incx| - j \times |incx|\rbrack} & {\text{if }incx < 0} \\\end{matrix} \right.$

where$\chi = m$ ifmode==CUBLAS_SIDE_LEFT and$\chi = n$ ifmode==CUBLAS_SIDE_RIGHT.

Example 1: if the user wants to perform$diag(diag(B)) \times A$ , then$incx = ldb + 1$ where$ldb$ is leading dimension of matrixB, either row-major or column-major.

Example 2: if the user wants to perform$\alpha \times A$ , then there are two choices, eithercublas<t>geam() with*beta==0 andtransa==CUBLAS_OP_N orcublas<t>dgmm() withincx==0 andx[0]==alpha.

The operation is out-of-place. The in-place only works iflda==ldc.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`mode`		input	Left multiply if`mode==CUBLAS_SIDE_LEFT` or right multiply if`mode==CUBLAS_SIDE_RIGHT`
`m`		input	Number of rows of matrix`A` and`C`.
`n`		input	Number of columns of matrix`A` and`C`.
`A`	device	input	<type> array of dimensions`ldaxn` with`lda>=max(1,m)`
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`x`	device	input	One-dimensional <type> array of size`abs(incx)xm` if`mode==CUBLAS_SIDE_LEFT` and`abs(incx)xn` if`mode==CUBLAS_SIDE_RIGHT`
`incx`		input	Stride of one-dimensional array`x`.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0`, or if`mode` is not one of`CUBLAS_SIDE_LEFT` and`CUBLAS_SIDE_RIGHT`, or if`lda<max(1,m)`, or if`ldc<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.8.3.cublas<t>getrfBatched()

cublasStatus_tcublasSgetrfBatched(cublasHandle_thandle,intn,float*constAarray[],intlda,int*PivotArray,int*infoArray,intbatchSize);cublasStatus_tcublasDgetrfBatched(cublasHandle_thandle,intn,double*constAarray[],intlda,int*PivotArray,int*infoArray,intbatchSize);cublasStatus_tcublasCgetrfBatched(cublasHandle_thandle,intn,cuComplex*constAarray[],intlda,int*PivotArray,int*infoArray,intbatchSize);cublasStatus_tcublasZgetrfBatched(cublasHandle_thandle,intn,cuDoubleComplex*constAarray[],intlda,int*PivotArray,int*infoArray,intbatchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensionsnxn and leading dimensionlda.

This function performs the LU factorization of eachAarray[i] for i = 0, …,batchSize-1 by the following equation

$\text{P}\text{*}{Aarray}\lbrack i\rbrack = L\text{*}U$

whereP is a permutation matrix which represents partial pivoting with row interchanges.L is a lower triangular matrix with unit diagonal andU is an upper triangular matrix.

FormallyP is written by a product of permutation matricesPj, forj=1,2,...,n, sayP=P1*P2*P3*....*Pn.Pj is a permutation matrix which interchanges two rows of vector x when performingPj*x.Pj can be constructed byj element ofPivotArray[i] by the following Matlab code

// In Matlab PivotArray[i] is an array of base-1.// In C, PivotArray[i] is base-0.Pj=eye(n);swapPj(j,:)andPj(PivotArray[i][j],:)

L andU are written back to original matrixA, and diagonal elements ofL are discarded. TheL andU can be constructed by the following Matlab code

// A is a matrix of nxn after getrf.L=eye(n);forj=1:nL(j+1:n,j)=A(j+1:n,j)endU=zeros(n);fori=1:nU(i,i:n)=A(i,i:n)end

If matrixA(=Aarray[i]) is singular, getrf still works and the value ofinfo(=infoArray[i]) reports first row index that LU factorization cannot proceed. If info isk,U(k,k) is zero. The equationP*A==L*U still holds, howeverL andU reconstruction needs a different Matlab code as follows:

// A is a matrix of nxn after getrf.// info is k, which means U(k,k) is zero.L=eye(n);forj=1:k-1L(j+1:n,j)=A(j+1:n,j)endU=zeros(n);fori=1:k-1U(i,i:n)=A(i,i:n)endfori=k:nU(i,k:n)=A(i,k:n)end

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrfBatched supports non-pivot LU factorization ifPivotArray is NULL.

cublas<t>getrfBatched supports arbitrary dimension.

cublas<t>getrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of rows and columns of`Aarray[i]`.
`Aarray`	device	input/output	Array of pointers to <type> array, with each array of dim.`nxn` with`lda>=max(1,n)`. Matrices`Aarray[i]` should not overlap; otherwise, undefined behavior is expected.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`Aarray[i]`.
`PivotArray`	device	output	Array of size`nxbatchSize` that contains the pivoting sequence of each factorization of`Aarray[i]` stored in a linear fashion. If`PivotArray` is NULL, pivoting is disabled.
`infoArray`	device	output	Array of size`batchSize` that info(=infoArray[i]) contains the information of factorization of`Aarray[i]`. If`info==0`, the execution is successful. If`info=-j`, the`j`-th parameter had an illegal value. If`info=k`,`U(k,k)==0`. The factorization has been completed, but U is exactly singular.
`batchSize`		input	Number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	The parameters`n<0` or`batchSize<0` or`lda<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgeqrf(),dgeqrf(),cgeqrf(),zgeqrf()

2.8.4.cublas<t>getrsBatched()

cublasStatus_tcublasSgetrsBatched(cublasHandle_thandle,cublasOperation_ttrans,intn,intnrhs,constfloat*constAarray[],intlda,constint*devIpiv,float*constBarray[],intldb,int*info,intbatchSize);cublasStatus_tcublasDgetrsBatched(cublasHandle_thandle,cublasOperation_ttrans,intn,intnrhs,constdouble*constAarray[],intlda,constint*devIpiv,double*constBarray[],intldb,int*info,intbatchSize);cublasStatus_tcublasCgetrsBatched(cublasHandle_thandle,cublasOperation_ttrans,intn,intnrhs,constcuComplex*constAarray[],intlda,constint*devIpiv,cuComplex*constBarray[],intldb,int*info,intbatchSize);cublasStatus_tcublasZgetrsBatched(cublasHandle_thandle,cublasOperation_ttrans,intn,intnrhs,constcuDoubleComplex*constAarray[],intlda,constint*devIpiv,cuDoubleComplex*constBarray[],intldb,int*info,intbatchSize);

This function solves an array of systems of linear equations of the form:

$\text{op}(A\lbrack i \rbrack) X\lbrack i\rbrack = B\lbrack i\rbrack$

where$A\lbrack i\rbrack$ is a matrix which has been LU factorized with pivoting,$X\lbrack i\rbrack$ and$B\lbrack i\rbrack$ are$n \times {nrhs}$ matrices. Also, for matrix$A$

$\text{op}(A\lbrack i\rbrack) = \left\{ \begin{matrix}{A\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_N}$}} \\{A^{T}\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_T}$}} \\{A^{H}\lbrack i\rbrack} & {\text{if }\textsf{trans == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>getrsBatched() supports non-pivot LU factorization ifdevIpiv is NULL.

cublas<t>getrsBatched() supports arbitrary dimension.

cublas<t>getrsBatched() only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows and columns of`Aarray[i]`.
`nrhs`		input	Number of columns of`Barray[i]`.
`Aarray`	device	input	Array of pointers to <type> array, with each array of dim.`nxn` with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`Aarray[i]`.
`devIpiv`	device	input	Array of size`nxbatchSize` that contains the pivoting sequence of each factorization of`Aarray[i]` stored in a linear fashion. If`devIpiv` is NULL, pivoting for all`Aarray[i]` is ignored.
`Barray`	device	input/output	Array of pointers to <type> array, with each array of dim.`nxnrhs` with`ldb>=max(1,n)`. Matrices`Barray[i]` should not overlap; otherwise, undefined behavior is expected.
`ldb`		input	Leading dimension of two-dimensional array used to store each solution matrix`Barray[i]`.
`info`	host	output	If`info==0`, the execution is successful. If`info=-j`, the`j`-th parameter had an illegal value.
`batchSize`		input	Number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`nrhs<0`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,n)`, or if`ldb<max(1,n)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgeqrs(),dgeqrs(),cgeqrs(),zgeqrs()

2.8.5.cublas<t>getriBatched()

cublasStatus_tcublasSgetriBatched(cublasHandle_thandle,intn,constfloat*constAarray[],intlda,int*PivotArray,float*constCarray[],intldc,int*infoArray,intbatchSize);cublasStatus_tcublasDgetriBatched(cublasHandle_thandle,intn,constdouble*constAarray[],intlda,int*PivotArray,double*constCarray[],intldc,int*infoArray,intbatchSize);cublasStatus_tcublasCgetriBatched(cublasHandle_thandle,intn,constcuComplex*constAarray[],intlda,int*PivotArray,cuComplex*constCarray[],intldc,int*infoArray,intbatchSize);cublasStatus_tcublasZgetriBatched(cublasHandle_thandle,intn,constcuDoubleComplex*constAarray[],intlda,int*PivotArray,cuDoubleComplex*constCarray[],intldc,int*infoArray,intbatchSize);

Aarray andCarray are arrays of pointers to matrices stored in column-major format with dimensionsn*n and leading dimensionlda andldc respectively.

This function performs the inversion of matricesA[i] for i = 0, …,batchSize-1.

Prior to calling cublas<t>getriBatched, the matrixA[i] must be factorized first using the routine cublas<t>getrfBatched. After the call of cublas<t>getrfBatched, the matrix pointing byAarray[i] will contain the LU factors of the matrixA[i] and the vector pointing by(PivotArray+i) will contain the pivoting sequence.

Following the LU factorization, cublas<t>getriBatched uses forward and backward triangular solvers to complete inversion of matricesA[i] for i = 0, …,batchSize-1. The inversion is out-of-place, so memory space of Carray[i] cannot overlap memory space of Array[i].

Typically all parameters in cublas<t>getrfBatched would be passed into cublas<t>getriBatched. For example,

// step 1: perform in-place LU decomposition, P*A = L*U.//      Aarray[i] is n*n matrix A[i]cublasDgetrfBatched(handle,n,Aarray,lda,PivotArray,infoArray,batchSize);//      check infoArray[i] to see if factorization of A[i] is successful or not.//      Array[i] contains LU factorization of A[i]// step 2: perform out-of-place inversion, Carray[i] = inv(A[i])cublasDgetriBatched(handle,n,Aarray,lda,PivotArray,Carray,ldc,infoArray,batchSize);//      check infoArray[i] to see if inversion of A[i] is successful or not.

The user can check singularity from either cublas<t>getrfBatched or cublas<t>getriBatched.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

If cublas<t>getrfBatched is performed by non-pivoting,PivotArray of cublas<t>getriBatched should be NULL.

cublas<t>getriBatched supports arbitrary dimension.

cublas<t>getriBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of rows and columns of`Aarray[i]`.
`Aarray`	device	input	Array of pointers to <type> array, with each array of dimension`n*n` with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`Aarray[i]`.
`PivotArray`	device	output	Array of size`n*batchSize` that contains the pivoting sequence of each factorization of`Aarray[i]` stored in a linear fashion. If`PivotArray` is NULL, pivoting is disabled.
`Carray`	device	output	Array of pointers to <type> array, with each array of dimension`n*n` with`ldc>=max(1,n)`. Matrices`Carray[i]` should not overlap; otherwise, undefined behavior is expected.
`ldc`		input	Leading dimension of two-dimensional array used to store each matrix`Carray[i]`.
`infoArray`	device	output	Array of size`batchSize` that info(=infoArray[i]) contains the information of inversion of`A[i]`. If`info==0`, the execution is successful. If`info==k`,`U(k,k)==0`. The U is exactly singular and the inversion failed.
`batchSize`		input	Number of pointers contained in A

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`lda<0` or`ldc<0` or`batchSize<0`, or if`lda<n` or`ldc<n`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.8.6.cublas<t>matinvBatched()

cublasStatus_tcublasSmatinvBatched(cublasHandle_thandle,intn,constfloat*constA[],intlda,float*constAinv[],intlda_inv,int*info,intbatchSize);cublasStatus_tcublasDmatinvBatched(cublasHandle_thandle,intn,constdouble*constA[],intlda,double*constAinv[],intlda_inv,int*info,intbatchSize);cublasStatus_tcublasCmatinvBatched(cublasHandle_thandle,intn,constcuComplex*constA[],intlda,cuComplex*constAinv[],intlda_inv,int*info,intbatchSize);cublasStatus_tcublasZmatinvBatched(cublasHandle_thandle,intn,constcuDoubleComplex*constA[],intlda,cuDoubleComplex*constAinv[],intlda_inv,int*info,intbatchSize);

A andAinv are arrays of pointers to matrices stored in column-major format with dimensionsn*n and leading dimensionlda andlda_inv respectively.

This function performs the inversion of matricesA[i] for i = 0, …,batchSize-1.

This function is a short cut ofcublas<t>getrfBatched() pluscublas<t>getriBatched(). However it doesn’t work ifn is greater than 32. If not, the user has to go throughcublas<t>getrfBatched() andcublas<t>getriBatched().

If the matrixA[i] is singular, theninfo[i] reports singularity, the same ascublas<t>getrfBatched().

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of rows and columns of`A[i]`.
`A`	device	input	Array of pointers to <type> array, with each array of dimension`n*n` with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`A[i]`.
`Ainv`	device	output	Array of pointers to <type> array, with each array of dimension`n*n` with`lda_inv>=max(1,n)`. Matrices`Ainv[i]` should not overlap; otherwise, undefined behavior is expected.
`lda_inv`		input	Leading dimension of two-dimensional array used to store each matrix`Ainv[i]`.
`info`	device	output	Array of size`batchSize` that info[i] contains the information of inversion of`A[i]`. If`info[i]==0`, the execution is successful. If`info[i]==k`, then`U(k,k)==0`. The U is exactly singular and the inversion failed.
`batchSize`		input	Number of pointers contained in`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`lda<0` or`lda_inv<0` or`batchSize<0`, or if`lda<n` or`lda_inv<n`, or if`n>32`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

2.8.7.cublas<t>geqrfBatched()

cublasStatus_tcublasSgeqrfBatched(cublasHandle_thandle,intm,intn,float*constAarray[],intlda,float*constTauArray[],int*info,intbatchSize);cublasStatus_tcublasDgeqrfBatched(cublasHandle_thandle,intm,intn,double*constAarray[],intlda,double*constTauArray[],int*info,intbatchSize);cublasStatus_tcublasCgeqrfBatched(cublasHandle_thandle,intm,intn,cuComplex*constAarray[],intlda,cuComplex*constTauArray[],int*info,intbatchSize);cublasStatus_tcublasZgeqrfBatched(cublasHandle_thandle,intm,intn,cuDoubleComplex*constAarray[],intlda,cuDoubleComplex*constTauArray[],int*info,intbatchSize);

Aarray is an array of pointers to matrices stored in column-major format with dimensionsmxn and leading dimensionlda.TauArray is an array of pointers to vectors of dimension of at leastmax(1,min(m,n).

This function performs the QR factorization of eachAarray[i] fori=0,...,batchSize-1 using Householder reflections. Each matrixQ[i] is represented as a product of elementary reflectors and is stored in the lower part of eachAarray[i] as follows :

Q[j] = H[j][1] H[j][2] . . . H[j](k), where k = min(m,n).

Each H[j][i] has the form

H[j][i] = I - tau[j] * v * v'

wheretau[j] is a real scalar, andv is a real vector withv(1:i-1)=0 andv(i)=1;v(i+1:m) is stored on exit inAarray[j][i+1:m,i], andtau inTauArray[j][i].

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

cublas<t>geqrfBatched supports arbitrary dimension.

cublas<t>geqrfBatched only supports compute capability 2.0 or above.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`m`		input	Number of rows`Aarray[i]`.
`n`		input	Number of columns of`Aarray[i]`.
`Aarray`	device	input	Array of pointers to <type> array, with each array of dim.`mxn` with`lda>=max(1,m)`.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`Aarray[i]`.
`TauArray`	device	output	Array of pointers to <type> vector, with each vector of dim.`max(1,min(m,n))`.
`info`	host	output	If`info==0`, the parameters passed to the function are valid If`info<0`, the parameter in postion`-info` is invalid
`batchSize`		input	Number of pointers contained in`Aarray`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`batchSize<0`, or if`lda<max(1,m)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgeqrf(),dgeqrf(),cgeqrf(),zgeqrf()

2.8.8.cublas<t>gelsBatched()

cublasStatus_tcublasSgelsBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intnrhs,float*constAarray[],intlda,float*constCarray[],intldc,int*info,int*devInfoArray,intbatchSize);cublasStatus_tcublasDgelsBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intnrhs,double*constAarray[],intlda,double*constCarray[],intldc,int*info,int*devInfoArray,intbatchSize);cublasStatus_tcublasCgelsBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intnrhs,cuComplex*constAarray[],intlda,cuComplex*constCarray[],intldc,int*info,int*devInfoArray,intbatchSize);cublasStatus_tcublasZgelsBatched(cublasHandle_thandle,cublasOperation_ttrans,intm,intn,intnrhs,cuDoubleComplex*constAarray[],intlda,cuDoubleComplex*constCarray[],intldc,int*info,int*devInfoArray,intbatchSize);

Aarray is an array of pointers to matrices stored in column-major format.Carray is an array of pointers to matrices stored in column-major format.

This function find the least squares solution of a batch of overdetermined systems: it solves the least squares problem described as follows :

minimize||Carray[i]-Aarray[i]*Xarray[i]||,withi=0,...,batchSize-1

On exit, eachAarray[i] is overwritten with their QR factorization and eachCarray[i] is overwritten with the least square solution

cublas<t>gelsBatched supports only the non-transpose operation and only solves over-determined systems (m >= n).

cublas<t>gelsBatched only supports compute capability 2.0 or above.

This function is intended to be used for matrices of small sizes where the launch overhead is a significant factor.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`trans`		input	Operation op(`Aarray[i]`) that is non- or (conj.) transpose. Only non-transpose operation is currently supported.
`m`		input	Number of rows of each`Aarray[i]` and`Carray[i]` if`trans==CUBLAS_OP_N`, numbers of columns of each`Aarray[i]` otherwise (not supported currently).
`n`		input	Number of columns of each`Aarray[i]` if`trans==CUBLAS_OP_N`, and number of rows of each`Aarray[i]` and`Carray[i]` otherwise (not supported currently).
`nrhs`		input	Number of columns of each`Carray[i]`.
`Aarray`	device	input/output	Array of pointers to <type> array, with each array of dim.`mxn` with`lda>=max(1,m)` if`trans==CUBLAS_OP_N`, and`nxm` with`lda>=max(1,n)` otherwise (not supported currently). Matrices`Aarray[i]` should not overlap; otherwise, behavior is undefined.
`lda`		input	Leading dimension of two-dimensional array used to store each matrix`Aarray[i]`.
`Carray`	device	input/output	Array of pointers to <type> array, with each array of dim.`mxnrhs` with`ldc>=max(1,m)` if`trans==CUBLAS_OP_N`, and`nxnrhs` with`lda>=max(1,n)` otherwise (not supported currently). Matrices`Carray[i]` should not overlap; otherwise, behavior is undefined.
`ldc`		input	Leading dimension of two-dimensional array used to store each matrix`Carray[i]`.
`info`	host	output	If`info==0` the parameters passed to the function are valid If`info<0` the parameter in position`-info` is invalid
`devInfoArray`	device	output	Optional array of integers of dimension batchsize. If non-null, every element of`devInfoArray[i]==V` has the following meaning: `V==0` : the`i`-th problem was sucessfully solved `V>0` : the`V`-th diagonal element of the`Aarray[i]` is zero.`Aarray[i]` does not have full rank.
`batchSize`		input	Number of pointers contained in Aarray and Carray

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`nrhs<0` or`batchSize<0` or if`lda<max(1,m)` or`ldc<max(1,m)`
`CUBLAS_STATUS_NOT_SUPPORTED`	The parameters`m<n` or`trans` is different from non-transpose.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgels(),dgels(),cgels(),zgels()

2.8.9.cublas<t>tpttr()

cublasStatus_tcublasStpttr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*AP,float*A,intlda);cublasStatus_tcublasDtpttr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*AP,double*A,intlda);cublasStatus_tcublasCtpttr(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*AP,cuComplex*A,intlda);cublasStatus_tcublasZtpttr(cublasHandle_thandle,cublasFillMode_tuplointn,constcuDoubleComplex*AP,cuDoubleComplex*A,intlda);

This function performs the conversion from the triangular packed format to the triangular format

Ifuplo==CUBLAS_FILL_MODE_LOWER then the elements ofAP are copied into the lower triangular part of the triangular matrixA and the upper part ofA is left untouched. Ifuplo==CUBLAS_FILL_MODE_UPPER then the elements ofAP are copied into the upper triangular part of the triangular matrixA and the lower part ofA is left untouched.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`AP` contains lower or upper part of matrix`A`.
`n`		input	Number of rows and columns of matrix`A`.
`AP`	device	input	<type> array with$A$ stored in packed format.
`A`	device	output	<type> array of dimensions`ldaxn` , with`lda>=max(1,n)`. The opposite side of A is left untouched.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

stpttr(),dtpttr(),ctpttr(),ztpttr()

2.8.10.cublas<t>trttp()

cublasStatus_tcublasStrttp(cublasHandle_thandle,cublasFillMode_tuplo,intn,constfloat*A,intlda,float*AP);cublasStatus_tcublasDtrttp(cublasHandle_thandle,cublasFillMode_tuplo,intn,constdouble*A,intlda,double*AP);cublasStatus_tcublasCtrttp(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuComplex*A,intlda,cuComplex*AP);cublasStatus_tcublasZtrttp(cublasHandle_thandle,cublasFillMode_tuplo,intn,constcuDoubleComplex*A,intlda,cuDoubleComplex*AP);

This function performs the conversion from the triangular format to the triangular packed format

Ifuplo==CUBLAS_FILL_MODE_LOWER then the lower triangular part of the triangular matrixA is copied into the arrayAP. Ifuplo==CUBLAS_FILL_MODE_UPPER then then the upper triangular part of the triangular matrixA is copied into the arrayAP.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates which matrix`A` lower or upper part is referenced.
`n`		input	Number of rows and columns of matrix`A`.
`A`	device	input	<type> array of dimensions`ldaxn` , with`lda>=max(1,n)`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`AP`	device	output	<type> array with`A` stored in packed format.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`lda<max(1,n)`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

strttp(),dtrttp(),ctrttp(),ztrttp()

2.8.11.cublas<t>gemmEx()

cublasStatus_tcublasSgemmEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constfloat*alpha,constvoid*A,cudaDataType_tAtype,intlda,constvoid*B,cudaDataType_tBtype,intldb,constfloat*beta,void*C,cudaDataType_tCtype,intldc)cublasStatus_tcublasCgemmEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constvoid*A,cudaDataType_tAtype,intlda,constvoid*B,cudaDataType_tBtype,intldb,constcuComplex*beta,void*C,cudaDataType_tCtype,intldc)

This function supports the64-bit Integer Interface.

This function is an extension ofcublas<t>gemm(). In this function the input matrices and output matrices can have a lower precision but the computation is still done in the type<t>. For example, in the typefloat forcublasSgemmEx() and in the typecuComplex forcublasCgemmEx().

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A`) and`C`.
`n`		input	Number of columns of matrix op(`B`) and`C`.
`k`		input	Number of columns of op(`A`) and rows of op(`B`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`Btype`		input	Enumerant specifying the datatype of matrix`B`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.

The matrix types combinations supported forcublasSgemmEx() are listed below:

C	A/B
`CUDA_R_16BF`	`CUDA_R_16BF`
`CUDA_R_16F`	`CUDA_R_16F`
`CUDA_R_32F`	`CUDA_R_8I`
	`CUDA_R_16BF`
	`CUDA_R_16F`
	`CUDA_R_32F`

The matrix types combinations supported forcublasCgemmEx() are listed below :

C	A/B
`CUDA_C_32F`	`CUDA_C_8I`
	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasCgemmEx() is only supported for GPU with architecture capabilities equal or greater than 5.0
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype`,`Btype` and`Ctype` is not supported
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgemm()

For more information about the numerical behavior of some GEMM algorithms, refer to theGEMM Algorithms Numerical Behavior section.

2.8.12.cublasGemmEx()

cublasStatus_tcublasGemmEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*A,cudaDataType_tAtype,intlda,constvoid*B,cudaDataType_tBtype,intldb,constvoid*beta,void*C,cudaDataType_tCtype,intldc,cublasComputeType_tcomputeType,cublasGemmAlgo_talgo)#if defined(__cplusplus)cublasStatus_tcublasGemmEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*A,cudaDataTypeAtype,intlda,constvoid*B,cudaDataTypeBtype,intldb,constvoid*beta,void*C,cudaDataTypeCtype,intldc,cudaDataTypecomputeType,cublasGemmAlgo_talgo)#endif

This function supports the64-bit Integer Interface.

This function is an extension ofcublas<t>gemm() that allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run. Supported combinations of arguments are listed further down in this section.

Note

The second variant ofcublasGemmEx() function is provided for backward compatibility with C++ applications code, where thecomputeType parameter is ofcudaDataType instead ofcublasComputeType_t. C applications would still compile with the updated function signature.

This function is only supported on devices with compute capability 5.0 or later.

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A`) and`C`.
`n`		input	Number of columns of matrix op(`B`) and`C`.
`k`		input	Number of columns of op(`A`) and rows of op(`B`).
`alpha`	host or device	input	Scaling factor for A*B of the type that corresponds to the computeType and Ctype, see the table below for details.
`A`	device	input	<type> array of dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A`.
`B`	device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`Btype`		input	Enumerant specifying the datatype of matrix`B`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host or device	input	Scaling factor for C of the type that corresponds to the computeType and Ctype, see the table below for details. If`beta==0`,`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of a two-dimensional array used to store the matrix`C`.
`computeType`		input	Enumerant specifying the computation type.
`algo`		input	Enumerant specifying the algorithm. SeecublasGemmAlgo_t.

cublasGemmEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32` or `CUBLAS_COMPUTE_32F_EMULATED_16BFX9`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

Note

CUBLAS_COMPUTE_32I andCUBLAS_COMPUTE_32I_PEDANTIC compute types are only supported with A, B being 4-byte aligned and lda, ldb being multiples of 4. For better performance, it is also recommended that IMMA kernels requirements for a regular data ordering listedhere are met.

The possible error values returned by this function and their meanings are listed in the following table.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmEx() is only supported for GPU with architecture capabilities equal or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype`,`Btype` and`Ctype` or the algorithm,`algo` is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`C` is NULL when`beta` is not zero if`Atype` or`Btype` or`Ctype` or`algo` are not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

Starting with release 11.2, using the typed functions instead of the extension functions (cublas**Ex()) helps in reducing the binary size when linking to static cuBLAS Library.

Also refer to:sgemm.()

For more information about the numerical behavior of some GEMM algorithms, refer to theGEMM Algorithms Numerical Behavior section.

2.8.13.cublasGemmBatchedEx()

cublasStatus_tcublasGemmBatchedEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*constAarray[],cudaDataType_tAtype,intlda,constvoid*constBarray[],cudaDataType_tBtype,intldb,constvoid*beta,void*constCarray[],cudaDataType_tCtype,intldc,intbatchCount,cublasComputeType_tcomputeType,cublasGemmAlgo_talgo)#if defined(__cplusplus)cublasStatus_tcublasGemmBatchedEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*constAarray[],cudaDataTypeAtype,intlda,constvoid*constBarray[],cudaDataTypeBtype,intldb,constvoid*beta,void*constCarray[],cudaDataTypeCtype,intldc,intbatchCount,cudaDataTypecomputeType,cublasGemmAlgo_talgo)#endif

This function supports the64-bit Integer Interface.

This function is an extension ofcublas<t>gemmBatched() that performs the matrix-matrix multiplication of a batch of matrices and allows the user to individually specify the data types for each of the A, B and C matrix arrays, the precision of computation and the GEMM algorithm to be run. Likecublas<t>gemmBatched(), the batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. The address of the input matrices and the output matrix of each instance of the batch are read from arrays of pointers passed to the function by the caller. Supported combinations of arguments are listed further down in this section.

Note

The second variant ofcublasGemmBatchedEx() function is provided for backward compatibility with C++ applications code, where thecomputeType parameter is ofcudaDataType instead ofcublasComputeType_t. C applications would still compile with the updated function signature.

$C\lbrack i\rbrack = \alpha\text{op}(A\lbrack i\rbrack)\text{op}(B\lbrack i\rbrack) + \beta C\lbrack i\rbrack,\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

and$\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix$B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, behavior is undefined.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemm() in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`Aarray[i]`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`Barray[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`Aarray[i]`) and`Carray[i]`.
`n`		input	Number of columns of matrix op(`Barray[i]`) and`Carray[i]`.
`k`		input	Number of columns of op(`Aarray[i]`) and rows of op(`Barray[i]`).
`alpha`	host or device	input	Scaling factor for matrix products of the type that corresponds to the computeType and Ctype, see the table below for details.
`Aarray`	device	input	Array of pointers to <Atype> array, with each array of dim.`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
`Atype`		input	Enumerant specifying the datatype of`Aarray`.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`Aarray[i]`.
`Barray`	device	input	Array of pointers to <Btype> array, with each array of dim.`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise. All pointers must meet certain alignment criteria. Please see below for details.
`Btype`		input	Enumerant specifying the datatype of`Barray`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`Barray[i]`.
`beta`	host or device	input	Scaling factor for`Carray` of the type that corresponds to the computeType and Ctype, see the table below for details. If`beta==0`,`Carray[i]` does not have to be a valid input.
`Carray`	device	in/out	Array of pointers to <Ctype> array. It has dimensions`ldcxn` with`ldc>=max(1,m)`. Matrices`Carray[i]` should not overlap; otherwise, the behavior is undefined. All pointers must meet certain alignment criteria. Please see below for details.
`Ctype`		input	Enumerant specifying the datatype of`Carray`.
`ldc`		input	Leading dimension of a two-dimensional array used to store each matrix`Carray[i]`.
`batchCount`		input	Number of pointers contained in`Aarray`,`Barray` and`Carray`.
`computeType`		input	Enumerant specifying the computation type.
`algo`		input	Enumerant specifying the algorithm. SeecublasGemmAlgo_t.

cublasGemmBatchedEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32` or `CUBLAS_COMPUTE_32F_EMULATED_16BFX9`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

IfAtype isCUDA_R_16F orCUDA_R_16BF, orcomputeType is any of theFAST options, or when math mode oralgo enable fast math modes, pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is recommended that they meet the following rule:

ifk%8==0 then ensureintptr_t(ptr)%16==0,
ifk%2==0 then ensureintptr_t(ptr)% 4==0.

Note

Compute typesCUBLAS_COMPUTE_32I andCUBLAS_COMPUTE_32I_PEDANTIC are only supported with all pointersA[i],B[i] being 4-byte aligned and lda, ldb being multiples of 4. For a better performance, it is also recommended that IMMA kernels requirements for the regular data ordering listedhere are met.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmBatchedEx() is only supported for GPU with architecture capabilities equal to or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype`,`Btype` and`Ctype` or the algorithm,`algo` is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`Atype` or`Btype` or`Ctype` or`algo` or`computeType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

Also refer to:sgemm.()

2.8.14.cublasGemmStridedBatchedEx()

cublasStatus_tcublasGemmStridedBatchedEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*A,cudaDataType_tAtype,intlda,longlongintstrideA,constvoid*B,cudaDataType_tBtype,intldb,longlongintstrideB,constvoid*beta,void*C,cudaDataType_tCtype,intldc,longlongintstrideC,intbatchCount,cublasComputeType_tcomputeType,cublasGemmAlgo_talgo)#if defined(__cplusplus)cublasStatus_tcublasGemmStridedBatchedEx(cublasHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constvoid*alpha,constvoid*A,cudaDataTypeAtype,intlda,longlongintstrideA,constvoid*B,cudaDataTypeBtype,intldb,longlongintstrideB,constvoid*beta,void*C,cudaDataTypeCtype,intldc,longlongintstrideC,intbatchCount,cudaDataTypecomputeType,cublasGemmAlgo_talgo)#endif

This function supports the64-bit Integer Interface.

This function is an extension ofcublas<t>gemmStridedBatched() that performs the matrix-matrix multiplication of a batch of matrices and allows the user to individually specify the data types for each of the A, B and C matrices, the precision of computation and the GEMM algorithm to be run. Likecublas<t>gemmStridedBatched(), the batch is considered to be “uniform”, i.e. all instances have the same dimensions (m, n, k), leading dimensions (lda, ldb, ldc) and transpositions (transa, transb) for their respective A, B and C matrices. Input matrices A, B and output matrix C for each instance of the batch are located at fixed offsets in number of elements from their locations in the previous instance. Pointers to A, B and C matrices for the first instance are passed to the function by the user along with the offsets in number of elements - strideA, strideB and strideC that determine the locations of input and output matrices in future instances.

Note

The second variant ofcublasGemmStridedBatchedEx() function is provided for backward compatibility with C++ applications code, where thecomputeType parameter is ofcudaDataType_t instead ofcublasComputeType_t. C applications would still compile with the updated function signature.

$C + i*{strideC} = \alpha\text{op}(A + i*{strideA})\text{op}(B + i*{strideB}) + \beta(C + i*{strideC}),\text{ for i } \in \lbrack 0,batchCount - 1\rbrack$

and$\text{op}(B\lbrack i\rbrack)$ is defined similarly for matrix$B\lbrack i\rbrack$ .

Note

$C\lbrack i\rbrack$ matrices must not overlap, i.e. the individual gemm operations must be computable independently; otherwise, the behavior is undefined.

On certain problem sizes, it might be advantageous to make multiple calls tocublas<t>gemm() in different CUDA streams, rather than use this API.

Note

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`transa`		input	Operation op(`A[i]`) that is non- or (conj.) transpose.
`transb`		input	Operation op(`B[i]`) that is non- or (conj.) transpose.
`m`		input	Number of rows of matrix op(`A[i]`) and`C[i]`.
`n`		input	Number of columns of matrix op(`B[i]`) and`C[i]`.
`k`		input	Number of columns of op(`A[i]`) and rows of op(`B[i]`).
`alpha`	host or device	input	Scaling factor for AB of the <Scale Type*> that corresponds to the computeType and Ctype, see the table below for details.
`A`	device	input	Pointer to <Atype> matrix, A, corresponds to the first instance of the batch, with dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of`A`.
`lda`		input	Leading dimension of two-dimensional array used to store the matrix`A[i]`.
`strideA`		input	Value of type long long int that gives the offset in number of elements between`A[i]` and`A[i+1]`.
`B`	device	input	Pointer to <Btype> matrix, B, corresponds to the first instance of the batch, with dimensions`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`Btype`		input	Enumerant specifying the datatype of`B`.
`ldb`		input	Leading dimension of two-dimensional array used to store matrix`B[i]`.
`strideB`		input	Value of type long long int that gives the offset in number of elements between`B[i]` and`B[i+1]`.
`beta`	host or device	input	Scaling factor for C of the <Scale Type> that corresponds to the computeType and Ctype, see the table below for details. If`beta==0`,`C[i]` does not have to be a valid input.
`C`	device	in/out	Pointer to <Ctype> matrix, C, corresponds to the first instance of the batch, with dimensions`ldcxn` with`ldc>=max(1,m)`. Matrices`C[i]` should not overlap; otherwise, undefined behavior is expected.
`Ctype`		input	Enumerant specifying the datatype of`C`.
`ldc`		input	Leading dimension of a two-dimensional array used to store each matrix`C[i]`.
`strideC`		input	Value of type long long int that gives the offset in number of elements between`C[i]` and`C[i+1]`.
`batchCount`		input	Number of GEMMs to perform in the batch.
`computeType`		input	Enumerant specifying the computation type.
`algo`		input	Enumerant specifying the algorithm. SeecublasGemmAlgo_t.

cublasGemmStridedBatchedEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_8I`	`CUDA_R_32F`
		`CUDA_R_16BF`	`CUDA_R_32F`
		`CUDA_R_16F`	`CUDA_R_32F`
		`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_8I`	`CUDA_C_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32` or `CUBLAS_COMPUTE_32F_EMULATED_16BFX9`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

Note

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ARCH_MISMATCH`	cublasGemmBatchedEx() is only supported for GPU with architecture capabilities equal or greater than 5.0.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype`,`Btype` and`Ctype` or the algorithm,`algo` is not supported.
`CUBLAS_STATUS_INVALID_VALUE`	If`m<0` or`n<0` or`k<0`, or if`transa` and`transb` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda<max(1,m)` when`transa==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldb<max(1,k)` when`transb==CUBLAS_OP_N` and`ldb<max(1,n)` otherwise, or if`ldc<max(1,m)`, or if`alpha` or`beta` are NULL, or if`Atype` or`Btype` or`Ctype` or`algo` or`computeType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

Also refer to:sgemm.()

2.8.15.cublasGemmGroupedBatchedEx()

cublasStatus_tcublasGemmGroupedBatchedEx(cublasHandle_thandle,constcublasOperation_ttransa_array[],constcublasOperation_ttransb_array[],constintm_array[],constintn_array[],constintk_array[],constvoid*alpha_array,constvoid*constAarray[],cudaDataType_tAtype,constintlda_array[],constvoid*constBarray[],cudaDataType_tBtype,constintldb_array[],constvoid*beta_array,void*constCarray[],cudaDataType_tCtype,constintldc_array[],intgroup_count,constintgroup_size[],cublasComputeType_tcomputeType)

This function supports the64-bit Integer Interface.

idx=0;fori=0:group_count-1forj=0:group_size[i]-1gemmEx(transa_array[i],transb_array[i],m_array[i],n_array[i],k_array[i],alpha_array[i],Aarray[idx],Atype,lda_array[i],Barray[idx],Btype,ldb_array[i],beta_array[i],Carray[idx],Ctype,ldc_array[i],computeType,CUBLAS_GEMM_DEFAULT);idx+=1;endend

$\text{op}(\text{Aarray}\lbrack\text{idx}\rbrack)$:$\text{$\mathrm{m\_array}$}\lbrack i\rbrack \times \text{$\mathrm{k\_array}$}\lbrack i\rbrack$
$\text{op}(\text{Barray}\lbrack\text{idx}\rbrack)$:$\text{$\mathrm{k\_array}$}\lbrack i\rbrack \times \text{$\mathrm{n\_array}$}\lbrack i\rbrack$
$\text{Carray}\lbrack\text{idx}\rbrack$:$\text{$\mathrm{m\_array}$}\lbrack i\rbrack \times \text{$\mathrm{n\_array}$}\lbrack i\rbrack$

Note

For matrix$A[\text{idx}]$ in group$i$

and$\text{op}(B[\text{idx}])$ is defined similarly for matrix$B[\text{idx}]$ in group$i$.

Note

$C\lbrack\text{idx}\rbrack$ matrices must not overlap, that is, the individual gemm operations must be computable independently; otherwise, undefined behavior is expected.

On certain problem sizes, it might be advantageous to make multiple calls tocublasGemmBatchedEx() in different CUDA streams, rather than use this API.

Param.	Memory	In/out	Meaning	Array Length
`handle`		input	Handle to the cuBLAS library context.
`transa_array`	host	input	Array containing the operations, op(`A[idx]`), that is non- or (conj.) transpose for each group.	group_count
`transb_array`	host	input	Array containing the operations, op(`B[idx]`), that is non- or (conj.) transpose for each group.	group_count
`m_array`	host	input	Array containing the number of rows of matrix op(`A[idx]`) and`C[idx]` for each group.	group_count
`n_array`	host	input	Array containing the number of columns of op(`B[idx]`) and`C[idx]` for each group.	group_count
`k_array`	host	input	Array containing the number of columns of op(`A[idx]`) and rows of op(`B[idx]`) for each group.	group_count
`alpha_array`	host	input	Array containing the <Scale Type> scalar used for multiplication for each group.	group_count
`Aarray`	device	input	Array of pointers to <Atype> array, with each array of dim.`lda[i]xk[i]` with`lda[i]>=max(1,m[i])` if`transa[i]==CUBLAS_OP_N` and`lda[i]xm[i]` with`lda[i]>=max(1,k[i])` otherwise. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`Atype`		input	Enumerant specifying the datatype of`A`.
`lda_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`A[idx]` for each group.	group_count
`Barray`	device	input	Array of pointers to <Btype> array, with each array of dim.`ldb[i]xn[i]` with`ldb[i]>=max(1,k[i])` if`transb[i]==CUBLAS_OP_N` and`ldb[i]xk[i]` with`ldb[i]>=max(1,n[i])` otherwise. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`Btype`		input	Enumerant specifying the datatype of`B`.
`ldb_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`B[idx]` for each group.	group_count
`beta_array`	host	input	Array containing the <Scale Type> scalar used for multiplication for each group.	group_count
`Carray`	device	in/out	Array of pointers to <Ctype> array. It has dimensions`ldc[i]xn[i]` with`ldc[i]>=max(1,m[i])`. Matrices`C[idx]` should not overlap; otherwise, undefined behavior is expected. All pointers must meet certain alignment criteria. Please see below for details.	problem_count
`Ctype`		input	Enumerant specifying the datatype of`C`.
`ldc_array`	host	input	Array containing the leading dimensions of two-dimensional arrays used to store each matrix`C[idx]` for each group.	group_count
`group_count`	host	input	Number of groups
`group_size`	host	input	Array containing the number of pointers contained in Aarray, Barray and Carray for each group.	group_count
`computeType`		input	Enumerant specifying the computation type.

cublasGemmGroupedBatchedEx() supports the following Compute Type, Scale Type, Atype/Btype, and Ctype:

Compute Type	Scale Type (alpha and beta)	Atype/Btype	Ctype
`CUBLAS_COMPUTE_32F`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`
		`CUDA_R_16F`	`CUDA_R_16F`
		`CUDA_R_32F`	`CUDA_R_32F`
`CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUBLAS_COMPUTE_32F_FAST_TF32`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`

IfAtype isCUDA_R_16F orCUDA_R_16BF or if thecomputeType is any of theFAST options, pointers (not the pointer arrays) placed in the GPU memory must be properly aligned to avoid misaligned memory access errors. Ideally all pointers are aligned to at least 16 Bytes. Otherwise it is required that they meet the following rule:

if(k*AtypeSize)%16==0 then ensureintptr_t(ptr)%16==0,
if(k*AtypeSize)%4==0 then ensureintptr_t(ptr)%4==0.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	If`transa_array`,`transb_array`,`m_array`,`n_array`,`k_array`,`alpha_array`,`lda_array`,`ldb_array`,`beta_array`,`ldc_array`, or`group_size` are NULL, or if`group_count<0`, or if`m_array[i]<0`,`n_array[i]<0`,`k_array[i]<0`,`group_size[i]<0`, or if`transa_array[i]` and`transb_array[i]` are not one of`CUBLAS_OP_N`,`CUBLAS_OP_C`,`CUBLAS_OP_T`, or if`lda_array[i]<max(1,m_array[i])` if`transa_array[i]==CUBLAS_OP_N` and`lda_array[i]<max(1,k_array[i])` otherwise, or if`ldb_array[i]<max(1,k_array[i])` if`transb_array[i]==CUBLAS_OP_N` and`ldb_array[i]<max(1,n_array[i])` otherwise, or if`ldc_array[i]<max(1,m_array[i])`
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_NOT_SUPPORTED`	the pointer mode is set to`CUBLAS_POINTER_MODE_DEVICE` `Atype` or`Btype` or`Ctype` or`computeType` are not supported

2.8.16.cublasCsyrkEx()

cublasStatus_tcublasCsyrkEx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constvoid*A,cudaDataTypeAtype,intlda,constcuComplex*beta,cuComplex*C,cudaDataTypeCtype,intldc)

This function supports the64-bit Integer Interface.

This function is an extension ofcublasCsyrk() where the input matrix and output matrix can have a lower precision but the computation is still done in the typecuComplex

This function performs the symmetric rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a symmetric matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\\end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`trans==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix A.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The matrix types combinations supported forcublasCsyrkEx() are listed below:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,n)` if`trans==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`Atype` or`Ctype` are not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype` and`Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk()

2.8.17.cublasCsyrk3mEx()

cublasStatus_tcublasCsyrk3mEx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constvoid*A,cudaDataTypeAtype,intlda,constcuComplex*beta,cuComplex*C,cudaDataTypeCtype,intldc)

This function supports the64-bit Integer Interface.

This function is an extension ofcublasCsyrk() where the input matrix and output matrix can have a lower precision but the computation is still done in the typecuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the symmetric rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a symmetric matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\\end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	Operation op(`A`) that is non- or transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`trans==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix A.
`beta`	host or device	input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The matrix types combinations supported forcublasCsyrk3mEx() are listed below :

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,n)` if`trans==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`Atype` or`Ctype` are not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype` and`Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk()

2.8.18.cublasCherkEx()

cublasStatus_tcublasCherkEx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constvoid*A,cudaDataTypeAtype,intlda,constfloat*beta,cuComplex*C,cudaDataTypeCtype,intldc)

This function supports the64-bit Integer Interface.

This function is an extension ofcublasCherk() where the input matrix and output matrix can have a lower precision but the computation is still done in the typecuComplex

This function performs the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a Hermitian matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`beta`		input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The matrix types combinations supported forcublasCherkEx() are listed in the following table:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,n)` if`trans==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`Atype` or`Ctype` are not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype` and`Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to NETLIB documentation:

cherk()

2.8.19.cublasCherk3mEx()

cublasStatus_tcublasCherk3mEx(cublasHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constvoid*A,cudaDataTypeAtype,intlda,constfloat*beta,cuComplex*C,cudaDataTypeCtype,intldc)

This function supports the64-bit Integer Interface.

This function is an extension ofcublasCherk() where the input matrix and output matrix can have a lower precision but the computation is still done in the typecuComplex. This routine is implemented using the Gauss complexity reduction algorithm which can lead to an increase in performance up to 25%

This function performs the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a Hermitian matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Note

This routine is only supported on GPUs with architecture capabilities equal to or greater than 5.0

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`uplo`		input	Indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	Operation op(`A`) that is non- or (conj.) transpose.
`n`		input	Number of rows of matrix op(`A`) and`C`.
`k`		input	Number of columns of matrix op(`A`).
`alpha`	host or device	input	<type> scalar used for multiplication.
`A`	device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`trans==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`Atype`		input	Enumerant specifying the datatype of matrix`A`.
`lda`		input	Leading dimension of two-dimensional array used to store matrix`A`.
`beta`		input	<type> scalar used for multiplication. If`beta==0` then`C` does not have to be a valid input.
`C`	device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`Ctype`		input	Enumerant specifying the datatype of matrix`C`.
`ldc`		input	Leading dimension of two-dimensional array used to store matrix`C`.

The matrix types combinations supported forcublasCherk3mEx() are listed in the following table:

A	C
`CUDA_C_8I`	`CUDA_C_32F`
`CUDA_C_32F`	`CUDA_C_32F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If`n<0` or`k<0`, or if`uplo` is not one of`CUBLAS_FILL_MODE_LOWER` and`CUBLAS_FILL_MODE_UPPER`, or if`trans` is not one of`CUBLAS_OP_N`,`CUBLAS_OP_T` and`CUBLAS_OP_C`, or if`lda<max(1,n)` if`trans==CUBLAS_OP_N` and`lda<max(1,k)` otherwise, or if`ldc<max(1,n)`, or if`Atype` or`Ctype` are not supported
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`Atype` and`Ctype` is not supported.
`CUBLAS_STATUS_ARCH_MISMATCH`	The device has a compute capability lower than 5.0.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.

For references please refer to NETLIB documentation:

cherk()

2.8.20.cublasNrm2Ex()

cublasStatus_tcublasNrm2Ex(cublasHandle_thandle,intn,constvoid*x,cudaDataTypexType,intincx,void*result,cudaDataTyperesultType,cudaDataTypeexecutionType)

This function supports the64-bit Integer Interface.

This function is an API generalization of the routinecublas<t>nrm2() where input data, output data and compute type can be specified independently.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`x`	device	input	<type> vector with`n` elements.
`xType`		input	Enumerant specifying the datatype of vector`x`.
`incx`		input	Stride between consecutive elements of`x`.
`result`	host or device	output	The resulting norm, which is set to`0` if`n<=0` or`incx<=0`.
`resultType`		input	Enumerant specifying the datatype of the`result`.
`executionType`		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported forcublasNrm2Ex() are listed below :

x	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_C_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_64F`	`CUDA_R_64F`	`CUDA_R_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`xType`,`resultType` and`executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	If`xType` or`resultType` or`executionType` is not supported, or if`result` is NULL

For references please refer to NETLIB documentation:

snrm2(),dnrm2(),scnrm2(),dznrm2()

2.8.21.cublasAxpyEx()

cublasStatus_tcublasAxpyEx(cublasHandle_thandle,intn,constvoid*alpha,cudaDataTypealphaType,constvoid*x,cudaDataTypexType,intincx,void*y,cudaDataTypeyType,intincy,cudaDataTypeexecutiontype);

This function supports the64-bit Integer Interface.

This function is an API generalization of the routinecublas<t>axpy() where input data, output data and compute type can be specified independently.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x` and`y`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`alphaType`		input	Enumerant specifying the datatype of scalar`alpha`.
`x`	device	input	<type> vector with`n` elements.
`xType`		input	Enumerant specifying the datatype of vector`x`.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`yType`		input	Enumerant specifying the datatype of vector`y`.
`incy`		input	Stride between consecutive elements of`y`.
`executionType`		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported forcublasAxpyEx() are listed in the following table:

alpha	x	y	execution
`CUDA_R_32F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`xType`,`yType`, and`executionType` is not supported.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.
`CUBLAS_STATUS_INVALID_VALUE`	`alphaType` or`xType` or`yType` or`executionType` is not supported.

For references please refer to NETLIB documentation:

saxpy(),daxpy(),caxpy(),zaxpy()

2.8.22.cublasDotEx()

cublasStatus_tcublasDotEx(cublasHandle_thandle,intn,constvoid*x,cudaDataTypexType,intincx,constvoid*y,cudaDataTypeyType,intincy,void*result,cudaDataTyperesultType,cudaDataTypeexecutionType);cublasStatus_tcublasDotcEx(cublasHandle_thandle,intn,constvoid*x,cudaDataTypexType,intincx,constvoid*y,cudaDataTypeyType,intincy,void*result,cudaDataTyperesultType,cudaDataTypeexecutionType);

These functions support the64-bit Integer Interface.

These functions are an API generalization of the routinescublas<t>dot() andcublas<t>dotc() where input data, output data and compute type can be specified independently. Note:cublas<t>dotc() is dot product conjugated,cublas<t>dotu() is dot product unconjugated.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	input	<type> vector with`n` elements.
`xType`		input	Enumerant specifying the datatype of vector`x`.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	input	<type> vector with`n` elements.
`yType`		input	Enumerant specifying the datatype of vector`y`.
`incy`		input	Stride between consecutive elements of`y`.
`result`	host or device	output	The resulting dot product, which is set to`0` if`n<=0`
`resultType`		input	Enumerant specifying the datatype of the`result`.
`executionType`		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported forcublasDotEx() andcublasDotcEx() are listed below:

x	y	result	execution
`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed in the following table:

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized.
`CUBLAS_STATUS_ALLOC_FAILED`	The reduction buffer could not be allocated.
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`xType`,`yType`,`resultType` and`executionType` is not supported.
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU.
`CUBLAS_STATUS_INVALID_VALUE`	`xType` or`yType` or`resultType` or`executionType` is not supported.

For references please refer to NETLIB documentation:

sdot(),ddot(),cdotu(),cdotc(),zdotu(),zdotc()

2.8.23.cublasRotEx()

cublasStatus_tcublasRotEx(cublasHandle_thandle,intn,void*x,cudaDataTypexType,intincx,void*y,cudaDataTypeyType,intincy,constvoid*c,/* host or device pointer */constvoid*s,cudaDataTypecsType,cudaDataTypeexecutiontype);

This function supports the64-bit Integer Interface.

This function is an extension to the routinecublas<t>rot() where input data, output data, cosine/sine type, and compute type can be specified independently.

This function applies Givens rotation matrix (i.e., rotation in the x,y plane counter-clockwise by angle defined by$cos(alpha) = c$,$sin(alpha) = s$):

$G = \begin{pmatrix}c & s \\{- s} & c \\\end{pmatrix}$

to vectorsx andy.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vectors`x` and`y`.
`x`	device	in/out	<type> vector with`n` elements.
`xType`		input	Enumerant specifying the datatype of vector`x`.
`incx`		input	Stride between consecutive elements of`x`.
`y`	device	in/out	<type> vector with`n` elements.
`yType`		input	Enumerant specifying the datatype of vector`y`.
`incy`		input	Stride between consecutive elements of`y`.
`c`	host or device	input	Cosine element of the rotation matrix.
`s`	host or device	input	Sine element of the rotation matrix.
`csType`		input	Enumerant specifying the datatype of`c` and`s`.
`executionType`		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported forcublasRotEx() are listed below :

executionType	xType / yType	csType
`CUDA_R_32F`	`CUDA_R_16BF` `CUDA_R_16F` `CUDA_R_32F`	`CUDA_R_16BF` `CUDA_R_16F` `CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F` `CUDA_C_32F`	`CUDA_R_32F` `CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F` `CUDA_C_64F`	`CUDA_R_64F` `CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU

For references please refer to NETLIB documentation:

srot(),drot(),crot(),csrot(),zrot(),zdrot()

2.8.24.cublasScalEx()

cublasStatus_tcublasScalEx(cublasHandle_thandle,intn,constvoid*alpha,cudaDataTypealphaType,void*x,cudaDataTypexType,intincx,cudaDataTypeexecutionType);

This function supports the64-bit Integer Interface.

Param.	Memory	In/out	Meaning
`handle`		input	Handle to the cuBLAS library context.
`n`		input	Number of elements in the vector`x`.
`alpha`	host or device	input	<type> scalar used for multiplication.
`alphaType`		input	Enumerant specifying the datatype of scalar`alpha`.
`x`	device	in/out	<type> vector with`n` elements.
`xType`		input	Enumerant specifying the datatype of vector`x`.
`incx`		input	Stride between consecutive elements of`x`.
`executionType`		input	Enumerant specifying the datatype in which the computation is executed.

The datatypes combinations currently supported forcublasScalEx() are listed below :

alpha	x	execution
`CUDA_R_32F`	`CUDA_R_16F`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_32F`
`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`
`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`
`CUDA_C_32F`	`CUDA_C_32F`	`CUDA_C_32F`
`CUDA_C_64F`	`CUDA_C_64F`	`CUDA_C_64F`

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	The library was not initialized
`CUBLAS_STATUS_NOT_SUPPORTED`	The combination of the parameters`xType` and`executionType` is not supported
`CUBLAS_STATUS_EXECUTION_FAILED`	The function failed to launch on the GPU
`CUBLAS_STATUS_INVALID_VALUE`	`alphaType` or`xType` or`executionType` is not supported

For references please refer to NETLIB documentation:

sscal(),dscal(),csscal(),cscal(),zdscal(),zscal()

3.Using the cuBLASLt API

3.1.General Description

The cuBLASLt library is a new lightweight library dedicated to GEneral Matrix-to-matrix Multiply (GEMM) operations with a new flexible API. This new library adds flexibility in matrix data layouts, input types, compute types, and also in choosing the algorithmic implementations and heuristics through parameter programmability.

Once a set of options for the intended GEMM operation are identified by the user, these options can be used repeatedly for different inputs. This is analogous to how cuFFT and FFTW first create a plan and reuse for same size and type FFTs with different input data.

Note

The cuBLASLt library does not guarantee the support of all possible sizes and configurations, however, since CUDA 12.2 update 2, the problem size limitations on m, n, and batch size have been largely resolved. The main focus of the library is to provide the most performant kernels, which might have some implied limitations. Some non-standard configurations may require a user to handle them manually, typically by decomposing the problem into smaller parts (seeProblem Size Limitations).

3.1.1.Problem Size Limitations

There are inherent problem size limitations that are a result of limitations in CUDA grid dimensions. For example, many kernels do not support batch sizes greater than 65535 due to a limitation on thez dimension of a grid. There are similar restriction on the m and n values for a given problem.

In cases where a problem cannot be run by a single kernel, cuBLASLt will attempt to decompose the problem into multiple sub-problems and solve it by running the kernel on each sub-problem.

There are some restrictions on cuBLASLt internal problem decomposition which are summarized below:

Amax computations are not supported. This means thatCUBLASLT_MATMUL_DESC_AMAX_D_POINTER andCUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER must be left unset (seecublasLtMatmulDescAttributes_t)
All matrix layouts must haveCUBLASLT_MATRIX_LAYOUT_ORDER set toCUBLASLT_ORDER_COL (seecublasLtOrder_t)
cuBLASLt will not partition along the n dimension whenCUBLASLT_MATMUL_DESC_EPILOGUE is set toCUBLASLT_EPILOGUE_DRELU_BGRAD orCUBLASLT_EPILOGUE_DGELU_BGRAD (seecublasLtEpilogue_t)

To overcome these limitations, a user may want to partition the problem themself, launch kernels for each sub-problem, and compute any necessary reductions to combine the results.

3.1.2.Heuristics Cache

cuBLASLt uses heuristics to pick the most suitable matmul kernel for execution based on the problem sizes, GPU configuration, and other parameters. This requires performing some computations on the host CPU, which could take tens of microseconds. To overcome this overhead, it is recommended to query the heuristics once usingcublasLtMatmulAlgoGetHeuristic() and then reuse the result for subsequent computations usingcublasLtMatmul().

For the cases where querying heuristics once and then reusing them is not feasible, cuBLASLt implements a heuristics cache that maps matmul problems to kernels previously selected by heuristics. The heuristics cache uses an LRU-like eviction policy and is thread-safe.

The user can control the heuristics cache capacity with theCUBLASLT_HEURISTICS_CACHE_CAPACITY environment variable or with thecublasLtHeuristicsCacheSetCapacity() function which has higher precedence. The capacity is measured in number of entries and might be rounded up to the nearest multiple of some factor for performance reasons. Each entry takes about 360 bytes but is subject to change. The default capacity is 8192 entries.

Note

Setting capacity to zero disables the cache completely. This can be useful for workloads that do not have a steady state and for which cache operations may have higher overhead than regular heuristics computations.

Note

The cache is not ideal for performance reasons, so it is sometimes necessary to increase its capacity 1.5x-2.x over the anticipated number of unique matmul problems to achieve a nearly perfect hit rate.

3.1.3.cuBLASLt Logging

cuBLASLt logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUBLASLT_LOG_LEVEL=<level> where<level> is one of the following levels:
- 0 - Off - logging is disabled (default)
- 1 - Error - only errors will be logged
- 2 - Trace - API calls that launch CUDA kernels will log their parameters and important information
- 3 - Hints - hints that can potentially improve the application’s performance
- 4 - Info - provides general information about the library execution, may contain details about heuristic status
- 5 - API Trace - API calls will log their parameter and important information
CUBLASLT_LOG_MASK=<mask>, where<mask> is a combination of the following flags:
- 0 - Off
- 1 - Error
- 2 - Trace
- 4 - Hints
- 8 - Info
- 16 - API Trace
For example, useCUBLASLT_LOG_MASK=5 to enable Error and Hints messages.
CUBLASLT_LOG_FILE=<file_name>, where<file_name> is a path to a logging file. The file name may contain%i, which will be replaced with the process ID. For examplefile_name_%i.log.

IfCUBLASLT_LOG_FILE is not set, the log messages are printed to stdout.

Another option is to use the experimental cuBLASLt logging API. See:

cublasLtLoggerSetCallback(),cublasLtLoggerSetFile(),cublasLtLoggerOpenFile(),cublasLtLoggerSetLevel(),cublasLtLoggerSetMask(),cublasLtLoggerForceDisable()

3.1.4.Narrow Precision Data Types Usage

What we call herenarrow precision data types were first introduced as 8-bit floating point data types (FP8) with Ada and Hopper GPUs (compute capability 8.9 and above), and were designed to further accelerate matrix multiplications. There are two types of FP8 available:

CUDA_R_8F_E4M3 is designed to be accurate at a smaller dynamic range than half precision. The E4 and M3 indicate a 4-bit exponent and a 3-bit mantissa respectively. For more details, see__nv__fp8_e4m3.
CUDA_R_8F_E5M2 is designed to be accurate at a similar dynamic range as half precision. The E5 and M2 indicate a 5-bit exponent and a 2-bit mantissa respectively. For more information see__nv__fp8_e5m2.

Note

Unless otherwise stated, FP8 refers to bothCUDA_R_8F_E4M3 andCUDA_R_8F_E5M2.

With the Blackwell GPUs (compute capability 10.0 and above), cuBLAS adds support for 4-bit floating data type (FP4)CUDA_R_4F_E2M1. The E2 and M1 indicate a 2-bit exponent and a 1-bit mantissa respectively. For more details, see__nv_fp4_e2m1.

In order to maintain accuracy, data in narrow precisions needs to be scaled or dequantized before and potentially quantized after computations. cuBLAS provides several modes how the scaling factors are applied, defined incublasLtMatmulMatrixScale_t and configured via theCUBLASLT_MATMUL_DESC_X_SCALE_MODE attributes (hereX stands forA,B,C,D,D_OUT, orEPILOGUE_AUX; seecublasLtMatmulDescAttributes_t). The scaling modes overview is given in the next table, and more details are available in the subsequent sections.

Scaling Mode Support Overview
Mode	Supported compute capabilities	Tensor values data type	Scaling factors data type	Scaling factor layout
Tensorwide scaling	8.9+	`CUDA_R_8F_E4M3` /`CUDA_R_8F_E5M2`	`CUDA_R_32F` [#fp32]	Scalar
Outer vector scaling	9.0	`CUDA_R_8F_E4M3` /`CUDA_R_8F_E5M2`	`CUDA_R_32F`	Vector
128-element 1D block scaling	9.0	`CUDA_R_8F_E4M3` /`CUDA_R_8F_E5M2`	`CUDA_R_32F`	Tensor
128x128-element 2D block scaling	9.0	`CUDA_R_8F_E4M3` /`CUDA_R_8F_E5M2`	`CUDA_R_32F`	Tensor
32-element 1D block scaling	10.0+	`CUDA_R_8F_E4M3` /`CUDA_R_8F_E5M2`	`CUDA_R_8F_UE8M0`2	Tiled tensor4
16-element 1D block scaling	10.0+	`CUDA_R_4F_E2M1`	`CUDA_R_8F_UE4M3`3	Tiled tensor4

NOTES:

1: Scaling factors that haveCUDA_R_32F data type can be negative and are applied as-is without taking their absolute value first.
2: CUDA_R_8F_UE8M0 is an 8-bit unsigned exponent-only floating data type. For more information see__nv_fp8_e8m0.
3: CUDA_R_8F_UE4M3 is an unsigned version ofCUDA_R_E4M3. The sign bit is ignored, so this enumerant is provided for convenience.
4(1,2): See1D Block Scaling Factors Layout for more details.

Note

Scales are only applicable to narrow precisions matmuls. If any scale is set for a non-narrow precisions matmul, cuBLAS will return an error. Furthermore, scales are generally only supported for narrow precision tensors. If the corresponding scale is set for a non-narrow precisions tensor, cuBLAS will return an error. The one exception is that the C tensor is allowed to have a scale for non-narrow data types with tensorwide scaling mode.

Note

Only Tensorwide scaling is supported whencublasLtBatchMode_t of any matrix is set toCUBLASLT_BATCH_MODE_POINTER_ARRAY.

3.1.4.1.Tensorwide Scaling For FP8 Data Types

Tensorwide scaling is enabled whenCUBLASLT_MATMUL_DESC_X_SCALE_MODE attributes (hereX stands forA,B,C,D, orEPILOGUE_AUX; seecublasLtMatmulDescAttributes_t) for all FP8-precision tensors is set toCUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F (this is the default value for FP8 tensors). In such case, the matmul operation in cuBLAS is defined in the following way (assuming, for exposition, that all tensors are using an FP8 precision):

\[D = scale_D \cdot (\alpha \cdot scale_A \cdot scale_B \cdot \text{op}(A) \text{op}(B) + \beta \cdot scale_C \cdot C).\]

Here$A$,$B$, and$C$ are input tensors, and$scale_A$,$scale_B$,$scale_C$,$scale_D$,$\alpha$, and$\beta$ are input scalars. This differs from the other matrix multiplication routines because of this addition of scaling factors for each matrix. The$scale_A$,$scale_B$, and$scale_C$ are used for de-quantization, and$scale_D$ is used for quantization. Note that all the scaling factors are applied multiplicatively. This means that sometimes it is necessary to use a scaling factor or its reciprocal depending on the context in which it is applied. For more information on FP8, seecublasLtMatmul() andcublasLtMatmulDescAttributes_t.

For such matrix multiplications, epilogues and the absolute maximums of intermediate values are computed as follows:

\[\begin{split}Aux_{temp} & = \alpha \cdot scale_A \cdot scale_B \cdot \text{op}(A) \text{op}(B) + \beta \cdot scale_C \cdot C + bias, \\D_{temp} & = \mathop{Epilogue}(Aux_{temp}), \\amax_{D} & = \mathop{absmax}(D_{temp}), \\amax_{Aux} & = \mathop{absmax}(Aux_{temp}), \\D & = scale_D * D_{temp}, \\Aux & = scale_{Aux} * Aux_{temp}. \\\end{split}\]

Here$Aux$ is an auxiliary output of matmul consisting of the values that are passed to an epilogue function like GELU,$scale_{Aux}$ is an optional scaling factor that can be applied to$Aux$, and$amax_{Aux}$ is the maximum absolute value in$Aux$ before scaling. For more information, see attributesCUBLASLT_MATMUL_DESC_AMAX_D_POINTER andCUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER incublasLtMatmulDescAttributes_t.

Note

As indicated in equation above, bias is applied before calculating$Aux_{temp}$.

3.1.4.2.Outer Vector Scaling for FP8 Data Types

This scaling mode (also known as channelwise or rowwise scaling) is a refinement of the tensorwide scaling. Instead of multiplying a matrix by a single scalar, a scaling factor is associated with each row of$A$ and each column of$B$:

\[D_{ij} = \alpha \cdot scale_A^i \cdot scale_B^j \sum_{l=1}^k a_{il}\cdot b_{lj} + \beta \cdot scale_C \cdot C_{ij}.\]

Notably,$scale_D$ is not supported because the only supported precisions for$D$ areCUDA_R_16F,CUDA_R_16BF, andCUDA_R_32F.

To enable outer vector scaling, theCUBLASLT_MATMUL_DESC_A_SCALE_MODE andCUBLASLT_MATMUL_DESC_B_SCALE_MODE attributes, must be set toCUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F, while all the other scaling modes must not be modified.

When using this scaling mode, the$scale_A$ and$scale_B$ must be vectors of length$M$ and$N$ respectively.

3.1.4.3.16/32-Element 1D Block Scaling for FP8 and FP4 Data Types

1D block scaling aims to overcome limitations of having a single scalar to scale a whole tensor. It is described in more details in theOCP MXFP specification, so we give just a brief overview here. Block scaling means that elements within the same 16- or 32-element block of adjacent values are assigned a shared scaling factor.

Currently, block scaling is supported for FP8-precision and FP4-precision tensors and mixing precisions is not supported. To enable block scaling, theCUBLASLT_MATMUL_DESC_X_SCALE_MODE attributes (hereX stands forA,B,C,DOUT, orEPILOGUE_AUX; seecublasLtMatmulDescAttributes_t) must be set toCUBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0 for all FP8-precision tensors or toCUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3 for all FP4-precision tensors.

With block scaling, the matmul operation in cuBLAS is defined in the following way (assuming, for exposition, that all tensors are using a narrow precision). We loosely follow the OCP MXFP specification notation.

First, ascaled block (or anMX-compliant format vector in the OCP MXFP specification) is a tuple$x = \left(S^x, \left[x^i\right]_{i=1}^k\right)$, where$S^x$ is a sharedscaling factor, and each$x^i$ is stored using an FP8 or FP4 data type.

A dot product of two scaled blocks$x = \left(S^x, \left[x^i\right]_{i=1}^{k}\right)$ and$y = \left(S^y, \left[y^i\right]_{i=1}^{k}\right)$ is defined as follows:

\[Dot(x, y) = S^x S^y \cdot \sum_{i=1}^{k} x^i y^i.\]

For a sequence of$n$ blocks$X = \{x_j\}_{j=1}^n$ and$Y = \{y_j\}_{j=1}^n$, the generalized dot product is defined as:

\[DotGeneral(X, Y) = \sum_{j=1}^n Dot(x_j, y_j).\]

The generalized dot product can be used to define the matrix multiplication by combining together one scaling factor per$k$ elements of$A$ and$B$ in the$K$ dimension (assuming, for simplicity, that$K$ is divisible by$k$ without a remainder):

\[\begin{split}L & = \frac{K}{k}, \\A_i & = \left\{{scale_A}_{i,b}, \left[A_{i,(b-1)k+l}\right]_{l=1}^{k}\right\}_{b=1}^L, \\B_j & = \left\{{scale_B}_{i,b}, \left[B_{(b-1)k+l,j}\right]_{l=1}^{k}\right\}_{b=1}^L, \\(\left\{scale_A, A\right\} \times \left\{scale_B, B\right\})_{i,j} & = DotGeneral(A_i, B_j).\end{split}\]

Now, the full matmul can be written as:

\[\left\{scale_D^{out}, D\right\} = Quantize\left(scale_D^{in}\left(\alpha \cdot \left\{scale_A, \text{op}(A)\right\} \times \left\{scale_B, \text{op}(B)\right\} + \beta \cdot Dequantize(\left\{scale_C, C\right\})\right)\right).\]

The$Quantize$ is explained in the1D Block Quantization section, and$Dequantize$ is defined as:

\[Dequantize\left(\left\{scale_C, C\right\})\right)_{i,j} = {scale_C}_{i/k,j} \cdot C_{i,j}.\]

Note

In addition to$scale_D^{out}$ that is computed during quantization, there is also aninput scalar tensor-wide scaling factor$scale_D^{in}$ for$D$ that is available only when scaling factors use theCUDA_R_8F_UE4M3 data type. It is used to ‘compress’ computed values prior to quantization.

3.1.4.3.1.1D Block Quantization

Consider a single block of$k$ elements of$D$ in the$M$ dimension:$D^b_{fp32} = \left[d^i_{fp32}\right]_{i=1}^k$. Quantization of partial blocks is performed as if the missing values are zero. Let$Amax(DType)$ be the maximal value representable in the destination precision.

The following computations steps are common to all combinations of output and scaling factors data types.

Compute the block absolute maximum value$Amax(D^b_{fp32}) = max(\{|d_i|\}_{i=1}^k)$.
Compute the block scaling factor in single precision as$S^b_{fp32} = \frac{Amax(D^b_{fp32})}{Amax(DType)}$.

Computing scaling and conversion factors for FP8 with UE8M0 scales

Note

RNE rounding is assumed unless noted otherwise.

Computations consist of the following steps:

Extract the block scaling factor exponent without bias adjustment as an integer$E^b_{int}$ and mantissa as a fixed point number$M^b_{fixp}$ from$S^b_{fp32}$ (the actual implementation operates on bit representation directly).
Round the block exponent up keeping it within the range of values representable in UE8M0:$E^b_{int} = \left\{\begin{array}{ll} E^b_{int} + 1, & \text{if } S^b_{fp32} \text{ is a normal number and } E^b_{int} < 254 \text{ and } M^b_{fixp} > 0 \\ E^b_{int} + 1, & \text{if } S^b_{fp32} \text{is a denormal number and } M^b_{fixp} > 0.5, \\ E^b_{int}, & \text{otherwise.} \end{array}\right.$
Compute the block scaling factor as$S^b_{ue8m0} = 2^{E^b_{int}}$. Note that UE8M0 data type has exponent bias of 127.
Compute the block conversion factor$R^b_{fp32} = \frac{1}{fp32(S^b_{ue8m0})}$.

Note

The algorithm above differs from the OCP MXFP suggested rounding scheme.

Computing scaling and conversion factors for FP4 with UE4M3 scales

Here we assume that the algorithm is provided with a precomputed input tensorwide scaling factor$scale_D^{in}$ which in general case is computed as

\[scale_D^{in} = \frac{Amax(e2m1) \cdot Amax(e4m3)}{Amax(D_{temp})},\]

where$Amax(D_{temp})$ is aglobal absolute maximum of matmul results before quantization. Since computing this value requires knowing the result of the whole computation, an approximate value from e.g. the previous iteration is used in practice.

Computations consist of the following steps:

Compute the narrow precision value of the block scaling factor$S^b_{e4m3} = e4m3(S^b_{fp32} \cdot scale_D^{in})$.
Compute the block conversion factor$R^b_{fp32} = \frac{scale_D^{in}}{fp32(S^b_{e4m3})}$.

Applying conversion factors

For each$i = 1 \ldots k$, compute$d^i = DType(d^i_{fp32} \cdot R^n_{fp32})$. The resulting quantized block is$\left(S^b, \left[d^i\right]_{i=1}^k\right)$, where$S^b$ is$S^b_{ue8m0}$ for FP8 with UE8M0 scaling factors, and$S^b_{ue4m3}$ for FP4 with UE4M3 scaling factors.

3.1.4.3.2.1D Block Scaling Factors Layout

Scaling factors are stored using a tiled layout. The following figure shows how each 128x4 tile is laid out in memory. The offset in memory is increasing from left to right, and then from top to bottom.

_images/cublasLt_scaling_factors_layout_tile.png

The following pseudocode can be used to translate frominner (K for A and B, and M for C or D) andouter (M for A, and N for B, C and D) indices to linearoffset within a tile and back:

// Indices -> offsetoffset=(outer%32)*16+(outer/32)*4+inner// Offset -> Indicesouter=((offset%16)/4)*32+(offset/16)inner=(offset%4)

A single tile of scaling factors is applied to a 128x64 block when the scaling mode isCUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3 and to a 128x128 block when it isCUBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0.

Multiple blocks are arranged in the row-major manner. The next picture shows an example. The offset in memory is increasing from left to right, and then from top to bottom.

_images/cublasLt_scaling_factors_layout_global.png

In general, for a scaling factors tensor withsf_inner_dim scaling factors per row, offset of a block with top left coordinate(sf_outer,sf_inner) (using the same correspondence to matrix coordinates as noted above) can be computed using the following pseudocode:

// Indices -> offset//   note that sf_inner is a multiple of 4 due to the tiling layoutoffset=(sf_inner+sf_outer*sf_inner_dim)*128

Note

Starting addresses of scaling factors must be 16B aligned.

Note

Note that the layout described above does not allow transposition. This means that even though the input tensors can be transposed, the layout of scaling factors does not change.

Note

Note that when tensor dimensions are not multiples of the tile size above, it is necessary to still allocate full tile for storage and fill out of bounds values with zeroes. Moreover, when writing output scaling factors, kernels may write additional zeroes, so it is best to not make any assuptions regarding the persistence of out of bounds values.

3.1.4.4.128-element 1D and 128x128 2D Block Scaling For FP8 Data Types

These two scaling modes apply principles of the scaling approach described16/32-Element 1D Block Scaling for FP8 and FP4 Data Types to the Hopper GPU architecture. However, here the scaling data type isCUDA_R_32F, and different scaling modes can be used for$A$ and$B$, and the only supported precisions for$D$ areCUDA_R_16F,CUDA_R_16BF, andCUDA_R_32F.

To enable this scaling mode, theCUBLASLT_MATMUL_DESC_X_SCALE_MODE attributes (hereX stands forA orB), must be set toCUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F orCUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F, while all the other scaling modes must not be modified. The following table shows supported combinations:

CUBLASLT_MATMUL_DESC_A_SCALE_MODE	CUBLASLT_MATMUL_DESC_B_SCALE_MODE	Supported?
`CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F`	`CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F`	Yes
`CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F`	`CUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F`	Yes
`CUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F`	`CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F`	Yes
`CUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F`	`CUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F`	No

Using the notation from the16/32-Element 1D Block Scaling for FP8 and FP4 Data Types, we can define sequences of scaled blocks for the$i$-th row of$A$ in the following way:

\[\begin{split}L & = \lceil \frac{K}{128} \rceil, \\A^{128}_i & = \left\{{scale_A}_{i,b}, \left[A_{i,(b-1)128+l}\right]_{l=1}^{128}\right\}_{b=1}^L, \text{(this is the 128-element 1D block scaling)} \\\\p & = \lceil \frac{i}{128} \rceil, \\A^{128 \times 128}_i & = \left\{{scale_A}_{p,b}, \left[A_{i,(b-1)128+l}\right]_{l=1}^{128}\right\}_{b=1}^L. \text{(this is the 128x128-element 2D block scaling)} \\\end{split}\]

Definitions for$B$ are similar. The matmul is then defined as in16/32-Element 1D Block Scaling for FP8 and FP4 Data Types with the notable difference that when using the 2D block scaling a single scaling factor is used for the whole 128x128 block of elements.

3.1.4.4.1.Scaling factors layouts

Note

Starting addresses of scaling factors must be 16B aligned.

Note

$M$ and$N$ must be multiples of 4.

Then for theCUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F scaling mode, the scaling factors are:

$M$-major for$A$ with shape$M \times L$ ($M$-major means that elements along the$M$ dimension are contiguous in memory),
$N$-major for$B$ with shape$N \times L$.

For theCUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F scaling mode, the scaling factors are$K$-major and the stride between the consecutive columns must be a multiple of 4. Let$L_4 = \lceil L \rceil_4$, where the$\lceil \cdot \rceil_4$ denotes rounding up to the nearest multiple of 4. Then

for$A$, the shape of the scaling factors is$L_4 \times \lceil \frac{M}{128} \rceil$,
for$B$, the shape of the scaling factors is$L_4 \times \lceil \frac{N}{128} \rceil$.

3.1.5.Disabling CPU Instructions

As mentioned in theHeuristics Cache section, cuBLASLt heuristics perform some compute-intensive operations on the host CPU.To speed-up the operations, the implementation detects CPU capabilities and may use special instructions, such as Advanced Vector Extensions (AVX) on x86-64 CPUs.However, in some rare cases this might be not desirable. For instance, using advanced instructions may result in CPU running at a lower frequency, which would affect performance of the other host code.

The user can optionally instruct the cuBLASLt library to not use some CPU instructions with theCUBLASLT_DISABLE_CPU_INSTRUCTIONS_MASK environment variable or with thecublasLtDisableCpuInstructionsSetMask() function which has higher precedence.The default mask is 0, meaning that there are no restrictions.

Please checkcublasLtDisableCpuInstructionsSetMask() for more information.

3.2.cuBLASLt Code Examples

Please visithttps://github.com/NVIDIA/CUDALibrarySamples/tree/master/cuBLASLt for updated code examples.

3.3.cuBLASLt Datatypes Reference

3.3.1.cublasLtClusterShape_t

cublasLtClusterShape_t is an enumerated type used to configure thread block cluster dimensions. Thread block clusters add an optional hierarchical level and are made up of thread blocks. Similar to thread blocks, these can be one, two, or three-dimensional. See alsoThread Block Clusters.

Value	Description
`CUBLASLT_CLUSTER_SHAPE_AUTO`	Cluster shape is automatically selected.
`CUBLASLT_CLUSTER_SHAPE_1x1x1`	Cluster shape is 1 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x2x1`	Cluster shape is 1 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x4x1`	Cluster shape is 1 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x1x1`	Cluster shape is 2 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x2x1`	Cluster shape is 2 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x4x1`	Cluster shape is 2 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x1x1`	Cluster shape is 4 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x2x1`	Cluster shape is 4 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x4x1`	Cluster shape is 4 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x8x1`	Cluster shape is 1 x 8 x 1.
`CUBLASLT_CLUSTER_SHAPE_8x1x1`	Cluster shape is 8 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x8x1`	Cluster shape is 2 x 8 x 1.
`CUBLASLT_CLUSTER_SHAPE_8x2x1`	Cluster shape is 8 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x16x1`	Cluster shape is 1 x 16 x 1.
`CUBLASLT_CLUSTER_SHAPE_16x1x1`	Cluster shape is 16 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x3x1`	Cluster shape is 1 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x5x1`	Cluster shape is 1 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x6x1`	Cluster shape is 1 x 6 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x7x1`	Cluster shape is 1 x 7 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x9x1`	Cluster shape is 1 x 9 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x10x1`	Cluster shape is 1 x 10 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x11x1`	Cluster shape is 1 x 11 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x12x1`	Cluster shape is 1 x 12 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x13x1`	Cluster shape is 1 x 13 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x14x1`	Cluster shape is 1 x 14 x 1.
`CUBLASLT_CLUSTER_SHAPE_1x15x1`	Cluster shape is 1 x 15 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x3x1`	Cluster shape is 2 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x5x1`	Cluster shape is 2 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x6x1`	Cluster shape is 2 x 6 x 1.
`CUBLASLT_CLUSTER_SHAPE_2x7x1`	Cluster shape is 2 x 7 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x1x1`	Cluster shape is 3 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x2x1`	Cluster shape is 3 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x3x1`	Cluster shape is 3 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x4x1`	Cluster shape is 3 x 4 x 1.
`CUBLASLT_CLUSTER_SHAPE_3x5x1`	Cluster shape is 3 x 5 x 1.
`CUBLASLT_CLUSTER_SHAPE_4x3x1`	Cluster shape is 4 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x1x1`	Cluster shape is 5 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x2x1`	Cluster shape is 5 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_5x3x1`	Cluster shape is 5 x 3 x 1.
`CUBLASLT_CLUSTER_SHAPE_6x1x1`	Cluster shape is 6 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_6x2x1`	Cluster shape is 6 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_7x1x1`	Cluster shape is 7 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_7x2x1`	Cluster shape is 7 x 2 x 1.
`CUBLASLT_CLUSTER_SHAPE_9x1x1`	Cluster shape is 9 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_10x1x1`	Cluster shape is 10 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_11x1x1`	Cluster shape is 11 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_12x1x1`	Cluster shape is 12 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_13x1x1`	Cluster shape is 13 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_14x1x1`	Cluster shape is 14 x 1 x 1.
`CUBLASLT_CLUSTER_SHAPE_15x1x1`	Cluster shape is 15 x 1 x 1.

3.3.2.cublasLtEpilogue_t

ThecublasLtEpilogue_t is an enum type to set the postprocessing options for the epilogue.

Value	Description
`CUBLASLT_EPILOGUE_DEFAULT=1`	No special postprocessing, just scale and quantize the results if necessary.
`CUBLASLT_EPILOGUE_RELU=2`	Apply ReLU point-wise transform to the results (`x:=max(x,0)`).
`CUBLASLT_EPILOGUE_RELU_AUX=CUBLASLT_EPILOGUE_RELU\|128`	Apply ReLU point-wise transform to the results (`x:=max(x,0)`). This epilogue mode produces an extra output, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_BIAS=4`	Apply (broadcast) bias from the bias vector. Bias vector length must match matrix D rows, and it must be packed (such as stride between vector elements is 1). Bias vector is broadcast to all columns and added before applying the final postprocessing.
`CUBLASLT_EPILOGUE_RELU_BIAS=CUBLASLT_EPILOGUE_RELU=CUBLASLT_EPILOGUE_BIAS`	Apply bias and then ReLU transform.
`CUBLASLT_EPILOGUE_RELU_AUX_BIAS=CUBLASLT_EPILOGUE_RELU_AUX=CUBLASLT_EPILOGUE_BIAS`	Apply bias and then ReLU transform. This epilogue mode produces an extra output, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_DRELU=8\|128`	Apply ReLu gradient to matmul output. Store ReLu gradient in the output matrix. This epilogue mode requires an extra input, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_DRELU_BGRAD=CUBLASLT_EPILOGUE_DRELU\|16`	Apply independently ReLu and Bias gradient to matmul output. Store ReLu gradient in the output matrix, and Bias gradient in the bias buffer (see`CUBLASLT_MATMUL_DESC_BIAS_POINTER`). This epilogue mode requires an extra input, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_GELU=32`	Apply GELU point-wise transform to the results (`x:=GELU(x)`).
`CUBLASLT_EPILOGUE_GELU_AUX=CUBLASLT_EPILOGUE_GELU\|128`	Apply GELU point-wise transform to the results (`x:=GELU(x)`). This epilogue mode outputs GELU input as a separate matrix (useful for training). See`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_GELU_BIAS=CUBLASLT_EPILOGUE_GELU=CUBLASLT_EPILOGUE_BIAS`	Apply Bias and then GELU transform5.
`CUBLASLT_EPILOGUE_GELU_AUX_BIAS=CUBLASLT_EPILOGUE_GELU_AUX=CUBLASLT_EPILOGUE_BIAS`	Apply Bias and then GELU transform5. This epilogue mode outputs GELU input as a separate matrix (useful for training). See`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_DGELU=64\|128`	Apply GELU gradient to matmul output. Store GELU gradient in the output matrix. This epilogue mode requires an extra input, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_DGELU_BGRAD=CUBLASLT_EPILOGUE_DGELU\|16`	Apply independently GELU and Bias gradient to matmul output. Store GELU gradient in the output matrix, and Bias gradient in the bias buffer (see`CUBLASLT_MATMUL_DESC_BIAS_POINTER`). This epilogue mode requires an extra input, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_BGRADA=256`	Apply Bias gradient to the input matrix A. The bias size corresponds to the number of rows of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see`CUBLASLT_MATMUL_DESC_BIAS_POINTER` ofcublasLtMatmulDescAttributes_t.
`CUBLASLT_EPILOGUE_BGRADB=512`	Apply Bias gradient to the input matrix B. The bias size corresponds to the number of columns of the matrix D. The reduction happens over the GEMM’s “k” dimension. Store Bias gradient in the bias buffer, see`CUBLASLT_MATMUL_DESC_BIAS_POINTER` ofcublasLtMatmulDescAttributes_t.

NOTES:

5(1,2): GELU (Gaussian Error Linear Unit) is approximated by:${0.5}x\left( 1 + \text{tanh}\left( \sqrt{2/\pi}\left( x + {0.044715}x^{3} \right) \right) \right)$

Note

OnlyCUBLASLT_EPILOGUE_DEFAULT is supported whencublasLtBatchMode_t of any matrix is set toCUBLASLT_BATCH_MODE_POINTER_ARRAY.

3.3.3.cublasLtHandle_t

ThecublasLtHandle_t type is a pointer type to an opaque structure holding the cuBLASLt library context. UsecublasLtCreate() to initialize the cuBLASLt library context and return a handle to an opaque structure holding the cuBLASLt library context, and usecublasLtDestroy() to destroy a previously created cuBLASLt library context descriptor and release the resources.

Note

cuBLAS handle (cublasHandle_t) encapsulates a cuBLASLt handle. Any validcublasHandle_t can be used in place ofcublasLtHandle_t with a simple cast. However, unlike a cuBLAS handle, a cuBLASLt handle is not tied to any particular CUDA context with the exception of CUDA contexts tied to a graphics context (starting from CUDA 12.8). If a cuBLASLt handle is created when the current CUDA context is tied to a graphics context, then cuBLASLt detects the corresponding shared memory limitations and records it in the handle.

3.3.4.cublasLtLoggerCallback_t

cublasLtLoggerCallback_t is a callback function pointer type. A callback function can be set usingcublasLtLoggerSetCallback().

Parameters:

Parameter	Input / Output	Description
`logLevel`	Output	SeecuBLASLt Logging.
`functionName`	Output	The name of the API that logged this message.
`message`	Output	The log message.

3.3.5.cublasLtMatmulAlgo_t

cublasLtMatmulAlgo_t is an opaque structure holding the description of the matrix multiplication algorithm. This structure can be trivially serialized and later restored for use with the same version of cuBLAS library to save on selecting the right configuration again.

3.3.6.cublasLtMatmulAlgoCapAttributes_t

cublasLtMatmulAlgoCapAttributes_t enumerates matrix multiplication algorithm capability attributes that can be retrieved from an initializedcublasLtMatmulAlgo_t descriptor usingcublasLtMatmulAlgoCapGetAttribute().

Value	Description	Data Type
`CUBLASLT_ALGO_CAP_SPLITK_SUPPORT`	Support for split-K. Boolean (0 or 1) to express if split-K implementation is supported. 0 means no support, and supported otherwise. See`CUBLASLT_ALGO_CONFIG_SPLITK_NUM` ofcublasLtMatmulAlgoConfigAttributes_t.	`int32_t`
`CUBLASLT_ALGO_CAP_REDUCTION_SCHEME_MASK`	Mask to express the types of reduction schemes supported, seecublasLtReductionScheme_t. If the reduction scheme is not masked out then it is supported. For example:`intisReductionSchemeComputeTypeSupported?(reductionSchemeMask&CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE)==CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE?1:0;`	`uint32_t`
`CUBLASLT_ALGO_CAP_CTA_SWIZZLING_SUPPORT`	Support for CTA-swizzling. Boolean (0 or 1) to express if CTA-swizzling implementation is supported. 0 means no support, and 1 means supported value of 1; other values are reserved. See also`CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING` ofcublasLtMatmulAlgoConfigAttributes_t.	`uint32_t`
`CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT`	Support strided batch. 0 means no support, supported otherwise.	`int32_t`
`CUBLASLT_ALGO_CAP_POINTER_ARRAY_BATCH_SUPPORT`	Support pointer array batch. 0 means no support, supported otherwise.	`int32_t`
`CUBLASLT_ALGO_CAP_OUT_OF_PLACE_RESULT_SUPPORT`	Support results out of place (D != C in D = alpha.A.B + beta.C). 0 means no support, supported otherwise.	`int32_t`
`CUBLASLT_ALGO_CAP_UPLO_SUPPORT`	Syrk (symmetric rank k update)/herk (Hermitian rank k update) support (on top of regular gemm). 0 means no support, supported otherwise.	`int32_t`
`CUBLASLT_ALGO_CAP_TILE_IDS`	The tile ids possible to use. SeecublasLtMatmulTile_t. If no tile ids are supported then use`CUBLASLT_MATMUL_TILE_UNDEFINED`. UsecublasLtMatmulAlgoCapGetAttribute() with`sizeInBytes=0` to query the actual count.	`uint32_t[]`
`CUBLASLT_ALGO_CAP_STAGES_IDS`	The stages ids possible to use. SeecublasLtMatmulStages_t. If no stages ids are supported then use`CUBLASLT_MATMUL_STAGES_UNDEFINED`. UsecublasLtMatmulAlgoCapGetAttribute() with`sizeInBytes=0` to query the actual count.	`uint32_t[]`
`CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX`	Custom option range is from 0 to`CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX` (inclusive). See`CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION` ofcublasLtMatmulAlgoConfigAttributes_t .	`int32_t`
`CUBLASLT_ALGO_CAP_MATHMODE_IMPL`	Indicates whether the algorithm is using regular compute or tensor operations. 0 means regular compute, 1 means tensor operations.DEPRECATED	`int32_t`
`CUBLASLT_ALGO_CAP_GAUSSIAN_IMPL`	Indicate whether the algorithm implements the Gaussian optimization of complex matrix multiplication. 0 means regular compute; 1 means Gaussian. SeecublasMath_t.DEPRECATED	`int32_t`
`CUBLASLT_ALGO_CAP_CUSTOM_MEMORY_ORDER`	Indicates whether the algorithm supports custom (not COL or ROW memory order). 0 means only COL and ROW memory order is allowed, non-zero means that algo might have different requirements. SeecublasLtOrder_t.	`int32_t`
`CUBLASLT_ALGO_CAP_POINTER_MODE_MASK`	Bitmask enumerating the pointer modes the algorithm supports. SeecublasLtPointerModeMask_t.	`uint32_t`
`CUBLASLT_ALGO_CAP_EPILOGUE_MASK`	Bitmask enumerating the kinds of postprocessing algorithm supported in the epilogue. SeecublasLtEpilogue_t.	`uint32_t`
`CUBLASLT_ALGO_CAP_LD_NEGATIVE`	Support for negative leading dimension for all of the matrices. 0 means no support, supported otherwise.	`uint32_t`
`CUBLASLT_ALGO_CAP_NUMERICAL_IMPL_FLAGS`	Details about algorithm’s implementation that affect it’s numerical behavior. SeecublasLtNumericalImplFlags_t.	`uint64_t`
`CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_A_BYTES`	Minimum alignment required for A matrix in bytes.	`uint32_t`
`CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_B_BYTES`	Minimum alignment required for B matrix in bytes.	`uint32_t`
`CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_C_BYTES`	Minimum alignment required for C matrix in bytes.	`uint32_t`
`CUBLASLT_ALGO_CAP_MIN_ALIGNMENT_D_BYTES`	Minimum alignment required for D matrix in bytes.	`uint32_t`
`CUBLASLT_ALGO_CAP_FLOATING_POINT_EMULATION_SUPPORT`	Support for for floating point emulation. SeeFloating Point Emulation.	`int32_t`

3.3.7.cublasLtMatmulAlgoConfigAttributes_t

cublasLtMatmulAlgoConfigAttributes_t is an enumerated type that contains the configuration attributes for cuBLASLt matrix multiply algorithms. The configuration attributes are algorithm-specific, and can be set. The attributes configuration of a given algorithm should agree with its capability attributes. UsecublasLtMatmulAlgoConfigGetAttribute() andcublasLtMatmulAlgoConfigSetAttribute() to get and set the attribute value of a matmul algorithm descriptor.

Value	Description	Data Type
`CUBLASLT_ALGO_CONFIG_ID`	Read-only attribute. Algorithm index. SeecublasLtMatmulAlgoGetIds(). Set bycublasLtMatmulAlgoInit().	`int32_t`
`CUBLASLT_ALGO_CONFIG_TILE_ID`	Tile id. SeecublasLtMatmulTile_t. Default:`CUBLASLT_MATMUL_TILE_UNDEFINED`.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_STAGES_ID`	stages id, seecublasLtMatmulStages_t. Default:`CUBLASLT_MATMUL_STAGES_UNDEFINED`.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_SPLITK_NUM`	Number of K splits. If the number of K splits is greater than one, SPLITK_NUM parts of matrix multiplication will be computed in parallel. The results will be accumulated according to`CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME`.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME`	Reduction scheme to use when splitK value > 1. Default:`CUBLASLT_REDUCTION_SCHEME_NONE`. SeecublasLtReductionScheme_t.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_CTA_SWIZZLING`	Enable/Disable CTA swizzling. Change mapping from CUDA grid coordinates to parts of the matrices. Possible values: 0 and 1; other values reserved.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_CUSTOM_OPTION`	Custom option value. Each algorithm can support some custom options that don’t fit the description of the other configuration attributes. See the`CUBLASLT_ALGO_CAP_CUSTOM_OPTION_MAX` ofcublasLtMatmulAlgoCapAttributes_t for the accepted range for a specific case.	`uint32_t`
`CUBLASLT_ALGO_CONFIG_INNER_SHAPE_ID`	Inner shape ID. Refer to`cublasLtMatmulInnerShape_t.` Default:`CUBLASLT_MATMUL_INNER_SHAPE_UNDEFINED`.	`uint16_t`
`CUBLASLT_ALGO_CONFIG_CLUSTER_SHAPE_ID`	Cluster shape ID. Refer to`cublasLtClusterShape_t.` Default:`CUBLASLT_CLUSTER_SHAPE_AUTO`.	`uint16_t`

3.3.8.cublasLtMatmulDesc_t

ThecublasLtMatmulDesc_t is a pointer to an opaque structure holding the description of the matrix multiplication operationcublasLtMatmul(). A descriptor can be created by callingcublasLtMatmulDescCreate() and destroyed by callingcublasLtMatmulDescDestroy().

3.3.9.cublasLtMatmulDescAttributes_t

cublasLtMatmulDescAttributes_t is a descriptor structure containing the attributes that define the specifics of the matrix multiply operation. UsecublasLtMatmulDescGetAttribute() andcublasLtMatmulDescSetAttribute() to get and set the attribute value of a matmul descriptor.

Value	Description	Data Type
`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE`	Compute type. Defines the data type used for multiply and accumulate operations, and the accumulator during the matrix multiplication. SeecublasComputeType_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_SCALE_TYPE`	Scale type. Defines the data type of the scaling factors`alpha` and`beta`. The accumulator value and the value from matrix`C` are typically converted to scale type before final scaling. The value is then converted from scale type to the type of matrix`D` before storing in memory. The default value depends on`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE`. SeecudaDataType_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_POINTER_MODE`	Specifies`alpha` and`beta` are passed by reference, whether they are scalars on the host or on the device, or device vectors. Default value is:`CUBLASLT_POINTER_MODE_HOST` (i.e., on the host). SeecublasLtPointerMode_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_TRANSA`	Specifies the type of transformation operation that should be performed on matrix A. Default value is:`CUBLAS_OP_N` (i.e., non-transpose operation). SeecublasOperation_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_TRANSB`	Specifies the type of transformation operation that should be performed on matrix B. Default value is:`CUBLAS_OP_N` (i.e., non-transpose operation). SeecublasOperation_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_TRANSC`	Specifies the type of transformation operation that should be performed on matrix C. Currently only`CUBLAS_OP_N` is supported. Default value is:`CUBLAS_OP_N` (i.e., non-transpose operation). SeecublasOperation_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_FILL_MODE`	Indicates whether the lower or upper part of the dense matrix was filled, and consequently should be used by the function. Currently this flag is not supported for bfloat16 or FP8 data types and is not supported on the following GPUs: Hopper, Blackwell. Default value is:`CUBLAS_FILL_MODE_FULL`. SeecublasFillMode_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_EPILOGUE`	Epilogue function. SeecublasLtEpilogue_t. Default value is:`CUBLASLT_EPILOGUE_DEFAULT`.	`uint32_t`
`CUBLASLT_MATMUL_DESC_BIAS_POINTER`	Bias or Bias gradient vector pointer in the device memory. Input vector with length that matches the number of rows of matrix D when one of the following epilogues is used:`CUBLASLT_EPILOGUE_BIAS`,`CUBLASLT_EPILOGUE_RELU_BIAS`,`CUBLASLT_EPILOGUE_RELU_AUX_BIAS`,`CUBLASLT_EPILOGUE_GELU_BIAS`,`CUBLASLT_EPILOGUE_GELU_AUX_BIAS`. Output vector with length that matches the number of rows of matrix D when one of the following epilogues is used:`CUBLASLT_EPILOGUE_DRELU_BGRAD`,`CUBLASLT_EPILOGUE_DGELU_BGRAD`,`CUBLASLT_EPILOGUE_BGRADA`. Output vector with length that matches the number of columns of matrix D when one of the following epilogues is used:`CUBLASLT_EPILOGUE_BGRADB`. Bias vector elements are the same type as`alpha` and`beta` (see`CUBLASLT_MATMUL_DESC_SCALE_TYPE` in this table) when matrix D datatype is`CUDA_R_8I` and same as matrix D datatype otherwise. See the datatypes table undercublasLtMatmul() for detailed mapping. Default value is: NULL.	`void` /`constvoid`
`CUBLASLT_MATMUL_DESC_BIAS_BATCH_STRIDE`	Stride (in elements) to the next bias or bias gradient vector for strided batch operations. The default value is 0.	`int64_t`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`	Pointer for epilogue auxiliary buffer. Output vector for ReLu bit-mask in forward pass when`CUBLASLT_EPILOGUE_RELU_AUX` or`CUBLASLT_EPILOGUE_RELU_AUX_BIAS` epilogue is used. Input vector for ReLu bit-mask in backward pass when`CUBLASLT_EPILOGUE_DRELU` or`CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Output of GELU input matrix in forward pass when`CUBLASLT_EPILOGUE_GELU_AUX_BIAS` epilogue is used. Input of GELU input matrix for backward pass when`CUBLASLT_EPILOGUE_DGELU` or`CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue is used. For aux data type, see`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_DATA_TYPE`. Routines that don’t dereference this pointer, likecublasLtMatmulAlgoGetHeuristic() depend on its value to determine expected pointer alignment. Requires setting the`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD` attribute.	`void` /`constvoid`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_LD`	Leading dimension for epilogue auxiliary buffer. ReLu bit-mask matrix leading dimension in elements (i.e. bits) when`CUBLASLT_EPILOGUE_RELU_AUX`,`CUBLASLT_EPILOGUE_RELU_AUX_BIAS`,`CUBLASLT_EPILOGUE_DRELU_BGRAD`, or`CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Must be divisible by 128 and be no less than the number of rows in the output matrix. GELU input matrix leading dimension in elements when`CUBLASLT_EPILOGUE_GELU_AUX_BIAS`,`CUBLASLT_EPILOGUE_DGELU`, or`CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue used. Must be divisible by 8 and be no less than the number of rows in the output matrix.	`int64_t`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_BATCH_STRIDE`	Batch stride for epilogue auxiliary buffer. ReLu bit-mask matrix batch stride in elements (i.e. bits) when`CUBLASLT_EPILOGUE_RELU_AUX`,`CUBLASLT_EPILOGUE_RELU_AUX_BIAS` or`CUBLASLT_EPILOGUE_DRELU_BGRAD` epilogue is used. Must be divisible by 128. GELU input matrix batch stride in elements when`CUBLASLT_EPILOGUE_GELU_AUX_BIAS`,`CUBLASLT_EPILOGUE_DRELU`, or`CUBLASLT_EPILOGUE_DGELU_BGRAD` epilogue used. Must be divisible by 8. Default value: 0.	`int64_t`
`CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE`	Batch stride for alpha vector. Used together with`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST` when matrix D’s`CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT` is greater than 1. If`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO` is set then`CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE` must be set to 0 as this mode doesn’t support batched alpha vector. IfcublasLtBatchMode_t of any matrix is set toCUBLASLT_BATCH_MODE_POINTER_ARRAY then`CUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE` must be set to 0. Default value: 0.	`int64_t`
`CUBLASLT_MATMUL_DESC_SM_COUNT_TARGET`	Number of SMs to target for parallel execution. Optimizes heuristics for execution on a different number of SMs when user expects a concurrent stream to be using some of the device resources. Default value: 0.	`int32_t`
`CUBLASLT_MATMUL_DESC_A_SCALE_POINTER`	Device pointer to the scale factor value that converts data in matrix A to the compute data type range. The scaling factor must have the same type as the compute type. If not specified, or set to NULL, the scaling factor is assumed to be 1. If set for an unsupported matrix data, scale, and compute type combination, callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: NULL	`constvoid*`
`CUBLASLT_MATMUL_DESC_B_SCALE_POINTER`	Equivalent to`CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix B. Default value: NULL	`constvoid*`
`CUBLASLT_MATMUL_DESC_C_SCALE_POINTER`	Equivalent to`CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix C. Default value: NULL	`constvoid*`
`CUBLASLT_MATMUL_DESC_D_SCALE_POINTER`	Equivalent to`CUBLASLT_MATMUL_DESC_A_SCALE_POINTER` for matrix D. Default value: NULL	`constvoid*`
`CUBLASLT_MATMUL_DESC_AMAX_D_POINTER`	Device pointer to the memory location that on completion will be set to the maximum of absolute values in the output matrix. The computed value has the same type as the compute type. If not specified, or set to NULL, the maximum absolute value is not computed. If set for an unsupported matrix data, scale, and compute type combination, callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: NULL	`void*`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_DATA_TYPE`	The type of the data that will be stored in`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. If unset (or set to the default value of -1), the data type is set to be the output matrix element data type (DType) with some exceptions: ReLu uses a bit-mask. For FP8 kernels with an output type (DType) of`CUDA_R_8F_E4M3`, the data type can be set to a non-default value if: AType and BType are`CUDA_R_8F_E4M3`. Bias Type is`CUDA_R_16F`. CType is`CUDA_R_16BF` or`CUDA_R_16F` `CUBLASLT_MATMUL_DESC_EPILOGUE` is set to`CUBLASLT_EPILOGUE_GELU_AUX` When CType is`CUDA_R_16F`, the data type may be set to`CUDA_R_16F` or`CUDA_R_8F_E4M3`. When CType is`CUDA_R_16BF`, the data type may be set to`CUDA_R_16BF`. Otherwise, the data type should be left unset or set to the default value of -1. If set for an unsupported matrix data, scale, and compute type combination, callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: -1	`int32_t` (cudaDataType_t)
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_SCALE_POINTER`	Device pointer to the scaling factor value to convert results from compute type data range to storage data range in the auxiliary matrix that is set via`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. The scaling factor value must have the same type as the compute type. If not specified, or set to NULL, the scaling factor is assumed to be 1. If set for an unsupported matrix data, scale, and compute type combination, callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: NULL	`void*`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_AMAX_POINTER`	Device pointer to the memory location that on completion will be set to the maximum of absolute values in the buffer that is set via`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_POINTER`. The computed value has the same type as the compute type. If not specified, or set to NULL, the maximum absolute value is not computed. If set for an unsupported matrix data, scale, and compute type combination, callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: NULL	`void*`
`CUBLASLT_MATMUL_DESC_FAST_ACCUM`	Flag for managing FP8 fast accumulation mode. When enabled, on some GPUs problem execution might be faster but at the cost of lower accuracy because intermediate results will not periodically be promoted to a higher precision. Currently this flag has an effect on the following GPUs: Ada, Hopper. Default value: 0 - fast accumulation mode is disabled	`int8_t`
`CUBLASLT_MATMUL_DESC_BIAS_DATA_TYPE`	Type of the bias or bias gradient vector in the device memory. Bias case: see`CUBLASLT_EPILOGUE_BIAS`. If unset (or set to the default value of -1), the bias vector elements are the same type as the elements of the output matrix (Dtype) with the following exceptions: IMMA kernels with computeType=`CUDA_R_32I` and`Ctype=CUDA_R_8I` where the bias vector elements are the same type as alpha, beta (`CUBLASLT_MATMUL_DESC_SCALE_TYPE=CUDA_R_32F`) For FP8 kernels with an output type of`CUDA_R_32F`,`CUDA_R_8F_E4M3` or`CUDA_R_8F_E5M2`. SeecublasLtMatmul() for more details. Default value: -1	`int32_t` (cudaDataType_t)
`CUBLASLT_MATMUL_DESC_A_SCALE_MODE`	Scaling mode that defines how the matrix scaling factor for matrix A is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_B_SCALE_MODE`	Scaling mode that defines how the matrix scaling factor for matrix B is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_C_SCALE_MODE`	Scaling mode that defines how the matrix scaling factor for matrix C is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_D_SCALE_MODE`	Scaling mode that defines how the matrix scaling factor for matrix D is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_EPILOGUE_AUX_SCALE_MODE`	Scaling mode that defines how the matrix scaling factor for the auxiliary matrix is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_D_OUT_SCALE_POINTER`	Device pointer to the scale factors that are used to convert data in matrix D to the compute data type range. The scaling factor value type is defined by the scaling mode (see`CUBLASLT_MATMUL_DESC_D_OUT_SCALE_MODE`). If set for an unsupported matrix data, scale, scale mode, and compute type combination, or missing for a supported combination, then callingcublasLtMatmul() will return`CUBLAS_INVALID_VALUE`. Default value: NULL.	`void*`
`CUBLASLT_MATMUL_DESC_D_OUT_SCALE_MODE`	Scaling mode that defines how the output matrix scaling factor for matrix D is interpreted. Default value: 0. SeecublasLtMatmulMatrixScale_t.	`int32_t`
`CUBLASLT_MATMUL_DESC_EMULATION_DESCRIPTOR`	Emulation descriptor to configure floating point emulation parameters. Default value: NULL.	`int32_t`

3.3.10.cublasLtMatmulHeuristicResult_t

cublasLtMatmulHeuristicResult_t is a descriptor that holds the configured matrix multiplication algorithm descriptor and its runtime properties.

Member	Description
cublasLtMatmulAlgo_t algo	Must be initialized withcublasLtMatmulAlgoInit() if the preference`CUBLASLT_MATMUL_PERF_SEARCH_MODE` is set to`CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID`. SeecublasLtMatmulSearch_t.
`size_t` workspaceSize;	Actual size of workspace memory required.
cublasStatus_t state;	Result status. Other fields are valid only if, after call tocublasLtMatmulAlgoGetHeuristic(), this member is set to`CUBLAS_STATUS_SUCCESS`.
`float` wavesCount;	Waves count is a device utilization metric. A`wavesCount` value of 1.0f suggests that when the kernel is launched it will fully occupy the GPU.
`int` reserved[4];	Reserved.

3.3.11.cublasLtMatmulInnerShape_t

cublasLtMatmulInnerShape_t is an enumerated type used to configure various aspects of the internal kernel design. This does not impact the CUDA grid size.

Value	Description
`CUBLASLT_MATMUL_INNER_SHAPE_UNDEFINED`	Inner shape is undefined.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA884`	Inner shape is MMA884.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA1684`	Inner shape is MMA1684.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA1688`	Inner shape is MMA1688.
`CUBLASLT_MATMUL_INNER_SHAPE_MMA16816`	Inner shape is MMA16816.

3.3.12.cublasLtMatmulPreference_t

ThecublasLtMatmulPreference_t is a pointer to an opaque structure holding the description of the preferences forcublasLtMatmulAlgoGetHeuristic() configuration. UsecublasLtMatmulPreferenceCreate() to create one instance of the descriptor andcublasLtMatmulPreferenceDestroy() to destroy a previously created descriptor and release the resources.

3.3.13.cublasLtMatmulPreferenceAttributes_t

cublasLtMatmulPreferenceAttributes_t is an enumerated type used to apply algorithm search preferences while fine-tuning the heuristic function. UsecublasLtMatmulPreferenceGetAttribute() andcublasLtMatmulPreferenceSetAttribute() to get and set the attribute value of a matmul preference descriptor.

Value	Description	Data Type
`CUBLASLT_MATMUL_PREF_SEARCH_MODE`	Search mode. SeecublasLtMatmulSearch_t. Default is`CUBLASLT_SEARCH_BEST_FIT`.	`uint32_t`
`CUBLASLT_MATMUL_PREF_MAX_WORKSPACE_BYTES`	Maximum allowed workspace memory. Default is 0 (no workspace memory allowed).	`uint64_t`
`CUBLASLT_MATMUL_PREF_REDUCTION_SCHEME_MASK`	Reduction scheme mask. SeecublasLtReductionScheme_t. Only algorithm configurations specifying`CUBLASLT_ALGO_CONFIG_REDUCTION_SCHEME` that is not masked out by this attribute are allowed. For example, a mask value of 0x03 will allow only`INPLACE` and`COMPUTE_TYPE` reduction schemes. Default is`CUBLASLT_REDUCTION_SCHEME_MASK` (i.e., allows all reduction schemes).	`uint32_t`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_A_BYTES`	Minimum buffer alignment for matrix A (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix A, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	`uint32_t`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_B_BYTES`	Minimum buffer alignment for matrix B (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix B, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	`uint32_t`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_C_BYTES`	Minimum buffer alignment for matrix C (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix C, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	`uint32_t`
`CUBLASLT_MATMUL_PREF_MIN_ALIGNMENT_D_BYTES`	Minimum buffer alignment for matrix D (in bytes). Selecting a smaller value will exclude algorithms that can not work with matrix D, which is not as strictly aligned as the algorithms need. Default is 256 bytes.	`uint32_t`
`CUBLASLT_MATMUL_PREF_MAX_WAVES_COUNT`	Maximum wave count. SeecublasLtMatmulHeuristicResult_t`::wavesCount.` Selecting a non-zero value will exclude algorithms that report device utilization higher than specified. Default is`0.0f.`	`float`
`CUBLASLT_MATMUL_PREF_IMPL_MASK`	Numerical implementation details mask. SeecublasLtNumericalImplFlags_t. Filters heuristic result to only include algorithms that use the allowed implementations. default: uint64_t(-1) (allow everything)	`uint64_t`

3.3.14.cublasLtMatmulSearch_t

cublasLtMatmulSearch_t is an enumerated type that contains the attributes for heuristics search type.

Value	Description	Data Type
`CUBLASLT_SEARCH_BEST_FIT`	Request heuristics for the best algorithm for the given use case.
`CUBLASLT_SEARCH_LIMITED_BY_ALGO_ID`	Request heuristics only for the pre-configured algo id.

3.3.15.cublasLtMatmulTile_t

cublasLtMatmulTile_t is an enumerated type used to set the tile size inrowsxcolumns. See alsoCUTLASS: Fast Linear Algebra in CUDA C++.

Value	Description
`CUBLASLT_MATMUL_TILE_UNDEFINED`	Tile size is undefined.
`CUBLASLT_MATMUL_TILE_8x8`	Tile size is 8 rows x 8 columns.
`CUBLASLT_MATMUL_TILE_8x16`	Tile size is 8 rows x 16 columns.
`CUBLASLT_MATMUL_TILE_16x8`	Tile size is 16 rows x 8 columns.
`CUBLASLT_MATMUL_TILE_8x32`	Tile size is 8 rows x 32 columns.
`CUBLASLT_MATMUL_TILE_16x16`	Tile size is 16 rows x 16 columns.
`CUBLASLT_MATMUL_TILE_32x8`	Tile size is 32 rows x 8 columns.
`CUBLASLT_MATMUL_TILE_8x64`	Tile size is 8 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_16x32`	Tile size is 16 rows x 32 columns.
`CUBLASLT_MATMUL_TILE_32x16`	Tile size is 32 rows x 16 columns.
`CUBLASLT_MATMUL_TILE_64x8`	Tile size is 64 rows x 8 columns.
`CUBLASLT_MATMUL_TILE_32x32`	Tile size is 32 rows x 32 columns.
`CUBLASLT_MATMUL_TILE_32x64`	Tile size is 32 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_64x32`	Tile size is 64 rows x 32 columns.
`CUBLASLT_MATMUL_TILE_32x128`	Tile size is 32 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_64x64`	Tile size is 64 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_128x32`	Tile size is 128 rows x 32 columns.
`CUBLASLT_MATMUL_TILE_64x128`	Tile size is 64 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_128x64`	Tile size is 128 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_64x256`	Tile size is 64 rows x 256 columns.
`CUBLASLT_MATMUL_TILE_128x128`	Tile size is 128 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_256x64`	Tile size is 256 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_64x512`	Tile size is 64 rows x 512 columns.
`CUBLASLT_MATMUL_TILE_128x256`	Tile size is 128 rows x 256 columns.
`CUBLASLT_MATMUL_TILE_256x128`	Tile size is 256 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_512x64`	Tile size is 512 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_64x96`	Tile size is 64 rows x 96 columns.
`CUBLASLT_MATMUL_TILE_96x64`	Tile size is 96 rows x 64 columns.
`CUBLASLT_MATMUL_TILE_96x128`	Tile size is 96 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_128x160`	Tile size is 128 rows x 160 columns.
`CUBLASLT_MATMUL_TILE_160x128`	Tile size is 160 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_192x128`	Tile size is 192 rows x 128 columns.
`CUBLASLT_MATMUL_TILE_128x192`	Tile size is 128 rows x 192 columns.
`CUBLASLT_MATMUL_TILE_128x96`	Tile size is 128 rows x 96 columns.

3.3.16.cublasLtMatmulStages_t

cublasLtMatmulStages_t is an enumerated type used to configure the size and number of shared memory buffers where input elements are staged. Number of staging buffers defines kernel’s pipeline depth.

Value	Description
`CUBLASLT_MATMUL_STAGES_UNDEFINED`	Stage size is undefined.
`CUBLASLT_MATMUL_STAGES_16x1`	Stage size is 16, number of stages is 1.
`CUBLASLT_MATMUL_STAGES_16x2`	Stage size is 16, number of stages is 2.
`CUBLASLT_MATMUL_STAGES_16x3`	Stage size is 16, number of stages is 3.
`CUBLASLT_MATMUL_STAGES_16x4`	Stage size is 16, number of stages is 4.
`CUBLASLT_MATMUL_STAGES_16x5`	Stage size is 16, number of stages is 5.
`CUBLASLT_MATMUL_STAGES_16x6`	Stage size is 16, number of stages is 6.
`CUBLASLT_MATMUL_STAGES_32x1`	Stage size is 32, number of stages is 1.
`CUBLASLT_MATMUL_STAGES_32x2`	Stage size is 32, number of stages is 2.
`CUBLASLT_MATMUL_STAGES_32x3`	Stage size is 32, number of stages is 3.
`CUBLASLT_MATMUL_STAGES_32x4`	Stage size is 32, number of stages is 4.
`CUBLASLT_MATMUL_STAGES_32x5`	Stage size is 32, number of stages is 5.
`CUBLASLT_MATMUL_STAGES_32x6`	Stage size is 32, number of stages is 6.
`CUBLASLT_MATMUL_STAGES_64x1`	Stage size is 64, number of stages is 1.
`CUBLASLT_MATMUL_STAGES_64x2`	Stage size is 64, number of stages is 2.
`CUBLASLT_MATMUL_STAGES_64x3`	Stage size is 64, number of stages is 3.
`CUBLASLT_MATMUL_STAGES_64x4`	Stage size is 64, number of stages is 4.
`CUBLASLT_MATMUL_STAGES_64x5`	Stage size is 64, number of stages is 5.
`CUBLASLT_MATMUL_STAGES_64x6`	Stage size is 64, number of stages is 6.
`CUBLASLT_MATMUL_STAGES_128x1`	Stage size is 128, number of stages is 1.
`CUBLASLT_MATMUL_STAGES_128x2`	Stage size is 128, number of stages is 2.
`CUBLASLT_MATMUL_STAGES_128x3`	Stage size is 128, number of stages is 3.
`CUBLASLT_MATMUL_STAGES_128x4`	Stage size is 128, number of stages is 4.
`CUBLASLT_MATMUL_STAGES_128x5`	Stage size is 128, number of stages is 5.
`CUBLASLT_MATMUL_STAGES_128x6`	Stage size is 128, number of stages is 6.
`CUBLASLT_MATMUL_STAGES_32x10`	Stage size is 32, number of stages is 10.
`CUBLASLT_MATMUL_STAGES_8x4`	Stage size is 8, number of stages is 4.
`CUBLASLT_MATMUL_STAGES_16x10`	Stage size is 16, number of stages is 10.
`CUBLASLT_MATMUL_STAGES_8x5`	Stage size is 8, number of stages is 5.
`CUBLASLT_MATMUL_STAGES_8x3`	Stage size is 8, number of stages is 3.
`CUBLASLT_MATMUL_STAGES_8xAUTO`	Stage size is 8, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_16xAUTO`	Stage size is 16, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_32xAUTO`	Stage size is 32, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_64xAUTO`	Stage size is 64, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_128xAUTO`	Stage size is 128, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_256xAUTO`	Stage size is 256, number of stages is selected automatically.
`CUBLASLT_MATMUL_STAGES_768xAUTO`	Stage size is 768, number of stages is selected automatically.

3.3.17.cublasLtNumericalImplFlags_t

cublasLtNumericalImplFlags_t: a set of bit-flags that can be specified to select implementation details that may affect numerical behavior of algorithms.

Flags below can be combined using the bit OR operator “|”.

Value	Description
`CUBLASLT_NUMERICAL_IMPL_FLAGS_FMA`	Specify that the implementation is based on [H,F,D]FMA (fused multiply-add) family instructions.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_HMMA`	Specify that the implementation is based on HMMA (tensor operation) family instructions.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_IMMA`	Specify that the implementation is based on IMMA (integer tensor operation) family instructions.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_DMMA`	Specify that the implementation is based on DMMA (double precision tensor operation) family instructions.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_TENSOR_OP_MASK`	Mask to filter implementations using any of the above kinds of tensor operations.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_TYPE_MASK`	Mask to filter implementation details about multiply-accumulate instructions used.

`CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_16F`	Specify that the implementation’s inner dot product is using half precision accumulator.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32F`	Specify that the implementation’s inner dot product is using single precision accumulator.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_64F`	Specify that the implementation’s inner dot product is using double precision accumulator.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_32I`	Specify that the implementation’s inner dot product is using 32 bit signed integer precision accumulator.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_ACCUMULATOR_TYPE_MASK`	Mask to filter implementation details about accumulator used.

`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16F`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using half-precision inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_16BF`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using bfloat16 inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_TF32`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using TF32 inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_32F`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using single-precision inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_64F`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using double-precision inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_INPUT_8I`	Specify that the implementation’s inner dot product multiply-accumulate instruction is using 8-bit integer inputs.
`CUBLASLT_NUMERICAL_IMPL_FLAGS_OP_INPUT_TYPE_MASK`	Mask to filter implementation details about accumulator input used.

`CUBLASLT_NUMERICAL_IMPL_FLAGS_GAUSSIAN`	Specify that the implementation applies Gauss complexity reduction algorithm to reduce arithmetic complexity of the complex matrix multiplication problem

3.3.18.cublasLtMatrixLayout_t

ThecublasLtMatrixLayout_t is a pointer to an opaque structure holding the description of a matrix layout. UsecublasLtMatrixLayoutCreate() to create one instance of the descriptor andcublasLtMatrixLayoutDestroy() to destroy a previously created descriptor and release the resources.

3.3.19.cublasLtMatrixLayoutAttribute_t

cublasLtMatrixLayoutAttribute_t is a descriptor structure containing the attributes that define the details of the matrix operation. UsecublasLtMatrixLayoutGetAttribute() andcublasLtMatrixLayoutSetAttribute() to get and set the attribute value of a matrix layout descriptor.

Value	Description	Data Type
`CUBLASLT_MATRIX_LAYOUT_TYPE`	Specifies the data precision type. SeecudaDataType_t.	`uint32_t`
`CUBLASLT_MATRIX_LAYOUT_ORDER`	Specifies the memory order of the data of the matrix. Default value is`CUBLASLT_ORDER_COL`. SeecublasLtOrder_t .	`int32_t`
`CUBLASLT_MATRIX_LAYOUT_ROWS`	Describes the number of rows in the matrix. Normally only values that can be expressed as`int32_t` are supported.	`uint64_t`
`CUBLASLT_MATRIX_LAYOUT_COLS`	Describes the number of columns in the matrix. Normally only values that can be expressed as`int32_t` are supported.	`uint64_t`
`CUBLASLT_MATRIX_LAYOUT_LD`	The leading dimension of the matrix. For`CUBLASLT_ORDER_COL` this is the stride (in elements) of matrix column. See alsocublasLtOrder_t. Currently only non-negative values are supported. Must be large enough so that matrix memory locations are not overlapping (e.g., greater or equal to`CUBLASLT_MATRIX_LAYOUT_ROWS` in case of`CUBLASLT_ORDER_COL`).	`int64_t`
`CUBLASLT_MATRIX_LAYOUT_BATCH_COUNT`	Number of matmul operations to perform in the batch. Default value is 1. See also`CUBLASLT_ALGO_CAP_STRIDED_BATCH_SUPPORT` and`CUBLASLT_ALGO_CAP_POINTER_ARRAY_BATCH_SUPPORT` incublasLtMatmulAlgoCapAttributes_t.	`int32_t`
`CUBLASLT_MATRIX_LAYOUT_STRIDED_BATCH_OFFSET`	Stride (in elements) to the next matrix for the strided batch operation. Default value is 0. When matrix type is planar-complex (`CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET` != 0), batch stride is interpreted bycublasLtMatmul() in number of real valued sub-elements. E.g. for data of type CUDA_C_16F, offset of 1024B is encoded as a stride of value 512 (since each element of the real and imaginary matrices is a 2B (16bit) floating point type). NOTE: A bug incublasLtMatrixTransform() causes it to interpret the batch stride for a planar-complex matrix as if it was specified in number of complex elements. Therefore an offset of 1024B must be encoded as stride value 256 when callingcublasLtMatrixTransform() (each complex element is 4B with real and imaginary values 2B each). This behavior is expected to be corrected in the next major cuBLAS version.	`int64_t`
`CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET`	Stride (in bytes) to the imaginary plane for planar-complex layout. Default value is 0, indicating that the layout is regular (real and imaginary parts of complex numbers are interleaved in memory for each element).	`int64_t`
`CUBLASLT_MATRIX_LAYOUT_BATCH_MODE`	The batch mode of the matrix. Default value is`CUBLASLT_BATCH_MODE_STRIDED`. SeecublasLtBatchMode_t .	`int32_t`

3.3.20.cublasLtMatrixTransformDesc_t

ThecublasLtMatrixTransformDesc_t is a pointer to an opaque structure holding the description of a matrix transformation operation. UsecublasLtMatrixTransformDescCreate() to create one instance of the descriptor andcublasLtMatrixTransformDescDestroy() to destroy a previously created descriptor and release the resources.

3.3.21.cublasLtMatrixTransformDescAttributes_t

cublasLtMatrixTransformDescAttributes_t is a descriptor structure containing the attributes that define the specifics of the matrix transform operation. UsecublasLtMatrixTransformDescGetAttribute() andcublasLtMatrixTransformDescSetAttribute() to set the attribute value of a matrix transform descriptor.

Value	Description	Data Type
`CUBLASLT_MATRIX_TRANSFORM_DESC_SCALE_TYPE`	Scale type. Inputs are converted to the scale type for scaling and summation, and results are then converted to the output type to store in the memory. For the supported data types seecudaDataType_t.	`int32_t`
`CUBLASLT_MATRIX_TRANSFORM_DESC_POINTER_MODE`	Specifies the scalars alpha and beta are passed by reference whether on the host or on the device. Default value is:`CUBLASLT_POINTER_MODE_HOST` (i.e., on the host). SeecublasLtPointerMode_t.	`int32_t`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSA`	Specifies the type of operation that should be performed on the matrix A. Default value is:`CUBLAS_OP_N` (i.e., non-transpose operation). SeecublasOperation_t.	`int32_t`
`CUBLASLT_MATRIX_TRANSFORM_DESC_TRANSB`	Specifies the type of operation that should be performed on the matrix B. Default value is:`CUBLAS_OP_N` (i.e., non-transpose operation). SeecublasOperation_t.	`int32_t`

3.3.22.cublasLtOrder_t

cublasLtOrder_t is an enumerated type used to indicate the data ordering of the matrix.

Value	Description
`CUBLASLT_ORDER_COL`	Data is ordered in column-major format. The leading dimension is the stride (in elements) to the beginning of next column in memory.
`CUBLASLT_ORDER_ROW`	Data is ordered in row-major format. The leading dimension is the stride (in elements) to the beginning of next row in memory.
`CUBLASLT_ORDER_COL32`	Data is ordered in column-major ordered tiles of 32 columns. The leading dimension is the stride (in elements) to the beginning of next group of 32-columns.For example, if the matrix has 33 columns and 2 rows, then the leading dimension must be at least`32*2=64`.
`CUBLASLT_ORDER_COL4_4R2_8C`	Data is ordered in column-major ordered tiles of composite tiles with total 32 columns and 8 rows. A tile is composed of interleaved inner tiles of 4 columns within 4 even or odd rows in an alternating pattern.The leading dimension is the stride (in elements) to the beginning of the first 32 column x 8 row tile for the next 32-wide group of columns. For example, if the matrix has 33 columns and 1 row,the leading dimension must be at least`(328)1=256`.
`CUBLASLT_ORDER_COL32_2R_4R4`	Data is ordered in column-major ordered tiles of composite tiles with total 32 columns ands 32 rows. Element offset within the tile is calculated as`(((row%8)/24+row/8)2+row%2)32+col`.Leading dimension is the stride (in elements) to the beginning of the first 32 column x 32 row tile for the next 32-wide group of columns. E.g. if matrix has 33 columns and 1 row, then its leading dimensionsmust be at least`(3232)*1=1024`.

3.3.23.cublasLtPointerMode_t

cublasLtPointerMode_t is an enumerated type used to set the pointer mode for the scaling factorsalpha andbeta.

Value	Description
`CUBLASLT_POINTER_MODE_HOST` =`CUBLAS_POINTER_MODE_HOST`	Matches`CUBLAS_POINTER_MODE_HOST`, and the pointer targets a single value host memory.
`CUBLASLT_POINTER_MODE_DEVICE` =`CUBLAS_POINTER_MODE_DEVICE`	Matches`CUBLAS_POINTER_MODE_DEVICE`, and the pointer targets a single value device memory.
`CUBLASLT_POINTER_MODE_DEVICE_VECTOR` = 2	Pointers target device memory vectors of length equal to the number of rows of matrix D.
`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO` = 3	`alpha` pointer targets a device memory vector of length equal to the number of rows of matrix D, and`beta` is zero.
`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST` = 4	`alpha` pointer targets a device memory vector of length equal to the number of rows of matrix D, and`beta` is a single value in host memory.

Note

Only pointer modesCUBLASLT_POINTER_MODE_HOST andCUBLASLT_POINTER_MODE_DEVICE are supported whencublasLtBatchMode_t of any matrix is set toCUBLASLT_BATCH_MODE_POINTER_ARRAY.

3.3.24.cublasLtPointerModeMask_t

cublasLtPointerModeMask_t is an enumerated type used to define and query the pointer mode capability.

Value	Description
`CUBLASLT_POINTER_MODE_MASK_HOST=1`	See`CUBLASLT_POINTER_MODE_HOST` incublasLtPointerMode_t.
`CUBLASLT_POINTER_MODE_MASK_DEVICE=2`	See`CUBLASLT_POINTER_MODE_DEVICE` incublasLtPointerMode_t.
`CUBLASLT_POINTER_MODE_MASK_DEVICE_VECTOR=4`	See`CUBLASLT_POINTER_MODE_DEVICE_VECTOR` incublasLtPointerMode_t
`CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_ZERO=8`	See`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO` incublasLtPointerMode_t
`CUBLASLT_POINTER_MODE_MASK_ALPHA_DEVICE_VECTOR_BETA_HOST=16`	See`CUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST` incublasLtPointerMode_t

3.3.25.cublasLtReductionScheme_t

cublasLtReductionScheme_t is an enumerated type used to specify a reduction scheme for the portions of the dot-product calculated in parallel (i.e., “split - K”).

Value	Description
`CUBLASLT_REDUCTION_SCHEME_NONE`	Do not apply reduction. The dot-product will be performed in one sequence.
`CUBLASLT_REDUCTION_SCHEME_INPLACE`	Reduction is performed “in place” using the output buffer, parts are added up in the output data type. Workspace is only used for counters that guarantee sequentiality.
`CUBLASLT_REDUCTION_SCHEME_COMPUTE_TYPE`	Reduction done out of place in a user-provided workspace. The intermediate results are stored in the compute type in the workspace and reduced in a separate step.
`CUBLASLT_REDUCTION_SCHEME_OUTPUT_TYPE`	Reduction done out of place in a user-provided workspace. The intermediate results are stored in the output type in the workspace and reduced in a separate step.
`CUBLASLT_REDUCTION_SCHEME_MASK`	Allows all reduction schemes.

3.3.26.cublasLtMatmulMatrixScale_t

cublasLtMatmulMatrixScale_t is an enumerated type used to specify scaling mode that defines how scaling factor pointers are interpreted.

Value	Description
`CUBLASLT_MATMUL_MATRIX_SCALE_SCALAR_32F`	Scaling factors are single-precision scalars applied to the whole tensors (this mode is the default for fp8). This is the only value valid for`CUBLASLT_MATMUL_DESC_D_SCALE_MODE` when the D tensor uses a narrow precision data type.
`CUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3`	Scaling factors are tensors that contain a dedicated scaling factor stored as an 8-bit`CUDA_R_8F_UE4M3` value for each 16-element block in the innermost dimension of the corresponding data tensor.
`CUBLASLT_MATMUL_MATRIX_SCALE_VEC32_UE8M0`	Scaling factors are tensors that contain a dedicated scaling factor stored as an 8-bit`CUDA_R_8F_UE8M0` value for each 32-element block in the innermost dimension of the corresponding data tensor.
`CUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F`	Scaling factors are vectors of CUDA_R_32F values. This mode is only applicable to matrices A and B, in which case the vectors are expected to have M and N elements respectively, and each (i, j)-th element of product of A and B is multiplied by i-th element of A scale and j-th element of B scale.
`CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F`	Scaling factors are tensors that contain a dedicated CUDA_R_32F scaling factor for each 128-element block in the innermost dimension of the corresponding data tensor.
`CUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F`	Scaling factors are tensors that contain a dedicated CUDA_R_32F scaling factor for each 128x128-element block in the the corresponding data tensor.

3.3.27.cublasLtBatchMode_t

Value	Description
`CUBLASLT_BATCH_MODE_STRIDED`	The matrices of each instance of the batch are located at fixed offsets in number of elements from their locations in the previous instance.
`CUBLASLT_BATCH_MODE_POINTER_ARRAY`	The address of the matrix of each instance of the batch are read from arrays of pointers.

3.3.28.cublasLtEmulationDesc_t

cublasLtEmulationDesc_t is a pointer to an opaque structure holding the emulation descriptor. UsecublasLtEmulationDescCreate() to create a new emulation descriptor, andcublasLtEmulationDescDestroy() to destroy it and release the resources.

3.3.29.cublasLtEmulationDescAttributes_t

cublasLtEmulationDescAttributes_t is an enumerated type used to configure floating point emulation parameters. SeeFloating Point Emulation documentation for more details.

Value	Description	Data Type
`CUBLASLT_EMULATION_DESC_STRATEGY`	Strategy, seecublasEmulationStrategy_t. Defines when to use floating point emulation algorithms. Default:EMULATION_STRATEGY_DEFAULT.	`int32_t`
`CUBLASLT_EMULATION_DESC_SPECIAL_VALUES_SUPPORT`	Special values support, seecudaEmulationSpecialValuesSupport_t. Defines a bit mask of special cases in floating-point representations that must be supported. Default:EMULATION_SPECIAL_VALUE_HANDLING_DEFAULT.	`int32_t`
`CUBLASLT_EMULATION_DESC_FIXEDPOINT_MANTISSA_CONTROL`	Mantissa control, seecudaEmulationMantissaControl_t. For fixed-point emulation, defines how to compute the number of retained mantissa bits. SeeFloating Point Emulation documentation for more details.	`int32_t`
`CUBLASLT_EMULATION_DESC_FIXEDPOINT_MAX_MANTISSA_BIT_COUNT`	For fixed-point emulation only. An int32_t representing the maximum (up to quantization) number of mantissa bits to retain during fixed-point emulation. A default value of 0 allows the library to select a reasonable value based on device properties. Default: 0.	`int32_t`
`CUBLASLT_EMULATION_DESC_FIXEDPOINT_MANTISSA_BIT_OFFSET`	This parameter is for fixed-point emulation with`CUDA_EMULATION_MANTISSA_CONTROL_DYNAMIC` mantissa control (seecudaEmulationMantissaControl_t). An integer which can be used to bias the number of recommended mantissa bits. Default: 0.	`int32_t`
`CUBLASLT_EMULATION_DESC_FIXEDPOINT_MANTISSA_BIT_COUNT_POINTER`	This parameter is for fixed-point emulation. A device pointer which will contain the number of mantissa bits that were retained. If emulation is not used, the pointer will contain -1. Default: nullptr.	`int32_t*`

3.4.cuBLASLt API Reference

3.4.1.cublasLtCreate()

cublasStatus_tcublasLtCreate(cublasLtHandle_t*lighthandle)

This function initializes the cuBLASLt library and creates a handle to an opaque structure holding the cuBLASLt library context. It allocates light hardware resources on the host and device, and must be called prior to making any other cuBLASLt library calls.

The cuBLASLt library context is tied to the current CUDA device. To use the library on multiple devices, one cuBLASLt handle must be created for each device. Furthermore, the device must be set as the current before invoking cuBLASLt functions with a handle tied to that device.

3.4.2.cublasLtDestroy()

cublasStatus_tcublasLtDestroy(cublasLtHandle_tlightHandle)

This function releases hardware resources used by the cuBLASLt library. This function is usually the last call with a particular handle to the cuBLASLt library. BecausecublasLtCreate() allocates some internal resources and the release of those resources by callingcublasLtDestroy() will implicitly callcudaDeviceSynchronize(), it is recommended to minimize the number of times these functions are called.

Parameters:

Parameter	Memory	Input / Output	Description
`lightHandle`		Input	Pointer to the cuBLASLt handle to be destroyed.

Returns:

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The cuBLASLt context was successfully destroyed.
`CUBLAS_STATUS_NOT_INITIALIZED`	The cuBLASLt library was not initialized.
`CUBLAS_STATUS_INVALID_VALUE`	`lightHandle` is NULL

SeecublasStatus_t for a complete list of valid return codes.

3.4.3.cublasLtDisableCpuInstructionsSetMask()

unsignedcublasLtDisableCpuInstructionsSetMask(unsignedmask);

Instructs cuBLASLt library to not useCPU instructions specified by the flags in themask.The function takes precedence over theCUBLASLT_DISABLE_CPU_INSTRUCTIONS_MASK environment variable.

Parameters:mask – the flags combined with bitwiseOR(|) operator that specify which CPU instructions should not be used.

Supported flags:

Value	Description
`0x1`	x86-64 AVX512 ISA.

Returns: the previous value of themask.

3.4.4.cublasLtGetCudartVersion()

size_tcublasLtGetCudartVersion(void);

This function returns the version number of the CUDA Runtime library.

Parameters: None.

Returns:size_t - The version number of the CUDA Runtime library.

3.4.5.cublasLtGetProperty()

cublasStatus_tcublasLtGetProperty(libraryPropertyTypetype,int*value);

This function returns the value of the requested property by writing it to the memory location pointed to by the value parameter.

Parameters:

Parameter	Memory	Input / Output	Description
`type`		Input	Of the type`libraryPropertyType`, whose value is requested from the property. SeelibraryPropertyType_t.
`value`		Output	Pointer to the host memory location where the requested information should be written.

Returns:

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	The requested`libraryPropertyType` information is successfully written at the provided address.
`CUBLAS_STATUS_INVALID_VALUE`	If invalid value of the`type` input argument, or if`value` is NULL

SeecublasStatus_t for a complete list of valid return codes.

3.4.6.cublasLtGetStatusName()

constchar*cublasLtGetStatusName(cublasStatus_tstatus);

Returns the string representation of a given status.

Parameters:cublasStatus_t - the status.

Returns:constchar* - the NULL-terminated string.

3.4.7.cublasLtGetStatusString()

constchar*cublasLtGetStatusString(cublasStatus_tstatus);

Returns the description string for a given status.

Parameters:cublasStatus_t - the status.

Returns:constchar* - the NULL-terminated string.

3.4.8.cublasLtHeuristicsCacheGetCapacity()

cublasStatus_tcublasLtHeuristicsCacheGetCapacity(size_t*capacity);

Returns theHeuristics Cache capacity.

Parameters:

Parameter	Description
`capacity`	The pointer to the returned capacity value.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	The capacity was successfully written.
`CUBLAS_STATUS_INVALID_VALUE`	The capacity was successfully set.

3.4.9.cublasLtHeuristicsCacheSetCapacity()

cublasStatus_tcublasLtHeuristicsCacheSetCapacity(size_tcapacity);

Sets theHeuristics Cache capacity. Set the capacity to 0 to disable the heuristics cache.

This function takes precedence overCUBLASLT_HEURISTICS_CACHE_CAPACITY environment variable.

Parameters:

Parameter	Description
`capacity`	The desirable heuristics cache capacity.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	The capacity was successfully set.

3.4.10.cublasLtGetVersion()

size_tcublasLtGetVersion(void);

This function returns the version number of cuBLASLt library.

Parameters: None.

Returns:size_t - The version number of cuBLASLt library.

3.4.11.cublasLtLoggerSetCallback()

cublasStatus_tcublasLtLoggerSetCallback(cublasLtLoggerCallback_tcallback);

Experimental: This function sets the logging callback function.

Parameters:

Parameter	Memory	Input / Output	Description
`callback`		Input	Pointer to a callback function. SeecublasLtLoggerCallback_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the callback function was successfully set.

SeecublasStatus_t for a complete list of valid return codes.

3.4.12.cublasLtLoggerSetFile()

cublasStatus_tcublasLtLoggerSetFile(FILE*file);

Experimental: This function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Parameters:

Parameter	Memory	Input / Output	Description
`file`		Input	Pointer to an open file. File should have write permission.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If logging file was successfully set.

SeecublasStatus_t for a complete list of valid return codes.

3.4.13.cublasLtLoggerOpenFile()

cublasStatus_tcublasLtLoggerOpenFile(constchar*logFile);

Experimental: This function opens a logging output file in the given path.

Parameters:

Parameter	Memory	Input / Output	Description
`logFile`		Input	Path of the logging output file.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the logging file was successfully opened.

SeecublasStatus_t for a complete list of valid return codes.

3.4.14.cublasLtLoggerSetLevel()

cublasStatus_tcublasLtLoggerSetLevel(intlevel);

Experimental: This function sets the value of the logging level.

Parameters:

Parameter	Memory	Input / Output	Description
`level`		Input	Value of the logging level. SeecuBLASLt Logging.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If the value was not a valid logging level. SeecuBLASLt Logging.
`CUBLAS_STATUS_SUCCESS`	If the logging level was successfully set.

SeecublasStatus_t for a complete list of valid return codes.

3.4.15.cublasLtLoggerSetMask()

cublasStatus_tcublasLtLoggerSetMask(intmask);

Experimental: This function sets the value of the logging mask.

Parameters:

Parameter	Memory	Input / Output	Description
`mask`		Input	Value of the logging mask. SeecuBLASLt Logging.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the logging mask was successfully set.

SeecublasStatus_t for a complete list of valid return codes.

3.4.16.cublasLtLoggerForceDisable()

cublasStatus_tcublasLtLoggerForceDisable();

Experimental: This function disables logging for the entire run.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If logging was successfully disabled.

SeecublasStatus_t for a complete list of valid return codes.

3.4.17.cublasLtMatmul()

cublasStatus_tcublasLtMatmul(cublasLtHandle_tlightHandle,cublasLtMatmulDesc_tcomputeDesc,constvoid*alpha,constvoid*A,cublasLtMatrixLayout_tAdesc,constvoid*B,cublasLtMatrixLayout_tBdesc,constvoid*beta,constvoid*C,cublasLtMatrixLayout_tCdesc,void*D,cublasLtMatrixLayout_tDdesc,constcublasLtMatmulAlgo_t*algo,void*workspace,size_tworkspaceSizeInBytes,cudaStream_tstream);

This function computes the matrix multiplication of matrices A and B to produce the output matrix D, according to the following operation:

D=alpha*(A*B)+beta*(C),

whereA,B, andC are input matrices, andalpha andbeta are input scalars.

Note

This function supports both in-place matrix multiplication (C==D andCdesc==Ddesc) and out-of-place matrix multiplication (C!=D, both matrices must have the same data type, number of rows, number of columns, batch size, and memory order). In the out-of-place case, the leading dimension of C can be different from the leading dimension of D. Specifically the leading dimension of C can be 0 to achieve row or column broadcast. IfCdesc is omitted, this function assumes it to be equal toDdesc.

Theworkspace pointer must be aligned to at least a multiple of 256 bytes.The recommendations onworkspaceSizeInBytes are the same as mentioned in thecublasSetWorkspace() section.

Datatypes Supported:

cublasLtMatmul() supports the following computeType, scaleType, Atype/Btype, and Ctype. Footnotes can be found at the end of this section.

Table 1. When A, B, C, and D are Regular Column- or Row-major Matrices
computeType	scaleType	Atype/Btype	Ctype	Bias Type6
`CUBLAS_COMPUTE_16F` or `CUBLAS_COMPUTE_16F_PEDANTIC`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`	Epilogue is not supported.
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_8I`	`CUDA_R_8I`	Epilogue is not supported.
`CUBLAS_COMPUTE_32F` or `CUBLAS_COMPUTE_32F_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`6
		`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
		`CUDA_R_8I`	`CUDA_R_32F`	Epilogue is not supported.
		`CUDA_R_16BF`	`CUDA_R_32F`	`CUDA_R_32F`6
		`CUDA_R_16F`	`CUDA_R_32F`	`CUDA_R_32F`6
		`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`6
	`CUDA_C_32F`7	`CUDA_C_8I`7	`CUDA_C_32F`7	Epilogue is not supported.
	`CUDA_C_32F`7	`CUDA_C_32F`7	`CUDA_C_32F`7	Epilogue is not supported.
`CUBLAS_COMPUTE_32F_FAST_16F` or `CUBLAS_COMPUTE_32F_FAST_16BF` or `CUBLAS_COMPUTE_32F_FAST_TF32` or `CUBLAS_COMPUTE_32F_EMULATED_16BFX9`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_32F`6
	`CUDA_C_32F`7	`CUDA_C_32F`7	`CUDA_C_32F`7	Epilogue is not supported.
`CUBLAS_COMPUTE_64F` or `CUBLAS_COMPUTE_64F_PEDANTIC` or `CUBLAS_COMPUTE_64F_EMULATED_FIXEDPOINT`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`	`CUDA_R_64F`6
	`CUDA_C_64F`7	`CUDA_C_64F`7	`CUDA_C_64F`7	Epilogue is not supported.

To use IMMA kernels, one of the following sets of requirements, with the first being the preferred one, must be met:

Using a regular data ordering:
- All matrix pointers must be 4-byte aligned. For even better performance, this condition should hold with 16 instead of 4.
- Leading dimensions of matrices A, B, C must be multiples of 4.
- Only the “TN” format is supported - A must be transposed and B non-transposed.
- Pointer mode can beCUBLASLT_POINTER_MODE_HOST,CUBLASLT_POINTER_MODE_DEVICE orCUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_HOST. With the latter mode, the kernels support theCUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE attribute.
- Dimensions m and k must be multiples of 4.
Using the IMMA-specific data ordering on Ampere (compute capability 8.0) or Turing (compute capability 7.5) (but not Hopper, compute capability 9.0, or later) architecture -CUBLASLT_ORDER_COL32` for matrices A, C, D, andCUBLASLT_ORDER_COL4_4R2_8C (on Turing or Ampere architecture) orCUBLASLT_ORDER_COL32_2R_4R4 (on Ampere architecture) for matrix B:
- Leading dimensions of matrices A, B, C must fulfill conditions specific to the memory ordering (seecublasLtOrder_t).
- Matmul descriptor must specifyCUBLAS_OP_T on matrix B andCUBLAS_OP_N (default) on matrix A and C.
- If scaleTypeCUDA_R_32I is used, the only supported values foralpha andbeta are0 or1.
- Pointer mode can beCUBLASLT_POINTER_MODE_HOST,CUBLASLT_POINTER_MODE_DEVICE,CUBLASLT_POINTER_MODE_DEVICE_VECTOR orCUBLASLT_POINTER_MODE_ALPHA_DEVICE_VECTOR_BETA_ZERO. These kernels do not supportCUBLASLT_MATMUL_DESC_ALPHA_VECTOR_BATCH_STRIDE.
- Only the “NT” format is supported - A must be non-transposed and B transposed.

Table 2. When A, B, C, and D Use Layouts for IMMA
computeType	scaleType	Atype/Btype	Ctype	Bias Type
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32I`	`CUDA_R_8I`	`CUDA_R_32I`	Non-default epilogue not supported.
`CUBLAS_COMPUTE_32I` or `CUBLAS_COMPUTE_32I_PEDANTIC`	`CUDA_R_32F`	`CUDA_R_8I`	`CUDA_R_8I`	CUDA_R_32F

To use tensor- or block-scaled FP8 kernels, the following set of requirements must be satisfied:

All matrix dimensions must meet the optimal requirements listed inTensor Core Usage (i.e. pointers and matrix dimension must support 16-byte alignment).
Scaling mode must meet the restrictions noted in theScaling Mode Support Overview table.
A must be transposed and B non-transposed (The “TN” format) on Ada (compute capability 8.9), Hopper (compute capability 9.0), and Blackwell GeForce (compute capability 12.x) GPUs.
The compute type must beCUBLAS_COMPUTE_32F.
The scale type must beCUDA_R_32F.

See the table below when using FP8 kernels:

Table 3. When A, B, C, and D Use Layouts for FP8
AType	BType	CType	DType	Bias Type
`CUDA_R_8F_E4M3`	`CUDA_R_8F_E4M3`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`6
		`CUDA_R_16BF`	`CUDA_R_8F_E4M3`8	`CUDA_R_16BF`6
		`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
		`CUDA_R_16F`	`CUDA_R_8F_E4M3`8	`CUDA_R_16F`6
		`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_16BF`6
	`CUDA_R_8F_E5M2`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`6
			`CUDA_R_8F_E4M3`8	`CUDA_R_16BF`6
			`CUDA_R_8F_E5M2`8	`CUDA_R_16BF`6
		`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
			`CUDA_R_8F_E4M3`8	`CUDA_R_16F`6
			`CUDA_R_8F_E5M2`8	`CUDA_R_16F`6
		`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_16BF`6
`CUDA_R_8F_E5M2`	`CUDA_R_8F_E4M3`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`6
			`CUDA_R_8F_E4M3`8	`CUDA_R_16BF`6
			`CUDA_R_8F_E5M2`8	`CUDA_R_16BF`6
		`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
			`CUDA_R_8F_E4M3`8	`CUDA_R_16F`6
			`CUDA_R_8F_E5M2`8	`CUDA_R_16F`6
		`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_16BF`6

To use block-scaled FP4 kernels, the following set of requirements must be satisfied:

All matrix dimensions must meet the optimal requirements listed inTensor Core Usage (i.e. pointers and matrix dimension must support 16-byte alignment).
Scaling mode must beCUBLASLT_MATMUL_MATRIX_SCALE_VEC16_UE4M3
A must be transposed and B non-transposed (The “TN” format)
The compute type must beCUBLAS_COMPUTE_32F.
The scale type must beCUDA_R_32F.

Table 4. When A, B, C, and D Use Layouts for FP4
AType	BType	CType	DType	Bias Type
`CUDA_R_4F_E2M1`	`CUDA_R_4F_E2M1`	`CUDA_R_16BF`	`CUDA_R_16BF`	`CUDA_R_16BF`6
		`CUDA_R_16BF`	`CUDA_R_4F_E2M1`	`CUDA_R_16BF`6
		`CUDA_R_16F`	`CUDA_R_16F`	`CUDA_R_16F`6
		`CUDA_R_16F`	`CUDA_R_4F_E2M1`	`CUDA_R_16F`6
		`CUDA_R_32F`	`CUDA_R_32F`	`CUDA_R_16BF`6

And finally, see below table when A,B,C,D are planar-complex matrices (CUBLASLT_MATRIX_LAYOUT_PLANE_OFFSET!=0, seecublasLtMatrixLayoutAttribute_t) to make use of mixed precision tensor core acceleration.

Table 5. When A, B, C, and D are Planar-Complex Matrices
computeType	scaleType	Atype/Btype	Ctype
`CUBLAS_COMPUTE_32F`	`CUDA_C_32F`	`CUDA_C_16F`7	`CUDA_C_16F`7
		`CUDA_C_16F`7	`CUDA_C_32F`7
		`CUDA_C_16BF`7	`CUDA_C_16BF`7
		`CUDA_C_16BF`7	`CUDA_C_32F`7

NOTES:

6(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33): ReLU, dReLu, GELU, dGELU and Bias epilogue modes (seeCUBLASLT_MATMUL_DESC_EPILOGUE incublasLtMatmulDescAttributes_t) are not supported when D matrix memory order is defined asCUBLASLT_ORDER_ROW. For best performance when using the bias vector, specify zero beta and set pointer mode toCUBLASLT_POINTER_MODE_HOST.
7(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17): Use ofCUBLAS_ORDER_ROW together withCUBLAS_OP_C (Hermitian operator) is not supported unless all of A, B, C, and D matrices use theCUBLAS_ORDER_ROW ordering.
8(1,2,3,4,5,6,7,8,9,10): FP8 DType is not supported when scaling modes are one ofCUBLASLT_MATMUL_MATRIX_SCALE_OUTER_VEC_32F,CUBLASLT_MATMUL_MATRIX_SCALE_VEC128_32F, andCUBLASLT_MATMUL_MATRIX_SCALE_BLK128x128_32F.

Parameters:

Parameter	Memory	Input / Output	Description
`lightHandle`		Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`computeDesc`		Input	Handle to a previously created matrix multiplication descriptor of typecublasLtMatmulDesc_t.
`alpha`,`beta`	Device or host	Input	Pointers to the scalars used in the multiplication.
`A`,`B`, and`C`	Device	Input	Pointers to the GPU memory associated with the corresponding descriptors`Adesc`,`Bdesc` and`Cdesc`.
`Adesc`,`Bdesc` and`Cdesc`		Input	Handles to the previous created descriptors of the typecublasLtMatrixLayout_t.
`D`	Device	Output	Pointer to the GPU memory associated with the descriptor`Ddesc`.
`Ddesc`		Input	Handle to the previous created descriptor of the typecublasLtMatrixLayout_t.
`algo`		Input	Handle for matrix multiplication algorithm to be used. SeecublasLtMatmulAlgo_t. When NULL, an implicit heuristics query with default search preferences will be performed to determine actual algorithm to use.
`workspace`	Device		Pointer to the workspace buffer allocated in the GPU memory. Must be 256B aligned (i.e. lowest 8 bits of address must be 0).
`workspaceSizeInBytes`		Input	Size of the workspace.
`stream`	Host	Input	The CUDA stream where all the GPU work will be submitted.

Returns:

Return Value	Description
`CUBLAS_STATUS_NOT_INITIALIZED`	If cuBLASLt handle has not been initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If the parameters are unexpectedly NULL, in conflict or in an impossible configuration. For example, when`workspaceSizeInBytes` is less than workspace required by the configured algo.
`CUBLAS_STATUS_NOT_SUPPORTED`	If the current implementation on the selected device doesn’t support the configured operation.
`CUBLAS_STATUS_ARCH_MISMATCH`	If the configured operation cannot be run using the selected device.
`CUBLAS_STATUS_EXECUTION_FAILED`	If CUDA reported an execution error from the device.
`CUBLAS_STATUS_SUCCESS`	If the operation completed successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.18.cublasLtMatmulAlgoCapGetAttribute()

cublasStatus_tcublasLtMatmulAlgoCapGetAttribute(constcublasLtMatmulAlgo_t*algo,cublasLtMatmulAlgoCapAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried capability attribute for an initializedcublasLtMatmulAlgo_t descriptor structure. The capability attribute value is retrieved from the enumerated typecublasLtMatmulAlgoCapAttributes_t.

For example, to get list of supported Tile IDs:

cublasLtMatmulTile_ttiles[CUBLASLT_MATMUL_TILE_END];size_tnum_tiles,size_written;if(cublasLtMatmulAlgoCapGetAttribute(algo,CUBLASLT_ALGO_CAP_TILE_IDS,tiles,sizeof(tiles),&size_written)==CUBLAS_STATUS_SUCCESS){num_tiles=size_written/sizeof(tiles[0]);}

Parameters:

Parameter	Input / Output	Description
`algo`	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. SeecublasLtMatmulAlgo_t.
`attr`	Input	The capability attribute whose value will be retrieved by this function. SeecublasLtMatmulAlgoCapAttributes_t.
`buf`	Output	The attribute value returned by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is 0 and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or if`sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.19.cublasLtMatmulAlgoCheck()

cublasStatus_tcublasLtMatmulAlgoCheck(cublasLtHandle_tlightHandle,cublasLtMatmulDesc_toperationDesc,cublasLtMatrixLayout_tAdesc,cublasLtMatrixLayout_tBdesc,cublasLtMatrixLayout_tCdesc,cublasLtMatrixLayout_tDdesc,constcublasLtMatmulAlgo_t*algo,cublasLtMatmulHeuristicResult_t*result);

This function performs the correctness check on the matrix multiply algorithm descriptor for the matrix multiply operationcublasLtMatmul() function with the given input matrices A, B and C, and the output matrix D. It checks whether the descriptor is supported on the current device, and returns the result containing the required workspace and the calculated wave count.

Note

CUBLAS_STATUS_SUCCESS doesn’t fully guarantee that the algo will run. The algo will fail if, for example, the buffers are not correctly aligned. However, ifcublasLtMatmulAlgoCheck() fails, the algo will not run.

Parameters:

Parameter	Input / Output	Description
`lightHandle`	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`operationDesc`	Input	Handle to a previously created matrix multiplication descriptor of typecublasLtMatmulDesc_t.
`Adesc`,`Bdesc`,`Cdesc`, and`Ddesc`	Input	Handles to the previously created matrix layout descriptors of the typecublasLtMatrixLayout_t.
`algo`	Input	Descriptor which specifies which matrix multiplication algorithm should be used. SeecublasLtMatmulAlgo_t. May point to`result->algo`.
`result`	Output	Pointer to the structure holding the results returned by this function. The results comprise of the required workspace and the calculated wave count. The`algo` field is never updated. SeecublasLtMatmulHeuristicResult_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If matrix layout descriptors or the operation descriptor do not match the`algo` descriptor.
`CUBLAS_STATUS_NOT_SUPPORTED`	If the`algo` configuration or data type combination is not currently supported on the given device.
`CUBLAS_STATUS_ARCH_MISMATCH`	If the`algo` configuration cannot be run using the selected device.
`CUBLAS_STATUS_SUCCESS`	If the check was successful.

SeecublasStatus_t for a complete list of valid return codes.

3.4.20.cublasLtMatmulAlgoConfigGetAttribute()

cublasStatus_tcublasLtMatmulAlgoConfigGetAttribute(constcublasLtMatmulAlgo_t*algo,cublasLtMatmulAlgoConfigAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried configuration attribute for an initializedcublasLtMatmulAlgo_t descriptor. The configuration attribute value is retrieved from the enumerated typecublasLtMatmulAlgoConfigAttributes_t.

Parameters:

Parameter	Input / Output	Description
`algo`	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. SeecublasLtMatmulAlgo_t.
`attr`	Input	The configuration attribute whose value will be retrieved by this function. SeecublasLtMatmulAlgoConfigAttributes_t.
`buf`	Output	The attribute value returned by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is 0 and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or if`sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.21.cublasLtMatmulAlgoConfigSetAttribute()

cublasStatus_tcublasLtMatmulAlgoConfigSetAttribute(cublasLtMatmulAlgo_t*algo,cublasLtMatmulAlgoConfigAttributes_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified configuration attribute for an initializedcublasLtMatmulAlgo_t descriptor. The configuration attribute is an enumerant of the typecublasLtMatmulAlgoConfigAttributes_t.

Parameters:

Parameter	Input / Output	Description
`algo`	Input	Pointer to the previously created opaque structure holding the matrix multiply algorithm descriptor. SeecublasLtMatmulAlgo_t.
`attr`	Input	The configuration attribute whose value will be set by this function. SeecublasLtMatmulAlgoConfigAttributes_t.
`buf`	Input	The value to which the configuration attribute should be set.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`buf` is NULL or`sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If the attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.22.cublasLtMatmulAlgoGetHeuristic()

cublasStatus_tcublasLtMatmulAlgoGetHeuristic(cublasLtHandle_tlightHandle,cublasLtMatmulDesc_toperationDesc,cublasLtMatrixLayout_tAdesc,cublasLtMatrixLayout_tBdesc,cublasLtMatrixLayout_tCdesc,cublasLtMatrixLayout_tDdesc,cublasLtMatmulPreference_tpreference,intrequestedAlgoCount,cublasLtMatmulHeuristicResult_theuristicResultsArray[],int*returnAlgoCount);

This function retrieves the possible algorithms for the matrix multiply operationcublasLtMatmul() function with the given input matrices A, B and C, and the output matrix D. The output is placed inheuristicResultsArray[] in the order of increasing estimated compute time.

Parameters:

Parameter	Input / Output	Description
`lightHandle`	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`operationDesc`	Input	Handle to a previously created matrix multiplication descriptor of typecublasLtMatmulDesc_t.
`Adesc`,`Bdesc`,`Cdesc`, and`Ddesc`	Input	Handles to the previously created matrix layout descriptors of the typecublasLtMatrixLayout_t.
`preference`	Input	Pointer to the structure holding the heuristic search preferences descriptor. SeecublasLtMatmulPreference_t.
`requestedAlgoCount`	Input	Size of the`heuristicResultsArray` (in elements). This is the requested maximum number of algorithms to return.
`heuristicResultsArray[]`	Output	Array containing the algorithm heuristics and associated runtime characteristics, returned by this function, in the order of increasing estimated compute time.
`returnAlgoCount`	Output	Number of algorithms returned by this function. This is the number of`heuristicResultsArray` elements written.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`requestedAlgoCount` is less or equal to zero.
`CUBLAS_STATUS_NOT_SUPPORTED`	If no heuristic function available for current configuration.
`CUBLAS_STATUS_SUCCESS`	If query was successful. Inspect`heuristicResultsArray[0to(returnAlgoCount-1)].state` for the status of the results.

SeecublasStatus_t for a complete list of valid return codes.

Note

This function may load some kernels using CUDA Driver API which may fail when there is no available GPU memory. Do not allocate the entire VRAM before runningcublasLtMatmulAlgoGetHeuristic().

3.4.23.cublasLtMatmulAlgoGetIds()

cublasStatus_tcublasLtMatmulAlgoGetIds(cublasLtHandle_tlightHandle,cublasComputeType_tcomputeType,cudaDataType_tscaleType,cudaDataType_tAtype,cudaDataType_tBtype,cudaDataType_tCtype,cudaDataType_tDtype,intrequestedAlgoCount,intalgoIdsArray[],int*returnAlgoCount);

This function retrieves the IDs of all the matrix multiply algorithms that are valid, and can potentially be run by thecublasLtMatmul() function, for given types of the input matrices A, B and C, and of the output matrix D.

Note

The IDs are returned in no particular order. To make sure the best possible algo is contained in the list, makerequestedAlgoCount large enough to receive the full list. The list is guaranteed to be full ifreturnAlgoCount<requestedAlgoCount.

Parameters:

Parameter	Input / Output	Description
lightHandle	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`computeType`,`scaleType`,`Atype`,`Btype`,`Ctype`, and`Dtype`	Inputs	Data types of the computation type, scaling factors and of the operand matrices. SeecudaDataType_t.
`requestedAlgoCount`	Input	Number of algorithms requested. Must be > 0.
`algoIdsArray[]`	Output	Array containing the algorithm IDs returned by this function.
`returnAlgoCount`	Output	Number of algorithms actually returned by this function.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`requestedAlgoCount` is less or equal to zero.
`CUBLAS_STATUS_SUCCESS`	If query was successful. Inspect`returnAlgoCount` to get actual number of IDs available.

SeecublasStatus_t for a complete list of valid return codes.

3.4.24.cublasLtMatmulAlgoInit()

cublasStatus_tcublasLtMatmulAlgoInit(cublasLtHandle_tlightHandle,cublasComputeType_tcomputeType,cudaDataType_tscaleType,cudaDataType_tAtype,cudaDataType_tBtype,cudaDataType_tCtype,cudaDataType_tDtype,intalgoId,cublasLtMatmulAlgo_t*algo);

This function initializes the matrix multiply algorithm structure for thecublasLtMatmul() , for a specified matrix multiply algorithm and input matrices A, B and C, and the output matrix D.

Parameters:

Parameter	Input / Output	Description
`lightHandle`	Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`computeType`	Input	Compute type. See`CUBLASLT_MATMUL_DESC_COMPUTE_TYPE` ofcublasLtMatmulDescAttributes_t.
`scaleType`	Input	Scale type. See`CUBLASLT_MATMUL_DESC_SCALE_TYPE`ofcublasLtMatmulDescAttributes_t. Usually same as computeType.
`Atype`,`Btype`,`Ctype`, and`Dtype`	Input	Datatype precision for the input and output matrices. SeecudaDataType_t .
`algoId`	Input	Specifies the algorithm being initialized. Should be a valid`algoId` returned by thecublasLtMatmulAlgoGetIds() function.
`algo`	Input	Pointer to the opaque structure to be initialized. SeecublasLtMatmulAlgo_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`algo` is NULL or`algoId` is outside the recognized range.
`CUBLAS_STATUS_NOT_SUPPORTED`	If`algoId` is not supported for given combination of data types.
`CUBLAS_STATUS_SUCCESS`	If the structure was successfully initialized.

SeecublasStatus_t for a complete list of valid return codes.

3.4.25.cublasLtMatmulDescCreate()

cublasStatus_tcublasLtMatmulDescCreate(cublasLtMatmulDesc_t*matmulDesc,cublasComputeType_tcomputeType,cudaDataType_tscaleType);

This function creates a matrix multiply descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Input / Output	Description
`matmulDesc`	Output	Pointer to the structure holding the matrix multiply descriptor created by this function. SeecublasLtMatmulDesc_t.
`computeType`	Input	Enumerant that specifies the data precision for the matrix multiply descriptor this function creates. SeecublasComputeType_t.
`scaleType`	Input	Enumerant that specifies the data precision for the matrix transform descriptor this function creates. SeecudaDataType_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.26.cublasLtMatmulDescInit()

cublasStatus_tcublasLtMatmulDescInit(cublasLtMatmulDesc_tmatmulDesc,cublasComputeType_tcomputeType,cudaDataType_tscaleType);

This function initializes a matrix multiply descriptor in a previously allocated one.

Parameters:

Parameter	Input / Output	Description
`matmulDesc`	Output	Pointer to the structure holding the matrix multiply descriptor initialized by this function. SeecublasLtMatmulDesc_t.
`computeType`	Input	Enumerant that specifies the data precision for the matrix multiply descriptor this function initializes. SeecublasComputeType_t.
`scaleType`	Input	Enumerant that specifies the data precision for the matrix transform descriptor this function initializes. SeecudaDataType_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.27.cublasLtMatmulDescDestroy()

cublasStatus_tcublasLtMatmulDescDestroy(cublasLtMatmulDesc_tmatmulDesc);

This function destroys a previously created matrix multiply descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
`matmulDesc`		Input	Pointer to the structure holding the matrix multiply descriptor that should be destroyed by this function. SeecublasLtMatmulDesc_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If operation was successful.

SeecublasStatus_t for a complete list of valid return codes.

3.4.28.cublasLtMatmulDescGetAttribute()

cublasStatus_tcublasLtMatmulDescGetAttribute(cublasLtMatmulDesc_tmatmulDesc,cublasLtMatmulDescAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix multiply descriptor.

Parameters:

Parameter	Input / Output	Description
`matmulDesc`	Input	Pointer to the previously created structure holding the matrix multiply descriptor queried by this function. SeecublasLtMatmulDesc_t.
`attr`	Input	The attribute that will be retrieved by this function. SeecublasLtMatmulDescAttributes_t.
`buf`	Output	Memory address containing the attribute value retrieved by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is 0 and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.29.cublasLtMatmulDescSetAttribute()

cublasStatus_tcublasLtMatmulDescSetAttribute(cublasLtMatmulDesc_tmatmulDesc,cublasLtMatmulDescAttributes_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix multiply descriptor.

Parameters:

Parameter	Input / Output	Description
`matmulDesc`	Input	Pointer to the previously created structure holding the matrix multiply descriptor queried by this function. SeecublasLtMatmulDesc_t.
`attr`	Input	The attribute that will be set by this function. SeecublasLtMatmulDescAttributes_t.
`buf`	Input	The value to which the specified attribute should be set.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`buf` is NULL or`sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If the attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.30.cublasLtMatmulPreferenceCreate()

cublasStatus_tcublasLtMatmulPreferenceCreate(cublasLtMatmulPreference_t*pref);

This function creates a matrix multiply heuristic search preferences descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Memory	Input / Output	Description
`pref`		Output	Pointer to the structure holding the matrix multiply preferences descriptor created by this function. SeecublasLtMatrixLayout_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.31.cublasLtMatmulPreferenceInit()

cublasStatus_tcublasLtMatmulPreferenceInit(cublasLtMatmulPreference_tpref);

This function initializes a matrix multiply heuristic search preferences descriptor in a previously allocated one.

Parameters:

Parameter	Memory	Input / Output	Description
`pref`		Output	Pointer to the structure holding the matrix multiply preferences descriptor created by this function. SeecublasLtMatrixLayout_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.32.cublasLtMatmulPreferenceDestroy()

cublasStatus_tcublasLtMatmulPreferenceDestroy(cublasLtMatmulPreference_tpref);

This function destroys a previously created matrix multiply preferences descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
`pref`		Input	Pointer to the structure holding the matrix multiply preferences descriptor that should be destroyed by this function. SeecublasLtMatmulPreference_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the operation was successful.

SeecublasStatus_t for a complete list of valid return codes.

3.4.33.cublasLtMatmulPreferenceGetAttribute()

cublasStatus_tcublasLtMatmulPreferenceGetAttribute(cublasLtMatmulPreference_tpref,cublasLtMatmulPreferenceAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix multiply heuristic search preferences descriptor.

Parameters:

Parameter	Input / Output	Description
`pref`	Input	Pointer to the previously created structure holding the matrix multiply heuristic search preferences descriptor queried by this function. SeecublasLtMatmulPreference_t.
`attr`	Input	The attribute that will be queried by this function. SeecublasLtMatmulPreferenceAttributes_t.
`buf`	Output	Memory address containing the attribute value retrieved by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is 0 and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.34.cublasLtMatmulPreferenceSetAttribute()

cublasStatus_tcublasLtMatmulPreferenceSetAttribute(cublasLtMatmulPreference_tpref,cublasLtMatmulPreferenceAttributes_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix multiply preferences descriptor.

Parameters:

Parameter	Input / Output	Description
`pref`	Input	Pointer to the previously created structure holding the matrix multiply preferences descriptor queried by this function. SeecublasLtMatmulPreference_t.
`attr`	Input	The attribute that will be set by this function. SeecublasLtMatmulPreferenceAttributes_t.
`buf`	Input	The value to which the specified attribute should be set.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If buf is NULL or`sizeInBytes` doesn’t match the size of the internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If the attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.35.cublasLtMatrixLayoutCreate()

cublasStatus_tcublasLtMatrixLayoutCreate(cublasLtMatrixLayout_t*matLayout,cudaDataTypetype,uint64_trows,uint64_tcols,int64_tld);

This function creates a matrix layout descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Input / Output	Description
`matLayout`	Output	Pointer to the structure holding the matrix layout descriptor created by this function. SeecublasLtMatrixLayout_t.
`type`	Input	Enumerant that specifies the data precision for the matrix layout descriptor this function creates. SeecudaDataType_t.
`rows`,`cols`	Input	Number of rows and columns of the matrix.
`ld`	Input	The leading dimension of the matrix. In column major layout, this is the number of elements to jump to reach the next column. Thus`ld>=m` (number of rows).

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If the memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.36.cublasLtMatrixLayoutInit()

cublasStatus_tcublasLtMatrixLayoutInit(cublasLtMatrixLayout_tmatLayout,cudaDataTypetype,uint64_trows,uint64_tcols,int64_tld);

This function initializes a matrix layout descriptor in a previously allocated one.

Parameters:

Parameter	Input / Output	Description
`matLayout`	Output	Pointer to the structure holding the matrix layout descriptor initialized by this function. SeecublasLtMatrixLayout_t.
`type`	Input	Enumerant that specifies the data precision for the matrix layout descriptor this function initializes. SeecudaDataType_t.
`rows`,`cols`	Input	Number of rows and columns of the matrix.
`ld`	Input	The leading dimension of the matrix. In column major layout, this is the number of elements to jump to reach the next column. Thus`ld>=m` (number of rows).

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If the memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.37.cublasLtMatrixLayoutDestroy()

cublasStatus_tcublasLtMatrixLayoutDestroy(cublasLtMatrixLayout_tmatLayout);

This function destroys a previously created matrix layout descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
`matLayout`		Input	Pointer to the structure holding the matrix layout descriptor that should be destroyed by this function. SeecublasLtMatrixLayout_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the operation was successful.

SeecublasStatus_t for a complete list of valid return codes.

3.4.38.cublasLtMatrixLayoutGetAttribute()

cublasStatus_tcublasLtMatrixLayoutGetAttribute(cublasLtMatrixLayout_tmatLayout,cublasLtMatrixLayoutAttribute_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried attribute belonging to the specified matrix layout descriptor.

Parameters:

Parameter	Input / Output	Description
`matLayout`	Input	Pointer to the previously created structure holding the matrix layout descriptor queried by this function. SeecublasLtMatrixLayout_t.
`attr`	Input	The attribute being queried for. SeecublasLtMatrixLayoutAttribute_t.
`buf`	Output	The attribute value returned by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is 0 and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or `sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.39.cublasLtMatrixLayoutSetAttribute()

cublasStatus_tcublasLtMatrixLayoutSetAttribute(cublasLtMatrixLayout_tmatLayout,cublasLtMatrixLayoutAttribute_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix layout descriptor.

Parameters:

Parameter	Input / Output	Description
`matLayout`	Input	Pointer to the previously created structure holding the matrix layout descriptor queried by this function. SeecublasLtMatrixLayout_t.
`attr`	Input	The attribute that will be set by this function. SeecublasLtMatrixLayoutAttribute_t.
`buf`	Input	The value to which the specified attribute should be set.
`sizeInBytes`	Input	Size of`buf`, the attribute buffer.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`buf` is NULL or`sizeInBytes` doesn’t match size of internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.40.cublasLtMatrixTransform()

cublasStatus_tcublasLtMatrixTransform(cublasLtHandle_tlightHandle,cublasLtMatrixTransformDesc_ttransformDesc,constvoid*alpha,constvoid*A,cublasLtMatrixLayout_tAdesc,constvoid*beta,constvoid*B,cublasLtMatrixLayout_tBdesc,void*C,cublasLtMatrixLayout_tCdesc,cudaStream_tstream);

This function computes the matrix transformation operation on the input matrices A and B, to produce the output matrix C, according to the below operation:

C=alpha*transformation(A)+beta*transformation(B),

whereA,B are input matrices, andalpha andbeta are input scalars. The transformation operation is defined by thetransformDesc pointer. This function can be used to change the memory order of data or to scale and shift the values.

Parameters:

Parameter	Memory	Input / Output	Description
`lightHandle`		Input	Pointer to the allocated cuBLASLt handle for the cuBLASLt context. SeecublasLtHandle_t.
`transformDesc`		Input	Pointer to the opaque descriptor holding the matrix transformation operation. SeecublasLtMatrixTransformDesc_t.
`alpha`,`beta`	Device or host	Input	Pointers to the scalars used in the multiplication.
`A`,`B`	Device	Input	Pointers to the GPU memory associated with the corresponding descriptors`Adesc` and`Bdesc`.
`C`	Device	Output	Pointer to the GPU memory associated with the`Cdesc` descriptor.
`Adesc`,`Bdesc` and`Cdesc`		Input	Handles to the previous created descriptors of the typecublasLtMatrixLayout_t. `Adesc` or`Bdesc` can be NULL if the corresponding pointer is NULL and the corresponding scalar is zero.
`stream`	Host	Input	The CUDA stream where all the GPU work will be submitted.

Returns:

Return Value	Description
`CUBLAS_STATUS_NOT_INITIALIZED`	If cuBLASLt handle has not been initialized.
`CUBLAS_STATUS_INVALID_VALUE`	If the parameters are in conflict or in an impossible configuration. For example, when`A` is not NULL, but`Adesc` is NULL.
`CUBLAS_STATUS_NOT_SUPPORTED`	If the current implementation on the selected device does not support the configured operation.
`CUBLAS_STATUS_ARCH_MISMATCH`	If the configured operation cannot be run using the selected device.
`CUBLAS_STATUS_EXECUTION_FAILED`	If CUDA reported an execution error from the device.
`CUBLAS_STATUS_SUCCESS`	If the operation completed successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.41.cublasLtMatrixTransformDescCreate()

cublasStatus_tcublasLtMatrixTransformDescCreate(cublasLtMatrixTransformDesc_t*transformDesc,cudaDataTypescaleType);

This function creates a matrix transform descriptor by allocating the memory needed to hold its opaque structure.

Parameters:

Parameter	Memory	Input / Output	Description
`transformDesc`		Output	Pointer to the structure holding the matrix transform descriptor created by this function. SeecublasLtMatrixTransformDesc_t.
`scaleType`		Input	Enumerant that specifies the data precision for the matrix transform descriptor this function creates. SeecudaDataType_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.42.cublasLtMatrixTransformDescInit()

cublasStatus_tcublasLtMatrixTransformDescInit(cublasLtMatrixTransformDesc_ttransformDesc,cudaDataTypescaleType);

This function initializes a matrix transform descriptor in a previously allocated one.

Parameters:

Parameter	Memory	Input / Output	Description
`transformDesc`		Output	Pointer to the structure holding the matrix transform descriptor initialized by this function. SeecublasLtMatrixTransformDesc_t.
`scaleType`		Input	Enumerant that specifies the data precision for the matrix transform descriptor this function initializes. SeecudaDataType_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.43.cublasLtMatrixTransformDescDestroy()

cublasStatus_tcublasLtMatrixTransformDescDestroy(cublasLtMatrixTransformDesc_ttransformDesc);

This function destroys a previously created matrix transform descriptor object.

Parameters:

Parameter	Memory	Input / Output	Description
`transformDesc`		Input	Pointer to the structure holding the matrix transform descriptor that should be destroyed by this function. SeecublasLtMatrixTransformDesc_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the operation was successful.

SeecublasStatus_t for a complete list of valid return codes.

3.4.44.cublasLtMatrixTransformDescGetAttribute()

cublasStatus_tcublasLtMatrixTransformDescGetAttribute(cublasLtMatrixTransformDesc_ttransformDesc,cublasLtMatrixTransformDescAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried attribute belonging to a previously created matrix transform descriptor.

Parameters:

Parameter	Input / Output	Description
`transformDesc`	Input	Pointer to the previously created structure holding the matrix transform descriptor queried by this function. SeecublasLtMatrixTransformDesc_t.
`attr`	Input	The attribute that will be retrieved by this function. SeecublasLtMatrixTransformDescAttributes_t.
`buf`	Output	Memory address containing the attribute value retrieved by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`. If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written; if`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`sizeInBytes` is zero and`sizeWritten` is NULL, or if`sizeInBytes` is non-zero and`buf` is NULL, or if`sizeInBytes` doesn’t match size of internal storage for the selected attribute
`CUBLAS_STATUS_SUCCESS`	If attribute’s value was successfully written to user memory.

SeecublasStatus_t for a complete list of valid return codes.

3.4.45.cublasLtMatrixTransformDescSetAttribute()

cublasStatus_tcublasLtMatrixTransformDescSetAttribute(cublasLtMatrixTransformDesc_ttransformDesc,cublasLtMatrixTransformDescAttributes_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified attribute belonging to a previously created matrix transform descriptor.

Parameters:

Parameter	Input / Output	Description
`transformDesc`	Input	Pointer to the previously created structure holding the matrix transform descriptor queried by this function. SeecublasLtMatrixTransformDesc_t.
`attr`	Input	The attribute that will be set by this function. SeecublasLtMatrixTransformDescAttributes_t.
`buf`	Input	The value to which the specified attribute should be set.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`buf` is NULL or`sizeInBytes` does not match size of the internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If the attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.46.cublasLtEmulationDescInit()

cublasStatus_tcublasLtEmulationDescInit(cublasLtEmulationDesc_temulationDesc);

This function initializes a previously allocated emulation descriptor.

Parameters:

Parameter	Memory	Input / Output	Description
`emulationDesc`		Input	Pointer to the previously created structure holding the emulation descriptor queried by this function. SeecublasLtEmulationDesc_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If the size of the pre-allocated space is insufficient.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.47.cublasLtEmulationDescCreate()

cublasStatus_tcublasLtEmulationDescCreate(cublasLtEmulationDesc_t*emulationDesc);

This function creates a new emulation descriptor.

Parameters:

Parameter	Memory	Input / Output	Description
`emulationDesc`		Input	Pointer to the previously created structure holding the emulation descriptor queried by this function. SeecublasLtEmulationDesc_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_ALLOC_FAILED`	If memory could not be allocated.
`CUBLAS_STATUS_SUCCESS`	If the descriptor was created successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.48.cublasLtEmulationDescDestroy()

cublasStatus_tcublasLtEmulationDescDestroy(cublasLtEmulationDesc_temulationDesc);

This function destroys a previously created emulation descriptor.

Parameters:

Parameter	Memory	Input / Output	Description
`emulationDesc`		Input	Pointer to the previously created structure holding the emulation descriptor queried by this function. SeecublasLtEmulationDesc_t.

Returns:

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	If the descriptor was destroyed successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.49.cublasLtEmulationDescSetAttribute()

cublasStatus_tcublasLtEmulationDescSetAttribute(cublasLtEmulationDesc_temulationDesc,cublasLtEmulationDescAttributes_tattr,constvoid*buf,size_tsizeInBytes);

This function sets the value of the specified attribute belonging to a previously created emulation descriptor.

Parameters:

Parameter	Input / Output	Description
`emulationDesc`	Input	Pointer to the previously created structure holding the emulation descriptor queried by this function. SeecublasLtEmulationDesc_t.
`attr`	Input	The attribute that will be set by this function. SeecublasLtEmulationDescAttributes_t.
`buf`	Input	The value to which the specified attribute should be set.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.

Returns:

Return Value	Description
`CUBLAS_STATUS_INVALID_VALUE`	If`buf` is NULL or`sizeInBytes` does not match size of the internal storage for the selected attribute.
`CUBLAS_STATUS_SUCCESS`	If the attribute was set successfully.

SeecublasStatus_t for a complete list of valid return codes.

3.4.50.cublasLtEmulationDescGetAttribute()

cublasStatus_tcublasLtEmulationDescGetAttribute(cublasLtEmulationDesc_temulationDesc,cublasLtEmulationDescAttributes_tattr,void*buf,size_tsizeInBytes,size_t*sizeWritten);

This function returns the value of the queried attribute belonging to a previously created emulation descriptor.

Parameters:

Parameter	Input / Output	Description
`emulationDesc`	Input	Pointer to the previously created structure holding the emulation descriptor queried by this function. SeecublasLtEmulationDesc_t.
`attr`	Input	The attribute that will be retrieved by this function. SeecublasLtEmulationDescAttributes_t.
`buf`	Output	Memory address containing the attribute value retrieved by this function.
`sizeInBytes`	Input	Size of`buf` buffer (in bytes) for verification.
`sizeWritten`	Output	Valid only when the return value is`CUBLAS_STATUS_SUCCESS`.If`sizeInBytes` is non-zero: then`sizeWritten` is the number of bytes actually written;If`sizeInBytes` is 0: then`sizeWritten` is the number of bytes needed to write full contents.

Returns:

Return Value

Description

CUBLAS_STATUS_INVALID_VALUE

IfsizeInBytes is zero andsizeWritten is NULL, or

ifsizeInBytes is non-zero andbuf is NULL, or
ifsizeInBytes doesn’t match size of internal storage for the selected attribute

CUBLAS_STATUS_SUCCESS

If attribute’s value was successfully written to user memory.

4.Using the cuBLASXt API

4.1.General description

The cuBLASXt API of cuBLAS exposes a multi-GPU capable host interface: when using this API the application only needs to allocate the required matrices on the host memory space. Additionally, the current implementation supports managed memory on Linux with GPU devices that have compute capability 6.x or greater but treats it as host memory. Managed memory is not supported on Windows. There are no restriction on the sizes of the matrices as long as they can fit into the host memory. The cuBLASXt API takes care of allocating the memory across the designated GPUs and dispatched the workload between them and finally retrieves the results back to the host. The cuBLASXt API supports only the compute-intensive BLAS3 routines (e.g matrix-matrix operations) where the PCI transfers back and forth from the GPU can be amortized. The cuBLASXt API has its own header filecublasXt.h.

Starting with release 8.0, cuBLASXt API allows any of the matrices to be located on a GPU device.

Note

When providing matrices allocated on the GPU using the Stream Ordered Memory Allocator, ensure visibility across all devices by usingcudaMemPoolSetAccess.

Note

The cuBLASXt API is only supported on 64-bit platforms.

4.1.1.Tiling design approach

To be able to share the workload between multiple GPUs, the cuBLASXt API uses a tiling strategy : every matrix is divided in square tiles of user-controllable dimension BlockDim x BlockDim. The resulting matrix tiling defines the static scheduling policy : each resulting tile is affected to a GPU in a round robin fashion One CPU thread is created per GPU and is responsible to do the proper memory transfers and cuBLAS operations to compute all the tiles that it is responsible for. From a performance point of view, due to this static scheduling strategy, it is better that compute capabilities and PCI bandwidth are the same for every GPU. The figure below illustrates the tiles distribution between 3 GPUs. To compute the first tile G0 from C, the CPU thread 0 responsible of GPU0, have to load 3 tiles from the first row of A and tiles from the first column of B in a pipeline fashion in order to overlap memory transfer and computations and sum the results into the first tile G0 of C before to move on to the next tile G0.

Example of cublasXt<t>gemm() tiling for 3 Gpus — Example ofcublasXt<t>gemm() tiling for 3 Gpus

When the tile dimension is not an exact multiple of the dimensions of C, some tiles are partially filled on the right border or/and the bottom border. The current implementation does not pad the incomplete tiles but simply keep track of those incomplete tiles by doing the right reduced cuBLAS operations : this way, no extra computation is done. However it still can lead to some load unbalance when all GPUS do not have the same number of incomplete tiles to work on.

When one or more matrices are located on some GPU devices, the same tiling approach and workload sharing is applied. The memory transfers are in this case done between devices. However, when the computation of a tile and some data are located on the same GPU device, the memory transfer to/from the local data into tiles is bypassed and the GPU operates directly on the local data. This can lead to a significant performance increase, especially when only one GPU is used for the computation.

The matrices can be located on any GPU device, and do not have to be located on the same GPU device. Furthermore, the matrices can even be located on a GPU device that do not participate to the computation.

On the contrary of the cuBLAS API, even if all matrices are located on the same device, the cuBLASXt API is still a blocking API from the host point of view : the data results wherever located will be valid on the call return and no device synchronization is required.

4.1.2.Hybrid CPU-GPU computation

In the case of very large problems, the cuBLASXt API offers the possibility to offload some of the computation to the host CPU. This feature can be setup with the routinescublasXtSetCpuRoutine() andcublasXtSetCpuRatio() The workload affected to the CPU is put aside : it is simply a percentage of the resulting matrix taken from the bottom and the right side whichever dimension is bigger. The GPU tiling is done after that on the reduced resulting matrix.

If any of the matrices is located on a GPU device, the feature is ignored and all computation will be done only on the GPUs

This feature should be used with caution because it could interfere with the CPU threads responsible of feeding the GPUs.

Currently, only the routinecublasXt<t>gemm() supports this feature.

4.1.3.Results reproducibility

Currently all cuBLASXt API routines from a given toolkit version, generate the same bit-wise results when the following conditions are respected :

all GPUs participating to the computation have the same compute capabilities and the same number of SMs.
the tiles size is kept the same between run.
either the CPU hybrid computation is not used or the CPU Blas provided is also guaranteed to produce reproducible results.

4.2.cuBLASXt API Datatypes Reference

4.2.1.cublasXtHandle_t

ThecublasXtHandle_t type is a pointer type to an opaque structure holding the cuBLASXt API context. The cuBLASXt API context must be initialized usingcublasXtCreate() and the returned handle must be passed to all subsequent cuBLASXt API function calls. The context should be destroyed at the end usingcublasXtDestroy().

4.2.2.cublasXtOpType_t

ThecublasOpType_t enumerates the four possible types supported by BLAS routines. This enum is used as parameters of the routinescublasXtSetCpuRoutine andcublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_FLOAT`	float or single precision type
`CUBLASXT_DOUBLE`	double precision type
`CUBLASXT_COMPLEX`	single precision complex
`CUBLASXT_DOUBLECOMPLEX`	double precision complex

4.2.3.cublasXtBlasOp_t

ThecublasXtBlasOp_t type enumerates the BLAS3 or BLAS-like routine supported by cuBLASXt API. This enum is used as parameters of the routinescublasXtSetCpuRoutine andcublasXtSetCpuRatio to setup the hybrid configuration.

Value	Meaning
`CUBLASXT_GEMM`	GEMM routine
`CUBLASXT_SYRK`	SYRK routine
`CUBLASXT_HERK`	HERK routine
`CUBLASXT_SYMM`	SYMM routine
`CUBLASXT_HEMM`	HEMM routine
`CUBLASXT_TRSM`	TRSM routine
`CUBLASXT_SYR2K`	SYR2K routine
`CUBLASXT_HER2K`	HER2K routine
`CUBLASXT_SPMM`	SPMM routine
`CUBLASXT_SYRKX`	SYRKX routine
`CUBLASXT_HERKX`	HERKX routine

4.2.4.cublasXtPinningMemMode_t

The type is used to enable or disable the Pinning Memory mode through the routinecubasMgSetPinningMemMode

Value	Meaning
`CUBLASXT_PINNING_DISABLED`	the Pinning Memory mode is disabled
`CUBLASXT_PINNING_ENABLED`	the Pinning Memory mode is enabled

4.3.cuBLASXt API Helper Function Reference

4.3.1.cublasXtCreate()

cublasStatus_tcublasXtCreate(cublasXtHandle_t*handle)

This function initializes the cuBLASXt API and creates a handle to an opaque structure holding the cuBLASXt API context. It allocates hardware resources on the host and device and must be called prior to making any other cuBLASXt API calls.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the initialization succeeded
`CUBLAS_STATUS_ALLOC_FAILED`	the resources could not be allocated
`CUBLAS_STATUS_NOT_SUPPORTED`	cuBLASXt API is only supported on 64-bit platform

4.3.2.cublasXtDestroy()

cublasStatus_tcublasXtDestroy(cublasXtHandle_thandle)

This function releases hardware resources used by the cuBLASXt API context. The release of GPU resources may be deferred until the application exits. This function is usually the last call with a particular handle to the cuBLASXt API.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the shut down succeeded
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized

4.3.3.cublasXtDeviceSelect()

cublasXtDeviceSelect(cublasXtHandle_thandle,intnbDevices,intdeviceId[])

This function allows the user to provide the number of GPU devices and their respective Ids that will participate to the subsequent cuBLASXt API Math function calls. This function will create a cuBLAS context for every GPU provided in that list. Currently the device configuration is static and cannot be changed between Math function calls. In that regard, this function should be called only once aftercublasXtCreate. To be able to run multiple configurations, multiple cuBLASXt API contexts should be created.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	User call was successful
`CUBLAS_STATUS_INVALID_VALUE`	Access to at least one of the device could not be done or a cuBLAS context could not be created on at least one of the device
`CUBLAS_STATUS_ALLOC_FAILED`	Some resources could not be allocated.

4.3.4.cublasXtSetBlockDim()

cublasXtSetBlockDim(cublasXtHandle_thandle,intblockDim)

This function allows the user to set the block dimension used for the tiling of the matrices for the subsequent Math function calls. Matrices are split in square tiles of blockDim x blockDim dimension. This function can be called anytime and will take effect for the following Math function calls. The block dimension should be chosen in a way to optimize the math operation and to make sure that the PCI transfers are well overlapped with the computation.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blockDim <= 0

4.3.5.cublasXtGetBlockDim()

cublasXtGetBlockDim(cublasXtHandle_thandle,int*blockDim)

This function allows the user to query the block dimension used for the tiling of the matrices.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

4.3.6.cublasXtSetCpuRoutine()

cublasXtSetCpuRoutine(cublasXtHandle_thandle,cublasXtBlasOp_tblasOp,cublasXtOpType_ttype,void*blasFunctor)

This function allows the user to provide a CPU implementation of the corresponding BLAS routine. This function can be used with the functioncublasXtSetCpuRatio() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

4.3.7.cublasXtSetCpuRatio()

cublasXtSetCpuRatio(cublasXtHandle_thandle,cublasXtBlasOp_tblasOp,cublasXtOpType_ttype,floatratio)

This function allows the user to define the percentage of workload that should be done on a CPU in the context of an hybrid computation. This function can be used with the functioncublasXtSetCpuRoutine() to define an hybrid computation between the CPU and the GPUs. Currently the hybrid feature is only supported for the xGEMM routines.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	blasOp or type define an invalid combination
`CUBLAS_STATUS_NOT_SUPPORTED`	CPU-GPU Hybridization for that routine is not supported

4.3.8.cublasXtSetPinningMemMode()

cublasXtSetPinningMemMode(cublasXtHandle_thandle,cublasXtPinningMemMode_tmode)

This function allows the user to enable or disable the Pinning Memory mode. When enabled, the matrices passed in subsequent cuBLASXt API calls will be pinned/unpinned using the CUDART routinecudaHostRegister() andcudaHostUnregister() respectively if the matrices are not already pinned. If a matrix happened to be pinned partially, it will also not be pinned. Pinning the memory improve PCI transfer performance and allows to overlap PCI memory transfer with computation. However pinning/unpinning the memory take some time which might not be amortized. It is advised that the user pins the memory on its own usingcudaMallocHost() orcudaHostRegister() and unpin it when the computation sequence is completed. By default, the Pinning Memory mode is disabled.

Note

The Pinning Memory mode should not be enabled when matrices used for different calls to cuBLASXt API overlap. cuBLASXt determines that a matrix is pinned or not if the first address of that matrix is pinned usingcudaHostGetFlags(), thus cannot know if the matrix is already partially pinned or not. This is especially true in multi-threaded application where memory could be partially or totally pinned or unpinned while another thread is accessing that memory.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful
`CUBLAS_STATUS_INVALID_VALUE`	the mode value is different from`CUBLASXT_PINNING_DISABLED` and`CUBLASXT_PINNING_ENABLED`

4.3.9.cublasXtGetPinningMemMode()

cublasXtGetPinningMemMode(cublasXtHandle_thandle,cublasXtPinningMemMode_t*mode)

This function allows the user to query the Pinning Memory mode. By default, the Pinning Memory mode is disabled.

Return Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the call has been successful

4.4.cuBLASXt API Math Functions Reference

In this chapter we describe the actual Linear Algebra routines that cuBLASXt API supports. We will use abbreviations <type> for type and <t> for the corresponding short type to make a more concise and clear presentation of the implemented functions. Unless otherwise specified <type> and <t> have the following meanings:

<type>	<t>	Meaning
`float`	‘s’ or ‘S’	real single-precision
`double`	‘d’ or ‘D’	real double-precision
`cuComplex`	‘c’ or ‘C’	complex single-precision
`cuDoubleComplex`	‘z’ or ‘Z’	complex double-precision

4.4.1.cublasXt<t>gemm()

cublasStatus_tcublasXtSgemm(cublasXtHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,size_tm,size_tn,size_tk,constfloat*alpha,constfloat*A,intlda,constfloat*B,intldb,constfloat*beta,float*C,intldc)cublasStatus_tcublasXtDgemm(cublasXtHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*B,intldb,constdouble*beta,double*C,intldc)cublasStatus_tcublasXtCgemm(cublasXtHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasXtZgemm(cublasXtHandle_thandle,cublasOperation_ttransa,cublasOperation_ttransb,intm,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function performs the matrix-matrix multiplication

$C = \alpha\text{op}(A)\text{op}(B) + \beta C$

and$\text{op}(B)$ is defined similarly for matrix$B$ .

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`transa`		input	operation op(`A`) that is non- or (conj.) transpose.
`transb`		input	operation op(`B`) that is non- or (conj.) transpose.
`m`		input	number of rows of matrix op(`A`) and`C`.
`n`		input	number of columns of matrix op(`B`) and`C`.
`k`		input	number of columns of op(`A`) and rows of op(`B`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimensions`ldaxk` with`lda>=max(1,m)` if`transa==CUBLAS_OP_N` and`ldaxm` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store the matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,k)` if`transb==CUBLAS_OP_N` and`ldbxk` with`ldb>=max(1,n)` otherwise.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication. If`beta==0`,`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	leading dimension of a two-dimensional array used to store the matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m,n,k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

sgemm(),dgemm(),cgemm(),zgemm()

4.4.2.cublasXt<t>hemm()

cublasStatus_tcublasXtChemm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constcuComplex*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZhemm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constcuDoubleComplex*beta,cuDoubleComplex*C,size_tldc)

This function performs the Hermitian matrix-matrix multiplication

where$A$ is a Hermitian matrix stored in lower or upper mode,$B$ and$C$ are$m \times n$ matrices, and$\alpha$ and$\beta$ are scalars.

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`side`		input	indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	indicates if matrix`A` lower or upper part is stored, the other Hermitian part is not referenced and is inferred from the stored elements.
`m`		input	number of rows of matrix`C` and`B`, with matrix`A` sized accordingly.
`n`		input	number of columns of matrix`C` and`B`, with matrix`A` sized accordingly.
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side=CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise. The imaginary parts of the diagonal elements are assumed to be zero.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m<0` or`n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

chemm(),zhemm()

4.4.3.cublasXt<t>symm()

cublasStatus_tcublasXtSsymm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constfloat*alpha,constfloat*A,size_tlda,constfloat*B,size_tldb,constfloat*beta,float*C,size_tldc)cublasStatus_tcublasXtDsymm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constdouble*alpha,constdouble*A,size_tlda,constdouble*B,size_tldb,constdouble*beta,double*C,size_tldc)cublasStatus_tcublasXtCsymm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constcuComplex*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZsymm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constcuDoubleComplex*beta,cuDoubleComplex*C,size_tldc)

This function performs the symmetric matrix-matrix multiplication

where$A$ is a symmetric matrix stored in lower or upper mode,$A$ and$A$ are$m \times n$ matrices, and$\alpha$ and$\beta$ are scalars.

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`side`		input	indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`m`		input	number of rows of matrix`A` and`B`, with matrix`A` sized accordingly.
`n`		input	number of columns of matrix`C` and`A`, with matrix`A` sized accordingly.
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m<0` or`n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssymm(),dsymm(),csymm(),zsymm()

4.4.4.cublasXt<t>syrk()

cublasStatus_tcublasXtSsyrk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constfloat*A,intlda,constfloat*beta,float*C,intldc)cublasStatus_tcublasXtDsyrk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constdouble*A,intlda,constdouble*beta,double*C,intldc)cublasStatus_tcublasXtCsyrk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuComplex*alpha,constcuComplex*A,intlda,constcuComplex*beta,cuComplex*C,intldc)cublasStatus_tcublasXtZsyrk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function performs the symmetric rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{T} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a symmetric matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{T} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_T}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	operation op(`A`) that is non- or transpose.
`n`		input	number of rows of matrix op(`A`) and`C`.
`k`		input	number of columns of matrix op(`A`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`trans==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix A.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk()

4.4.5.cublasXt<t>syr2k()

cublasStatus_tcublasXtSsyr2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constfloat*alpha,constfloat*A,size_tlda,constfloat*B,size_tldb,constfloat*beta,float*C,size_tldc)cublasStatus_tcublasXtDsyr2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constdouble*alpha,constdouble*A,size_tlda,constdouble*B,size_tldb,constdouble*beta,double*C,size_tldc)cublasStatus_tcublasXtCsyr2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constcuComplex*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZsyr2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constcuDoubleComplex*beta,cuDoubleComplex*C,size_tldc)

This function performs the symmetric rank-$2k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \text{op}(B)\text{op}(A)^{T}) + \beta C$

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	operation op(`A`) that is non- or transpose.
`n`		input	number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	number of columns of matrix op(`A`) and op(`B`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimensions`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0`, then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,n)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyr2k(),dsyr2k(),csyr2k(),zsyr2k()

4.4.6.cublasXt<t>syrkx()

cublasStatus_tcublasXtSsyrkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constfloat*alpha,constfloat*A,size_tlda,constfloat*B,size_tldb,constfloat*beta,float*C,size_tldc)cublasStatus_tcublasXtDsyrkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constdouble*alpha,constdouble*A,size_tlda,constdouble*B,size_tldb,constdouble*beta,double*C,size_tldc)cublasStatus_tcublasXtCsyrkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constcuComplex*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZsyrkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constcuDoubleComplex*beta,cuDoubleComplex*C,size_tldc)

This function performs a variation of the symmetric rank-$k$ update

$C = \alpha(\text{op}(A)\text{op}(B)^{T} + \beta C$

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part, is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`trans`		input	operation op(`A`) that is non- or transpose.
`n`		input	number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	number of columns of matrix op(`A`) and op(`B`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimensions`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0`, then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimensions`ldcxn` with`ldc>=max(1,n)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssyrk(),dsyrk(),csyrk(),zsyrk() and

ssyr2k(),dsyr2k(),csyr2k(),zsyr2k()

4.4.7.cublasXt<t>herk()

cublasStatus_tcublasXtCherk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constfloat*alpha,constcuComplex*A,intlda,constfloat*beta,cuComplex*C,intldc)cublasStatus_tcublasXtZherk(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,intn,intk,constdouble*alpha,constcuDoubleComplex*A,intlda,constdouble*beta,cuDoubleComplex*C,intldc)

This function performs the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(A)^{H} + \beta C$

where$\alpha$ and$\beta$ are scalars,$C$ is a Hermitian matrix stored in lower or upper mode, and$A$ is a matrix with dimensions$\text{op}(A)$$n \times k$ . Also, for matrix$A$

$\text{op}(A) = \left\{ \begin{matrix}A & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_N}$}} \\A^{H} & {\text{if }\textsf{transa == $\mathrm{CUBLAS\_OP\_C}$}} \\\end{matrix} \right.$

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	operation op(`A`) that is non- or (conj.) transpose.
`n`		input	number of rows of matrix op(`A`) and`C`.
`k`		input	number of columns of matrix op(`A`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

cherk(),zherk()

4.4.8.cublasXt<t>her2k()

cublasStatus_tcublasXtCher2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constfloat*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZher2k(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constdouble*beta,cuDoubleComplex*C,size_tldc)

This function performs the Hermitian rank-$2k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \overset{ˉ}{\alpha}\text{op}(B)\text{op}(A)^{H} + \beta C$

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	operation op(`A`) that is non- or (conj.) transpose.
`n`		input	number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	number of columns of matrix op(`A`) and op(`B`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

cher2k(),zher2k()

4.4.9.cublasXt<t>herkx()

cublasStatus_tcublasXtCherkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,constfloat*beta,cuComplex*C,size_tldc)cublasStatus_tcublasXtZherkx(cublasXtHandle_thandle,cublasFillMode_tuplo,cublasOperation_ttrans,size_tn,size_tk,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,constdouble*beta,cuDoubleComplex*C,size_tldc)

This function performs a variation of the Hermitian rank-$k$ update

$C = \alpha\text{op}(A)\text{op}(B)^{H} + \beta C$

This routine can be used when the matrix B is in such way that the result is guaranteed to be hermitian. A usual example is when the matrix B is a scaled form of the matrix A : this is equivalent to B being the product of the matrix A and a diagonal matrix.

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`uplo`		input	indicates if matrix`C` lower or upper part is stored, the other Hermitian part is not referenced.
`trans`		input	operation op(`A`) that is non- or (conj.) transpose.
`n`		input	number of rows of matrix op(`A`), op(`B`) and`C`.
`k`		input	number of columns of matrix op(`A`) and op(`B`).
`alpha`	host	input	<type> scalar used for multiplication.
`A`	host or device	input	<type> array of dimension`ldaxk` with`lda>=max(1,n)` if`transa==CUBLAS_OP_N` and`ldaxn` with`lda>=max(1,k)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxk` with`ldb>=max(1,n)` if`transb==CUBLAS_OP_N` and`ldbxn` with`ldb>=max(1,k)` otherwise.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	real scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn`, with`ldc>=max(1,n)`. The imaginary parts of the diagonal elements are assumed and set to zero.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`n<0` or`k<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

cherk(),zherk() and

cher2k(),zher2k()

4.4.10.cublasXt<t>trsm()

cublasStatus_tcublasXtStrsm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasXtDiagType_tdiag,size_tm,size_tn,constfloat*alpha,constfloat*A,size_tlda,float*B,size_tldb)cublasStatus_tcublasXtDtrsm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasXtDiagType_tdiag,size_tm,size_tn,constdouble*alpha,constdouble*A,size_tlda,double*B,size_tldb)cublasStatus_tcublasXtCtrsm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasXtDiagType_tdiag,size_tm,size_tn,constcuComplex*alpha,constcuComplex*A,size_tlda,cuComplex*B,size_tldb)cublasStatus_tcublasXtZtrsm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasXtDiagType_tdiag,size_tm,size_tn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,cuDoubleComplex*B,size_tldb)

This function solves the triangular linear system with multiple right-hand-sides

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal,$X$ and$B$ are$m \times n$ matrices, and$\alpha$ is a scalar. Also, for matrix$A$

The solution$X$ overwrites the right-hand-sides$B$ on exit.

No test for singularity or near-singularity is included in this function.

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`side`		input	indicates if matrix`A` is on the left or right of`X`.
`uplo`		input	indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`m`		input	number of rows of matrix`B`, with matrix`A` sized accordingly.
`n`		input	number of columns of matrix`B`, with matrix`A` is sized accordingly.
`alpha`	host	input	<type> scalar used for multiplication, if`alpha==0` then`A` is not referenced and`B` does not have to be a valid input.
`A`	host or device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	in/out	<type> array. It has dimensions`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m<0` or`n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

strsm(),dtrsm(),ctrsm(),ztrsm()

4.4.11.cublasXt<t>trmm()

cublasStatus_tcublasXtStrmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,size_tm,size_tn,constfloat*alpha,constfloat*A,size_tlda,constfloat*B,size_tldb,float*C,size_tldc)cublasStatus_tcublasXtDtrmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,size_tm,size_tn,constdouble*alpha,constdouble*A,size_tlda,constdouble*B,size_tldb,double*C,size_tldc)cublasStatus_tcublasXtCtrmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,size_tm,size_tn,constcuComplex*alpha,constcuComplex*A,size_tlda,constcuComplex*B,size_tldb,cuComplex*C,size_tldc)cublasStatus_tcublasXtZtrmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,cublasOperation_ttrans,cublasDiagType_tdiag,size_tm,size_tn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,size_tlda,constcuDoubleComplex*B,size_tldb,cuDoubleComplex*C,size_tldc)

This function performs the triangular matrix-matrix multiplication

where$A$ is a triangular matrix stored in lower or upper mode with or without the main diagonal,$B$ and$C$ are$m \times n$ matrix, and$\alpha$ is a scalar. Also, for matrix$A$

Notice that in order to achieve better parallelism, similarly to the cublas API, cuBLASXt API differs from the BLAS API for this routine. The BLAS API assumes an in-place implementation (with results written back to B), while the cuBLASXt API assumes an out-of-place implementation (with results written into C). The application can still obtain the in-place functionality of BLAS in the cuBLASXt API by passing the address of the matrix B in place of the matrix C. No other overlapping in the input parameters is supported.

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`side`		input	indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	indicates if matrix`A` lower or upper part is stored, the other part is not referenced and is inferred from the stored elements.
`trans`		input	operation op(`A`) that is non- or (conj.) transpose.
`diag`		input	indicates if the elements on the main diagonal of matrix`A` are unity and should not be accessed.
`m`		input	number of rows of matrix`B`, with matrix`A` sized accordingly.
`n`		input	number of columns of matrix`B`, with matrix`A` sized accordingly.
`alpha`	host	input	<type> scalar used for multiplication, if`alpha` then`A` is not referenced and`B` does not have to be a valid input.
`A`	host or device	input	<type> array of dimension`ldaxm` with`lda>=max(1,m)` if`side==CUBLAS_SIDE_LEFT` and`ldaxn` with`lda>=max(1,n)` otherwise.
`lda`		input	leading dimension of two-dimensional array used to store matrix`A`.
`B`	host or device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`C`	host or device	in/out	<type> array of dimension`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m<0` or`n<0`
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

strmm(),dtrmm(),ctrmm(),ztrmm()

4.4.12.cublasXt<t>spmm()

cublasStatus_tcublasXtSspmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constfloat*alpha,constfloat*AP,constfloat*B,size_tldb,constfloat*beta,float*C,size_tldc);cublasStatus_tcublasXtDspmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constdouble*alpha,constdouble*AP,constdouble*B,size_tldb,constdouble*beta,double*C,size_tldc);cublasStatus_tcublasXtCspmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuComplex*alpha,constcuComplex*AP,constcuComplex*B,size_tldb,constcuComplex*beta,cuComplex*C,size_tldc);cublasStatus_tcublasXtZspmm(cublasXtHandle_thandle,cublasSideMode_tside,cublasFillMode_tuplo,size_tm,size_tn,constcuDoubleComplex*alpha,constcuDoubleComplex*AP,constcuDoubleComplex*B,size_tldb,constcuDoubleComplex*beta,cuDoubleComplex*C,size_tldc);

This function performs the symmetric packed matrix-matrix multiplication

where$A$ is a$n \times n$ symmetric matrix stored in packed format,$B$ and$C$ are$m \times n$ matrices, and$\alpha$ and$\beta$ are scalars.

Note

The packed matrix AP must be located on the host or managed memory whereas the other matrices can be located on the host or any GPU device

Param.	Memory	In/out	Meaning
`handle`		input	handle to the cuBLASXt API context.
`side`		input	indicates if matrix`A` is on the left or right of`B`.
`uplo`		input	indicates if matrix`A` lower or upper part is stored, the other symmetric part is not referenced and is inferred from the stored elements.
`m`		input	number of rows of matrix`A` and`B`, with matrix`A` sized accordingly.
`n`		input	number of columns of matrix`C` and`A`, with matrix`A` sized accordingly.
`alpha`	host	input	<type> scalar used for multiplication.
`AP`	host	input	<type> array with$A$ stored in packed format.
`B`	host or device	input	<type> array of dimension`ldbxn` with`ldb>=max(1,m)`.
`ldb`		input	leading dimension of two-dimensional array used to store matrix`B`.
`beta`	host	input	<type> scalar used for multiplication, if`beta==0` then`C` does not have to be a valid input.
`C`	host or device	in/out	<type> array of dimension`ldcxn` with`ldc>=max(1,m)`.
`ldc`		input	leading dimension of two-dimensional array used to store matrix`C`.

The possible error values returned by this function and their meanings are listed below.

Error Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_INVALID_VALUE`	the parameters`m<0` or`n<0`
`CUBLAS_STATUS_NOT_SUPPORTED`	the matrix AP is located on a GPU device
`CUBLAS_STATUS_EXECUTION_FAILED`	the function failed to launch on the GPU

For references please refer to NETLIB documentation:

ssymm(),dsymm(),csymm(),zsymm()

5.Using the cuBLASDx API

The cuBLASDx library (preview) is a device side API extension for performing BLAS calculations inside CUDA kernels.By fusing numerical operations you can decrease latency and further improve performance of your applications.

You can access cuBLASDx documentationhere.
cuBLASDx is not a part of the CUDA Toolkit. You can download cuBLASDx separately fromhere.

6.Using the cuBLAS Legacy API

This section does not provide a full reference of each Legacy API datatype and entry point. Instead, it describes how to use the API, especially where this is different from the regular cuBLAS API.

Note that in this section, all references to the “cuBLAS Library” refer to the Legacy cuBLAS API only.

Warning

The legacy cuBLAS API is deprecated and will be removed in future release.

6.1.Error Status

ThecublasStatus type is used for function status returns. The cuBLAS Library helper functions return status directly, while the status of core functions can be retrieved usingcublasGetError(). Notice that reading the error status viacublasGetError(), resets the internal error state toCUBLAS_STATUS_SUCCESS. Currently, the following values are defined:

Value	Meaning
`CUBLAS_STATUS_SUCCESS`	the operation completed successfully
`CUBLAS_STATUS_NOT_INITIALIZED`	the library was not initialized
`CUBLAS_STATUS_ALLOC_FAILED`	the resource allocation failed
`CUBLAS_STATUS_INVALID_VALUE`	an invalid numerical value was used as an argument
`CUBLAS_STATUS_ARCH_MISMATCH`	an absent device architectural feature is required
`CUBLAS_STATUS_MAPPING_ERROR`	an access to GPU memory space failed
`CUBLAS_STATUS_EXECUTION_FAILED`	the GPU program failed to execute
`CUBLAS_STATUS_INTERNAL_ERROR`	an internal operation failed
`CUBLAS_STATUS_NOT_SUPPORTED`	the feature required is not supported

This legacy type corresponds to typecublasStatus_t in the cuBLAS library API.

6.2.Initialization and Shutdown

The functionscublasInit() andcublasShutdown() are used to initialize and shutdown the cuBLAS library. It is recommended forcublasInit() to be called before any other function is invoked. It allocates hardware resources on the GPU device that is currently bound to the host thread from which it was invoked.

The legacy initialization and shutdown functions are similar to the cuBLAS library API routinescublasCreate() andcublasDestroy().

6.3.Thread Safety

The legacy API is not thread safe when used with multiple host threads and devices. It is recommended to be used only when utmost compatibility with Fortran is required and when a single host thread is used to setup the library and make all the functions calls.

6.4.Memory Management

The memory used by the legacy cuBLAS library API is allocated and released using functionscublasAlloc() andcublasFree(), respectively. These functions create and destroy an object in the GPU memory space capable of holding an array ofn elements, where each element requireselemSize bytes of storage. Please see the legacy cuBLAS API header file “cublas.h” for the prototypes of these functions.

The functioncublasAlloc() is a wrapper around the functioncudaMalloc(), therefore device pointers returned bycublasAlloc() can be passed to any CUDA™ device kernel functions. However, these device pointers can not be dereferenced in the host code. The functioncublasFree() is a wrapper around the functioncudaFree().

6.5.Scalar Parameters

In the legacy cuBLAS API, scalar parameters are passed by value from the host. Also, the few functions that do return a scalar result, such as dot() and nrm2(), return the resulting value on the host, and hence these routines will wait for kernel execution on the device to complete before returning, which makes parallelism with streams impractical. However, the majority of functions do not return any value, in order to be more compatible with Fortran and the existing BLAS libraries.

6.6.Helper Functions

In this section we list the helper functions provided by the legacy cuBLAS API and their functionality. For the exact prototypes of these functions please refer to the legacy cuBLAS API header file “cublas.h”.

Helper function	Meaning
`cublasInit()`	initialize the library
`cublasShutdown()`	shuts down the library
`cublasGetError()`	retrieves the error status of the library
`cublasSetKernelStream()`	sets the stream to be used by the library
`cublasAlloc()`	allocates the device memory for the library
`cublasFree()`	releases the device memory allocated for the library
`cublasSetVector()`	copies a vector`x` on the host to a vector on the GPU
`cublasGetVector()`	copies a vector`x` on the GPU to a vector on the host
`cublasSetMatrix()`	copies a$m \times n$ tile from a matrix on the host to the GPU
`cublasGetMatrix()`	copies a$m \times n$ tile from a matrix on the GPU to the host
`cublasSetVectorAsync()`	similar to`cublasSetVector()`, but the copy is asynchronous
`cublasGetVectorAsync()`	similar to`cublasGetVector()`, but the copy is asynchronous
`cublasSetMatrixAsync()`	similar to`cublasSetMatrix()`, but the copy is asynchronous
`cublasGetMatrixAsync()`	similar to`cublasGetMatrix()`, but the copy is asynchronous

6.7.Level-1,2,3 Functions

The Level-1,2,3 cuBLAS functions (also called core functions) have the same name and behavior as the ones listed in the chapters 3, 4 and 5 in this document. Please refer to the legacy cuBLAS API header file “cublas.h” for their exact prototype. Also, the next section talks a bit more about the differences between the legacy and the cuBLAS API prototypes, more specifically how to convert the function calls from one API to another.

6.8.Converting Legacy to the cuBLAS API

There are a few general rules that can be used to convert from legacy to the cuBLAS API:

Exchange the header file “cublas.h” for “cublas_v2.h”.
Exchange the typecublasStatus forcublasStatus_t.
Exchange the functioncublasSetKernelStream() forcublasSetStream().
Exchange the functioncublasAlloc() andcublasFree() forcudaMalloc() andcudaFree(), respectively. Notice thatcudaMalloc() expects the size of the allocated memory to be provided in bytes (usually simply providenxelemSize to allocaten elements, each of sizeelemSize bytes).
Declare thecublasHandle_t cuBLAS library handle.
Initialize the handle usingcublasCreate(). Also, release the handle once finished usingcublasDestroy().
Add the handle as the first parameter to all the cuBLAS library function calls.
Change the scalar parameters to be passed by reference, instead of by value (usually simply adding “&” symbol in C/C++ is enough, because the parameters are passed by reference on the host bydefault). However, note that if the routine is running asynchronously, then the variable holding the scalar parameter cannot be changed until the kernels that the routine dispatches are completed. See the CUDA C++ Programming Guide for a detailed discussion of how to use streams.
Change the parameter charactersN orn (non-transpose operation),T ort (transpose operation) andC orc (conjugate transpose operation) toCUBLAS_OP_N,CUBLAS_OP_T andCUBLAS_OP_C, respectively.
Change the parameter charactersL orl (lower part filled) andU oru (upper part filled) toCUBLAS_FILL_MODE_LOWER andCUBLAS_FILL_MODE_UPPER, respectively.
Change the parameter charactersN orn (non-unit diagonal) andU oru (unit diagonal) toCUBLAS_DIAG_NON_UNIT andCUBLAS_DIAG_UNIT, respectively.
Change the parameter charactersL orl (left side) andR orr (right side) toCUBLAS_SIDE_LEFT andCUBLAS_SIDE_RIGHT, respectively.
If the legacy API function returns a scalar value, add an extra scalar parameter of the same type passed by reference, as the last parameter to the same function.
Instead of usingcublasGetError(), use the return value of the function itself to check for errors.
Finally, please use the function prototypes in the header filescublas.h andcublas_v2.h to check the code for correctness.

6.9.Examples

For sample code references that use the legacy cuBLAS API please see the two examples below. They show an application written in C using the legacy cuBLAS library API with two indexing styles (Example A.1. “Application Using C and cuBLAS: 1-based indexing” and Example A.2. “Application Using C and cuBLAS: 0-based Indexing”). This application is analogous to the one using the cuBLAS library API that is shown in the Introduction chapter.

Example A.1. Application Using C and cuBLAS: 1-based indexing

//-----------------------------------------------------------#include<stdio.h>#include<stdlib.h>#include<math.h>#include"cublas.h"#define M 6#define N 5#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))static__inline__voidmodify(float*m,intldm,intn,intp,intq,floatalpha,floatbeta){cublasSscal(n-q+1,alpha,&m[IDX2F(p,q,ldm)],ldm);cublasSscal(ldm-p+1,beta,&m[IDX2F(p,q,ldm)],1);}intmain(void){inti,j;cublasStatusstat;float*devPtrA;float*a=0;a=(float*)malloc(M*N*sizeof(*a));if(!a){printf("host memory allocation failed");returnEXIT_FAILURE;}for(j=1;j<=N;j++){for(i=1;i<=M;i++){a[IDX2F(i,j,M)]=(float)((i-1)*M+j);}}cublasInit();stat=cublasAlloc(M*N,sizeof(*a),(void**)&devPtrA);if(stat!=CUBLAS_STATUS_SUCCESS){printf("device memory allocation failed");cublasShutdown();returnEXIT_FAILURE;}stat=cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data download failed");cublasFree(devPtrA);cublasShutdown();returnEXIT_FAILURE;}modify(devPtrA,M,N,2,3,16.0f,12.0f);stat=cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data upload failed");cublasFree(devPtrA);cublasShutdown();returnEXIT_FAILURE;}cublasFree(devPtrA);cublasShutdown();for(j=1;j<=N;j++){for(i=1;i<=M;i++){printf("%7.0f",a[IDX2F(i,j,M)]);}printf("\n");}free(a);returnEXIT_SUCCESS;}

Example A.2. Application Using C and cuBLAS: 0-based indexing

//-----------------------------------------------------------#include<stdio.h>#include<stdlib.h>#include<math.h>#include"cublas.h"#define M 6#define N 5#define IDX2C(i,j,ld) (((j)*(ld))+(i))static__inline__voidmodify(float*m,intldm,intn,intp,intq,floatalpha,floatbeta){cublasSscal(n-q,alpha,&m[IDX2C(p,q,ldm)],ldm);cublasSscal(ldm-p,beta,&m[IDX2C(p,q,ldm)],1);}intmain(void){inti,j;cublasStatusstat;float*devPtrA;float*a=0;a=(float*)malloc(M*N*sizeof(*a));if(!a){printf("host memory allocation failed");returnEXIT_FAILURE;}for(j=0;j<N;j++){for(i=0;i<M;i++){a[IDX2C(i,j,M)]=(float)(i*M+j+1);}}cublasInit();stat=cublasAlloc(M*N,sizeof(*a),(void**)&devPtrA);if(stat!=CUBLAS_STATUS_SUCCESS){printf("device memory allocation failed");cublasShutdown();returnEXIT_FAILURE;}stat=cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data download failed");cublasFree(devPtrA);cublasShutdown();returnEXIT_FAILURE;}modify(devPtrA,M,N,1,2,16.0f,12.0f);stat=cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M);if(stat!=CUBLAS_STATUS_SUCCESS){printf("data upload failed");cublasFree(devPtrA);cublasShutdown();returnEXIT_FAILURE;}cublasFree(devPtrA);cublasShutdown();for(j=0;j<N;j++){for(i=0;i<M;i++){printf("%7.0f",a[IDX2C(i,j,M)]);}printf("\n");}free(a);returnEXIT_SUCCESS;}

7.cuBLAS Fortran Bindings

The cuBLAS library is implemented using the C-based CUDA toolchain. Thus, it provides a C-style API. This makes interfacing to applications written in C and C++ trivial, but the library can also be used by applications written in Fortran. In particular, the cuBLAS library uses 1-based indexing and Fortran-style column-major storage for multidimensional data to simplify interfacing to Fortran applications. Unfortunately, Fortran-to-C calling conventions are not standardized and differ by platform and toolchain. In particular, differences may exist in the following areas:

symbol names (capitalization, name decoration)
argument passing (by value or reference)
passing of string arguments (length information)
passing of pointer arguments (size of the pointer)
returning floating-point or compound data types (for example single-precision or complex data types)

To provide maximum flexibility in addressing those differences, the cuBLAS Fortran interface is provided in the form of wrapper functions and is part of the Toolkit delivery. The C source code of those wrapper functions is located in thesrc directory and provided in two different forms:

the thunking wrapper interface located in the filefortran_thunking.c
the direct wrapper interface located in the filefortran.c

The code of one of those two files needs to be compiled into an application for it to call the cuBLAS API functions. Providing source code allows users to make any changes necessary for a particular platform and toolchain.

The code in those two C files has been used to demonstrate interoperability with the compilers g77 3.2.3 and g95 0.91 on 32-bit Linux, g77 3.4.5 and g95 0.91 on 64-bit Linux, Intel Fortran 9.0 and Intel Fortran 10.0 on 32-bit and 64-bit Microsoft Windows XP, and g77 3.4.0 and g95 0.92 on Mac OS X.

Note that for g77, use of the compiler flag-fno-second-underscore is required to use these wrappers as provided. Also, the use of the default calling conventions with regard to argument and return value passing is expected. Using the flag -fno-f2c changes the default calling convention with respect to these two items.

The thunking wrappers allow interfacing to existing Fortran applications without any changes to the application. During each call, the wrappers allocate GPU memory, copy source data from CPU memory space to GPU memory space, call cuBLAS, and finally copy back the results to CPU memory space and deallocate the GPU memory. As this process causes very significant call overhead, these wrappers are intended for light testing, not for production code. To use the thunking wrappers, the application needs to be compiled with the filefortran_thunking.c.

The direct wrappers, intended for production code, substitute device pointers for vector and matrix arguments in all BLAS functions. To use these interfaces, existing applications need to be modified slightly to allocate and deallocate data structures in GPU memory space (usingcuBLAS_ALLOC andcuBLAS_FREE) and to copy data between GPU and CPU memory spaces (usingcuBLAS_SET_VECTOR,cuBLAS_GET_VECTOR,cuBLAS_SET_MATRIX, andcuBLAS_GET_MATRIX). The sample wrappers provided infortran.c map device pointers to the OS-dependent typesize_t, which is 32-bit wide on 32-bit platforms and 64-bit wide on a 64-bit platforms.

One approach to deal with index arithmetic on device pointers in Fortran code is to use C-style macros, and use the C preprocessor to expand these, as shown in the example below. On Linux and Mac OS X, one way of pre-processing is to use the option-E-xf77-cpp-input when using g77 compiler, or simply the option-cpp when using g95 or gfortran. On Windows platforms with Microsoft Visual C/C++, using ’cl -EP’ achieves similar results.

! Example B.1. Fortran 77 Application Executing on the Host! ----------------------------------------------------------subroutinemodify(m,ldm,n,p,q,alpha,beta)implicit noneintegerldm,n,p,qreal*4m(ldm,*),alpha,betaexternalcublas_sscalcallcublas_sscal(n-p+1,alpha,m(p,q),ldm)callcublas_sscal(ldm-p+1,beta,m(p,q),1)return    end    programmatrixmodimplicit noneintegerM,Nparameter(M=6,N=5)real*4a(M,N)integeri,jexternalcublas_initexternalcublas_shutdowndoj=1,Ndoi=1,Ma(i,j)=(i-1)*M+jenddo    enddo    callcublas_initcallmodify(a,M,N,2,3,16.0,12.0)callcublas_shutdowndoj=1,Ndoi=1,Mwrite(*,"(F7.0$)")a(i,j)enddo        write(*,*)""enddo    stop    end

When traditional fixed-form Fortran 77 code is ported to use the cuBLAS library, line length often increases when the BLAS calls are exchanged for cuBLAS calls. Longer function names and possible macro expansion are contributing factors. Inadvertently exceeding the maximum line length can lead to run-time errors that are difficult to find, so care should be taken not to exceed the 72-column limit if fixed form is retained.

The examples in this chapter show a small application implemented in Fortran 77 on the host and the same application with the non-thunking wrappers after it has been ported to use the cuBLAS library.

The second example should be compiled with ARCH_64 defined as 1 on 64-bit OS system and as 0 on 32-bit OS system. For example for g95 or gfortran, this can be done directly on the command line by using the option-cpp-DARCH_64=1.

! Example B.2. Same Application Using Non-thunking cuBLAS Calls!-------------------------------------------------------------#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1))subroutinemodify(devPtrM,ldm,n,p,q,alpha,beta)implicit noneintegersizeof_realparameter(sizeof_real=4)integerldm,n,p,q#if ARCH_64integer*8devPtrM#elseinteger*4devPtrM#endifreal*4alpha,betacallcublas_sscal(n-p+1,alpha,1devPtrM+IDX2F(p,q,ldm)*sizeof_real,2ldm)callcublas_sscal(ldm-p+1,beta,1devPtrM+IDX2F(p,q,ldm)*sizeof_real,21)return    end    programmatrixmodimplicit noneintegerM,N,sizeof_real#if ARCH_64integer*8devPtrA#elseinteger*4devPtrA#endifparameter(M=6,N=5,sizeof_real=4)real*4a(M,N)integeri,j,statexternalcublas_init,cublas_set_matrix,cublas_get_matrixexternalcublas_shutdown,cublas_allocintegercublas_alloc,cublas_set_matrix,cublas_get_matrixdoj=1,Ndoi=1,Ma(i,j)=(i-1)*M+jenddo    enddo    callcublas_initstat=cublas_alloc(M*N,sizeof_real,devPtrA)if(stat.NE.0)then        write(*,*)"device memory allocation failed"callcublas_shutdownstop    endifstat=cublas_set_matrix(M,N,sizeof_real,a,M,devPtrA,M)if(stat.NE.0)then        callcublas_free(devPtrA)write(*,*)"data download failed"callcublas_shutdownstop    endif

—

— Code block continues below. Space added for formatting purposes. —

—

callmodify(devPtrA,M,N,2,3,16.0,12.0)stat=cublas_get_matrix(M,N,sizeof_real,devPtrA,M,a,M)if(stat.NE.0)thencallcublas_free(devPtrA)write(*,*)"data upload failed"callcublas_shutdownstopendifcallcublas_free(devPtrA)callcublas_shutdowndoj=1,Ndoi=1,Mwrite(*,"(F7.0$)")a(i,j)enddowrite(*,*)""enddostopend

8.Interaction with Other Libraries and Tools

This section describes important requirements and recommendations that ensure correct use of cuBLAS with other libraries and utilities.

8.1.nvprune

nvprune enables pruning relocatable host objects and static libraries to only contain device code for the specific target architectures. In case of cuBLAS, particular care must be taken if usingnvprune with compute capabilities, whose minor revision number is different than 0. To reduce binary size, cuBLAS may only store major revision equivalents of CUDA binary files for kernels reused between different minor revision versions. Therefore, to ensure that a pruned library does not fail for arbitrary problems, the user must keep binaries for a selected architecture and all prior minor architectures in its major architecture.

For example, the following call pruneslibcublas_static.a to contain only sm_75 (Turing) and sm_70 (Volta) cubins:

nvprune--generate-codecode=sm_70--generate-codecode=sm_75libcublasLt_static.a-olibcublasLt_static_sm70_sm75.a

which should be used instead of:

nvprune-arch=sm_75libcublasLt_static.a-olibcublasLt_static_sm75.a

9.Acknowledgements

NVIDIA would like to thank the following individuals and institutions for their contributions:

Portions of the SGEMM, DGEMM, CGEMM and ZGEMM library routines were written by Vasily Volkov of the University of California.
Portions of the SGEMM, DGEMM and ZGEMM library routines were written by Davide Barbieri of the University of Rome Tor Vergata.
Portions of the DGEMM and SGEMM library routines optimized for Fermi architecture were developed by the University of Tennessee. Subsequently, several other routines that are optimized for the Fermi architecture have been derived from these initial DGEMM and SGEMM implementations.
The substantial optimizations of the STRSV, DTRSV, CTRSV and ZTRSV library routines were developed by Jonathan Hogg of The Science and Technology Facilities Council (STFC). Subsequently, some optimizations of the STRSM, DTRSM, CTRSM and ZTRSM have been derived from these TRSV implementations.
Substantial optimizations of the SYMV and HEMV library routines were developed by Ahmad Abdelfattah, David Keyes and Hatem Ltaief of King Abdullah University of Science and Technology (KAUST).
Substantial optimizations of the TRMM and TRSM library routines were developed by Ali Charara, David Keyes and Hatem Ltaief of King Abdullah University of Science and Technology (KAUST).
This product includes {fmt} - A modern formatting libraryhttps://fmt.devCopyright (c) 2012 - present, Victor Zverovich.
This product includes spdlog - Fast C++ logging library.https://github.com/gabime/spdlog The MIT License (MIT).
This product includes SIMD Library for Evaluating Elementary Functions, vectorized libm and DFThttps://sleef.orgBoost Software License - Version 1.0 - August 17th, 2003.
This product includes Frozen - a header-only, constexpr alternative to gperf for C++14 users.https://github.com/serge-sans-paille/frozen Apache License - Version 2.0, January 2004.
This product includes Boost C++ Libraries - free peer-reviewed portable C++ source librarieshttps://www.boost.org/ Boost Software License - Version 1.0 - August 17th, 2003.
This product includes Zstandard - a fast lossless compression algorithm, targeting real-time compression scenarios at zlib-level and better compression ratios.https://github.com/facebook/zstd The BSD License.

10.Notices

10.1.Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

10.2.OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

10.3.Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.

Return Value	Description
`CUBLAS_STATUS_SUCCESS`	The allocation completed successfully.
`CUBLAS_STATUS_NOT_INITIALIZED`	The cuBLASLt library was not initialized. This usually happens: whencublasLtCreate() is not called first an error in the CUDA Runtime API called by the cuBLASLt routine, or an error in the hardware setup.
`CUBLAS_STATUS_ALLOC_FAILED`	Resource allocation failed inside the cuBLASLt library. This is usually caused by a`cudaMalloc()` failure. To correct: prior to the function call, deallocate the previously allocated memory as much as possible.
`CUBLAS_STATUS_INVALID_VALUE`	`lighthandle` is NULL

Movatterモバイル変換

1.Introduction

1.1.Data Layout

1.2.New and Legacy cuBLAS API

1.3.Example Code

1.4.Forward Compatibility

1.5.Floating Point Emulation

1.5.1.BF16x9

1.5.2.Fixed-Point

1.5.2.1.Dynamic Mantissa Control

1.5.2.2.Fixed Mantissa Control

1.5.2.3.Representation and Mappings

1.5.2.4.Fixed-Point Workspace Requirements

1.5.2.5.Fixed-Point Performance Guide

1.5.3.Default Library Configurations

1.5.4.Support For Floating Point Special Values

2.Using the cuBLAS API

2.1.General Description

2.1.1.Error Status

2.1.2.cuBLAS Context

2.1.3.Thread Safety

2.1.4.Results Reproducibility

2.1.5.Scalar Parameters

2.1.6.Parallelism with Streams

2.1.7.Batching Kernels

2.1.8.Cache Configuration

2.1.9.Static Library Support

2.1.10.GEMM Algorithms Numerical Behavior

2.1.11.Tensor Core Usage

2.1.12.CUDA Graphs Support

2.1.13.64-bit Integer Interface

2.2.cuBLAS Datatypes Reference

2.2.1.cublasHandle_t

2.2.2.cublasStatus_t

2.2.3.cublasOperation_t

2.2.4.cublasFillMode_t

2.2.5.cublasDiagType_t

2.2.6.cublasSideMode_t

2.2.7.cublasPointerMode_t

2.2.8.cublasAtomicsMode_t

2.2.9.cublasGemmAlgo_t

2.2.10.cublasMath_t

2.2.11.cublasComputeType_t

2.2.12.cublasEmulationStrategy_t

2.3.CUDA Datatypes Reference

2.3.1.cudaDataType_t

2.3.2.cudaEmulationStrategy_t

2.3.3.cudaEmulationMantissaControl_t

2.3.4.cudaEmulationSpecialValuesSupport_t

2.3.5.libraryPropertyType_t

2.4.cuBLAS Helper Function Reference

2.4.1.cublasCreate()

2.4.2.cublasDestroy()

2.4.3.cublasGetVersion()

2.4.4.cublasGetProperty()

2.4.5.cublasGetStatusName()

2.4.6.cublasGetStatusString()

2.4.7.cublasSetStream()

2.4.8.cublasSetWorkspace()

2.4.9.cublasGetStream()

2.4.10.cublasGetPointerMode()

2.4.11.cublasSetPointerMode()

2.4.12.cublasSetVector()

2.4.13.cublasGetVector()

2.4.14.cublasSetMatrix()

2.4.15.cublasGetMatrix()

2.4.16.cublasSetVectorAsync()

2.4.17.cublasGetVectorAsync()

2.4.18.cublasSetMatrixAsync()

2.4.19.cublasGetMatrixAsync()

2.4.20.cublasSetAtomicsMode()

2.4.21.cublasGetAtomicsMode()

2.4.22.cublasSetMathMode()

2.4.23.cublasGetMathMode()

2.4.24.cublasSetSmCountTarget()

2.4.25.cublasGetSmCountTarget()

2.4.26.cublasSetEmulationStrategy()

2.4.27.cublasGetEmulationStrategy()

2.4.28.cublasGetEmulationSpecialValuesSupport()

2.4.29.cublasSetEmulationSpecialValuesSupport()