cuSPARSE

The API reference guide for cuSPARSE, the CUDA sparse matrix library.

1.Introduction

The cuSPARSE library contains a set of GPU-accelerated basic linear algebra subroutines used for handling sparse matrices that perform significantly faster than CPU-only alternatives. Depending on the specific operation, the library targets matrices with sparsity ratios in the range between 70%-99.9%.It is implemented on top of the NVIDIA® CUDA™ runtime (which is part of the CUDA Toolkit) and is designed to be called from C and C++.

See alsocuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication

cuSPARSE Release Notes:cuda-toolkit-release-notes

cuSPARSE GitHub Samples:CUDALibrarySamples

Nvidia Developer Forum:GPU-Accelerated Libraries

Provide Feedback:Math-Libs-Feedback@nvidia.com

Recent cuSPARSE/cuSPARSELt Blog Posts and GTC presentations:

The library routines provide the following functionalities:

  • Operations between asparse vector and adense vector: sum, dot product, scatter, gather

  • Operations between adense matrix and asparse vector: multiplication

  • Operations between asparse matrix and adense vector: multiplication, triangular solver, tridiagonal solver, pentadiagonal solver

  • Operations between asparse matrix and adense matrix: multiplication, triangular solver, tridiagonal solver, pentadiagonal solver

  • Operations between asparse matrix and asparse matrix: sum, multiplication

  • Operations betweendense matrices with output asparse matrix: multiplication

  • Sparse matrix preconditioners: Incomplete Cholesky Factorization (level 0), Incomplete LU Factorization (level 0)

  • Reordering and Conversion operations between differentsparse matrix storage formats


1.1.Library Organization and Features

The cuSPARSE library is organized in two set of APIs:

  • TheLegacy APIs, inspired by the Sparse BLAS standard, provide a limited set of functionalities andwill not be improved in future releases, even if standard maintenance is still ensured. Some routines in this category could be deprecated and removed in the short-term. A replacement will be provided for the most important of them during the deprecation process.

  • TheGeneric APIs provide thestandard interface layer of cuSPARSE. They allow computing the most common sparse linear algebra operations, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:

    • Set matrix datalayouts, number ofbatches, andstorage formats (for example, CSR, COO, and so on).

    • Set input/output/compute data types. This also allowsmixed data-type computation.

    • Set types of sparse vector/matrixindices (e.g. 32-bit, 64-bit).

    • Choose thealgorithm for the computation.

    • Guarantee external device memory for internal operations.

    • Provide extensiveconsistency checks across input matrices and vectors. This includes the validation of sizes, data types, layout, allowed operations, etc.

    • Provide constant descriptors for vector and matrix inputs to support const-safe interface and guarantee that the APIs do not modify their inputs.


1.2.Static Library Support

Starting with CUDA 6.5, the cuSPARSE library is also delivered in astatic form aslibcusparse_static.a on Linux.

For example, to compile a small application using cuSPARSE against thedynamic library, the following command can be used:

nvccmy_cusparse_app.cu-lcusparse-omy_cusparse_app

Whereas to compile against thestatic library, the following command has to be used:

nvccmy_cusparse_app.cu-lcusparse_static-omy_cusparse_app

It is also possible to use the native Host C++ compiler. Depending on the Host Operating system, some additional libraries likepthread ordl might be needed on the linking line. The following command on Linux is suggested:

gccmy_cusparse_app.c-lcusparse_static-lcudart_static-lpthread-ldl-I<cuda-toolkit-path>/include-L<cuda-toolkit-path>/lib64-omy_cusparse_app

Note that in the latter case, the librarycuda is not needed. The CUDA Runtime will try to open explicitly thecuda library if needed. In the case of a system which does not have the CUDA driver installed, this allows the application to gracefully manage this issue and potentially run if a CPU-only path is available.


1.3.Library Dependencies

Starting with CUDA 12.0, cuSPARSE will depend onnvJitLink library for JIT (Just-In-Time) LTO (Link-Time-Optimization) capabilities; refer to thecusparseSpMMOp() APIs for more information.

If the user links to thedynamic library, the environment variables for loading the libraries at run-time (such asLD_LIBRARY_PATH on Linux andPATH on Windows) must include the path wherelibnvjitlink.so is located. If it is in the same directory as cuSPARSE, the user doesn’t need to take any action.

If linking to thestatic library, the user needs to link with-lnvjitlink and set the environment variables for loading the libraries at compile-timeLIBRARY_PATH/PATH accordingly.



2.Using the cuSPARSE API

This chapter describes how to use the cuSPARSE library API. It is not a reference for the cuSPARSE API data types and functions; that is provided in subsequent chapters.

2.1.APIs Usage Notes

The cuSPARSE library allows developers to access the computational resources of the NVIDIA graphics processing unit (GPU).

  • The cuSPARSE APIs assume that input and output data (vectors and matrices) reside in GPU (device) memory.

  • The input and outputscalars (e.g.\(\alpha\) and\(\beta\)) can be passed by reference on the host or the device, instead of only being allowed to be passed by value on the host. This allows library functions to execute asynchronously using streams even when they are generated by a previous kernel resulting in maximum parallelism.

  • The handle to thecuSPARSE library context is initialized using the function and is explicitly passed to every subsequent library function call. This allows the user to have more control over the library setup when using multiple host threads and multiple GPUs.

  • The error statuscusparseStatus_t is returned by all cuSPARSE library function calls.

It is the responsibility of the developer to allocate memory and to copy data between GPU memory and CPU memory using standard CUDA runtime API routines, such ascudaMalloc(),cudaFree(),cudaMemcpy(), andcudaMemcpyAsync().

The cuSPARSE library functions are executedasynchronously with respect to the host and may return control to the application on the host before the result is ready. Developers can use thecudaDeviceSynchronize() function to ensure that the execution of a particular cuSPARSE library routine has completed.

A developer can also use thecudaMemcpy() routine to copy data from the device to the host and vice versa, using thecudaMemcpyDeviceToHost andcudaMemcpyHostToDevice parameters, respectively. In this case there is no need to add a call tocudaDeviceSynchronize() because the call tocudaMemcpy() with the above parameters is blocking and completes only when the results are ready on the host.


2.2.Deprecated APIs

The cuSPARSE library documentation explicitly indicates the set of APIs/enumerators/data structures that are deprecated. The library policy for deprecated APIs is the following:

  1. An API is marked[[DEPRECATED]] on a release X.Y (e.g. 11.2)

    • The documentation indices a replacement if available

    • Otherwise, the functionality will not be maintained in the future

  2. The API will be removed in the release X+1.0 (e.g. 12.0)

Correctness bugs are still addressed even for deprecated APIs, while performance issues are not always ensured.

In addition to the documentation, deprecated APIs generate acompile-time warning for most platforms when used. Deprecation warnings can be disabled by defining the macroDISABLE_CUSPARSE_DEPRECATED before includingcusparse.h or by passing the flag-DDISABLE_CUSPARSE_DEPRECATED to the compiler.

2.3.Thread Safety

The library is thread safe.It is safe to call any function from any thread at any time,as long as none of the data it is using is being written to from another thread at the same time.Whether or not a cuSPARSE function writes to an object is typically indicated viaconst parameters.

It is not recommended to share the samecuSPARSE handle across multiple threads.It is possible to do so, but changes to the handle(e.g.set stream ordestroy)will affect all threads and introduce global synchronization issues.


2.4.Result Reproducibility

The design of cuSPARSE prioritizes performance over bit-wise reproducibility.

Operations using transpose or conjugate-transposecusparseOperation_thave no reproducibility guarantees.

For the remaining operations,performing the same API call twice with the exact same arguments,on the same machine, with the same executable will produce bit-wise identical results.This bit-wise reproducibility can be disrupted by changes to:hardware, CUDA drivers, cuSPARSE version, memory alignment of the data, or algorithm selection.


2.5.NaN and Inf Propagation

Floating-point numbers have special values for NaN (not-a-number) and Inf (infinity).Functions in cuSPARSE make no guarantees about the propagation of NaN and Inf.

The cuSPARSE algorithms evaluate assuming all finite floating-point values.NaN and Inf appear in the output only if the algorithms happen to generate or propagate them.Because the algorithms are subject to change based on toolkit version and runtime considerations,so too are the propagation behaviours of NaN and Inf.

NaN propagation is different in cuSPARSE than intypical dense numerical linear algebra, such as cuBLAS.The dot product between vectors[0,1,0] and[1,1,NaN]is NaN when using typical dense numerical algorithms,but will be 1.0 with typical sparse numerical algorithms.


2.6.Parallelism with Streams

If the application performs several small independent computations, or if it makes data transfers in parallel with the computation, CUDA streams can be used to overlap these tasks.

The application can conceptually associate a stream with each task. To achieve the overlap of computation between the tasks, the developer should create CUDA streams using the functioncudaStreamCreate() and set the stream to be used by each individual cuSPARSE library routine by callingcusparseSetStream() just before calling the actual cuSPARSE routine. Then, computations performed in separate streams would be overlapped automatically on the GPU, when possible. This approach is especially useful when the computation performed by a single task is relatively small and is not enough to fill the GPU with work, or when there is a data transfer that can be performed in parallel with the computation.

When streams are used, we recommend using the new cuSPARSE API with scalar parameters and results passed by reference in the device memory to achieve maximum computational overlap.

Although a developer can create many streams, in practice it is not possible to have more than 16 concurrent kernels executing at the same time.


2.7.Compatibility and Versioning

The cuSPARSE APIs are intended to be backward compatible at the source level with future releases (unless stated otherwise in the release notes of a specific future release). In other words, if a program uses cuSPARSE, it should continue to compile and work correctly with newer versions of cuSPARSE without source code changes. cuSPARSE is not guaranteed to be backward compatible at the binary level. Using different versions of thecusparse.h header file and shared library is not supported. Using different versions of cuSPARSE and the CUDA runtime is not supported.


The library uses the standardversion semantic convention for identify different releases.

Theversion takes the form of four fields joined by periods:MAJOR.MINOR.PATCH.BUILD

Theseversion fields are incremented based on the following rules:

  • MAJOR: API breaking changes or new CUDA major version (breaking changes at lower level, e.g. drivers, compilers, libraries)

  • MINOR: new APIs and functionalities

  • PATCH: Bug fixes or performance improvements (or* new CUDA release)

  • BUILD: Internal build number

* Different CUDA toolkit releases ensure distinct library versions even if there are no changes at library level.


2.8.Optimization Notes

Most of the cuSPARSE routines can be optimized by exploitingCUDA Graphs capture andHardware Memory Compression features.

More in details, a single cuSPARSE call or a sequence of calls can be captured by aCUDA Graph and executed in a second moment. This minimizes kernels launch overhead and allows the CUDA runtime to optimize the whole workflow. A full example of CUDA graphs capture applied to a cuSPARSE routine can be found incuSPARSE Library Samples - CUDA Graph.

Secondly, the data types and functionalities involved in cuSPARSE are suitable forHardware Memory Compression available in Ampere GPU devices (compute capability 8.0) or above. The feature allows memory compression for data with enough zero bytes without no loss of information. The device memory must be allocation with theCUDA driver APIs. A full example of Hardware Memory Compression applied to a cuSPARSE routine can be found incuSPARSE Library Samples - Memory Compression.



3.cuSPARSE Storage Formats

The cuSPARSE library supports dense and sparse vector, and dense and sparse matrix formats.

3.1.Index Base

The library supports zero- and one-based indexing to ensure the compatibility with C/C++ and Fortran languages respectively. The index base is selected through thecusparseIndexBase_t type.


3.2.Vector Formats

This section describes dense and sparse vector formats.

3.2.1.Dense Vector Format

Dense vectors are represented with a single data array that is stored linearly in memory, such as the following\(7 \times 1\) dense vector.

_images/dense_vector.png

Dense vector representation


3.2.2.Sparse Vector Format

Sparse vectors are represented with two arrays.

  • Thevalues array stores the nonzero values from the equivalent array in dense format.

  • Theindices array represent the positions of the corresponding nonzero values in the equivalent array in dense format.

For example, the dense vector in section 3.2.1 can be stored as a sparse vector with zero-based or one-based indexing.

_images/sparse_vector.png

Sparse vector representation

Note

The cuSPARSE routines assume that the indices are provided in increasing order and that each index appears only once. In the opposite case, the correctness of the computation is not always ensured.


3.3.Matrix Formats

Dense and several sparse formats for matrices are discussed in this section.

3.3.1.Dense Matrix Format

A dense matrix can be stored in bothrow-major andcolumn-major memory layout (ordering) and it is represented by the following parameters.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Theleading dimension, which must be

    • Greater than or equal to thenumber of columns in therow-major layout

    • Greater than or equal to thenumber of rows in thecolumn-major layout

  • The pointers to thevalues array of length

    • \(rows \times leading\; dimension\) in therow-major layout

    • \(columns \times leading\; dimension\) in thecolumn-major layout

The following figure represents a\(5 \times 2\) dense matrix with both memory layouts

_images/dense_matrix.png

Dense matrix representations

The indices within the matrix represents the contiguous locations in memory.

The leading dimension is useful to represent a sub-matrix within the original one

_images/sub_matrix.png

Sub-matrix representations


3.3.2.Coordinate (COO)

A sparse matrix stored inCOO format is represented by the following parameters.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Thenumber of non-zero elements (nnz) in the matrix.

  • The pointers to therow indices array of lengthnnz that contains the row indices of the corresponding elements in thevalues array.

  • The pointers to thecolumn indices array of lengthnnz that contains the column indices of the corresponding elements in thevalues array.

  • The pointers to thevalues array of lengthnnz that holds all nonzero values of the matrix in row-major ordering.

  • Each entry of the COO representation consists of a<row,column> pair.

  • The COO format is assumed to be sortedby row.

The following example shows a\(5 \times 4\) matrix represented in COO format.

_images/coo.png
_images/coo_one_base.png

Note

cuSPARSE supports bothsorted andunsorted column indices within a given row.

Note

If the column indices within a given row are not unique, the correctness of the computation is not always ensured.

Given an entry in the COO format (zero-base), the corresponding position in the dense matrix is computed as:

// row-majorrows_indices[i]*leading_dimension+column_indices[i]// column-majorcolumn_indices[i]*leading_dimension+rows_indices[i]

3.3.3.Compressed Sparse Row (CSR)

TheCSR format is similar to COO, where therow indices are compressed and replaced by an array ofoffsets.

A sparse matrix stored in CSR format is represented by the following parameters.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Thenumber of non-zero elements (nnz) in the matrix.

  • The pointers to therow offsets array of lengthnumber of rows + 1 that represents the starting position of each row in thecolumns and values arrays.

  • The pointers to thecolumn indices array of lengthnnz that contains the column indices of the corresponding elements in thevalues array.

  • The pointers to thevalues array of lengthnnz that holds all nonzero values of the matrix in row-major ordering.

The following example shows a\(5 \times 4\) matrix represented in CSR format.

_images/csr.png
_images/csr_one_base.png

Note

cuSPARSE supports bothsorted andunsorted column indices within a given row.

Note

If thecolumn indices within a givenrow are not unique, the correctness of the computation is not always ensured.

Given an entry in the CSR format (zero-base), the corresponding position in the dense matrix is computed as:

// row-majorrow*leading_dimension+column_indices[row_offsets[row]+k]// column-majorcolumn_indices[row_offsets[row]+k]*leading_dimension+row

3.3.4.Compressed Sparse Column (CSC)

TheCSC format is similar to COO, where thecolumn indices are compressed and replaced by an array ofoffsets.

A sparse matrix stored in CSC format is represented by the following parameters.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Thenumber of non-zero elements (nnz) in the matrix.

  • The pointers to thecolumn offsets array of lengthnumber of column + 1 that represents the starting position of each column in thecolumns and values arrays.

  • The pointers to therow indices array of lengthnnz that contains row indices of the corresponding elements in thevalues array.

  • The pointers to thevalues array of lengthnnz that holds all nonzero values of the matrix in column-major ordering.

The following example shows a\(5 \times 4\) matrix represented in CSC format.

_images/csc.png
_images/csc_one_base.png

Note

The CSR format has exactly the same memory layout as its transpose in CSC format (and vice versa).

Note

cuSPARSE supports bothsorted andunsorted row indices within a given column.

Note

If therow indices within a givencolumn are not unique, the correctness of the computation is not always ensured.

Given an entry in the CSC format (zero-base), the corresponding position in the dense matrix is computed as:

// row-majorrow_indices[column_offsets[column]+k]*leading_dimension+column// column-majorcolumn*leading_dimension+row_indices[column_offsets[column]+k]

3.3.5.Sliced Ellpack (SELL)

TheSliced Ellpack format is standardized and well-known as the state of the art.This format allows to significantly improve the performance of all problems that involve low variability in the number of nonzero elements per row.

A matrix in the Sliced Ellpack format is divided intoslices of anexact number of rows (\(sliceSize\)), defined by the user.The maximum row length (i.e., the maximum non-zeros per row) is found for each slice, and every row in the slice is padded to the maximum row length.The value-1 is used for padding.

A\(m \times n\) sparse matrix\(A\) is equivalent to asliced sparse matrix\(A_{s}\) with\(nslices = \left \lceil{\frac{m}{sliceSize}}\right \rceil\) slice rows and\(n\) columns.To improve memory coalescing and memory utilization, each slice is stored incolumn-major order.

A sparse matrix stored in SELL format is represented by the following parameters.

  • Thenumber of slices.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Thenumber of non-zero elements (nnz) in the matrix.

  • Thetotal number elements (sellValuesSize), including non-zero values and padded elements.

  • The pointer to theslice offsets of length\(nslices + 1\) that holds offsets of the slides corresponding to the columns and values arrays.

  • The pointer to thecolumn indices array of lengthsellValuesSize that contains column indices of the corresponding elements in thevalues array. The column indices are stored incolumn-major layout. Value-1 refers to padding.

  • The pointer to thevalues array of lengthsellValuesSize that holds all non-zero values and padding incolumn-major layout.

The following example shows a\(5 \times 4\) matrix represented in SELL format.

_images/sell.png
_images/sell_one_base.png

3.3.6.Block Sparse Row (BSR)

The BSR format is similar to CSR, where thecolumn indices represent two-dimensional blocks instead of a single matrix entry.

A matrix in the Block Sparse Row format is organized into blocks of size\(blockSize\), defined by the user.

A\(m \times n\) sparse matrix\(A\) is equivalent to ablock sparse matrix\(A_{B}\):\(mb \times nb\) with\(mb = \frac{m}{blockSize}\)block rows and\(nb = \frac{n}{blockSize}\)block columns.If\(m\) or\(n\) is not multiple of\(blockSize\), the user needs to pad the matrix with zeros.

Note

cuSPARSE currently supports onlysquare blocks.

The BSR format stores the blocks in row-major ordering. However, the internal storage format of blocks can becolumn-major (cusparseDirection_t=CUSPARSE_DIRECTION_COLUMN) orrow-major (cusparseDirection_t=CUSPARSE_DIRECTION_ROW), independently of the base index.

A sparse matrix stored in BSR format is represented by the following parameters.

  • Theblock size.

  • Thenumber of row blocks in the matrix.

  • Thenumber of column blocks in the matrix.

  • Thenumber of non-zero blocks (nnzb) in the matrix.

  • The pointers to therow block offsets array of lengthnumber of row blocks + 1 that represents the starting position of each row block in thecolumns and values arrays.

  • The pointers to thecolumn block indices array of lengthnnzb that contains the location of the corresponding elements in the values array.

  • The pointers to thevalues array of lengthnnzb that holds all nonzero values of the matrix.

The following example shows a\(4 \times 7\) matrix represented in BSR format.

_images/bsr.png
_images/bsr_one_base.png

3.3.7.Blocked Ellpack (BLOCKED-ELL)

The Blocked Ellpack format is similar to the standard Ellpack, where thecolumn indices represent two-dimensional blocks instead of a single matrix entry.

A matrix in the Blocked Ellpack format is organized into blocks of size\(blockSize\), defined by the user. The number of columns per row\(nEllCols\) is also defined by the user (\(nEllCols \le n\)).

A\(m \times n\) sparse matrix\(A\) is equivalent to aBlocked-ELL matrix\(A_{B}\):\(mb \times nb\) with\(mb = \left \lceil{\frac{m}{blockSize}}\right \rceil\)block rows, and\(nb = \left \lceil{\frac{nEllCols}{blockSize}}\right \rceil\) block columns.If\(m\) or\(n\) is not multiple of\(blockSize\), then the remaining elements are zero.

A sparse matrix stored in Blocked-ELL format is represented by the following parameters.

  • Theblock size.

  • Thenumber of rows in the matrix.

  • Thenumber of columns in the matrix.

  • Thenumber of columns per row (nEllCols) in the matrix.

  • The pointers to thecolumn block indices array of length\(mb \times nb\) that contains the location of the corresponding elements in the values array. Empty blocks can be represented with-1 index.

  • The pointers to thevalues array of length\(m \times nEllCols\) that holds all nonzero values of the matrix in row-major ordering.

The following example shows a\(9 \times 9\) matrix represented in Blocked-ELL format.

_images/blockedell.png
_images/blockedell_one_base.png

3.3.8.Extended BSR Format (BSRX) [DEPRECATED]

BSRX is the same as the BSR format, but the arraybsrRowPtrA is separated into two parts. The first nonzero block of each row is still specified by the arraybsrRowPtrA, which is the same as in BSR, but the position next to the last nonzero block of each row is specified by the arraybsrEndPtrA. Briefly, BSRX format is simply like a 4-vector variant of BSR format.

MatrixA is represented in BSRX format by the following parameters.

blockDim

(integer)

Block dimension of matrixA.

mb

(integer)

The number of block rows ofA.

nb

(integer)

The number of block columns ofA.

nnzb

(integer)

number of nonzero blocks in the matrixA.

bsrValA

(pointer)

Points to the data array of length\(nnzb \ast blockDim^{2}\) that holds all the elements of the nonzero blocks ofA. The block elements are stored in either column-major order or row-major order.

bsrRowPtrA

(pointer)

Points to the integer array of lengthmb that holds indices into the arraysbsrColIndA andbsrValA;bsrRowPtrA(i) is the position of the first nonzero block of theith block row inbsrColIndA andbsrValA.

bsrEndPtrA

(pointer)

Points to the integer array of lengthmb that holds indices into the arraysbsrColIndA andbsrValA;bsrRowPtrA(i) is the position next to the last nonzero block of theith block row inbsrColIndA andbsrValA.

bsrColIndA

(pointer)

Points to the integer array of lengthnnzb that contains the column indices of the corresponding blocks in arraybsrValA.

A simple conversion between BSR and BSRX can be done as follows. Suppose the developer has a\(2 \times 3\) block sparse matrix\(A_{b}\) represented as shown.

\[\begin{split}A_{b} = \begin{bmatrix} A_{00} & A_{01} & A_{02} \\ A_{10} & A_{11} & A_{12} \\ \end{bmatrix}\end{split}\]

Assume it has this BSR format:

\[\begin{split}\begin{matrix} \text{bsrValA of BSR} & = & \begin{bmatrix} A_{00} & A_{01} & A_{10} & A_{11} & A_{12} \\ \end{bmatrix} \\ \text{bsrRowPtrA of BSR} & = & \begin{bmatrix} {0\phantom{.0}} & {2\phantom{.0}} & 5 \\ \end{bmatrix} \\ \text{bsrColIndA of BSR} & = & \begin{bmatrix} {0\phantom{.0}} & {1\phantom{.0}} & {0\phantom{.0}} & {1\phantom{.0}} & 2 \\ \end{bmatrix} \\ \end{matrix}\end{split}\]

ThebsrRowPtrA of the BSRX format is simply the first two elements of thebsrRowPtrA BSR format. ThebsrEndPtrA of BSRX format is the last two elements of thebsrRowPtrA of BSR format.

\[\begin{split}\begin{matrix} \text{bsrRowPtrA of BSRX} & = & \begin{bmatrix} {0\phantom{.0}} & 2 \\ \end{bmatrix} \\ \text{bsrEndPtrA of BSRX} & = & \begin{bmatrix} {2\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\end{split}\]

The advantage of the BSRX format is that the developer can specify a submatrix in the original BSR format by modifyingbsrRowPtrA andbsrEndPtrA while keepingbsrColIndA andbsrValA unchanged.

For example, to create another block matrix\(\widetilde{A} = \begin{bmatrix}O & O & O \\O & A_{11} & O \\\end{bmatrix}\) that is slightly different from\(A\) , the developer can keepbsrColIndA andbsrValA, but reconstruct\(\widetilde{A}\) by properly setting ofbsrRowPtrA andbsrEndPtrA. The following 4-vector characterizes\(\widetilde{A}\) .

\[\begin{split}\begin{matrix} {\text{bsrValA of }\widetilde{A}} & = & \begin{bmatrix} A_{00} & A_{01} & A_{10} & A_{11} & A_{12} \\ \end{bmatrix} \\ {\text{bsrColIndA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & {1\phantom{.0}} & {0\phantom{.0}} & {1\phantom{.0}} & 2 \\ \end{bmatrix} \\ {\text{bsrRowPtrA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & 3 \\ \end{bmatrix} \\ {\text{bsrEndPtrA of }\widetilde{A}} & = & \begin{bmatrix} {0\phantom{.0}} & 4 \\ \end{bmatrix} \\ \end{matrix}\end{split}\]


4.cuSPARSE Basic APIs

4.1.cuSPARSE Types Reference

4.1.1.cudaDataType_t

The section describes the types shared by multiple CUDA Libraries and defined in the header filelibrary_types.h. ThecudaDataType type is an enumerator to specify the data precision. It is used when the data reference does not carry the type itself (e.g.void*). For example, it is used in the routinecusparseSpMM().

Value

Meaning

Data Type

Header

CUDA_R_16F

The data type is 16-bit IEEE-754 floating-point

__half

cuda_fp16.h

CUDA_C_16F

The data type is 16-bit complex IEEE-754 floating-point

__half2

cuda_fp16.h

[DEPRECATED]

CUDA_R_16BF

The data type is 16-bit bfloat floating-point

__nv_bfloat16

cuda_bf16.h

CUDA_C_16BF

The data type is 16-bit complex bfloat floating-point

__nv_bfloat162

cuda_bf16.h

[DEPRECATED]

CUDA_R_32F

The data type is 32-bit IEEE-754 floating-point

float

CUDA_C_32F

The data type is 32-bit complex IEEE-754 floating-point

cuComplex

cuComplex.h

CUDA_R_64F

The data type is 64-bit IEEE-754 floating-point

double

CUDA_C_64F

The data type is 64-bit complex IEEE-754 floating-point

cuDoubleComplex

cuComplex.h

CUDA_R_8I

The data type is 8-bit integer

int8_t

stdint.h

CUDA_R_32I

The data type is 32-bit integer

int32_t

stdint.h

IMPORTANT: The Generic API routines allow all data types reported in the respective section of the documentation only on GPU architectures withnative support for them. If a specific GPU model does not providenative support for a given data type, the routine returnsCUSPARSE_STATUS_ARCH_MISMATCH error.

Unsupported data types and Compute Capability (CC):

  • __half on GPUs withCC<53 (e.g. Kepler)

  • __nv_bfloat16 on GPUs withCC<80 (e.g. Kepler, Maxwell, Pascal, Volta, Turing)

seehttps://developer.nvidia.com/cuda-gpus


4.1.2.cusparseStatus_t

This data type represents the status returned by the library functions and it can have the following values:

Value

Description

CUSPARSE_STATUS_SUCCESS

The operation completed successfully

CUSPARSE_STATUS_NOT_INITIALIZED

The cuSPARSE library was not initialized. This is usually caused by the lack of a prior call, an error in the CUDA Runtime API called by the cuSPARSE routine, or an error in the hardware setup

To correct: callcusparseCreate() prior to the function call; and check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

The error also applies to generic APIs (cuSPARSE Generic APIs) for indicating a matrix/vector descriptor not initialized

CUSPARSE_STATUS_ALLOC_FAILED

Resource allocation failed inside the cuSPARSE library. This is usually caused by a device memory allocation (cudaMalloc()) or by a host memory allocation failure

To correct: prior to the function call, deallocate previously allocated memory as much as possible

CUSPARSE_STATUS_INVALID_VALUE

An unsupported value or parameter was passed to the function (a negative vector size, for example)

To correct: ensure that all the parameters being passed have valid values

CUSPARSE_STATUS_ARCH_MISMATCH

The function requires a feature absent from the device architecture

To correct: compile and run the application on a device with appropriate compute capability

CUSPARSE_STATUS_EXECUTION_FAILED

The GPU program failed to execute. This is often caused by a launch failure of the kernel on the GPU, which can be caused by multiple reasons

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed

CUSPARSE_STATUS_INTERNAL_ERROR

An internal cuSPARSE operation failed

To correct: check that the hardware, an appropriate version of the driver, and the cuSPARSE library are correctly installed. Also, check that the memory passed as a parameter to the routine is not being deallocated prior to the routine completion

CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED

The matrix type is not supported by this function. This is usually caused by passing an invalid matrix descriptor to the function

To correct: check that the fields incusparseMatDescr_tdescrA were set correctly

CUSPARSE_STATUS_NOT_SUPPORTED

The operation or data type combination is currently not supported by the function

CUSPARSE_STATUS_INSUFFICIENT_RESOURCES

The resources for the computation, such as GPU global or shared memory, are not sufficient to complete the operation. The error can also indicate that the current computation mode (e.g. bit size of sparse matrix indices) does not allow to handle the given input


4.1.3.cusparseHandle_t

This is a pointer type to an opaque cuSPARSE context, which the user must initialize by calling prior to callingcusparseCreate() any other library function. The handle created and returned bycusparseCreate() must be passed to every cuSPARSE function.


4.1.4.cusparsePointerMode_t

This type indicates whether the scalar values are passed by reference on the host or device. It is important to point out that if several scalar values are passed by reference in the function call, all of them will conform to the same single pointer mode. The pointer mode can be set and retrieved usingcusparseSetPointerMode() andcusparseGetPointerMode() routines, respectively.

Value

Meaning

CUSPARSE_POINTER_MODE_HOST

The scalars are passed by reference on the host.

CUSPARSE_POINTER_MODE_DEVICE

The scalars are passed by reference on the device.


4.1.5.cusparseOperation_t

This type indicates which operations is applied to the related input (e.g. sparse matrix, or vector).

Value

Meaning

CUSPARSE_OPERATION_NON_TRANSPOSE

The non-transpose operation is selected.

CUSPARSE_OPERATION_TRANSPOSE

The transpose operation is selected.

CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE

The conjugate transpose operation is selected.


4.1.6.cusparseDiagType_t

This type indicates if the matrix diagonal entries are unity. The diagonal elements are always assumed to be present, but ifCUSPARSE_DIAG_TYPE_UNIT is passed to an API routine, then the routine assumes that all diagonal entries are unity and will not read or modify those entries. Note that in this case the routine assumes the diagonal entries are equal to one, regardless of what those entries are actually set to in memory.

Value

Meaning

CUSPARSE_DIAG_TYPE_NON_UNIT

The matrix diagonal has non-unit elements.

CUSPARSE_DIAG_TYPE_UNIT

The matrix diagonal has unit elements.


4.1.7.cusparseFillMode_t

This type indicates if the lower or upper part of a matrix is stored in sparse storage.

Value

Meaning

CUSPARSE_FILL_MODE_LOWER

The lower triangular part is stored.

CUSPARSE_FILL_MODE_UPPER

The upper triangular part is stored.


4.1.8.cusparseIndexBase_t

This type indicates if the base of the matrix indices is zero or one.

Value

Meaning

CUSPARSE_INDEX_BASE_ZERO

The base index is zero (C compatibility).

CUSPARSE_INDEX_BASE_ONE

The base index is one (Fortran compatibility).


4.1.9.cusparseDirection_t

This type indicates whether the elements of a dense matrix should be parsed by rows or by columns (assuming column-major storage in memory of the dense matrix) in function cusparse[S|D|C|Z]nnz. Besides storage format of blocks in BSR format is also controlled by this type.

Value

Meaning

CUSPARSE_DIRECTION_ROW

The matrix should be parsed by rows.

CUSPARSE_DIRECTION_COLUMN

The matrix should be parsed by columns.



4.2.cuSPARSE Management API

The cuSPARSE functions for managing the library are described in this section.


4.2.1.cusparseCreate()

cusparseStatus_tcusparseCreate(cusparseHandle_t*handle)

This function initializes the cuSPARSE library and creates a handle on the cuSPARSE context. It must be called before any other cuSPARSE API function is invoked. It allocates hardware resources necessary for accessing the GPU.

Param.

In/out

Meaning

handle

IN

The pointer to the handle to the cuSPARSE context

Refer tocusparseStatus_t for the description of the return status.


4.2.2.cusparseDestroy()

cusparseStatus_tcusparseDestroy(cusparseHandle_thandle)

This function releases CPU-side resources used by the cuSPARSE library. The release of GPU-side resources may be deferred until the application shuts down.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

Refer tocusparseStatus_t for the description of the return status.


4.2.3.cusparseGetErrorName()

constchar*cusparseGetErrorString(cusparseStatus_tstatus)

The function returns the string representation of an error code enum name. If the error code is not recognized, “unrecognized error code” is returned.

Param.

In/out

Meaning

status

IN

Error code to convert to string

constchar*

OUT

Pointer to a NULL-terminated string


4.2.4.cusparseGetErrorString()

constchar*cusparseGetErrorString(cusparseStatus_tstatus)

Returns the description string for an error code. If the error code is not recognized, “unrecognized error code” is returned.

Param.

In/out

Meaning

status

IN

Error code to convert to string

constchar*

OUT

Pointer to a NULL-terminated string


4.2.5.cusparseGetProperty()

cusparseStatus_tcusparseGetProperty(libraryPropertyTypetype,int*value)

The function returns the value of the requested property. Refer tolibraryPropertyType for supported types.

Param.

In/out

Meaning

type

IN

Requested property

value

OUT

Value of the requested property

libraryPropertyType (defined inlibrary_types.h):

Value

Meaning

MAJOR_VERSION

Enumerator to query the major version

MINOR_VERSION

Enumerator to query the minor version

PATCH_LEVEL

Number to identify the patch level

Refer tocusparseStatus_t for the description of the return status.


4.2.6.cusparseGetVersion()

cusparseStatus_tcusparseGetVersion(cusparseHandle_thandle,int*version)

This function returns the version number of the cuSPARSE library.

Param.

In/out

Meaning

handle

IN

cuSPARSE handle

version

OUT

The version number of the library

Refer tocusparseStatus_t for the description of the return status.


4.2.7.cusparseGetPointerMode()

cusparseStatus_tcusparseGetPointerMode(cusparseHandlethandle,cusparsePointerMode_t*mode)

This function obtains the pointer mode used by the cuSPARSE library. Please see the section on thecusparsePointerMode_t type for more details.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

mode

OUT

One of the enumerated pointer mode types

Refer tocusparseStatus_t for the description of the return status.


4.2.8.cusparseSetPointerMode()

cusparseStatus_tcusparseSetPointerMode(cusparseHandle_thandle,cusparsePointerMode_tmode)

This function sets the pointer mode used by the cuSPARSE library. Thedefault is for the values to be passed by reference on the host. Please see the section on thecublasPointerMode_t type for more details.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

mode

IN

One of the enumerated pointer mode types

Refer tocusparseStatus_t for the description of the return status.


4.2.9.cusparseGetStream()

cusparseStatus_tcusparseGetStream(cusparseHandle_thandle,cudaStream_t*streamId)

This function gets the cuSPARSE library stream, which is being used to to execute all calls to the cuSPARSE library functions. If the cuSPARSE library stream is not set, all kernels use the default NULL stream.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

streamId

OUT

The stream used by the library

Refer tocusparseStatus_t for the description of the return status.


4.2.10.cusparseSetStream()

cusparseStatus_tcusparseSetStream(cusparseHandle_thandle,cudaStream_tstreamId)

This function sets the stream to be used by the cuSPARSE library to execute its routines.

Param.

In/out

Meaning

handle

IN

The handle to the cuSPARSE context

streamId

IN

The stream to be used by the library

Refer tocusparseStatus_t for the description of the return status.



4.3.cuSPARSE Logging API

cuSPARSE logging mechanism can be enabled by setting the following environment variables before launching the target application:

CUSPARSE_LOG_LEVEL=<level> - while level is one of the following levels:

  • 0 -Off - logging is disabled (default)

  • 1 -Error - only errors will be logged

  • 2 -Trace - API calls that launch CUDA kernels will log their parameters and important information

  • 3 -Hints - hints that can potentially improve the application’s performance

  • 4 -Info - provides general information about the library execution, may contain details about heuristic status

  • 5 -API Trace - API calls will log their parameter and important information

CUSPARSE_LOG_MASK=<mask> - while mask is a combination of the following masks:

  • 0 -Off

  • 1 -Error

  • 2 -Trace

  • 4 -Hints

  • 8 -Info

  • 16 -API Trace

CUSPARSE_LOG_FILE=<file_name> - while file name is a path to a logging file. File name may contain%i, that will be replaced with the process id. E.g<file_name>_%i.log.

IfCUSPARSE_LOG_FILE is not defined, the log messages are printed tostdout.

Starting from CUDA 12.3, it is also possible to dump sparse matrices (CSR, CSC, COO, SELL, BSR) in binary files during the creation by setting the environment variableCUSPARSE_STORE_INPUT_MATRIX. Later on, the binary files can be send toMath-Libs-Feedback@nvidia.com for debugging and reproducibility purposes of a specific correctness/performance issue.

Another option is to use the experimental cuSPARSE logging API. Refer to:

Note

The logging mechanism is not available for the legacy APIs.


4.3.1.cusparseLoggerSetCallback()

cusparseStatus_tcusparseLoggerSetCallback(cusparseLoggerCallback_tcallback)

Experimental: The function sets the logging callback function.

Param.

In/out

Meaning

callback

IN

Pointer to a callback function

wherecusparseLoggerCallback_t has the following signature:

void(*cusparseLoggerCallback_t)(intlogLevel,constchar*functionName,constchar*message)

Param.

In/out

Meaning

logLevel

IN

Selected log level

functionName

IN

The name of the API that logged this message

message

IN

The log message

SeecusparseStatus_t for the description of the return status.


4.3.2.cusparseLoggerSetFile()

cusparseStatus_tcusparseLoggerSetFile(FILE*file)

Experimental: The function sets the logging output file. Note: once registered using this function call, the provided file handle must not be closed unless the function is called again to switch to a different file handle.

Param.

In/out

Meaning

file

IN

Pointer to an open file. File should have write permission

SeecusparseStatus_t for the description of the return status.


4.3.3.cusparseLoggerOpenFile()

cusparseStatus_tcusparseLoggerOpenFile(constchar*logFile)

Experimental: The function opens a logging output file in the given path.

Param.

In/out

Meaning

logFile

IN

Path of the logging output file

SeecusparseStatus_t for the description of the return status.


4.3.4.cusparseLoggerSetLevel()

cusparseStatus_tcusparseLoggerSetLevel(intlevel)

Experimental: The function sets the value of the logging level. path.

Param.

In/out

Meaning

level

IN

Value of the logging level

SeecusparseStatus_t for the description of the return status


4.3.5.cusparseLoggerSetMask()

cusparseStatus_tcusparseLoggerSetMask(intmask)

Experimental: The function sets the value of the logging mask.

Param.

In/out

Meaning

mask

IN

Value of the logging mask

SeecusparseStatus_t for the description of the return status



5.cuSPARSE Legacy APIs

5.1.Naming Conventions

The cuSPARSE legacy functions are available for data typesfloat,double,cuComplex, andcuDoubleComplex. The sparse Level 2, and Level 3 functions follow this naming convention:

cusparse<t>[<matrixdataformat>]<operation>[<outputmatrixdataformat>]

where <t> can beS,D,C,Z, orX, corresponding to the data typesfloat,double,cuComplex,cuDoubleComplex, and the generic type, respectively.

The <matrixdataformat> can bedense,coo,csr, orcsc, corresponding to the dense, coordinate, compressed sparse row, and compressed sparse column formats, respectively.

5.2.cuSPARSE Legacy Types Reference

5.2.1.cusparseAction_t

This type indicates whether the operation is performed only on indices or on data and indices.

Value

Meaning

CUSPARSE_ACTION_SYMBOLIC

the operation is performed only on indices.

CUSPARSE_ACTION_NUMERIC

the operation is performed on data and indices.

5.2.2.cusparseMatDescr_t

This structure is used to describe the shape and properties of a matrix.

typedefstruct{cusparseMatrixType_tMatrixType;cusparseFillMode_tFillMode;cusparseDiagType_tDiagType;cusparseIndexBase_tIndexBase;}cusparseMatDescr_t;

5.2.3.cusparseMatrixType_t

This type indicates the type of matrix stored in sparse storage. Notice that for symmetric, Hermitian and triangular matrices only their lower or upper part is assumed to be stored.

The whole idea of matrix type and fill mode is to keep minimum storage for symmetric/Hermitian matrix, and also to take advantage of symmetric property on SpMV (Sparse Matrix Vector multiplication). To computey=A*x whenA is symmetric and only lower triangular part is stored, two steps are needed. First step is to computey=(L+D)*x and second step is to computey=L^T*x+y. Given the fact that the transpose operationy=L^T*x is 10x slower than non-transpose versiony=L*x, the symmetric property does not show up any performance gain. It is better for the user to extend the symmetric matrix to a general matrix and applyy=A*x with matrix typeCUSPARSE_MATRIX_TYPE_GENERAL.

In general, SpMV, preconditioners (incomplete Cholesky or incomplete LU) and triangular solver are combined together in iterative solvers, for example PCG and GMRES. If the user always uses general matrix (instead of symmetric matrix), there is no need to support other than general matrix in preconditioners. Therefore the new routines,[bsr|csr]sv2 (triangular solver),[bsr|csr]ilu02 (incomplete LU) and[bsr|csr]ic02 (incomplete Cholesky), only support matrix typeCUSPARSE_MATRIX_TYPE_GENERAL.

Value

Meaning

CUSPARSE_MATRIX_TYPE_GENERAL

the matrix is general.

CUSPARSE_MATRIX_TYPE_SYMMETRIC

the matrix is symmetric.

CUSPARSE_MATRIX_TYPE_HERMITIAN

the matrix is Hermitian.

CUSPARSE_MATRIX_TYPE_TRIANGULAR

the matrix is triangular.

5.2.4.cusparseColorInfo_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used incsrcolor().

5.2.5.cusparseSolvePolicy_t [DEPRECATED]

This type indicates whether level information is generated and used incsrsv2,csric02,csrilu02,bsrsv2,bsric02andbsrilu02.

Value

Meaning

CUSPARSE_SOLVE_POLICY_NO_LEVEL

no level information is generated and used.

CUSPARSE_SOLVE_POLICY_USE_LEVEL

generate and use level information.

5.2.6.bsric02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used inbsric02_bufferSize(),bsric02_analysis(), andbsric02().

5.2.7.bsrilu02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used inbsrilu02_bufferSize(),bsrilu02_analysis(), andbsrilu02().

5.2.8.bsrsm2Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used inbsrsm2_bufferSize(),bsrsm2_analysis(), andbsrsm2_solve().

5.2.9.bsrsv2Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used inbsrsv2_bufferSize(),bsrsv2_analysis(), andbsrsv2_solve().

5.2.10.csric02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used incsric02_bufferSize(),csric02_analysis(), andcsric02().

5.2.11.csrilu02Info_t [DEPRECATED]

This is a pointer type to an opaque structure holding the information used incsrilu02_bufferSize(),csrilu02_analysis(), andcsrilu02().

5.3.cuSPARSE Helper Function Reference

The cuSPARSE helper functions are described in this section.

5.3.1.cusparseCreateColorInfo() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateColorInfo(cusparseColorInfo_t*info)

This function creates and initializes thecusparseColorInfo_t structure todefault values.

Input

info

the pointer to thecusparseColorInfo_t structure

SeecusparseStatus_t for the description of the return status.

5.3.2.cusparseCreateMatDescr()

cusparseStatus_tcusparseCreateMatDescr(cusparseMatDescr_t*descrA)

This function initializes the matrix descriptor. It sets the fieldsMatrixType andIndexBase to thedefault valuesCUSPARSE_MATRIX_TYPE_GENERAL andCUSPARSE_INDEX_BASE_ZERO , respectively, while leaving other fields uninitialized.

Input

descrA

the pointer to the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.3.cusparseDestroyColorInfo() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyColorInfo(cusparseColorInfo_tinfo)

This function destroys and releases any memory required by the structure.

Input

info

the pointer to the structure ofcsrcolor()

SeecusparseStatus_t for the description of the return status.

5.3.4.cusparseDestroyMatDescr()

cusparseStatus_tcusparseDestroyMatDescr(cusparseMatDescr_tdescrA)

This function releases the memory allocated for the matrix descriptor.

Input

descrA

the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.5.cusparseGetMatDiagType()

cusparseDiagType_tcusparseGetMatDiagType(constcusparseMatDescr_tdescrA)

This function returns theDiagType field of the matrix descriptordescrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated diagType types.

5.3.6.cusparseGetMatFillMode()

cusparseFillMode_tcusparseGetMatFillMode(constcusparseMatDescr_tdescrA)

This function returns theFillMode field of the matrix descriptordescrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated fillMode types.

5.3.7.cusparseGetMatIndexBase()

cusparseIndexBase_tcusparseGetMatIndexBase(constcusparseMatDescr_tdescrA)

This function returns theIndexBase field of the matrix descriptordescrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated indexBase types.

5.3.8.cusparseGetMatType()

cusparseMatrixType_tcusparseGetMatType(constcusparseMatDescr_tdescrA)

This function returns theMatrixType field of the matrix descriptordescrA.

Input

descrA

the matrix descriptor.

Returned

 

One of the enumerated matrix types.

5.3.9.cusparseSetMatDiagType()

cusparseStatus_tcusparseSetMatDiagType(cusparseMatDescr_tdescrA,cusparseDiagType_tdiagType)

This function sets theDiagType field of the matrix descriptordescrA.

Input

diagType

One of the enumerated diagType types.

Output

descrA

the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.10.cusparseSetMatFillMode()

cusparseStatus_tcusparseSetMatFillMode(cusparseMatDescr_tdescrA,cusparseFillMode_tfillMode)

This function sets theFillMode field of the matrix descriptordescrA.

Input

fillMode

One of the enumerated fillMode types.

Output

descrA

the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.11.cusparseSetMatIndexBase()

cusparseStatus_tcusparseSetMatIndexBase(cusparseMatDescr_tdescrA,cusparseIndexBase_tbase)

This function sets theIndexBase field of the matrix descriptordescrA.

Input

base

One of the enumerated indexBase types.

Output

descrA

the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.12.cusparseSetMatType()

cusparseStatus_tcusparseSetMatType(cusparseMatDescr_tdescrA,cusparseMatrixType_ttype)

This function sets theMatrixType field of the matrix descriptordescrA.

Input

type

One of the enumerated matrix types.

Output

descrA

the matrix descriptor.

SeecusparseStatus_t for the description of the return status.

5.3.13.cusparseCreateCsric02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateCsric02Info(csric02Info_t*info);

This function creates and initializes the solve and analysis structure of incomplete Cholesky todefault values.

Input

info

the pointer to the solve and analysis structure of incomplete Cholesky.

SeecusparseStatus_t for the description of the return status.

5.3.14.cusparseDestroyCsric02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyCsric02Info(csric02Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(csric02_solve) and analysis(csric02_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.15.cusparseCreateCsrilu02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateCsrilu02Info(csrilu02Info_t*info);

This function creates and initializes the solve and analysis structure of incomplete LU todefault values.

Input

info

the pointer to the solve and analysis structure of incomplete LU.

SeecusparseStatus_t for the description of the return status.

5.3.16.cusparseDestroyCsrilu02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyCsrilu02Info(csrilu02Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(csrilu02_solve) and analysis(csrilu02_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.17.cusparseCreateBsrsv2Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateBsrsv2Info(bsrsv2Info_t*info);

This function creates and initializes the solve and analysis structure of bsrsv2 todefault values.

Input

info

the pointer to the solve and analysis structure of bsrsv2.

SeecusparseStatus_t for the description of the return status.

5.3.18.cusparseDestroyBsrsv2Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyBsrsv2Info(bsrsv2Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(bsrsv2_solve) and analysis(bsrsv2_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.19.cusparseCreateBsrsm2Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateBsrsm2Info(bsrsm2Info_t*info);

This function creates and initializes the solve and analysis structure of bsrsm2 todefault values.

Input

info

the pointer to the solve and analysis structure of bsrsm2.

SeecusparseStatus_t for the description of the return status.

5.3.20.cusparseDestroyBsrsm2Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyBsrsm2Info(bsrsm2Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(bsrsm2_solve) and analysis(bsrsm2_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.21.cusparseCreateBsric02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateBsric02Info(bsric02Info_t*info);

This function creates and initializes the solve and analysis structure of block incomplete Cholesky todefault values.

Input

info

the pointer to the solve and analysis structure of block incomplete Cholesky.

SeecusparseStatus_t for the description of the return status.

5.3.22.cusparseDestroyBsric02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyBsric02Info(bsric02Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(bsric02_solve) and analysis(bsric02_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.23.cusparseCreateBsrilu02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateBsrilu02Info(bsrilu02Info_t*info);

This function creates and initializes the solve and analysis structure of block incomplete LU todefault values.

Input

info

the pointer to the solve and analysis structure of block incomplete LU.

SeecusparseStatus_t for the description of the return status.

5.3.24.cusparseDestroyBsrilu02Info() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyBsrilu02Info(bsrilu02Info_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the solve(bsrilu02_solve) and analysis(bsrilu02_analysis) structure.

SeecusparseStatus_t for the description of the return status.

5.3.25.cusparseCreatePruneInfo() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreatePruneInfo(pruneInfo_t*info);

This function creates and initializes structure ofprune todefault values.

Input

info

the pointer to the structure ofprune.

SeecusparseStatus_t for the description of the return status.

5.3.26.cusparseDestroyPruneInfo() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseDestroyPruneInfo(pruneInfo_tinfo);

This function destroys and releases any memory required by the structure.

Input

info

the structure ofprune.

SeecusparseStatus_t for the description of the return status.

5.4.cuSPARSE Level 2 Function Reference

This chapter describes the sparse linear algebra functions that perform operations between sparse matrices and dense vectors.

5.4.1.cusparse<t>bsrmv() [DEPRECATED]

cusparseStatus_tcusparseSbsrmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intmb,intnb,intnnzb,constfloat*alpha,constcusparseMatDescr_tdescr,constfloat*bsrVal,constint*bsrRowPtr,constint*bsrColInd,intblockDim,constfloat*x,constfloat*beta,float*y)cusparseStatus_tcusparseDbsrmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intmb,intnb,intnnzb,constdouble*alpha,constcusparseMatDescr_tdescr,constdouble*bsrVal,constint*bsrRowPtr,constint*bsrColInd,intblockDim,constdouble*x,constdouble*beta,double*y)cusparseStatus_tcusparseCbsrmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intmb,intnb,intnnzb,constcuComplex*alpha,constcusparseMatDescr_tdescr,constcuComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,intblockDim,constcuComplex*x,constcuComplex*beta,cuComplex*y)cusparseStatus_tcusparseZbsrmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intmb,intnb,intnnzb,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescr,constcuDoubleComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,intblockDim,constcuDoubleComplex*x,constcuDoubleComplex*beta,cuDoubleComplex*y)

This function performs the matrix-vector operation

\[\text{y} = \alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y}\]

where\(A\text{ is an }(mb \ast blockDim) \times (nb \ast blockDim)\) sparse matrix that is defined in BSR storage format by the three arraysbsrVal,bsrRowPtr, andbsrColInd);x andy are vectors;\(\alpha\text{ and }\beta\) are scalars; and

image1

bsrmv() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Several comments onbsrmv():

  • OnlyblockDim>1 is supported

  • OnlyCUSPARSE_OPERATION_NON_TRANSPOSE is supported, that is

\[\text{y} = \alpha \ast A \ast \text{x} + \beta{} \ast \text{y}\]
  • OnlyCUSPARSE_MATRIX_TYPE_GENERAL is supported.

  • The size of vectorx should be\((nb \ast blockDim)\) at least, and the size of vectory should be\((mb \ast blockDim)\) at least; otherwise, the kernel may returnCUSPARSE_STATUS_EXECUTION_FAILED because of an out-of-bounds array.

For example, suppose the user has a CSR format and wants to trybsrmv(), the following code demonstrates how to usecsr2bsr() conversion andbsrmv() multiplication in single precision.

// Suppose that A is m x n sparse matrix represented by CSR format,// hx is a host vector of size n, and hy is also a host vector of size m.// m and n are not multiple of blockDim.// step 1: transform CSR to BSR with column-major orderintbase,nnz;intnnzb;cusparseDirection_tdirA=CUSPARSE_DIRECTION_COLUMN;intmb=(m+blockDim-1)/blockDim;intnb=(n+blockDim-1)/blockDim;cudaMalloc((void**)&bsrRowPtrC,sizeof(int)*(mb+1));cusparseXcsr2bsrNnz(handle,dirA,m,n,descrA,csrRowPtrA,csrColIndA,blockDim,descrC,bsrRowPtrC,&nnzb);cudaMalloc((void**)&bsrColIndC,sizeof(int)*nnzb);cudaMalloc((void**)&bsrValC,sizeof(float)*(blockDim*blockDim)*nnzb);cusparseScsr2bsr(handle,dirA,m,n,descrA,csrValA,csrRowPtrA,csrColIndA,blockDim,descrC,bsrValC,bsrRowPtrC,bsrColIndC);// step 2: allocate vector x and vector y large enough for bsrmvcudaMalloc((void**)&x,sizeof(float)*(nb*blockDim));cudaMalloc((void**)&y,sizeof(float)*(mb*blockDim));cudaMemcpy(x,hx,sizeof(float)*n,cudaMemcpyHostToDevice);cudaMemcpy(y,hy,sizeof(float)*m,cudaMemcpyHostToDevice);// step 3: perform bsrmvcusparseSbsrmv(handle,dirA,transA,mb,nb,nnzb,&alpha,descrC,bsrValC,bsrRowPtrC,bsrColIndC,blockDim,x,&beta,y);

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

trans

the operation\(\text{op}(A)\) . OnlyCUSPARSE_OPERATION_NON_TRANSPOSE is supported.

mb

number of block rows of matrix\(A\).

nb

number of block columns of matrix\(A\).

nnzb

number of nonzero blocks of matrix\(A\).

alpha

<type> scalar used for multiplication.

descr

the descriptor of matrix\(A\). The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrVal

<type> array ofnnz\(( =\)csrRowPtrA(mb)\(-\)csrRowPtrA(0)\()\) nonzero blocks of matrix\(A\).

bsrRowPtr

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColInd

integer array ofnnz\(( =\)csrRowPtrA(mb)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero blocks of matrix\(A\).

blockDim

block dimension of sparse matrix\(A\), larger than zero.

x

<type> vector of\(nb \ast blockDim\) elements.

beta

<type> scalar used for multiplication. Ifbeta is zero,y does not have to be a valid input.

y

<type> vector of\(mb \ast blockDim\) elements.

Output

y

<type> updated vector.

SeecusparseStatus_t for the description of the return status.

5.4.2.cusparse<t>bsrxmv() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrxmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intsizeOfMask,intmb,intnb,intnnzb,constfloat*alpha,constcusparseMatDescr_tdescr,constfloat*bsrVal,constint*bsrMaskPtr,constint*bsrRowPtr,constint*bsrEndPtr,constint*bsrColInd,intblockDim,constfloat*x,constfloat*beta,float*y)cusparseStatus_tcusparseDbsrxmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intsizeOfMask,intmb,intnb,intnnzb,constdouble*alpha,constcusparseMatDescr_tdescr,constdouble*bsrVal,constint*bsrMaskPtr,constint*bsrRowPtr,constint*bsrEndPtr,constint*bsrColInd,intblockDim,constdouble*x,constdouble*beta,double*y)cusparseStatus_tcusparseCbsrxmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intsizeOfMask,intmb,intnb,intnnzb,constcuComplex*alpha,constcusparseMatDescr_tdescr,constcuComplex*bsrVal,constint*bsrMaskPtr,constint*bsrRowPtr,constint*bsrEndPtr,constint*bsrColInd,intblockDim,constcuComplex*x,constcuComplex*beta,cuComplex*y)cusparseStatus_tcusparseZbsrxmv(cusparseHandle_thandle,cusparseDirection_tdir,cusparseOperation_ttrans,intsizeOfMask,intmb,intnb,intnnzb,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescr,constcuDoubleComplex*bsrVal,constint*bsrMaskPtr,constint*bsrRowPtr,constint*bsrEndPtr,constint*bsrColInd,intblockDim,constcuDoubleComplex*x,constcuDoubleComplex*beta,cuDoubleComplex*y)

This function performs absrmv and a mask operation

\[\text{y(mask)} = (\alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y})\text{(mask)}\]

where\(A\text{ is an }(mb \ast blockDim) \times (nb \ast blockDim)\) sparse matrix that is defined in BSRX storage format by the four arraysbsrVal,bsrRowPtr,bsrEndPtr, andbsrColInd);x andy are vectors;\(\alpha\text{~and~}\beta\) are scalars; and

image1

The mask operation is defined by arraybsrMaskPtr which contains updated block row indices of\(y\) . If row\(i\) is not specified inbsrMaskPtr, thenbsrxmv() does not touch row block\(i\) of\(A\) and\(y\) .

For example, consider the\(2 \times 3\) block matrix\(A\):

\[\begin{split}\begin{matrix}{A = \begin{bmatrix}A_{11} & A_{12} & O \\A_{21} & A_{22} & A_{23} \\\end{bmatrix}} \\\end{matrix}\end{split}\]

and its one-based BSR format (three vector form) is:

\[\begin{split}\begin{matrix}\text{bsrVal} & = & \begin{bmatrix}A_{11} & A_{12} & A_{21} & A_{22} & A_{23} \\\end{bmatrix} \\\text{bsrRowPtr} & = & \begin{bmatrix}{1\phantom{.0}} & {3\phantom{.0}} & 6 \\\end{bmatrix} \\\text{bsrColInd} & = & \begin{bmatrix}{1\phantom{.0}} & {2\phantom{.0}} & {1\phantom{.0}} & {2\phantom{.0}} & 3 \\\end{bmatrix} \\\end{matrix}\end{split}\]

Suppose we want to do the followingbsrmv operation on a matrix\(\bar{A}\) which is slightly different from\(A\) .

\[\begin{split}\begin{bmatrix}y_{1} \\y_{2} \\\end{bmatrix}:=alpha \ast (\widetilde{A} = \begin{bmatrix}O & O & O \\O & A_{22} & O \\\end{bmatrix}) \ast \begin{bmatrix}x_{1} \\x_{2} \\x_{3} \\\end{bmatrix} + \begin{bmatrix}y_{1} \\{beta \ast y_{2}} \\\end{bmatrix}\end{split}\]

We don’t need to create another BSR format for the new matrix\(\bar{A}\) , all that we should do is to keepbsrVal andbsrColInd unchanged, but modifybsrRowPtr and add an additional arraybsrEndPtr which points to the last nonzero elements per row of\(\bar{A}\) plus 1.

For example, the followingbsrRowPtr andbsrEndPtr can represent matrix\(\bar{A}\):

\[\begin{split}\begin{matrix} \text{bsrRowPtr} & = & \begin{bmatrix} {1\phantom{.0}} & 4 \\ \end{bmatrix} \\ \text{bsrEndPtr} & = & \begin{bmatrix} {1\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\end{split}\]

Further we can use a mask operator (specified by arraybsrMaskPtr) to update particular block row indices of\(y\) only because\(y_{1}\) is never changed. In this case,bsrMaskPtr\(=\) [2] andsizeOfMask=1.

The mask operator is equivalent to the following operation:

\[\begin{split}\begin{bmatrix} ? \\ y_{2} \\ \end{bmatrix}:=alpha \ast \begin{bmatrix} ? & ? & ? \\ O & A_{22} & O \\ \end{bmatrix} \ast \begin{bmatrix} x_{1} \\ x_{2} \\ x_{3} \\ \end{bmatrix} + beta \ast \begin{bmatrix} ? \\ y_{2} \\ \end{bmatrix}\end{split}\]

If a block row is not present in thebsrMaskPtr, then no calculation is performed on that row, and the corresponding value iny is unmodified. The question mark “?” is used to inidcate row blocks not inbsrMaskPtr.

In this case, first row block is not present inbsrMaskPtr, sobsrRowPtr[0] andbsrEndPtr[0] are not touched also.

\[\begin{split}\begin{matrix} \text{bsrRowPtr} & = & \begin{bmatrix} {?\phantom{.0}} & 4 \\ \end{bmatrix} \\ \text{bsrEndPtr} & = & \begin{bmatrix} {?\phantom{.0}} & 5 \\ \end{bmatrix} \\ \end{matrix}\end{split}\]

bsrxmv() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

A couple of comments onbsrxmv():

  • OnlyblockDim>1 is supported

  • OnlyCUSPARSE_OPERATION_NON_TRANSPOSE andCUSPARSE_MATRIX_TYPE_GENERAL are supported.

  • ParametersbsrMaskPtr,bsrRowPtr,bsrEndPtr andbsrColInd are consistent with base index, either one-based or zero-based. The above example is one-based.

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

trans

the operation\(\text{op}(A)\) . OnlyCUSPARSE_OPERATION_NON_TRANSPOSE is supported.

sizeOfMask

number of updated block rows of\(y\).

mb

number of block rows of matrix\(A\).

nb

number of block columns of matrix\(A\).

nnzb

number of nonzero blocks of matrix\(A\).

alpha

<type> scalar used for multiplication.

descr

the descriptor of matrix\(A\). The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrVal

<type> array ofnnz nonzero blocks of matrix\(A\).

bsrMaskPtr

integer array ofsizeOfMask elements that contains the indices corresponding to updated block rows.

bsrRowPtr

integer array ofmb elements that contains the start of every block row.

bsrEndPtr

integer array ofmb elements that contains the end of the every block row plus one.

bsrColInd

integer array ofnnzb column indices of the nonzero blocks of matrix\(A\).

blockDim

block dimension of sparse matrix\(A\), larger than zero.

x

<type> vector of\(nb \ast blockDim\) elements.

beta

<type> scalar used for multiplication. Ifbeta is zero,y does not have to be a valid input.

y

<type> vector of\(mb \ast blockDim\) elements.

SeecusparseStatus_t for the description of the return status.

5.4.3.cusparse<t>bsrsv2_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsv2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseDbsrsv2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseCbsrsv2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseZbsrsv2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,int*pBufferSizeInBytes)

This function returns size of the buffer used inbsrsv2, a new sparse triangular linear systemop(A)*y=\(\alpha\)x.

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);x andy are the right-hand-side and the solution vectors;\(\alpha\) is a scalar; and

image1

Although there are six combinations in terms of parametertrans and the upper (lower) triangular part ofA,bsrsv2_bufferSize() returns the maximum size buffer among these combinations. The buffer size depends on the dimensionsmb,blockDim, and the number of nonzero blocks of the matrixnnzb. If the user changes the matrix, it is necessary to callbsrsv2_bufferSize() again to have the correct buffer size; otherwise a segmentation fault may occur.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operation\(\text{op}(A)\) .

mb

number of block rows of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A; must be larger than zero.

Output

info

record of internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used in thebsrsv2_analysis() andbsrsv2_solve().

SeecusparseStatus_t for the description of the return status.

5.4.4.cusparse<t>bsrsv2_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsv2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrsv2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrsv2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrsv2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase ofbsrsv2, a new sparse triangular linear systemop(A)*y=\(\alpha\)x.

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);x andy are the right-hand side and the solution vectors;\(\alpha\) is a scalar; and

image1

The block of BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires a buffer size returned bybsrsv2_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsrsv2_analysis() reports a structural zero and computes level information, which stored in the opaque structureinfo. The level information can extract more parallelism for a triangular solver. Howeverbsrsv2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

Functionbsrsv2_analysis() always reports the first structural zero, even when parameterpolicy isCUSPARSE_SOLVE_POLICY_NO_LEVEL. No structural zero is reported ifCUSPARSE_DIAG_TYPE_UNIT is specified, even if blockA(j,j) is missing for somej. The user needs to callcusparseXbsrsv2_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to callbsrsv2_solve() ifbsrsv2_analysis() reports a structural zero. In this case, the user can still callbsrsv2_solve(), which will return a numerical zero at the same position as a structural zero. However the resultx is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operation\(\text{op}(A)\) .

mb

number of block rows of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure initialized usingcusparseCreateBsrsv2Info().

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is return bybsrsv2_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

SeecusparseStatus_t for the description of the return status.

5.4.5.cusparse<t>bsrsv2_solve() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsv2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constfloat*alpha,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,constfloat*x,float*y,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrsv2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constdouble*alpha,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,constdouble*x,double*y,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsrsv2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcuComplex*alpha,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,constcuComplex*x,cuComplex*y,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrsv2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,intmb,intnnzb,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrsv2Info_tinfo,constcuDoubleComplex*x,cuDoubleComplex*y,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase ofbsrsv2, a new sparse triangular linear systemop(A)*y=\(\alpha\)x.

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);x andy are the right-hand-side and the solution vectors;\(\alpha\) is a scalar; and

image1

The block in BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Functionbsrsv02_solve() can support an arbitraryblockDim.

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires a buffer size returned bybsrsv2_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Althoughbsrsv2_solve() can be done without level information, the user still needs to be aware of consistency. Ifbsrsv2_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,bsrsv2_solve() can be run with or without levels. On the other hand, ifbsrsv2_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,bsrsv2_solve() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

The level information may not improve the performance, but may spend extra time doing analysis. For example, a tridiagonal matrix has no parallelism. In this case,CUSPARSE_SOLVE_POLICY_NO_LEVEL performs better thanCUSPARSE_SOLVE_POLICY_USE_LEVEL. If the user has an iterative solver, the best approach is to dobsrsv2_analysis() withCUSPARSE_SOLVE_POLICY_USE_LEVEL once. Then dobsrsv2_solve() withCUSPARSE_SOLVE_POLICY_NO_LEVEL in the first run, and withCUSPARSE_SOLVE_POLICY_USE_LEVEL in the second run, and pick the fastest one to perform the remaining iterations.

Functionbsrsv02_solve() has the same behavior ascsrsv02_solve(). That is,bsr2csr(bsrsv02(A))=csrsv02(bsr2csr(A)). The numerical zero ofcsrsv02_solve() means there exists some zeroA(j,j). The numerical zero ofbsrsv02_solve() means there exists some blockA(j,j) that is not invertible.

Functionbsrsv2_solve() reports the first numerical zero, including a structural zero. No numerical zero is reported ifCUSPARSE_DIAG_TYPE_UNIT is specified, even ifA(j,j) is not invertible for somej. The user needs to callcusparseXbsrsv2_zeroPivot() to know where the numerical zero is.

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

For example, suppose L is a lower triangular matrix with unit diagonal, then the following code solvesL*y=x by level information.

// Suppose that L is m x m sparse matrix represented by BSR format,// The number of block rows/columns is mb, and// the number of nonzero blocks is nnzb.// L is lower triangular with unit diagonal.// Assumption:// - dimension of matrix L is m(=mb*blockDim),// - matrix L has nnz(=nnzb*blockDim*blockDim) nonzero elements,// - handle is already created by cusparseCreate(),// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of L on device memory,// - d_x is right hand side vector on device memory.// - d_y is solution vector on device memory.// - d_x and d_y are of size m.cusparseMatDescr_tdescr=0;bsrsv2Info_tinfo=0;intpBufferSize;void*pBuffer=0;intstructural_zero;intnumerical_zero;constdoublealpha=1.;constcusparseSolvePolicy_tpolicy=CUSPARSE_SOLVE_POLICY_USE_LEVEL;constcusparseOperation_ttrans=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;// step 1: create a descriptor which contains// - matrix L is base-1// - matrix L is lower triangular// - matrix L has unit diagonal, specified by parameter CUSPARSE_DIAG_TYPE_UNIT//   (L may not have all diagonal elements.)cusparseCreateMatDescr(&descr);cusparseSetMatIndexBase(descr,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatFillMode(descr,CUSPARSE_FILL_MODE_LOWER);cusparseSetMatDiagType(descr,CUSPARSE_DIAG_TYPE_UNIT);// step 2: create a empty info structurecusparseCreateBsrsv2Info(&info);// step 3: query how much memory used in bsrsv2, and allocate the buffercusparseDbsrsv2_bufferSize(handle,dir,trans,mb,nnzb,descr,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,&pBufferSize);// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.cudaMalloc((void**)&pBuffer,pBufferSize);// step 4: perform analysiscusparseDbsrsv2_analysis(handle,dir,trans,mb,nnzb,descr,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info,policy,pBuffer);// L has unit diagonal, so no structural zero is reported.status=cusparseXbsrsv2_zeroPivot(handle,info,&structural_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("L(%d,%d) is missing\n",structural_zero,structural_zero);}// step 5: solve L*y = xcusparseDbsrsv2_solve(handle,dir,trans,mb,nnzb,&alpha,descr,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info,d_x,d_y,policy,pBuffer);// L has unit diagonal, so no numerical zero is reported.status=cusparseXbsrsv2_zeroPivot(handle,info,&numerical_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("L(%d,%d) is zero\n",numerical_zero,numerical_zero);}// step 6: free resourcescudaFree(pBuffer);cusparseDestroyBsrsv2Info(info);cusparseDestroyMatDescr(descr);cusparseDestroy(handle);

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operation\(\text{op}(A)\).

mb

number of block rows and block columns of matrixA.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA, larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

x

<type> right-hand-side vector of sizem.

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned bybsrsv2_bufferSize().

Output

y

<type> solution vector of sizem.

SeecusparseStatus_t for the description of the return status.

5.4.6.cusparseXbsrsv2_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXbsrsv2_zeroPivot(cusparseHandle_thandle,bsrsv2Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) is either structural zero or numerical zero (singular block). Otherwiseposition=-1.

Theposition can be 0-based or 1-based, the same as the matrix.

FunctioncusparseXbsrsv2_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set the proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or numerical zero if the user already calledbsrsv2_analysis() orbsrsv2_solve().

Output

position

if no structural or numerical zero,position is -1; otherwise ifA(j,j) is missing orU(j,j) is zero,position=j.

SeecusparseStatus_t for the description of the return status

5.4.7.cusparse<t>gemvi() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseSgemvi_bufferSize(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,intnnz,int*pBufferSize)cusparseStatus_tcusparseDgemvi_bufferSize(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,intnnz,int*pBufferSize)cusparseStatus_tcusparseCgemvi_bufferSize(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,intnnz,int*pBufferSize)cusparseStatus_tcusparseZgemvi_bufferSize(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,intnnz,int*pBufferSize)
cusparseStatus_tcusparseSgemvi(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,constfloat*alpha,constfloat*A,intlda,intnnz,constfloat*x,constint*xInd,constfloat*beta,float*y,cusparseIndexBase_tidxBase,void*pBuffer)cusparseStatus_tcusparseDgemvi(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,constdouble*alpha,constdouble*A,intlda,intnnz,constdouble*x,constint*xInd,constdouble*beta,double*y,cusparseIndexBase_tidxBase,void*pBuffer)cusparseStatus_tcusparseCgemvi(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,constcuComplex*alpha,constcuComplex*A,intlda,intnnz,constcuComplex*x,constint*xInd,constcuComplex*beta,cuComplex*y,cusparseIndexBase_tidxBase,void*pBuffer)cusparseStatus_tcusparseZgemvi(cusparseHandle_thandle,cusparseOperation_ttransA,intm,intn,constcuDoubleComplex*alpha,constcuDoubleComplex*A,intlda,intnnz,constcuDoubleComplex*x,constint*xInd,constcuDoubleComplex*beta,cuDoubleComplex*y,cusparseIndexBase_tidxBase,void*pBuffer)

This function performs the matrix-vector operation

\[\text{y} = \alpha \ast \text{op}(A) \ast \text{x} + \beta \ast \text{y}\]

A is an\(m \times n\) dense matrix and a sparse vectorx that is defined in a sparse storage format by the two arraysxVal,xInd of lengthnnz, andy is a dense vector;\(\alpha \;\) and\(\beta \;\) are scalars; and

image2

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

The functioncusparse<t>gemvi_bufferSize() returns the size of buffer used incusparse<t>gemvi().

Input

handle

Handle to the cuSPARSE library context.

trans

The operation\(\text{op}(A)\).

m

Number of rows of matrixA.

n

Number of columns of matrixA.

alpha

<type> scalar used for multiplication.

A

The pointer to dense matrixA.

lda

Size of the leading dimension ofA.

nnz

Number of nonzero elements of vectorx.

x

<type> sparse vector ofnnz elements of sizen if\(\text{op}(A)=A\), and sizem if\(\text{op}(A)=A^{T}\).

xInd

Indices of non-zero values inx.

beta

<type> scalar used for multiplication. Ifbeta is zero,y does not have to be a valid input.

y

<type> dense vector ofm elements if\(\text{op}(A)=A\), andn elements if\(\text{op}(A)=A^{T}\).

idxBase

0 or 1, for 0 based or 1 based indexing, respectively.

pBufferSize

Number of elements needed the buffer used incusparse<t>gemvi().

pBuffer

Working space buffer.

Output

y

<type> updated dense vector.

SeecusparseStatus_t for the description of the return status.

5.5.cuSPARSE Level 3 Function Reference

This chapter describes sparse linear algebra functions that perform operations between sparse and (usually tall) dense matrices.

5.5.1.cusparse<t>bsrmm() [DEPRECATED]

>This routine will be removed in a future major release.Use cusparseSpMM() with BSR matrices instead.

cusparseStatus_tcusparseSbsrmm(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransB,intmb,intn,intkb,intnnzb,constfloat*alpha,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constfloat*B,intldb,constfloat*beta,float*C,intldc)cusparseStatus_tcusparseDbsrmm(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransB,intmb,intn,intkb,intnnzb,constdouble*alpha,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constdouble*B,intldb,constdouble*beta,double*C,intldc)cusparseStatus_tcusparseCbsrmm(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransB,intmb,intn,intkb,intnnzb,constcuComplex*alpha,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcuComplex*B,intldb,constcuComplex*beta,cuComplex*C,intldc)cusparseStatus_tcusparseZbsrmm(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransB,intmb,intn,intkb,intnnzb,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcuDoubleComplex*B,intldb,constcuDoubleComplex*beta,cuDoubleComplex*C,intldc)

This function performs one of the following matrix-matrix operations:

\[C = \alpha \ast \text{op}(A) \ast \text{op}(B) + \beta \ast C\]

A is an\(mb \times kb\) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA;B andC are dense matrices;\(\alpha\text{~and~}\beta\) are scalars; and

image3

and

image4

The function has the following limitations:

  • onlyCUSPARSE_MATRIX_TYPE_GENERAL matrix type is supported

  • onlyblockDim>1 is supported

  • ifblockDim ≤ 4, then max(mb)/max(n) = 524,272

  • if 4 <blockDim ≤ 8, then max(mb) = 524,272, max(n) = 262,136

  • ifblockDim > 8, then m < 65,535 and max(n) = 262,136

The motivation oftranspose(B) is to improve memory access of matrixB. The computational pattern ofA*transpose(B) with matrixB in column-major order is equivalent toA*B with matrixB in row-major order.

In practice, no operation in an iterative solver or eigenvalue solver usesA*transpose(B). However, we can performA*transpose(transpose(B)) which is the same asA*B. For example, supposeA ismb*kb,B isk*n andC ism*n, the following code shows usage ofcusparseDbsrmm().

// A is mb*kb, B is k*n and C is m*nconstintm=mb*blockSize;constintk=kb*blockSize;constintldb_B=k;// leading dimension of Bconstintldc=m;// leading dimension of C// perform C:=alpha*A*B + beta*CcusparseSetMatType(descrA,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseDbsrmm(cusparse_handle,CUSPARSE_DIRECTION_COLUMN,CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_NON_TRANSPOSE,mb,n,kb,nnzb,alpha,descrA,bsrValA,bsrRowPtrA,bsrColIndA,blockSize,B,ldb_B,beta,C,ldc);

Instead of usingA*B, our proposal is to transposeB toBt by first callingcublas<t>geam(), and then to performA*transpose(Bt).

// step 1: Bt := transpose(B)constintm=mb*blockSize;constintk=kb*blockSize;double*Bt;constintldb_Bt=n;// leading dimension of BtcudaMalloc((void**)&Bt,sizeof(double)*ldb_Bt*k);doubleone=1.0;doublezero=0.0;cublasSetPointerMode(cublas_handle,CUBLAS_POINTER_MODE_HOST);cublasDgeam(cublas_handle,CUBLAS_OP_T,CUBLAS_OP_T,n,k,&one,B,intldb_B,&zero,B,intldb_B,Bt,ldb_Bt);// step 2: perform C:=alpha*A*transpose(Bt) + beta*CcusparseDbsrmm(cusparse_handle,CUSPARSE_DIRECTION_COLUMN,CUSPARSE_OPERATION_NON_TRANSPOSE,CUSPARSE_OPERATION_TRANSPOSE,mb,n,kb,nnzb,alpha,descrA,bsrValA,bsrRowPtrA,bsrColIndA,blockSize,Bt,ldb_Bt,beta,C,ldc);

bsrmm() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operationop(A).

transB

the operationop(B).

mb

number of block rows of sparse matrixA.

n

number of columns of dense matrixop(B) andA.

kb

number of block columns of sparse matrixA.

nnzb

number of non-zero blocks of sparse matrixA.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA, larger than zero.

B

array of dimensions(ldb,n) ifop(B)=B and(ldb,k) otherwise.

ldb

leading dimension ofB. Ifop(B)=B, it must be at least\(\max\text{(1,\ k)}\) Ifop(B)!=B, it must be at leastmax(1,n).

beta

<type> scalar used for multiplication. Ifbeta is zero,C does not have to be a valid input.

C

array of dimensions(ldc,n).

ldc

leading dimension ofC. It must be at least\(\max\text{(1,\ m)}\) ifop(A)=A and at least\(\max\text{(1,\ k)}\) otherwise.

Output

C

<type> updated array of dimensions(ldc,n).

SeecusparseStatus_t for the description of the return status.

5.5.2.cusparse<t>bsrsm2_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsm2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,float*bsrSortedValA,constint*bsrSortedRowPtrA,constint*bsrSortedColIndA,intblockDim,bsrsm2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseDbsrsm2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,double*bsrSortedValA,constint*bsrSortedRowPtrA,constint*bsrSortedColIndA,intblockDim,bsrsm2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseCbsrsm2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrSortedValA,constint*bsrSortedRowPtrA,constint*bsrSortedColIndA,intblockDim,bsrsm2Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseZbsrsm2_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrSortedValA,constint*bsrSortedRowPtrA,constint*bsrSortedColIndA,intblockDim,bsrsm2Info_tinfo,int*pBufferSizeInBytes)

This function returns size of buffer used inbsrsm2(), a new sparse triangular linear systemop(A)*op(X)=\(\alpha\)op(B).

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);B andX are the right-hand-side and the solution matrices;\(\alpha\) is a scalar; and

image9

Although there are six combinations in terms of parametertrans and the upper (and lower) triangular part ofA,bsrsm2_bufferSize() returns the maximum size of the buffer among these combinations. The buffer size depends on dimensionmb,blockDim and the number of nonzeros of the matrix,nnzb. If the user changes the matrix, it is necessary to callbsrsm2_bufferSize() again to get the correct buffer size, otherwise a segmentation fault may occur.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operationop(A).

transX

the operationop(X).

mb

number of block rows of matrixA.

n

number of columns of matrixop(B) andop(X).

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA; larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used inbsrsm2_analysis() andbsrsm2_solve().

SeecusparseStatus_t for the description of the return status.

5.5.3.cusparse<t>bsrsm2_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsm2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,constfloat*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrsm2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,constdouble*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsrsm2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,constcuComplex*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrsm2_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase ofbsrsm2(), a new sparse triangular linear systemop(A)*op(X)=\(\alpha\)op(B).

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);B andX are the right-hand-side and the solution matrices;\(\alpha\) is a scalar; and

image9

and

image5

andop(B) andop(X) are equal.

The block of BSR format is of sizeblockDim*blockDim, stored in column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

It is expected that this function will be executed only once for a given matrix and a particular operation type.

This function requires the buffer size returned bybsrsm2_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsrsm2_analysis() reports a structural zero and computes the level information stored in opaque structureinfo. The level information can extract more parallelism during a triangular solver. Howeverbsrsm2_solve() can be done without level information. To disable level information, the user needs to specify the policy of the triangular solver asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

Functionbsrsm2_analysis() always reports the first structural zero, even if the parameterpolicy isCUSPARSE_SOLVE_POLICY_NO_LEVEL. Besides, no structural zero is reported ifCUSPARSE_DIAG_TYPE_UNIT is specified, even if blockA(j,j) is missing for somej. The user must callcusparseXbsrsm2_query_zero_pivot() to know where the structural zero is.

Ifbsrsm2_analysis() reports a structural zero, the solve will return a numerical zero in the same position as the structural zero but this resultX is meaningless.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operationop(A).

transX

the operationop(B) andop(X).

mb

number of block rows of matrixA.

n

number of columns of matrixop(B) andop(X).

nnzb

number of non-zero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA; larger than zero.

info

structure initialized usingcusparseCreateBsrsm2Info.

policy

The supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is return bybsrsm2_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

SeecusparseStatus_t for the description of the return status.

5.5.4.cusparse<t>bsrsm2_solve() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrsm2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constfloat*alpha,constcusparseMatDescr_tdescrA,constfloat*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,constfloat*B,intldb,float*X,intldx,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrsm2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constdouble*alpha,constcusparseMatDescr_tdescrA,constdouble*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,constdouble*B,intldb,double*X,intldx,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsrsm2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcuComplex*alpha,constcusparseMatDescr_tdescrA,constcuComplex*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,constcuComplex*B,intldb,cuComplex*X,intldx,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrsm2_solve(cusparseHandle_thandle,cusparseDirection_tdirA,cusparseOperation_ttransA,cusparseOperation_ttransX,intmb,intn,intnnzb,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrSortedVal,constint*bsrSortedRowPtr,constint*bsrSortedColInd,intblockDim,bsrsm2Info_tinfo,constcuDoubleComplex*B,intldb,cuDoubleComplex*X,intldx,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase of the solution of a sparse triangular linear system:

\[\text{op}(A) \ast \text{op(X)} = \alpha \ast \text{op(B)}\]

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA);B andX are the right-hand-side and the solution matrices;\(\alpha\) is a scalar, and

image9

and

image6

Onlyop(A)=A is supported.

op(B) andop(X) must be performed in the same way. In other words, ifop(B)=B,op(X)=X.

The block of BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Functionbsrsm02_solve() can support an arbitraryblockDim.

This function may be executed multiple times for a given matrix and a particular operation type.

This function requires the buffer size returned bybsrsm2_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Althoughbsrsm2_solve() can be done without level information, the user still needs to be aware of consistency. Ifbsrsm2_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,bsrsm2_solve() can be run with or without levels. On the other hand, ifbsrsm2_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,bsrsm2_solve() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsrsm02_solve() has the same behavior asbsrsv02_solve(), reporting the first numerical zero, including a structural zero. The user must callcusparseXbsrsm2_query_zero_pivot() to know where the numerical zero is.

The motivation oftranspose(X) is to improve the memory access of matrixX. The computational pattern oftranspose(X) with matrixX in column-major order is equivalent toX with matrixX in row-major order.

In-place is supported and requires thatB andX point to the same memory block, andldb=ldx.

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

transA

the operationop(A).

transX

the operationop(B) andop(X).

mb

number of block rows of matrixA.

n

number of columns of matrixop(B) andop(X).

nnzb

number of non-zero blocks of matrixA.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, while the supported diagonal types areCUSPARSE_DIAG_TYPE_UNIT andCUSPARSE_DIAG_TYPE_NON_UNIT.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) non-zero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA; larger than zero.

info

structure initialized usingcusparseCreateBsrsm2Info().

B

<type> right-hand-side array.

ldb

leading dimension ofB. Ifop(B)=B,ldb>=(mb*blockDim); otherwise,ldb>=n.

ldx

leading dimension ofX. Ifop(X)=X, thenldx>=(mb*blockDim). otherwiseldx>=n.

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bybsrsm2_bufferSize().

Output

X

<type> solution array with leading dimensionsldx.

SeecusparseStatus_t for the description of the return status.

5.5.5.cusparseXbsrsm2_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXbsrsm2_zeroPivot(cusparseHandle_thandle,bsrsm2Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) is either a structural zero or a numerical zero (singular block). Otherwiseposition=-1.

Theposition can be 0-base or 1-base, the same as the matrix.

FunctioncusparseXbsrsm2_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set the proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or a numerical zero if the user already calledbsrsm2_analysis() orbsrsm2_solve().

Output

position

if no structural or numerical zero,position is -1; otherwise, ifA(j,j) is missing orU(j,j) is zero,position=j.

SeecusparseStatus_t for the description of the return status.

5.6.cuSPARSE Extra Function Reference

This chapter describes the extra routines used to manipulate sparse matrices.

5.6.1.cusparse<t>csrgeam2()

cusparseStatus_tcusparseScsrgeam2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constfloat*alpha,constcusparseMatDescr_tdescrA,intnnzA,constfloat*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constfloat*beta,constcusparseMatDescr_tdescrB,intnnzB,constfloat*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,constfloat*csrSortedValC,constint*csrSortedRowPtrC,constint*csrSortedColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDcsrgeam2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constdouble*alpha,constcusparseMatDescr_tdescrA,intnnzA,constdouble*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constdouble*beta,constcusparseMatDescr_tdescrB,intnnzB,constdouble*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,constdouble*csrSortedValC,constint*csrSortedRowPtrC,constint*csrSortedColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseCcsrgeam2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuComplex*alpha,constcusparseMatDescr_tdescrA,intnnzA,constcuComplex*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constcuComplex*beta,constcusparseMatDescr_tdescrB,intnnzB,constcuComplex*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,constcuComplex*csrSortedValC,constint*csrSortedRowPtrC,constint*csrSortedColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseZcsrgeam2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescrA,intnnzA,constcuDoubleComplex*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constcuDoubleComplex*beta,constcusparseMatDescr_tdescrB,intnnzB,constcuDoubleComplex*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,constcuDoubleComplex*csrSortedValC,constint*csrSortedRowPtrC,constint*csrSortedColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseXcsrgeam2Nnz(cusparseHandle_thandle,intm,intn,constcusparseMatDescr_tdescrA,intnnzA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constcusparseMatDescr_tdescrB,intnnzB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,int*csrSortedRowPtrC,int*nnzTotalDevHostPtr,void*workspace)
cusparseStatus_tcusparseScsrgeam2(cusparseHandle_thandle,intm,intn,constfloat*alpha,constcusparseMatDescr_tdescrA,intnnzA,constfloat*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constfloat*beta,constcusparseMatDescr_tdescrB,intnnzB,constfloat*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,float*csrSortedValC,int*csrSortedRowPtrC,int*csrSortedColIndC,void*pBuffer)cusparseStatus_tcusparseDcsrgeam2(cusparseHandle_thandle,intm,intn,constdouble*alpha,constcusparseMatDescr_tdescrA,intnnzA,constdouble*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constdouble*beta,constcusparseMatDescr_tdescrB,intnnzB,constdouble*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,double*csrSortedValC,int*csrSortedRowPtrC,int*csrSortedColIndC,void*pBuffer)cusparseStatus_tcusparseCcsrgeam2(cusparseHandle_thandle,intm,intn,constcuComplex*alpha,constcusparseMatDescr_tdescrA,intnnzA,constcuComplex*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constcuComplex*beta,constcusparseMatDescr_tdescrB,intnnzB,constcuComplex*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,cuComplex*csrSortedValC,int*csrSortedRowPtrC,int*csrSortedColIndC,void*pBuffer)cusparseStatus_tcusparseZcsrgeam2(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*alpha,constcusparseMatDescr_tdescrA,intnnzA,constcuDoubleComplex*csrSortedValA,constint*csrSortedRowPtrA,constint*csrSortedColIndA,constcuDoubleComplex*beta,constcusparseMatDescr_tdescrB,intnnzB,constcuDoubleComplex*csrSortedValB,constint*csrSortedRowPtrB,constint*csrSortedColIndB,constcusparseMatDescr_tdescrC,cuDoubleComplex*csrSortedValC,int*csrSortedRowPtrC,int*csrSortedColIndC,void*pBuffer)

This function performs following matrix-matrix operation

\[C = \alpha \ast A + \beta \ast B\]

whereA,B, andC are\(m \times n\) sparse matrices (defined in CSR storage format by the three arrayscsrValA|csrValB|csrValC,csrRowPtrA|csrRowPtrB|csrRowPtrC, andcsrColIndA|csrColIndB|csrcolIndC respectively), and\(\alpha\text{~and~}\beta\) are scalars. SinceA andB have different sparsity patterns, cuSPARSE adopts a two-step approach to complete sparse matrixC. In the first step, the user allocatescsrRowPtrC ofm+1elements and uses functioncusparseXcsrgeam2Nnz() to determinecsrRowPtrC and the total number of nonzero elements. In the second step, the user gathersnnzC (number of nonzero elements of matrixC) from either(nnzC=*nnzTotalDevHostPtr) or(nnzC=csrRowPtrC(m)-csrRowPtrC(0)) and allocatescsrValC,csrColIndC ofnnzC elements respectively, then finally calls functioncusparse[S|D|C|Z]csrgeam2() to complete matrixC.

The general procedure is as follows:

intbaseC,nnzC;/* alpha, nnzTotalDevHostPtr points to host memory */size_tBufferSizeInBytes;char*buffer=NULL;int*nnzTotalDevHostPtr=&nnzC;cusparseSetPointerMode(handle,CUSPARSE_POINTER_MODE_HOST);cudaMalloc((void**)&csrRowPtrC,sizeof(int)*(m+1));/* prepare buffer */cusparseScsrgeam2_bufferSizeExt(handle,m,n,alpha,descrA,nnzA,csrValA,csrRowPtrA,csrColIndA,beta,descrB,nnzB,csrValB,csrRowPtrB,csrColIndB,descrC,csrValC,csrRowPtrC,csrColIndC&bufferSizeInBytes);cudaMalloc((void**)&buffer,sizeof(char)*bufferSizeInBytes);cusparseXcsrgeam2Nnz(handle,m,n,descrA,nnzA,csrRowPtrA,csrColIndA,descrB,nnzB,csrRowPtrB,csrColIndB,descrC,csrRowPtrC,nnzTotalDevHostPtr,buffer);if(NULL!=nnzTotalDevHostPtr){nnzC=*nnzTotalDevHostPtr;}else{cudaMemcpy(&nnzC,csrRowPtrC+m,sizeof(int),cudaMemcpyDeviceToHost);cudaMemcpy(&baseC,csrRowPtrC,sizeof(int),cudaMemcpyDeviceToHost);nnzC-=baseC;}cudaMalloc((void**)&csrColIndC,sizeof(int)*nnzC);cudaMalloc((void**)&csrValC,sizeof(float)*nnzC);cusparseScsrgeam2(handle,m,n,alpha,descrA,nnzA,csrValA,csrRowPtrA,csrColIndA,beta,descrB,nnzB,csrValB,csrRowPtrB,csrColIndB,descrC,csrValC,csrRowPtrC,csrColIndCbuffer);

Several comments oncsrgeam2():

  • The other three combinations, NT, TN, and TT, are not supported by cuSPARSE. In order to do any one of the three, the user should use the routinecsr2csc() to convert\(A\) |\(B\) to\(A^{T}\) |\(B^{T}\) .

  • OnlyCUSPARSE_MATRIX_TYPE_GENERAL is supported. If eitherA orB is symmetric or Hermitian, then the user must extend the matrix to a full one and reconfigure theMatrixType field of the descriptor toCUSPARSE_MATRIX_TYPE_GENERAL.

  • If the sparsity pattern of matrixC is known, the user can skip the call to functioncusparseXcsrgeam2Nnz(). For example, suppose that the user has an iterative algorithm which would updateA andB iteratively but keep the sparsity patterns. The user can call functioncusparseXcsrgeam2Nnz() once to set up the sparsity pattern ofC, then call functioncusparse[S|D|C|Z]geam() only for each iteration.

  • The pointersalpha andbeta must be valid.

  • Whenalpha orbeta is zero, it is not considered a special case by cuSPARSE. The sparsity pattern ofC is independent of the value ofalpha andbeta. If the user wants\(C = 0 \times A + 1 \times B^{T}\) , thencsr2csc() is better thancsrgeam2().

  • csrgeam2() is the same ascsrgeam() exceptcsrgeam2() needs explicit buffer wherecsrgeam() allocates the buffer internally.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

m

number of rows of sparse matrixA,B,C.

n

number of columns of sparse matrixA,B,C.

alpha

<type> scalar used for multiplication.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL only.

nnzA

number of nonzero elements of sparse matrixA.

csrValA

<type> array ofnnzA\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnzA\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

beta

<type> scalar used for multiplication. Ifbeta is zero,y does not have to be a valid input.

descrB

the descriptor of matrixB. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL only.

nnzB

number of nonzero elements of sparse matrixB.

csrValB

<type> array ofnnzB\(( =\)csrRowPtrB(m)\(-\)csrRowPtrB(0)\()\) nonzero elements of matrixB.

csrRowPtrB

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndB

integer array ofnnzB\(( =\)csrRowPtrB(m)\(-\)csrRowPtrB(0)\()\) column indices of the nonzero elements of matrixB.

descrC

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL only.

Output

csrValC

<type> array ofnnzC\(( =\)csrRowPtrC(m)\(-\)csrRowPtrC(0)\()\) nonzero elements of matrixC.

csrRowPtrC

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndC

integer array ofnnzC\(( =\)csrRowPtrC(m)\(-\)csrRowPtrC(0)\()\) column indices of the nonzero elements of matrixC.

nnzTotalDevHostPtr

total number of nonzero elements in device or host memory. It is equal to(csrRowPtrC(m)-csrRowPtrC(0)).

SeecusparseStatus_t for the description of the return status

5.7.cuSPARSE Preconditioners Reference

This chapter describes the routines that implement different preconditioners.

5.7.1.Incomplete Cholesky Factorization: level 0 [DEPRECATED]

Different algorithms for ic0 are discussed in this section.

5.7.1.1.cusparse<t>csric02_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsric02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,float*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseDcsric02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,double*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseCcsric02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseZcsric02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,int*pBufferSizeInBytes)

This function returns size of buffer used in computing the incomplete-Cholesky factorization with\(0\) fill-in and no pivoting:

\[A \approx LL^{H}\]

A is an\(m \times m\) sparse matrix that is defined in CSR storage format by the three arrayscsrValA,csrRowPtrA, andcsrColIndA.

The buffer size depends on dimensionm andnnz, the number of nonzeros of the matrix. If the user changes the matrix, it is necessary to callcsric02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

Output

info

record internal states based on different algorithms

pBufferSizeInBytes

number of bytes of the buffer used incsric02_analysis() andcsric02()

SeecusparseStatus_t for the description of the return status.

5.7.1.2.cusparse<t>csric02_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsric02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDcsric02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCcsric02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZcsric02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase of the incomplete-Cholesky factorization with\(0\) fill-in and no pivoting:

\[A \approx LL^{H}\]

A is an\(m \times m\) sparse matrix that is defined in CSR storage format by the three arrayscsrValA,csrRowPtrA, andcsrColIndA.

This function requires a buffer size returned bycsric02_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functioncsric02_analysis() reports a structural zero and computes level information stored in the opaque structureinfo. The level information can extract more parallelism during incomplete Cholesky factorization. Howevercsric02() can be done without level information. To disable level information, the user must specify the policy ofcsric02_analysis() andcsric02() asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

Functioncsric02_analysis() always reports the first structural zero, even if the policy isCUSPARSE_SOLVE_POLICY_NO_LEVEL. The user needs to callcusparseXcsric02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to callcsric02() ifcsric02_analysis() reports a structural zero. In this case, the user can still callcsric02(), which will return a numerical zero at the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

info

structure initialized usingcusparseCreateCsric02Info().

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bycsric02_bufferSize().

Output

info

number of bytes of the buffer used incsric02_analysis() andcsric02()

SeecusparseStatus_t for the description of the return status.

5.7.1.3.cusparse<t>csric02() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsric02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,float*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDcsric02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,double*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCcsric02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZcsric02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase of the computing the incomplete-Cholesky factorization with\(0\) fill-in and no pivoting:

\[A \approx LL^{H}\]

This function requires a buffer size returned bycsric02_bufferSize(). The address ofpBuffer must be a multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Althoughcsric02() can be done without level information, the user still needs to be aware of consistency. Ifcsric02_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,csric02() can be run with or without levels. On the other hand, ifcsric02_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,csric02() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functioncsric02() reports the first numerical zero, including a structural zero. The user must callcusparseXcsric02_zeroPivot() to know where the numerical zero is.

Functioncsric02() only takes the lower triangular part of matrixA to perform factorization. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, the fill mode and diagonal type are ignored, and the strictly upper triangular part is ignored and never touched. It does not matter ifA is Hermitian or not. In other words, from the point of view ofcsric02()A is Hermitian and only the lower triangular part is provided.

Note

In practice, a positive definite matrix may not have incomplete cholesky factorization. To the best of our knowledge, only matrixM can guarantee the existence of incomplete cholesky factorization. Ifcsric02() failed cholesky factorization and reported a numerical zero, it is possible that incomplete cholesky factorization does not exist.

For example, supposeA is a realm times m matrix, the following code solves the precondition systemM*y=x whereM is the product of Cholesky factorizationL and its transpose.

\[M = LL^{H}\]
// Suppose that A is m x m sparse matrix represented by CSR format,// Assumption:// - handle is already created by cusparseCreate(),// - (d_csrRowPtr, d_csrColInd, d_csrVal) is CSR of A on device memory,// - d_x is right hand side vector on device memory,// - d_y is solution vector on device memory.// - d_z is intermediate result on device memory.cusparseMatDescr_tdescr_M=0;cusparseMatDescr_tdescr_L=0;csric02Info_tinfo_M=0;csrsv2Info_tinfo_L=0;csrsv2Info_tinfo_Lt=0;intpBufferSize_M;intpBufferSize_L;intpBufferSize_Lt;intpBufferSize;void*pBuffer=0;intstructural_zero;intnumerical_zero;constdoublealpha=1.;constcusparseSolvePolicy_tpolicy_M=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_L=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_Lt=CUSPARSE_SOLVE_POLICY_USE_LEVEL;constcusparseOperation_ttrans_L=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseOperation_ttrans_Lt=CUSPARSE_OPERATION_TRANSPOSE;// step 1: create a descriptor which contains// - matrix M is base-1// - matrix L is base-1// - matrix L is lower triangular// - matrix L has non-unit diagonalcusparseCreateMatDescr(&descr_M);cusparseSetMatIndexBase(descr_M,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_M,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseCreateMatDescr(&descr_L);cusparseSetMatIndexBase(descr_L,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_L,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_L,CUSPARSE_FILL_MODE_LOWER);cusparseSetMatDiagType(descr_L,CUSPARSE_DIAG_TYPE_NON_UNIT);// step 2: create a empty info structure// we need one info for csric02 and two info's for csrsv2cusparseCreateCsric02Info(&info_M);cusparseCreateCsrsv2Info(&info_L);cusparseCreateCsrsv2Info(&info_Lt);// step 3: query how much memory used in csric02 and csrsv2, and allocate the buffercusparseDcsric02_bufferSize(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,&bufferSize_M);cusparseDcsrsv2_bufferSize(handle,trans_L,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_L,&pBufferSize_L);cusparseDcsrsv2_bufferSize(handle,trans_Lt,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_Lt,&pBufferSize_Lt);pBufferSize=max(bufferSize_M,max(pBufferSize_L,pBufferSize_Lt));// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.cudaMalloc((void**)&pBuffer,pBufferSize);// step 4: perform analysis of incomplete Cholesky on M//         perform analysis of triangular solve on L//         perform analysis of triangular solve on L'// The lower triangular part of M has the same sparsity pattern as L, so// we can do analysis of csric02 and csrsv2 simultaneously.cusparseDcsric02_analysis(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,policy_M,pBuffer);status=cusparseXcsric02_zeroPivot(handle,info_M,&structural_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("A(%d,%d) is missing\n",structural_zero,structural_zero);}cusparseDcsrsv2_analysis(handle,trans_L,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_L,policy_L,pBuffer);cusparseDcsrsv2_analysis(handle,trans_Lt,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_Lt,policy_Lt,pBuffer);// step 5: M = L * L'cusparseDcsric02(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,policy_M,pBuffer);status=cusparseXcsric02_zeroPivot(handle,info_M,&numerical_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("L(%d,%d) is zero\n",numerical_zero,numerical_zero);}// step 6: solve L*z = xcusparseDcsrsv2_solve(handle,trans_L,m,nnz,&alpha,descr_L,// replace with cusparseSpSVd_csrVal,d_csrRowPtr,d_csrColInd,info_L,d_x,d_z,policy_L,pBuffer);// step 7: solve L'*y = zcusparseDcsrsv2_solve(handle,trans_Lt,m,nnz,&alpha,descr_L,// replace with cusparseSpSVd_csrVal,d_csrRowPtr,d_csrColInd,info_Lt,d_z,d_y,policy_Lt,pBuffer);// step 6: free resourcescudaFree(pBuffer);cusparseDestroyMatDescr(descr_M);cusparseDestroyMatDescr(descr_L);cusparseDestroyCsric02Info(info_M);cusparseDestroyCsrsv2Info(info_L);cusparseDestroyCsrsv2Info(info_Lt);cusparseDestroy(handle);

The function supports the following properties ifpBuffer!=NULL:

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA_valM

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bycsric02_bufferSize().

Output

csrValA_valM

<type> matrix containing the incomplete-Cholesky lower triangular factor.

SeecusparseStatus_t for the description of the return status.

5.7.1.4.cusparseXcsric02_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXcsric02_zeroPivot(cusparseHandle_thandle,csric02Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) has either a structural zero or a numerical zero; otherwise,position=-1.

Theposition can be 0-based or 1-based, the same as the matrix.

FunctioncusparseXcsric02_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already calledcsric02_analysis() orcsric02().

Output

position

if no structural or numerical zero,position is -1; otherwise, ifA(j,j) is missing orL(j,j) is zero,position=j.

SeecusparseStatus_t for the description of the return status.

5.7.1.5.cusparse<t>bsric02_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsric02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseDbsric02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseCbsric02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseZbsric02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,int*pBufferSizeInBytes)

This function returns the size of a buffer used in computing the incomplete-Cholesky factorization with 0 fill-in and no pivoting

\[A \approx LL^{H}\]

A is an(mb*blockDim)*(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA.

The buffer size depends on the dimensions ofmb,blockDim, and the number of nonzero blocks of the matrixnnzb. If the user changes the matrix, it is necessary to callbsric02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A, larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used inbsric02_analysis() andbsric02().

SeecusparseStatus_t for the description of the return status.

5.7.1.6.cusparse<t>bsric02_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsric02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsric02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsric02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsric02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase of the incomplete-Cholesky factorization with 0 fill-in and no pivoting

\[A \approx LL^{H}\]

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA. The block in BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned bybsric02_bufferSize90. The address ofpBuffer must be a multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsric02_analysis() reports structural zero and computes level information stored in the opaque structureinfo. The level information can extract more parallelism during incomplete Cholesky factorization. Howeverbsric02() can be done without level information. To disable level information, the user needs to specify the parameterpolicy ofbsric02[_analysis|] asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

Functionbsric02_analysis always reports the first structural zero, even when parameterpolicy isCUSPARSE_SOLVE_POLICY_NO_LEVEL. The user must callcusparseXbsric02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to callbsric02() ifbsric02_analysis() reports a structural zero. In this case, the user can still callbsric02(), which returns a numerical zero in the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A; must be larger than zero.

info

structure initialized usingcusparseCreateBsric02Info().

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bybsric02_bufferSize().

Output

info

Structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

SeecusparseStatus_t for the description of the return status.

5.7.1.7.cusparse<t>bsric02() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsric02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsric02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsric02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsric02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsric02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase of the incomplete-Cholesky factorization with 0 fill-in and no pivoting.

\[A \approx LL^{H}\]

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA. The block in BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned bybsric02_bufferSize(). The address ofpBuffer must be a multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Althoughbsric02() can be done without level information, the user must be aware of consistency. Ifbsric02_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,bsric02() can be run with or without levels. On the other hand, ifbsric02_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,bsric02() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsric02() has the same behavior ascsric02(). That is,bsr2csr(bsric02(A))=csric02(bsr2csr(A)). The numerical zero ofcsric02() means there exists some zeroL(j,j). The numerical zero ofbsric02() means there exists some blockLj,j) that is not invertible.

Functionbsric02 reports the first numerical zero, including a structural zero. The user must callcusparseXbsric02_zeroPivot() to know where the numerical zero is.

Thebsric02() function only takes the lower triangular part of matrixA to perform factorization. The strictly upper triangular part is ignored and never touched. It does not matter ifA is Hermitian or not. In other words, from the point of view ofbsric02(),A is Hermitian and only the lower triangular part is provided. Moreover, the imaginary part of diagonal elements of diagonal blocks is ignored.

For example, supposeA is a real m-by-m matrix, wherem=mb*blockDim. The following code solves precondition systemM*y=x, whereM is the product of Cholesky factorizationL and its transpose.

\[M = LL^{H}\]
// Suppose that A is m x m sparse matrix represented by BSR format,// The number of block rows/columns is mb, and// the number of nonzero blocks is nnzb.// Assumption:// - handle is already created by cusparseCreate(),// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of A on device memory,// - d_x is right hand side vector on device memory,// - d_y is solution vector on device memory.// - d_z is intermediate result on device memory.// - d_x, d_y and d_z are of size m.cusparseMatDescr_tdescr_M=0;cusparseMatDescr_tdescr_L=0;bsric02Info_tinfo_M=0;bsrsv2Info_tinfo_L=0;bsrsv2Info_tinfo_Lt=0;intpBufferSize_M;intpBufferSize_L;intpBufferSize_Lt;intpBufferSize;void*pBuffer=0;intstructural_zero;intnumerical_zero;constdoublealpha=1.;constcusparseSolvePolicy_tpolicy_M=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_L=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_Lt=CUSPARSE_SOLVE_POLICY_USE_LEVEL;constcusparseOperation_ttrans_L=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseOperation_ttrans_Lt=CUSPARSE_OPERATION_TRANSPOSE;constcusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;// step 1: create a descriptor which contains// - matrix M is base-1// - matrix L is base-1// - matrix L is lower triangular// - matrix L has non-unit diagonalcusparseCreateMatDescr(&descr_M);cusparseSetMatIndexBase(descr_M,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_M,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseCreateMatDescr(&descr_L);cusparseSetMatIndexBase(descr_L,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_L,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_L,CUSPARSE_FILL_MODE_LOWER);cusparseSetMatDiagType(descr_L,CUSPARSE_DIAG_TYPE_NON_UNIT);// step 2: create a empty info structure// we need one info for bsric02 and two info's for bsrsv2cusparseCreateBsric02Info(&info_M);cusparseCreateBsrsv2Info(&info_L);cusparseCreateBsrsv2Info(&info_Lt);// step 3: query how much memory used in bsric02 and bsrsv2, and allocate the buffercusparseDbsric02_bufferSize(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,&bufferSize_M);cusparseDbsrsv2_bufferSize(handle,dir,trans_L,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,&pBufferSize_L);cusparseDbsrsv2_bufferSize(handle,dir,trans_Lt,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_Lt,&pBufferSize_Lt);pBufferSize=max(bufferSize_M,max(pBufferSize_L,pBufferSize_Lt));// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.cudaMalloc((void**)&pBuffer,pBufferSize);// step 4: perform analysis of incomplete Cholesky on M//         perform analysis of triangular solve on L//         perform analysis of triangular solve on L'// The lower triangular part of M has the same sparsity pattern as L, so// we can do analysis of bsric02 and bsrsv2 simultaneously.cusparseDbsric02_analysis(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,policy_M,pBuffer);status=cusparseXbsric02_zeroPivot(handle,info_M,&structural_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("A(%d,%d) is missing\n",structural_zero,structural_zero);}cusparseDbsrsv2_analysis(handle,dir,trans_L,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,policy_L,pBuffer);cusparseDbsrsv2_analysis(handle,dir,trans_Lt,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_Lt,policy_Lt,pBuffer);// step 5: M = L * L'cusparseDbsric02_solve(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,policy_M,pBuffer);status=cusparseXbsric02_zeroPivot(handle,info_M,&numerical_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("L(%d,%d) is not positive definite\n",numerical_zero,numerical_zero);}// step 6: solve L*z = xcusparseDbsrsv2_solve(handle,dir,trans_L,mb,nnzb,&alpha,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,d_x,d_z,policy_L,pBuffer);// step 7: solve L'*y = zcusparseDbsrsv2_solve(handle,dir,trans_Lt,mb,nnzb,&alpha,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_Lt,d_z,d_y,policy_Lt,pBuffer);// step 6: free resourcescudaFree(pBuffer);cusparseDestroyMatDescr(descr_M);cusparseDestroyMatDescr(descr_L);cusparseDestroyBsric02Info(info_M);cusparseDestroyBsrsv2Info(info_L);cusparseDestroyBsrsv2Info(info_Lt);cusparseDestroy(handle);

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned bybsric02_bufferSize().

Output

bsrValA

<type> matrix containing the incomplete-Cholesky lower triangular factor.

SeecusparseStatus_t for the description of the return status.

5.7.1.8.cusparseXbsric02_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXbsric02_zeroPivot(cusparseHandle_thandle,bsric02Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) has either a structural zero or a numerical zero (the block is not positive definite). Otherwiseposition=-1.

Theposition can be 0-based or 1-based, the same as the matrix.

FunctioncusparseXbsric02_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set the proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

info

info contains a structural zero or a numerical zero if the user already calledbsric02_analysis() orbsric02().

Output

position

If no structural or numerical zero,position is -1, otherwise ifA(j,j) is missing orL(j,j) is not positive definite,position=j.

SeecusparseStatus_t for the description of the return status.

5.7.2.Incomplete LU Factorization: level 0 [DEPRECATED]

Different algorithms for ilu0 are discussed in this section.

5.7.2.1.cusparse<t>csrilu02_numericBoost() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsrilu02_numericBoost(cusparseHandle_thandle,csrilu02Info_tinfo,intenable_boost,double*tol,float*boost_val)cusparseStatus_tcusparseDcsrilu02_numericBoost(cusparseHandle_thandle,csrilu02Info_tinfo,intenable_boost,double*tol,double*boost_val)cusparseStatus_tcusparseCcsrilu02_numericBoost(cusparseHandle_thandle,csrilu02Info_tinfo,intenable_boost,double*tol,cuComplex*boost_val)cusparseStatus_tcusparseZcsrilu02_numericBoost(cusparseHandle_thandle,csrilu02Info_tinfo,intenable_boost,double*tol,cuDoubleComplex*boost_val)

The user can use a boost value to replace a numerical value in incomplete LU factorization. Thetol is used to determine a numerical zero, and theboost_val is used to replace a numerical zero. The behavior is

iftol>=fabs(A(j,j)), thenA(j,j)=boost_val.

To enable a boost value, the user has to set parameterenable_boost to 1 before callingcsrilu02(). To disable a boost value, the user can callcsrilu02_numericBoost() again with parameterenable_boost=0.

Ifenable_boost=0,tol andboost_val are ignored.

Bothtol andboost_val can be in the host memory or device memory. The user can set the proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context

info

structure initialized usingcusparseCreateCsrilu02Info()

enable_boost

disable boost byenable_boost=0; otherwise, boost is enabled

tol

tolerance to determine a numerical zero

boost_val

boost value to replace a numerical zero

SeecusparseStatus_t for the description of the return status.

5.7.2.2.cusparse<t>csrilu02_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsrilu02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,float*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseDcsrilu02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,double*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseCcsrilu02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,int*pBufferSizeInBytes)cusparseStatus_tcusparseZcsrilu02_bufferSize(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,int*pBufferSizeInBytes)

This function returns size of the buffer used in computing the incomplete-LU factorization with\(0\) fill-in and no pivoting:

\[A \approx LU\]

A is an\(m \times m\) sparse matrix that is defined in CSR storage format by the three arrayscsrValA,csrRowPtrA, andcsrColIndA.

The buffer size depends on the dimensionm andnnz, the number of nonzeros of the matrix. If the user changes the matrix, it is necessary to callcsrilu02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

Output

info

record internal states based on different algorithms

pBufferSizeInBytes

number of bytes of the buffer used incsrilu02_analysis() andcsrilu02()

SeecusparseStatus_t for the description of the return status.

5.7.2.3.cusparse<t>csrilu02_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsrilu02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDcsrilu02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCcsrilu02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZcsrilu02_analysis(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase of the incomplete-LU factorization with\(0\) fill-in and no pivoting:

\[A \approx LU\]

A is an\(m \times m\) sparse matrix that is defined in CSR storage format by the three arrayscsrValA,csrRowPtrA, andcsrColIndA.

This function requires the buffer size returned bycsrilu02_bufferSize(). The address ofpBuffer must be a multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functioncsrilu02_analysis() reports a structural zero and computes level information stored in the opaque structureinfo. The level information can extract more parallelism during incomplete LU factorization; howevercsrilu02() can be done without level information. To disable level information, the user must specify the policy ofcsrilu02() asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

It is the user’s choice whether to callcsrilu02() ifcsrilu02_analysis() reports a structural zero. In this case, the user can still callcsrilu02(), which will return a numerical zero at the same position as the structural zero. However, the result is meaningless.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

info

structure initialized usingcusparseCreateCsrilu02Info().

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned bycsrilu02_bufferSize().

Output

info

Structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged).

SeecusparseStatus_t for the description of the return status.

5.7.2.4.cusparse<t>csrilu02() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsrilu02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,float*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDcsrilu02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,double*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCcsrilu02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZcsrilu02(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrValA_valM,constint*csrRowPtrA,constint*csrColIndA,csrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase of the incomplete-LU factorization with\(0\) fill-in and no pivoting:

\[A \approx LU\]

A is an\(m \times m\) sparse matrix that is defined in CSR storage format by the three arrayscsrValA_valM,csrRowPtrA, andcsrColIndA.

This function requires a buffer size returned bycsrilu02_bufferSize(). The address ofpBuffer must be a multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL. The fill mode and diagonal type are ignored.

Althoughcsrilu02() can be done without level information, the user still needs to be aware of consistency. Ifcsrilu02_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,csrilu02() can be run with or without levels. On the other hand, ifcsrilu02_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,csrilu02() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functioncsrilu02() reports the first numerical zero, including a structural zero. The user must callcusparseXcsrilu02_zeroPivot() to know where the numerical zero is.

For example, supposeA is a real\(m \times m\) matrix, the following code solves precondition systemM*y=x whereM is the product of LU factorsL andU.

// Suppose that A is m x m sparse matrix represented by CSR format,// Assumption:// - handle is already created by cusparseCreate(),// - (d_csrRowPtr, d_csrColInd, d_csrVal) is CSR of A on device memory,// - d_x is right hand side vector on device memory,// - d_y is solution vector on device memory.// - d_z is intermediate result on device memory.cusparseMatDescr_tdescr_M=0;cusparseMatDescr_tdescr_L=0;cusparseMatDescr_tdescr_U=0;csrilu02Info_tinfo_M=0;csrsv2Info_tinfo_L=0;csrsv2Info_tinfo_U=0;intpBufferSize_M;intpBufferSize_L;intpBufferSize_U;intpBufferSize;void*pBuffer=0;intstructural_zero;intnumerical_zero;constdoublealpha=1.;constcusparseSolvePolicy_tpolicy_M=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_L=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_U=CUSPARSE_SOLVE_POLICY_USE_LEVEL;constcusparseOperation_ttrans_L=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseOperation_ttrans_U=CUSPARSE_OPERATION_NON_TRANSPOSE;// step 1: create a descriptor which contains// - matrix M is base-1// - matrix L is base-1// - matrix L is lower triangular// - matrix L has unit diagonal// - matrix U is base-1// - matrix U is upper triangular// - matrix U has non-unit diagonalcusparseCreateMatDescr(&descr_M);cusparseSetMatIndexBase(descr_M,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_M,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseCreateMatDescr(&descr_L);cusparseSetMatIndexBase(descr_L,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_L,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_L,CUSPARSE_FILL_MODE_LOWER);cusparseSetMatDiagType(descr_L,CUSPARSE_DIAG_TYPE_UNIT);cusparseCreateMatDescr(&descr_U);cusparseSetMatIndexBase(descr_U,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_U,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_U,CUSPARSE_FILL_MODE_UPPER);cusparseSetMatDiagType(descr_U,CUSPARSE_DIAG_TYPE_NON_UNIT);// step 2: create a empty info structure// we need one info for csrilu02 and two info's for csrsv2cusparseCreateCsrilu02Info(&info_M);cusparseCreateCsrsv2Info(&info_L);cusparseCreateCsrsv2Info(&info_U);// step 3: query how much memory used in csrilu02 and csrsv2, and allocate the buffercusparseDcsrilu02_bufferSize(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,&pBufferSize_M);cusparseDcsrsv2_bufferSize(handle,trans_L,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_L,&pBufferSize_L);cusparseDcsrsv2_bufferSize(handle,trans_U,m,nnz,descr_U,d_csrVal,d_csrRowPtr,d_csrColInd,info_U,&pBufferSize_U);pBufferSize=max(pBufferSize_M,max(pBufferSize_L,pBufferSize_U));// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.cudaMalloc((void**)&pBuffer,pBufferSize);// step 4: perform analysis of incomplete Cholesky on M//         perform analysis of triangular solve on L//         perform analysis of triangular solve on U// The lower(upper) triangular part of M has the same sparsity pattern as L(U),// we can do analysis of csrilu0 and csrsv2 simultaneously.cusparseDcsrilu02_analysis(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,policy_M,pBuffer);status=cusparseXcsrilu02_zeroPivot(handle,info_M,&structural_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("A(%d,%d) is missing\n",structural_zero,structural_zero);}cusparseDcsrsv2_analysis(handle,trans_L,m,nnz,descr_L,d_csrVal,d_csrRowPtr,d_csrColInd,info_L,policy_L,pBuffer);cusparseDcsrsv2_analysis(handle,trans_U,m,nnz,descr_U,d_csrVal,d_csrRowPtr,d_csrColInd,info_U,policy_U,pBuffer);// step 5: M = L * UcusparseDcsrilu02(handle,m,nnz,descr_M,d_csrVal,d_csrRowPtr,d_csrColInd,info_M,policy_M,pBuffer);status=cusparseXcsrilu02_zeroPivot(handle,info_M,&numerical_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==status){printf("U(%d,%d) is zero\n",numerical_zero,numerical_zero);}// step 6: solve L*z = xcusparseDcsrsv2_solve(handle,trans_L,m,nnz,&alpha,descr_L,// replace with cusparseSpSVd_csrVal,d_csrRowPtr,d_csrColInd,info_L,d_x,d_z,policy_L,pBuffer);// step 7: solve U*y = zcusparseDcsrsv2_solve(handle,trans_U,m,nnz,&alpha,descr_U,// replace with cusparseSpSVd_csrVal,d_csrRowPtr,d_csrColInd,info_U,d_z,d_y,policy_U,pBuffer);// step 6: free resourcescudaFree(pBuffer);cusparseDestroyMatDescr(descr_M);cusparseDestroyMatDescr(descr_L);cusparseDestroyMatDescr(descr_U);cusparseDestroyCsrilu02Info(info_M);cusparseDestroyCsrsv2Info(info_L);cusparseDestroyCsrsv2Info(info_U);cusparseDestroy(handle);

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

m

number of rows and columns of matrixA.

nnz

number of nonzeros of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA_valM

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm\(+ 1\) elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bycsrilu02_bufferSize().

Output

csrValA_valM

<type> matrix containing the incomplete-LU lower and upper triangular factors.

SeecusparseStatus_t for the description of the return status.

5.7.2.5.cusparseXcsrilu02_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXcsrilu02_zeroPivot(cusparseHandle_thandle,csrilu02Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) has either a structural zero or a numerical zero; otherwise,position=-1.

Theposition can be 0-based or 1-based, the same as the matrix.

FunctioncusparseXcsrilu02_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

Handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already calledcsrilu02_analysis() orcsrilu02().

Output

position

If no structural or numerical zero,position is -1; otherwise ifA(j,j) is missing orU(j,j) is zero,position=j.

SeecusparseStatus_t for the description of the return status.

5.7.2.6.cusparse<t>bsrilu02_numericBoost() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrilu02_numericBoost(cusparseHandle_thandle,bsrilu02Info_tinfo,intenable_boost,double*tol,float*boost_val)cusparseStatus_tcusparseDbsrilu02_numericBoost(cusparseHandle_thandle,bsrilu02Info_tinfo,intenable_boost,double*tol,double*boost_val)cusparseStatus_tcusparseCbsrilu02_numericBoost(cusparseHandle_thandle,bsrilu02Info_tinfo,intenable_boost,double*tol,cuComplex*boost_val)cusparseStatus_tcusparseZbsrilu02_numericBoost(cusparseHandle_thandle,bsrilu02Info_tinfo,intenable_boost,double*tol,cuDoubleComplex*boost_val)

The user can use a boost value to replace a numerical value in incomplete LU factorization. Parametertol is used to determine a numerical zero, andboost_val is used to replace a numerical zero. The behavior is as follows:

iftol>=fabs(A(j,j)), then reset each diagonal element of blockA(j,j) byboost_val.

To enable a boost value, the user sets parameterenable_boost to 1 before callingbsrilu02(). To disable the boost value, the user can callbsrilu02_numericBoost() with parameterenable_boost=0.

Ifenable_boost=0,tol andboost_val are ignored.

Bothtol andboost_val can be in host memory or device memory. The user can set the proper mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

info

structure initialized usingcusparseCreateBsrilu02Info().

enable_boost

disable boost by settingenable_boost=0. Otherwise, boost is enabled.

tol

tolerance to determine a numerical zero.

boost_val

boost value to replace a numerical zero.

SeecusparseStatus_t for the description of the return status.

5.7.2.7.cusparse<t>bsrilu02_bufferSize() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrilu02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,int*pBufferSizeInBytes);cusparseStatus_tcusparseDbsrilu02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,int*pBufferSizeInBytes);cusparseStatus_tcusparseCbsrilu02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,int*pBufferSizeInBytes);cusparseStatus_tcusparseZbsrilu02_bufferSize(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,int*pBufferSizeInBytes);

This function returns the size of the buffer used in computing the incomplete-LU factorization with 0 fill-in and no pivoting.

\[A \approx LU\]

A is an(mb*blockDim)x(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA.

The buffer size depends on the dimensions ofmb,blockDim, and the number of nonzero blocks of the matrixnnzb. If the user changes the matrix, it is necessary to callbsrilu02_bufferSize() again to have the correct buffer size; otherwise, a segmentation fault may occur.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A, larger than zero.

Output

info

record internal states based on different algorithms.

pBufferSizeInBytes

number of bytes of the buffer used inbsrilu02_analysis() andbsrilu02().

Status Returned

CUSPARSE_STATUS_SUCCESS

the operation completed successfully.

CUSPARSE_STATUS_NOT_INITIALIZED

the library was not initialized.

CUSPARSE_STATUS_ALLOC_FAILED

the resources could not be allocated.

CUSPARSE_STATUS_INVALID_VALUE

invalid parameters were passed (mb,nnzb<=0), base index is not 0 or 1.

CUSPARSE_STATUS_ARCH_MISMATCH

the device only supports compute capability 2.0 and above.

CUSPARSE_STATUS_INTERNAL_ERROR

an internal operation failed.

CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED

the matrix type is not supported.

5.7.2.8.cusparse<t>bsrilu02_analysis() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrilu02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrilu02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsrilu02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrilu02_analysis(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescrA,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the analysis phase of the incomplete-LU factorization with 0 fill-in and no pivoting.

\(A \approx LU\)

A is an(mb*blockDim)×(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA. The block in BSR format is of sizeblockDim*blockDim, stored as column-major or row-major as determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored.

This function requires a buffer size returned bybsrilu02_bufferSize(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsrilu02_analysis() reports a structural zero and computes level information stored in the opaque structureinfo. The level information can extract more parallelism during incomplete LU factorization. Howeverbsrilu02() can be done without level information. To disable level information, the user needs to specify the parameterpolicy ofbsrilu02[_analysis|] asCUSPARSE_SOLVE_POLICY_NO_LEVEL.

Functionbsrilu02_analysis() always reports the first structural zero, even with parameterpolicy isCUSPARSE_SOLVE_POLICY_NO_LEVEL. The user must callcusparseXbsrilu02_zeroPivot() to know where the structural zero is.

It is the user’s choice whether to callbsrilu02() ifbsrilu02_analysis() reports a structural zero. In this case, the user can still callbsrilu02(), which will return a numerical zero at the same position as the structural zero. However the result is meaningless.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A, larger than zero.

info

structure initialized usingcusparseCreateBsrilu02Info().

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user, the size is returned bybsrilu02_bufferSize().

Output

info

structure filled with information collected during the analysis phase (that should be passed to the solve phase unchanged)

SeecusparseStatus_t for the description of the return status.

5.7.2.9.cusparse<t>bsrilu02() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSbsrilu02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescry,float*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseDbsrilu02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescry,double*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseCbsrilu02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescry,cuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)cusparseStatus_tcusparseZbsrilu02(cusparseHandle_thandle,cusparseDirection_tdirA,intmb,intnnzb,constcusparseMatDescr_tdescry,cuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,bsrilu02Info_tinfo,cusparseSolvePolicy_tpolicy,void*pBuffer)

This function performs the solve phase of the incomplete-LU factorization with 0 fill-in and no pivoting.

\[A \approx LU\]

A is an(mb*blockDim)×(mb*blockDim) sparse matrix that is defined in BSR storage format by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA. The block in BSR format is of sizeblockDim*blockDim, stored as column-major or row-major determined by parameterdirA, which is eitherCUSPARSE_DIRECTION_COLUMN orCUSPARSE_DIRECTION_ROW. The matrix type must beCUSPARSE_MATRIX_TYPE_GENERAL, and the fill mode and diagonal type are ignored. Functionbsrilu02() supports an arbitraryblockDim.

This function requires a buffer size returned bybsrilu02_bufferSize(). The address ofpBuffer must be a multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

Althoughbsrilu02() can be used without level information, the user must be aware of consistency. Ifbsrilu02_analysis() is called with policyCUSPARSE_SOLVE_POLICY_USE_LEVEL,bsrilu02() can be run with or without levels. On the other hand, ifbsrilu02_analysis() is called withCUSPARSE_SOLVE_POLICY_NO_LEVEL,bsrilu02() can only acceptCUSPARSE_SOLVE_POLICY_NO_LEVEL; otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Functionbsrilu02() has the same behavior ascsrilu02(). That is,bsr2csr(bsrilu02(A))=csrilu02(bsr2csr(A)). The numerical zero ofcsrilu02() means there exists some zeroU(j,j). The numerical zero ofbsrilu02() means there exists some blockU(j,j) that is not invertible.

Functionbsrilu02 reports the first numerical zero, including a structural zero. The user must callcusparseXbsrilu02_zeroPivot() to know where the numerical zero is.

For example, supposeA is a real m-by-m matrix wherem=mb*blockDim. The following code solves precondition systemM*y=x, whereM is the product of LU factorsL andU.

// Suppose that A is m x m sparse matrix represented by BSR format,// The number of block rows/columns is mb, and// the number of nonzero blocks is nnzb.// Assumption:// - handle is already created by cusparseCreate(),// - (d_bsrRowPtr, d_bsrColInd, d_bsrVal) is BSR of A on device memory,// - d_x is right hand side vector on device memory.// - d_y is solution vector on device memory.// - d_z is intermediate result on device memory.// - d_x, d_y and d_z are of size m.cusparseMatDescr_tdescr_M=0;cusparseMatDescr_tdescr_L=0;cusparseMatDescr_tdescr_U=0;bsrilu02Info_tinfo_M=0;bsrsv2Info_tinfo_L=0;bsrsv2Info_tinfo_U=0;intpBufferSize_M;intpBufferSize_L;intpBufferSize_U;intpBufferSize;void*pBuffer=0;intstructural_zero;intnumerical_zero;constdoublealpha=1.;constcusparseSolvePolicy_tpolicy_M=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_L=CUSPARSE_SOLVE_POLICY_NO_LEVEL;constcusparseSolvePolicy_tpolicy_U=CUSPARSE_SOLVE_POLICY_USE_LEVEL;constcusparseOperation_ttrans_L=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseOperation_ttrans_U=CUSPARSE_OPERATION_NON_TRANSPOSE;constcusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;// step 1: create a descriptor which contains// - matrix M is base-1// - matrix L is base-1// - matrix L is lower triangular// - matrix L has unit diagonal// - matrix U is base-1// - matrix U is upper triangular// - matrix U has non-unit diagonalcusparseCreateMatDescr(&descr_M);cusparseSetMatIndexBase(descr_M,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_M,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseCreateMatDescr(&descr_L);cusparseSetMatIndexBase(descr_L,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_L,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_L,CUSPARSE_FILL_MODE_LOWER);cusparseSetMatDiagType(descr_L,CUSPARSE_DIAG_TYPE_UNIT);cusparseCreateMatDescr(&descr_U);cusparseSetMatIndexBase(descr_U,CUSPARSE_INDEX_BASE_ONE);cusparseSetMatType(descr_U,CUSPARSE_MATRIX_TYPE_GENERAL);cusparseSetMatFillMode(descr_U,CUSPARSE_FILL_MODE_UPPER);cusparseSetMatDiagType(descr_U,CUSPARSE_DIAG_TYPE_NON_UNIT);// step 2: create a empty info structure// we need one info for bsrilu02 and two info's for bsrsv2cusparseCreateBsrilu02Info(&info_M);cusparseCreateBsrsv2Info(&info_L);cusparseCreateBsrsv2Info(&info_U);// step 3: query how much memory used in bsrilu02 and bsrsv2, and allocate the buffercusparseDbsrilu02_bufferSize(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,&pBufferSize_M);cusparseDbsrsv2_bufferSize(handle,dir,trans_L,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,&pBufferSize_L);cusparseDbsrsv2_bufferSize(handle,dir,trans_U,mb,nnzb,descr_U,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_U,&pBufferSize_U);pBufferSize=max(pBufferSize_M,max(pBufferSize_L,pBufferSize_U));// pBuffer returned by cudaMalloc is automatically aligned to 128 bytes.cudaMalloc((void**)&pBuffer,pBufferSize);// step 4: perform analysis of incomplete LU factorization on M//         perform analysis of triangular solve on L//         perform analysis of triangular solve on U// The lower(upper) triangular part of M has the same sparsity pattern as L(U),// we can do analysis of bsrilu0 and bsrsv2 simultaneously.cusparseDbsrilu02_analysis(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,policy_M,pBuffer);status=cusparseXbsrilu02_zeroPivot(handle,info_M,&structural_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==statuss){printf("A(%d,%d) is missing\n",structural_zero,structural_zero);}cusparseDbsrsv2_analysis(handle,dir,trans_L,mb,nnzb,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,policy_L,pBuffer);cusparseDbsrsv2_analysis(handle,dir,trans_U,mb,nnzb,descr_U,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_U,policy_U,pBuffer);// step 5: M = L * UcusparseDbsrilu02(handle,dir,mb,nnzb,descr_M,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_M,policy_M,pBuffer);status=cusparseXbsrilu02_zeroPivot(handle,info_M,&numerical_zero);if(CUSPARSE_STATUS_ZERO_PIVOT==statuss){printf("block U(%d,%d) is not invertible\n",numerical_zero,numerical_zero);}// step 6: solve L*z = xcusparseDbsrsv2_solve(handle,dir,trans_L,mb,nnzb,&alpha,descr_L,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_L,d_x,d_z,policy_L,pBuffer);// step 7: solve U*y = zcusparseDbsrsv2_solve(handle,dir,trans_U,mb,nnzb,&alpha,descr_U,d_bsrVal,d_bsrRowPtr,d_bsrColInd,blockDim,info_U,d_z,d_y,policy_U,pBuffer);// step 6: free resourcescudaFree(pBuffer);cusparseDestroyMatDescr(descr_M);cusparseDestroyMatDescr(descr_L);cusparseDestroyMatDescr(descr_U);cusparseDestroyBsrilu02Info(info_M);cusparseDestroyBsrsv2Info(info_L);cusparseDestroyBsrsv2Info(info_U);cusparseDestroy(handle);

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dirA

storage format of blocks: eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows and block columns of matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) nonzero blocks of matrixA.

bsrRowPtrA

integer array ofmb\(+ 1\) elements that contains the start of every block row and the end of the last block row plus one.

bsrColIndA

integer array ofnnzb\(( =\)bsrRowPtrA(mb)\(-\)bsrRowPtrA(0)\()\) column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrix A; must be larger than zero.

info

structure with information collected during the analysis phase (that should have been passed to the solve phase unchanged).

policy

the supported policies areCUSPARSE_SOLVE_POLICY_NO_LEVEL andCUSPARSE_SOLVE_POLICY_USE_LEVEL.

pBuffer

buffer allocated by the user; the size is returned bybsrilu02_bufferSize().

Output

bsrValA

<type> matrix containing the incomplete-LU lower and upper triangular factors

SeecusparseStatus_t for the description of the return status.

5.7.2.10.cusparseXbsrilu02_zeroPivot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseXbsrilu02_zeroPivot(cusparseHandle_thandle,bsrilu02Info_tinfo,int*position)

If the returned error code isCUSPARSE_STATUS_ZERO_PIVOT,position=j meansA(j,j) has either a structural zero or a numerical zero (the block is not invertible). Otherwiseposition=-1.

Theposition can be 0-based or 1-based, the same as the matrix.

FunctioncusparseXbsrilu02_zeroPivot() is a blocking call. It callscudaDeviceSynchronize() to make sure all previous kernels are done.

Theposition can be in the host memory or device memory. The user can set proper the mode withcusparseSetPointerMode().

  • The routine requires no extra storage.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

info

info contains structural zero or numerical zero if the user already calledbsrilu02_analysis() orbsrilu02().

Output

position

if no structural or numerical zero,position is -1; otherwise ifA(j,j) is missing orU(j,j) is not invertible,position=j.

SeecusparseStatus_t for the description of the return status.

5.7.3.Tridiagonal Solve

Different algorithms for tridiagonal solve are discussed in this section.

5.7.3.1.cusparse<t>gtsv2_buffSizeExt()

cusparseStatus_tcusparseSgtsv2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constfloat*dl,constfloat*d,constfloat*du,constfloat*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseDgtsv2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constdouble*dl,constdouble*d,constdouble*du,constdouble*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseCgtsv2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuComplex*dl,constcuComplex*d,constcuComplex*du,constcuComplex*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseZgtsv2_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,constcuDoubleComplex*B,intldb,size_t*bufferSizeInBytes)

This function returns the size of the buffer used ingtsv2 which computes the solution of a tridiagonal linear system with multiple right-hand sides.

\[A \ast X = B\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrixB.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions(ldb,n).

ldb

leading dimension ofB (that is ≥\(\max\text{(1,\ m))}\) .

Output

pBufferSizeInBytes

number of bytes of the buffer used in thegtsv2.

SeecusparseStatus_t for the description of the return status.

5.7.3.2.cusparse<t>gtsv2()

cusparseStatus_tcusparseSgtsv2(cusparseHandle_thandle,intm,intn,constfloat*dl,constfloat*d,constfloat*du,float*B,intldb,void*pBuffer)cusparseStatus_tcusparseDgtsv2(cusparseHandle_thandle,intm,intn,constdouble*dl,constdouble*d,constdouble*du,double*B,intldb,void*pBuffer)cusparseStatus_tcusparseCgtsv2(cusparseHandle_thandle,intm,intn,constcuComplex*dl,constcuComplex*d,constcuComplex*du,cuComplex*B,intldb,void*pBuffer)cusparseStatus_tcusparseZgtsv2(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,cuDoubleComplex*B,intldb,void*pBuffer)

This function computes the solution of a tridiagonal linear system with multiple right-hand sides:

\[A \ast X = B\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

AssumingA is of sizem and base-1,dl,d anddu are defined by the following formula:

dl(i):=A(i,i-1) fori=1,2,...,m

The first element of dl is out-of-bound (dl(1):=A(1,0)), sodl(1)=0.

d(i)=A(i,i) fori=1,2,...,m

du(i)=A(i,i+1) fori=1,2,...,m

The last element of du is out-of-bound (du(m):=A(m,m+1)), sodu(m)=0.

The routine does perform pivoting, which usually results in more accurate and more stable results thancusparse<t>gtsv_nopivot() orcusparse<t>gtsv2_nopivot() at the expense of some execution time.

This function requires a buffer size returned bygtsv2_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrixB.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions(ldb,n).

ldb

leading dimension ofB (that is ≥\(\max\text{(1,\ m))}\) .

pBuffer

buffer allocated by the user, the size is return bygtsv2_bufferSizeExt.

Output

B

<type> dense solution array of dimensions(ldb,n).

SeecusparseStatus_t for the description of the return status.

5.7.3.3.cusparse<t>gtsv2_nopivot_bufferSizeExt()

cusparseStatus_tcusparseSgtsv2_nopivot_bufferSizeExt(cusparseHandle_thandle,intm,intn,constfloat*dl,constfloat*d,constfloat*du,constfloat*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseDgtsv2_nopivot_bufferSizeExt(cusparseHandle_thandle,intm,intn,constdouble*dl,constdouble*d,constdouble*du,constdouble*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseCgtsv2_nopivot_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuComplex*dl,constcuComplex*d,constcuComplex*du,constcuComplex*B,intldb,size_t*bufferSizeInBytes)cusparseStatus_tcusparseZgtsv2_nopivot_bufferSizeExt(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,constcuDoubleComplex*B,intldb,size_t*bufferSizeInBytes)

This function returns the size of the buffer used ingtsv2_nopivot which computes the solution of a tridiagonal linear system with multiple right-hand sides.

\[A \ast X = B\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrixB.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions(ldb,n).

ldb

leading dimension ofB. (that is ≥\(\max\text{(1,\ m))}\) .

Output

pBufferSizeInBytes

number of bytes of the buffer used in thegtsv2_nopivot.

SeecusparseStatus_t for the description of the return status.

5.7.3.4.cusparse<t>gtsv2_nopivot()

cusparseStatus_tcusparseSgtsv2_nopivot(cusparseHandle_thandle,intm,intn,constfloat*dl,constfloat*d,constfloat*du,float*B,intldb,void*pBuffer)cusparseStatus_tcusparseDgtsv2_nopivot(cusparseHandle_thandle,intm,intn,constdouble*dl,constdouble*d,constdouble*du,double*B,intldb,void*pBuffer)cusparseStatus_tcusparseCgtsv2_nopivot(cusparseHandle_thandle,intm,intn,constcuComplex*dl,constcuComplex*d,constcuComplex*du,cuComplex*B,intldb,void*pBuffer)cusparseStatus_tcusparseZgtsv2_nopivot(cusparseHandle_thandle,intm,intn,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,cuDoubleComplex*B,intldb,void*pBuffer)

This function computes the solution of a tridiagonal linear system with multiple right-hand sides:

\[A \ast X = B\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

The routine does not perform any pivoting and uses a combination of the Cyclic Reduction (CR) and the Parallel Cyclic Reduction (PCR) algorithms to find the solution. It achieves better performance whenm is a power of 2.

This function requires a buffer size returned bygtsv2_nopivot_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

n

number of right-hand sides, columns of matrixB.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

B

<type> dense right-hand-side array of dimensions(ldb,n).

ldb

leading dimension ofB. (that is ≥\(\max\text{(1,\ m))}\) .

pBuffer

buffer allocated by the user, the size is return bygtsv2_nopivot_bufferSizeExt.

Output

B

<type> dense solution array of dimensions(ldb,n).

SeecusparseStatus_t for the description of the return status.

5.7.4.Batched Tridiagonal Solve

Different algorithms for batched tridiagonal solve are discussed in this section.

5.7.4.1.cusparse<t>gtsv2StridedBatch_bufferSizeExt()

cusparseStatus_tcusparseSgtsv2StridedBatch_bufferSizeExt(cusparseHandle_thandle,intm,constfloat*dl,constfloat*d,constfloat*du,constfloat*x,intbatchCount,intbatchStride,size_t*bufferSizeInBytes)cusparseStatus_tcusparseDgtsv2StridedBatch_bufferSizeExt(cusparseHandle_thandle,intm,constdouble*dl,constdouble*d,constdouble*du,constdouble*x,intbatchCount,intbatchStride,size_t*bufferSizeInBytes)cusparseStatus_tcusparseCgtsv2StridedBatch_bufferSizeExt(cusparseHandle_thandle,intm,constcuComplex*dl,constcuComplex*d,constcuComplex*du,constcuComplex*x,intbatchCount,intbatchStride,size_t*bufferSizeInBytes)cusparseStatus_tcusparseZgtsv2StridedBatch_bufferSizeExt(cusparseHandle_thandle,intm,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,constcuDoubleComplex*x,intbatchCount,intbatchStride,size_t*bufferSizeInBytes)

This function returns the size of the buffer used ingtsv2StridedBatch which computes the solution of multiple tridiagonal linear systems fori=0,…,batchCount:

\[A^{(i)} \ast \text{y}^{(i)} = \text{x}^{(i)}\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixX. Notice that solutionY overwrites right-hand-side matrixX on exit. The different matrices are assumed to be of the same size and are stored with a fixedbatchStride in memory.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

m

the size of the linear system (must be ≥ 3).

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The lower diagonal\(dl^{(i)}\) that corresponds to theith linear system starts at locationdl+batchStride×i in memory. Also, the first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system. The main diagonal\(d^{(i)}\) that corresponds to theith linear system starts at locationd+batchStride×i in memory.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The upper diagonal\(du^{(i)}\) that corresponds to theith linear system starts at locationdu+batchStride×i in memory. Also, the last element of each upper diagonal must be zero.

x

<type> dense array that contains the right-hand-side of the tri-diagonal linear system. The right-hand-side\(x^{(i)}\) that corresponds to theith linear system starts at locationx+batchStride×iin memory.

batchCount

number of systems to solve.

batchStride

stride (number of elements) that separates the vectors of every system (must be at leastm).

Output

pBufferSizeInBytes

number of bytes of the buffer used in thegtsv2StridedBatch.

SeecusparseStatus_t for the description of the return status.

5.7.4.2.cusparse<t>gtsv2StridedBatch()

cusparseStatus_tcusparseSgtsv2StridedBatch(cusparseHandle_thandle,intm,constfloat*dl,constfloat*d,constfloat*du,float*x,intbatchCount,intbatchStride,void*pBuffer)cusparseStatus_tcusparseDgtsv2StridedBatch(cusparseHandle_thandle,intm,constdouble*dl,constdouble*d,constdouble*du,double*x,intbatchCount,intbatchStride,void*pBuffer)cusparseStatus_tcusparseCgtsv2StridedBatch(cusparseHandle_thandle,intm,constcuComplex*dl,constcuComplex*d,constcuComplex*du,cuComplex*x,intbatchCount,intbatchStride,void*pBuffer)cusparseStatus_tcusparseZgtsv2StridedBatch(cusparseHandle_thandle,intm,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,cuDoubleComplex*x,intbatchCount,intbatchStride,void*pBuffer)

This function computes the solution of multiple tridiagonal linear systems fori=0,…,batchCount:

\[A^{(i)} \ast \text{y}^{(i)} = \text{x}^{(i)}\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixX. Notice that solutionY overwrites right-hand-side matrixX on exit. The different matrices are assumed to be of the same size and are stored with a fixedbatchStride in memory.

The routine does not perform any pivoting and uses a combination of the Cyclic Reduction (CR) and the Parallel Cyclic Reduction (PCR) algorithms to find the solution. It achieves better performance whenm is a power of 2.

This function requires a buffer size returned bygtsv2StridedBatch_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

n

the size of the linear system (must be ≥ 3).

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The lower diagonal\(dl^{(i)}\) that corresponds to theith linear system starts at locationdl+batchStride×i in memory. Also, the first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system. The main diagonal\(d^{(i)}\) that corresponds to theith linear system starts at locationd+batchStride×i in memory.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The upper diagonal\(du^{(i)}\) that corresponds to theith linear system starts at locationdu+batchStride×i in memory. Also, the last element of each upper diagonal must be zero.

x

<type> dense array that contains the right-hand-side of the tri-diagonal linear system. The right-hand-side\(x^{(i)}\) that corresponds to theith linear system starts at locationx+batchStride×iin memory.

batchCount

number of systems to solve.

batchStride

stride (number of elements) that separates the vectors of every system (must be at leastn).

pBuffer

buffer allocated by the user, the size is return bygtsv2StridedBatch_bufferSizeExt.

Output

x

<type> dense array that contains the solution of the tri-diagonal linear system. The solution\(x^{(i)}\) that corresponds to theith linear system starts at locationx+batchStride×iin memory.

SeecusparseStatus_t for the description of the return status.

5.7.4.3.cusparse<t>gtsvInterleavedBatch()

cusparseStatus_tcusparseSgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constfloat*dl,constfloat*d,constfloat*du,constfloat*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constdouble*dl,constdouble*d,constdouble*du,constdouble*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseCgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constcuComplex*dl,constcuComplex*d,constcuComplex*du,constcuComplex*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseZgtsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,constcuDoubleComplex*x,intbatchCount,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseSgtsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,float*dl,float*d,float*du,float*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseDgtsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,double*dl,double*d,double*du,double*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseCgtsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,cuComplex*dl,cuComplex*d,cuComplex*du,cuComplex*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseZgtsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,cuDoubleComplex*dl,cuDoubleComplex*d,cuDoubleComplex*du,cuDoubleComplex*x,intbatchCount,void*pBuffer)

This function computes the solution of multiple tridiagonal linear systems fori=0,…,batchCount:

\[A^{(i)} \ast \text{x}^{(i)} = \text{b}^{(i)}\]

The coefficient matrixA of each of these tri-diagonal linear system is defined with three vectors corresponding to its lower (dl), main (d), and upper (du) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

AssumingA is of sizem and base-1,dl,d anddu are defined by the following formula:

dl(i):=A(i,i-1) fori=1,2,...,m

The first element of dl is out-of-bound (dl(1):=A(1,0)), sodl(1)=0.

d(i)=A(i,i) fori=1,2,...,m

du(i)=A(i,i+1) fori=1,2,...,m

The last element of du is out-of-bound (du(m):=A(m,m+1)), sodu(m)=0.

The data layout is different fromgtsvStridedBatch which aggregates all matrices one after another. Instead,gtsvInterleavedBatch gathers different matrices of the same element in a continous manner. Ifdl is regarded as a 2-D array of sizem-by-batchCount,dl(:,j) to storej-th matrix.gtsvStridedBatch uses column-major whilegtsvInterleavedBatch uses row-major.

The routine provides three different algorithms, selected by parameteralgo. The first algorithm iscuThomas provided byBarcelonaSupercomputingCenter. The second algorithm is LU with partial pivoting and last algorithm is QR. From stability perspective, cuThomas is not numerically stable because it does not have pivoting. LU with partial pivoting and QR are stable. From performance perspective, LU with partial pivoting and QR is about 10% to 20% slower than cuThomas.

This function requires a buffer size returned bygtsvInterleavedBatch_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

If the user prepares aggregate format, one can usecublasXgeam to get interleaved format. However such transformation takes time comparable to solver itself. To reach best performance, the user must prepare interleaved format explicitly.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

algo

algo = 0: cuThomas (unstable algorithm); algo = 1: LU with pivoting (stable algorithm); algo = 2: QR (stable algorithm)

m

the size of the linear system.

dl

<type> dense array containing the lower diagonal of the tri-diagonal linear system. The first element of each lower diagonal must be zero.

d

<type> dense array containing the main diagonal of the tri-diagonal linear system.

du

<type> dense array containing the upper diagonal of the tri-diagonal linear system. The last element of each upper diagonal must be zero.

x

<type> dense right-hand-side array of dimensions(batchCount,n).

pBuffer

buffer allocated by the user, the size is return bygtsvInterleavedBatch_bufferSizeExt.

Output

x

<type> dense solution array of dimensions(batchCount,n).

SeecusparseStatus_t for the description of the return status.

5.7.5.Batched Pentadiagonal Solve

Different algorithms for batched pentadiagonal solve are discussed in this section.

5.7.5.1.cusparse<t>gpsvInterleavedBatch()

cusparseStatus_tcusparseSgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constfloat*ds,constfloat*dl,constfloat*d,constfloat*du,constfloat*dw,constfloat*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constdouble*ds,constdouble*dl,constdouble*d,constdouble*du,constdouble*dw,constdouble*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseCgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constcuComplex*ds,constcuComplex*dl,constcuComplex*d,constcuComplex*du,constcuComplex*dw,constcuComplex*x,intbatchCount,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseZgpsvInterleavedBatch_bufferSizeExt(cusparseHandle_thandle,intalgo,intm,constcuDoubleComplex*ds,constcuDoubleComplex*dl,constcuDoubleComplex*d,constcuDoubleComplex*du,constcuDoubleComplex*dw,constcuDoubleComplex*x,intbatchCount,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseSgpsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,float*ds,float*dl,float*d,float*du,float*dw,float*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseDgpsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,double*ds,double*dl,double*d,double*du,double*dw,double*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseCgpsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,cuComplex*ds,cuComplex*dl,cuComplex*d,cuComplex*du,cuComplex*dw,cuComplex*x,intbatchCount,void*pBuffer)cusparseStatus_tcusparseZgpsvInterleavedBatch(cusparseHandle_thandle,intalgo,intm,cuDoubleComplex*ds,cuDoubleComplex*dl,cuDoubleComplex*d,cuDoubleComplex*du,cuDoubleComplex*dw,cuDoubleComplex*x,intbatchCount,void*pBuffer)

This function computes the solution of multiple penta-diagonal linear systems fori=0,…,batchCount:

\[A^{(i)} \ast \text{x}^{(i)} = \text{b}^{(i)}\]

The coefficient matrixA of each of these penta-diagonal linear system is defined with five vectors corresponding to its lower (ds,dl), main (d), and upper (du,dw) matrix diagonals; the right-hand sides are stored in the dense matrixB. Notice that solutionX overwrites right-hand-side matrixB on exit.

AssumingA is of sizem and base-1,ds,dl,d,du anddw are defined by the following formula:

ds(i):=A(i,i-2) fori=1,2,...,m

The first two elements of ds is out-of-bound (ds(1):=A(1,-1),ds(2):=A(2,0)), sods(1)=0 andds(2)=0.

dl(i):=A(i,i-1) fori=1,2,...,m

The first element of dl is out-of-bound (dl(1):=A(1,0)), sodl(1)=0.

d(i)=A(i,i) fori=1,2,...,m

du(i)=A(i,i+1) fori=1,2,...,m

The last element of du is out-of-bound (du(m):=A(m,m+1)), sodu(m)=0.

dw(i)=A(i,i+2) fori=1,2,...,m

The last two elements of dw is out-of-bound (dw(m-1):=A(m-1,m+1),dw(m):=A(m,m+2)), sodw(m-1)=0 anddw(m)=0.

The data layout is the same asgtsvStridedBatch.

The routine is numerically stable because it uses QR to solve the linear system.

This function requires a buffer size returned bygpsvInterleavedBatch_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If it is not,CUSPARSE_STATUS_INVALID_VALUE is returned.

The function supports the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

algo

only support algo = 0 (QR)

m

the size of the linear system.

ds

<type> dense array containing the lower diagonal (distance 2 to the diagonal) of the penta-diagonal linear system. The first two elements must be zero.

dl

<type> dense array containing the lower diagonal (distance 1 to the diagonal) of the penta-diagonal linear system. The first element must be zero.

d

<type> dense array containing the main diagonal of the penta-diagonal linear system.

du

<type> dense array containing the upper diagonal (distance 1 to the diagonal) of the penta-diagonal linear system. The last element must be zero.

dw

<type> dense array containing the upper diagonal (distance 2 to the diagonal) of the penta-diagonal linear system. The last two elements must be zero.

x

<type> dense right-hand-side array of dimensions(batchCount,n).

pBuffer

buffer allocated by the user, the size is return bygpsvInterleavedBatch_bufferSizeExt.

Output

x

<type> dense solution array of dimensions(batchCount,n).

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSgpsvInterleavedBatch for a code example.

5.8.cuSPARSE Reorderings Reference

This chapter describes the reordering routines used to manipulate sparse matrices.

5.8.1.cusparse<t>csrcolor() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseScsrcolor(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,constfloat*fractionToColor,int*ncolors,int*coloring,int*reordering,cusparseColorInfo_tinfo)cusparseStatus_tcusparseDcsrcolor(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,constdouble*fractionToColor,int*ncolors,int*coloring,int*reordering,cusparseColorInfo_tinfo)cusparseStatus_tcusparseCcsrcolor(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcuComplex*fractionToColor,int*ncolors,int*coloring,int*reordering,cusparseColorInfo_tinfo)cusparseStatus_tcusparseZcsrcolor(cusparseHandle_thandle,intm,intnnz,constcusparseMatDescr_tdescrA,constcuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcuDoubleComplex*fractionToColor,int*ncolors,int*coloring,int*reordering,cusparseColorInfo_tinfo)

This function performs the coloring of the adjacency graph associated with the matrix A stored in CSR format. The coloring is an assignment of colors (integer numbers) to nodes, such that neighboring nodes have distinct colors. An approximate coloring algorithm is used in this routine, and is stopped when a certain percentage of nodes has been colored. The rest of the nodes are assigned distinct colors (an increasing sequence of integers numbers, starting from the last integer used previously). The last two auxiliary routines can be used to extract the resulting number of colors, their assignment and the associated reordering. The reordering is such that nodes that have been assigned the same color are reordered to be next to each other.

The matrix A passed to this routine, must be stored as a general matrix and have a symmetric sparsity pattern. If the matrix is nonsymmetric the user should pass A+A^T as a parameter to this routine.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

m

number of rows of matrixA.

nnz

number of nonzero elements of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) nonzero elements of matrixA.

csrRowPtrA

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndA

integer array ofnnz\(( =\)csrRowPtrA(m)\(-\)csrRowPtrA(0)\()\) column indices of the nonzero elements of matrixA.

fractionToColor

fraction of nodes to be colored, which should be in the interval [0.0,1.0], for example 0.8 implies that 80 percent of nodes will be colored.

info

structure with information to be passed to the coloring.

Output

ncolors

The number of distinct colors used (at most the size of the matrix, but likely much smaller).

coloring

The resulting coloring permutation

reordering

The resulting reordering permutation (untouched if NULL)

SeecusparseStatus_t for the description of the return status.

5.9.cuSPARSE Format Conversion Reference

This chapter describes the conversion routines between different sparse and dense storage formats.

coosort,csrsort,cscsort, andcsru2csr are sorting routines without malloc inside, the following table estimates the buffer size.

routine

buffersize

maximumproblemsizeifbufferislimitedby2GB

coosort

>16*nbytes

125M

csrsortorcscsort

>20*nbytes

100M

csru2csr

'd'>28*nbytes;'z'>36*nbytes

71M for ‘d’ and 55M for ‘z’

5.9.1.cusparse<t>bsr2csr() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseSbsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcusparseMatDescr_tdescrC,float*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseDbsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcusparseMatDescr_tdescrC,double*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseCbsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcusparseMatDescr_tdescrC,cuComplex*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseZbsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,intblockDim,constcusparseMatDescr_tdescrC,cuDoubleComplex*csrValC,int*csrRowPtrC,int*csrColIndC)

This function converts a sparse matrix in BSR format that is defined by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA) into a sparse matrix in CSR format that is defined by arrayscsrValC,csrRowPtrC, andcsrColIndC.

Letm(=mb*blockDim) be the number of rows ofA andn(=nb*blockDim) be number of columns ofA, thenA andC arem*n sparse matrices. The BSR format ofA containsnnzb(=bsrRowPtrA[mb]-bsrRowPtrA[0]) nonzero blocks, whereas the sparse matrixA containsnnz(=nnzb*blockDim*blockDim) elements. The user must allocate enough space for arrayscsrRowPtrC,csrColIndC, andcsrValC. The requirements are as follows:

csrRowPtrC ofm+1 elements

csrValC ofnnz elements

csrColIndC ofnnz elements

The general procedure is as follows:

// Given BSR format (bsrRowPtrA, bsrcolIndA, bsrValA) and// blocks of BSR format are stored in column-major order.cusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;intm=mb*blockDim;intnnzb=bsrRowPtrA[mb]-bsrRowPtrA[0];// number of blocksintnnz=nnzb*blockDim*blockDim;// number of elementscudaMalloc((void**)&csrRowPtrC,sizeof(int)*(m+1));cudaMalloc((void**)&csrColIndC,sizeof(int)*nnz);cudaMalloc((void**)&csrValC,sizeof(float)*nnz);cusparseSbsr2csr(handle,dir,mb,nb,descrA,bsrValA,bsrRowPtrA,bsrColIndA,blockDim,descrC,csrValC,csrRowPtrC,csrColIndC);
  • The routine requires no extra storage

  • The routine supports asynchronous execution ifblockDim!=1 or the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture ifblockDim!=1 or the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows of sparse matrixA.

nb

number of block columns of sparse matrixA.

descrA

the descriptor of matrixA.

bsrValA

<type> array ofnnzb*blockDim*blockDim nonzero elements of matrixA.

bsrRowPtrA

integer array ofmb+1 elements that contains the start of every block row and the end of the last block row plus one of matrixA.

bsrColIndA

integer array ofnnzb column indices of the nonzero blocks of matrixA.

blockDim

block dimension of sparse matrixA.

descrC

the descriptor of matrixC.

Output

csrValC

<type> array ofnnz(=csrRowPtrC[m]-csrRowPtrC[0]) nonzero elements of matrixC.

csrRowPtrC

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one of matrixC.

csrColIndC

integer array ofnnz column indices of the nonzero elements of matrixC.

SeecusparseStatus_t for the description of the return status.

5.9.2.cusparse<t>gebsr2gebsc()

cusparseStatus_tcusparseSgebsr2gebsc_bufferSize(cusparseHandle_thandle,intmb,intnb,intnnzb,constfloat*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseDgebsr2gebsc_bufferSize(cusparseHandle_thandle,intmb,intnb,intnnzb,constdouble*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseCgebsr2gebsc_bufferSize(cusparseHandle_thandle,intmb,intnb,intnnzb,constcuComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseZgebsr2gebsc_bufferSize(cusparseHandle_thandle,intmb,intnb,intnnzb,constcuDoubleComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,int*pBufferSize)
cusparseStatus_tcusparseSgebsr2gebsc(cusparseHandle_thandle,intmb,intnb,intnnzb,constfloat*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,float*bscVal,int*bscRowInd,int*bscColPtr,cusparseAction_tcopyValues,cusparseIndexBase_tbaseIdx,void*pBuffer)cusparseStatus_tcusparseDgebsr2gebsc(cusparseHandle_thandle,intmb,intnb,intnnzb,constdouble*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,double*bscVal,int*bscRowInd,int*bscColPtr,cusparseAction_tcopyValues,cusparseIndexBase_tbaseIdx,void*pBuffer)cusparseStatus_tcusparseCgebsr2gebsc(cusparseHandle_thandle,intmb,intnb,intnnzb,constcuComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,cuComplex*bscVal,int*bscRowInd,int*bscColPtr,cusparseAction_tcopyValues,cusparseIndexBase_tbaseIdx,void*pBuffer)cusparseStatus_tcusparseZgebsr2gebsc(cusparseHandle_thandle,intmb,intnb,intnnzb,constcuDoubleComplex*bsrVal,constint*bsrRowPtr,constint*bsrColInd,introwBlockDim,intcolBlockDim,cuDoubleComplex*bscVal,int*bscRowInd,int*bscColPtr,cusparseAction_tcopyValues,cusparseIndexBase_tbaseIdx,void*pBuffer)

This function can be seen as the same ascsr2csc() when each block of sizerowBlockDim*colBlockDim is regarded as a scalar.

This sparsity pattern of the result matrix can also be seen as the transpose of the original sparse matrix, but the memory layout of a block does not change.

The user must callgebsr2gebsc_bufferSize() to determine the size of the buffer required bygebsr2gebsc(), allocate the buffer, and pass the buffer pointer togebsr2gebsc().

  • The routine requires no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

handle

handle to the cuSPARSE library context.

mb

number of block rows of sparse matrixA.

nb

number of block columns of sparse matrixA.

nnzb

number of nonzero blocks of matrixA.

bsrVal

<type> array ofnnzb*rowBlockDim*colBlockDim nonzero elements of matrixA.

bsrRowPtr

integer array ofmb+1 elements that contains the start of every block row and the end of the last block row plus one.

bsrColInd

integer array ofnnzb column indices of the non-zero blocks of matrixA.

rowBlockDim

number of rows within a block ofA.

colBlockDim

number of columns within a block ofA.

copyValues

CUSPARSE_ACTION_SYMBOLIC orCUSPARSE_ACTION_NUMERIC.

baseIdx

CUSPARSE_INDEX_BASE_ZERO orCUSPARSE_INDEX_BASE_ONE.

pBufferSize

host pointer containing number of bytes of the buffer used ingebsr2gebsc().

pBuffer

buffer allocated by the user; the size is return bygebsr2gebsc_bufferSize().

Output

bscVal

<type> array ofnnzb*rowBlockDim*colBlockDim non-zero elements of matrixA. It is only filled-in ifcopyValues is set toCUSPARSE_ACTION_NUMERIC.

bscRowInd

integer array ofnnzb row indices of the non-zero blocks of matrixA.

bscColPtr

integer array ofnb+1 elements that contains the start of every block column and the end of the last block column plus one.

SeecusparseStatus_t for the description of the return status.

5.9.3.cusparse<t>gebsr2gebsr() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseSgebsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,introwBlockDimC,intcolBlockDimC,int*pBufferSize)cusparseStatus_tcusparseDgebsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,introwBlockDimC,intcolBlockDimC,int*pBufferSize)cusparseStatus_tcusparseCgebsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,introwBlockDimC,intcolBlockDimC,int*pBufferSize)cusparseStatus_tcusparseZgebsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,introwBlockDimC,intcolBlockDimC,int*pBufferSize)
cusparseStatus_tcusparseXgebsr2gebsrNnz(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,constcusparseMatDescr_tdescrC,int*bsrRowPtrC,introwBlockDimC,intcolBlockDimC,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseSgebsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,constcusparseMatDescr_tdescrC,float*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDimC,intcolBlockDimC,void*pBuffer)cusparseStatus_tcusparseDgebsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,constcusparseMatDescr_tdescrC,double*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDimC,intcolBlockDimC,void*pBuffer)cusparseStatus_tcusparseCgebsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,constcusparseMatDescr_tdescrC,cuComplex*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDimC,intcolBlockDimC,void*pBuffer)cusparseStatus_tcusparseZgebsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,intnnzb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDimA,intcolBlockDimA,constcusparseMatDescr_tdescrC,cuDoubleComplex*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDimC,intcolBlockDimC,void*pBuffer)

This function converts a sparse matrix in general BSR format that is defined by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA into a sparse matrix in another general BSR format that is defined by arraysbsrValC,bsrRowPtrC, andbsrColIndC.

IfrowBlockDimA=1 andcolBlockDimA=1,cusparse[S|D|C|Z]gebsr2gebsr() is the same ascusparse[S|D|C|Z]csr2gebsr().

IfrowBlockDimC=1 andcolBlockDimC=1,cusparse[S|D|C|Z]gebsr2gebsr() is the same ascusparse[S|D|C|Z]gebsr2csr().

A is anm*n sparse matrix wherem(=mb*rowBlockDim) is the number of rows ofA, andn(=nb*colBlockDim) is the number of columns ofA. The general BSR format ofA containsnnzb(=bsrRowPtrA[mb]-bsrRowPtrA[0]) nonzero blocks. The matrixC is also general BSR format with a different block size,rowBlockDimC*colBlockDimC. Ifm is not a multiple ofrowBlockDimC, orn is not a multiple ofcolBlockDimC, zeros are filled in. The number of block rows ofC ismc(=(m+rowBlockDimC-1)/rowBlockDimC). The number of block rows ofC isnc(=(n+colBlockDimC-1)/colBlockDimC). The number of nonzero blocks ofC isnnzc.

The implementation adopts a two-step approach to do the conversion. First, the user allocatesbsrRowPtrC ofmc+1 elements and uses functioncusparseXgebsr2gebsrNnz() to determine the number of nonzero block columns per block row of matrixC. Second, the user gathersnnzc (number of non-zero block columns of matrixC) from either(nnzc=*nnzTotalDevHostPtr) or(nnzc=bsrRowPtrC[mc]-bsrRowPtrC[0]) and allocatesbsrValC ofnnzc*rowBlockDimC*colBlockDimC elements andbsrColIndC ofnnzc integers. Finally the functioncusparse[S|D|C|Z]gebsr2gebsr() is called to complete the conversion.

The user must callgebsr2gebsr_bufferSize() to know the size of the buffer required bygebsr2gebsr(), allocate the buffer, and pass the buffer pointer togebsr2gebsr().

The general procedure is as follows:

// Given general BSR format (bsrRowPtrA, bsrColIndA, bsrValA) and// blocks of BSR format are stored in column-major order.cusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;intbase,nnzc;intm=mb*rowBlockDimA;intn=nb*colBlockDimA;intmc=(m+rowBlockDimC-1)/rowBlockDimC;intnc=(n+colBlockDimC-1)/colBlockDimC;intbufferSize;void*pBuffer;cusparseSgebsr2gebsr_bufferSize(handle,dir,mb,nb,nnzb,descrA,bsrValA,bsrRowPtrA,bsrColIndA,rowBlockDimA,colBlockDimA,rowBlockDimC,colBlockDimC,&bufferSize);cudaMalloc((void**)&pBuffer,bufferSize);cudaMalloc((void**)&bsrRowPtrC,sizeof(int)*(mc+1));// nnzTotalDevHostPtr points to host memoryint*nnzTotalDevHostPtr=&nnzc;cusparseXgebsr2gebsrNnz(handle,dir,mb,nb,nnzb,descrA,bsrRowPtrA,bsrColIndA,rowBlockDimA,colBlockDimA,descrC,bsrRowPtrC,rowBlockDimC,colBlockDimC,nnzTotalDevHostPtr,pBuffer);if(NULL!=nnzTotalDevHostPtr){nnzc=*nnzTotalDevHostPtr;}else{cudaMemcpy(&nnzc,bsrRowPtrC+mc,sizeof(int),cudaMemcpyDeviceToHost);cudaMemcpy(&base,bsrRowPtrC,sizeof(int),cudaMemcpyDeviceToHost);nnzc-=base;}cudaMalloc((void**)&bsrColIndC,sizeof(int)*nnzc);cudaMalloc((void**)&bsrValC,sizeof(float)*(rowBlockDimC*colBlockDimC)*nnzc);cusparseSgebsr2gebsr(handle,dir,mb,nb,nnzb,descrA,bsrValA,bsrRowPtrA,bsrColIndA,rowBlockDimA,colBlockDimA,descrC,bsrValC,bsrRowPtrC,bsrColIndC,rowBlockDimC,colBlockDimC,pBuffer);
  • The routines require no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routines donot support CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows of sparse matrixA.

nb

number of block columns of sparse matrixA.

nnzb

number of nonzero blocks of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb*rowBlockDimA*colBlockDimA non-zero elements of matrixA.

bsrRowPtrA

integer array ofmb+1 elements that contains the start of every block row and the end of the last block row plus one of matrixA.

bsrColIndA

integer array ofnnzb column indices of the non-zero blocks of matrixA.

rowBlockDimA

number of rows within a block ofA.

colBlockDimA

number of columns within a block ofA.

descrC

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

rowBlockDimC

number of rows within a block ofC.

colBlockDimC

number of columns within a block ofC.

pBufferSize

host pointer containing number of bytes of the buffer used ingebsr2gebsr().

pBuffer

buffer allocated by the user; the size is return bygebsr2gebsr_bufferSize().

Output

bsrValC

<type> array ofnnzc*rowBlockDimC*colBlockDimC non-zero elements of matrixC.

bsrRowPtrC

integer array ofmc+1 elements that contains the start of every block row and the end of the last block row plus one of matrixC.

bsrColIndC

integer array ofnnzc block column indices of the nonzero blocks of matrixC.

nnzTotalDevHostPtr

total number of nonzero blocks ofC.*nnzTotalDevHostPtr is the same asbsrRowPtrC[mc]-bsrRowPtrC[0].

SeecusparseStatus_t for the description of the return status.

5.9.4.cusparse<t>gebsr2csr() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseSgebsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constfloat*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDim,intcolBlockDim,constcusparseMatDescr_tdescrC,float*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseDgebsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constdouble*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDim,intcolBlockDim,constcusparseMatDescr_tdescrC,double*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseCgebsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constcuComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDim,intcolBlockDim,constcusparseMatDescr_tdescrC,cuComplex*csrValC,int*csrRowPtrC,int*csrColIndC)cusparseStatus_tcusparseZgebsr2csr(cusparseHandle_thandle,cusparseDirection_tdir,intmb,intnb,constcusparseMatDescr_tdescrA,constcuDoubleComplex*bsrValA,constint*bsrRowPtrA,constint*bsrColIndA,introwBlockDim,intcolBlockDim,constcusparseMatDescr_tdescrC,cuDoubleComplex*csrValC,int*csrRowPtrC,int*csrColIndC)

This function converts a sparse matrix in general BSR format that is defined by the three arraysbsrValA,bsrRowPtrA, andbsrColIndA into a sparse matrix in CSR format that is defined by arrayscsrValC,csrRowPtrC, andcsrColIndC.

Letm(=mb*rowBlockDim) be number of rows ofA andn(=nb*colBlockDim) be number of columns ofA, thenA andC arem*n sparse matrices. The general BSR format ofA containsnnzb(=bsrRowPtrA[mb]-bsrRowPtrA[0]) non-zero blocks, whereas sparse matrixA containsnnz(=nnzb*rowBlockDim*colBlockDim) elements. The user must allocate enough space for arrayscsrRowPtrC,csrColIndC, andcsrValC. The requirements are as follows:

csrRowPtrC ofm+1 elements

csrValC ofnnz elements

csrColIndC ofnnz elements

The general procedure is as follows:

// Given general BSR format (bsrRowPtrA, bsrColIndA, bsrValA) and// blocks of BSR format are stored in column-major order.cusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;intm=mb*rowBlockDim;intn=nb*colBlockDim;intnnzb=bsrRowPtrA[mb]-bsrRowPtrA[0];// number of blocksintnnz=nnzb*rowBlockDim*colBlockDim;// number of elementscudaMalloc((void**)&csrRowPtrC,sizeof(int)*(m+1));cudaMalloc((void**)&csrColIndC,sizeof(int)*nnz);cudaMalloc((void**)&csrValC,sizeof(float)*nnz);cusparseSgebsr2csr(handle,dir,mb,nb,descrA,bsrValA,bsrRowPtrA,bsrColIndA,rowBlockDim,colBlockDim,descrC,csrValC,csrRowPtrC,csrColIndC);
  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

mb

number of block rows of sparse matrixA.

nb

number of block columns of sparse matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

bsrValA

<type> array ofnnzb*rowBlockDim*colBlockDim non-zero elements of matrixA.

bsrRowPtrA

integer array ofmb+1 elements that contains the start of every block row and the end of the last block row plus one of matrixA.

bsrColIndA

integer array ofnnzb column indices of the non-zero blocks of matrixA.

rowBlockDim

number of rows within a block ofA.

colBlockDim

number of columns within a block ofA.

descrC

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

Output

csrValC

<type> array ofnnz non-zero elements of matrixC.

csrRowPtrC

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one of matrixC.

csrColIndC

integer array ofnnz column indices of the non-zero elements of matrixC.

SeecusparseStatus_t for the description of the return status.

5.9.5.cusparse<t>csr2gebsr()

cusparseStatus_tcusparseScsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseDcsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseCcsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constcuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,introwBlockDim,intcolBlockDim,int*pBufferSize)cusparseStatus_tcusparseZcsr2gebsr_bufferSize(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constcuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,introwBlockDim,intcolBlockDim,int*pBufferSize)
cusparseStatus_tcusparseXcsr2gebsrNnz(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constint*csrRowPtrA,constint*csrColIndA,constcusparseMatDescr_tdescrC,int*bsrRowPtrC,introwBlockDim,intcolBlockDim,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseScsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcusparseMatDescr_tdescrC,float*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDim,intcolBlockDim,void*pBuffer)cusparseStatus_tcusparseDcsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcusparseMatDescr_tdescrC,double*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDim,intcolBlockDim,void*pBuffer)cusparseStatus_tcusparseCcsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constcuComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcusparseMatDescr_tdescrC,cuComplex*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDim,intcolBlockDim,void*pBuffer)cusparseStatus_tcusparseZcsr2gebsr(cusparseHandle_thandle,cusparseDirection_tdir,intm,intn,constcusparseMatDescr_tdescrA,constcuDoubleComplex*csrValA,constint*csrRowPtrA,constint*csrColIndA,constcusparseMatDescr_tdescrC,cuDoubleComplex*bsrValC,int*bsrRowPtrC,int*bsrColIndC,introwBlockDim,intcolBlockDim,void*pBuffer)

This function converts a sparse matrixA in CSR format (that is defined by arrayscsrValA,csrRowPtrA, andcsrColIndA) into a sparse matrixC in general BSR format (that is defined by the three arraysbsrValC,bsrRowPtrC, andbsrColIndC).

The matrix A is an :math:m times n sparse matrix and matrixC is a(mb*rowBlockDim)*(nb*colBlockDim) sparse matrix, wheremb(=(m+rowBlockDim-1)/rowBlockDim) is the number of block rows ofC, andnb(=(n+colBlockDim-1)/colBlockDim) is the number of block columns ofC.

The block ofC is of sizerowBlockDim*colBlockDim. Ifm is not multiple ofrowBlockDim orn is not multiple ofcolBlockDim, zeros are filled in.

The implementation adopts a two-step approach to do the conversion. First, the user allocatesbsrRowPtrC ofmb+1 elements and uses functioncusparseXcsr2gebsrNnz() to determine the number of nonzero block columns per block row. Second, the user gathersnnzb (number of nonzero block columns of matrixC) from either(nnzb=*nnzTotalDevHostPtr) or(nnzb=bsrRowPtrC[mb]-bsrRowPtrC[0]) and allocatesbsrValC ofnnzb*rowBlockDim*colBlockDim elements andbsrColIndC ofnnzb integers. Finally functioncusparse[S|D|C|Z]csr2gebsr() is called to complete the conversion.

The user must obtain the size of the buffer required bycsr2gebsr() by callingcsr2gebsr_bufferSize(), allocate the buffer, and pass the buffer pointer tocsr2gebsr().

The general procedure is as follows:

// Given CSR format (csrRowPtrA, csrColIndA, csrValA) and// blocks of BSR format are stored in column-major order.cusparseDirection_tdir=CUSPARSE_DIRECTION_COLUMN;intbase,nnzb;intmb=(m+rowBlockDim-1)/rowBlockDim;intnb=(n+colBlockDim-1)/colBlockDim;intbufferSize;void*pBuffer;cusparseScsr2gebsr_bufferSize(handle,dir,m,n,descrA,csrValA,csrRowPtrA,csrColIndA,rowBlockDim,colBlockDim,&bufferSize);cudaMalloc((void**)&pBuffer,bufferSize);cudaMalloc((void**)&bsrRowPtrC,sizeof(int)*(mb+1));// nnzTotalDevHostPtr points to host memoryint*nnzTotalDevHostPtr=&nnzb;cusparseXcsr2gebsrNnz(handle,dir,m,n,descrA,csrRowPtrA,csrColIndA,descrC,bsrRowPtrC,rowBlockDim,colBlockDim,nnzTotalDevHostPtr,pBuffer);if(NULL!=nnzTotalDevHostPtr){nnzb=*nnzTotalDevHostPtr;}else{cudaMemcpy(&nnzb,bsrRowPtrC+mb,sizeof(int),cudaMemcpyDeviceToHost);cudaMemcpy(&base,bsrRowPtrC,sizeof(int),cudaMemcpyDeviceToHost);nnzb-=base;}cudaMalloc((void**)&bsrColIndC,sizeof(int)*nnzb);cudaMalloc((void**)&bsrValC,sizeof(float)*(rowBlockDim*colBlockDim)*nnzb);cusparseScsr2gebsr(handle,dir,m,n,descrA,csrValA,csrRowPtrA,csrColIndA,descrC,bsrValC,bsrRowPtrC,bsrColIndC,rowBlockDim,colBlockDim,pBuffer);

The routinecusparseXcsr2gebsrNnz() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

The routinecusparse<t>csr2gebsr() has the following properties:

  • The routine requires no extra storage ifpBuffer!=NULL.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

dir

storage format of blocks, eitherCUSPARSE_DIRECTION_ROW orCUSPARSE_DIRECTION_COLUMN.

m

number of rows of sparse matrixA.

n

number of columns of sparse matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

<type> array ofnnz nonzero elements of matrixA.

csrRowPtrA

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one of matrixA.

csrColIndA

integer array ofnnz column indices of the nonzero elements of matrixA.

descrC

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

rowBlockDim

number of rows within a block ofC.

colBlockDim

number of columns within a block ofC.

pBuffer

buffer allocated by the user, the size is return bycsr2gebsr_bufferSize().

Output

bsrValC

<type> array ofnnzb*rowBlockDim*colBlockDim nonzero elements of matrixC.

bsrRowPtrC

integer array ofmb+1 elements that contains the start of every block row and the end of the last block row plus one of matrixC.

bsrColIndC

integer array ofnnzb column indices of the nonzero blocks of matrixC.

nnzTotalDevHostPtr

total number of nonzero blocks of matrixC. PointernnzTotalDevHostPtr can point to a device memory or host memory.

SeecusparseStatus_t for the description of the return status.

5.9.6.cusparse<t>coo2csr()

cusparseStatus_tcusparseXcoo2csr(cusparseHandle_thandle,constint*cooRowInd,intnnz,intm,int*csrRowPtr,cusparseIndexBase_tidxBase)

This function converts the array containing the uncompressed row indices (corresponding to COO format) into an array of compressed row pointers (corresponding to CSR format).

It can also be used to convert the array containing the uncompressed column indices (corresponding to COO format) into an array of column pointers (corresponding to CSC format).

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

cooRowInd

integer array ofnnz uncompressed row indices.

nnz

number of non-zeros of the sparse matrix (that is also the length of arraycooRowInd).

m

number of rows of matrixA.

idxBase

CUSPARSE_INDEX_BASE_ZERO orCUSPARSE_INDEX_BASE_ONE.

Output

csrRowPtr

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

SeecusparseStatus_t for the description of the return status.

5.9.7.cusparse<t>csr2coo()

cusparseStatus_tcusparseXcsr2coo(cusparseHandle_thandle,constint*csrRowPtr,intnnz,intm,int*cooRowInd,cusparseIndexBase_tidxBase)

This function converts the array containing the compressed row pointers (corresponding to CSR format) into an array of uncompressed row indices (corresponding to COO format).

It can also be used to convert the array containing the compressed column indices (corresponding to CSC format) into an array of uncompressed column indices (corresponding to COO format).

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

handle

handle to the cuSPARSE library context.

csrRowPtr

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

nnz

number of nonzeros of the sparse matrix (that is also the length of arraycooRowInd).

m

number of rows of matrixA.

idxBase

CUSPARSE_INDEX_BASE_ZERO orCUSPARSE_INDEX_BASE_ONE.

Output

cooRowInd

integer array ofnnz uncompressed row indices.

SeecusparseStatus_t for the description of the return status.

5.9.8.cusparseCsr2cscEx2()

cusparseStatus_tcusparseCsr2cscEx2_bufferSize(cusparseHandle_thandle,intm,intn,intnnz,constvoid*csrVal,constint*csrRowPtr,constint*csrColInd,void*cscVal,int*cscColPtr,int*cscRowInd,cudaDataTypevalType,cusparseAction_tcopyValues,cusparseIndexBase_tidxBase,cusparseCsr2CscAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseCsr2cscEx2(cusparseHandle_thandle,intm,intn,intnnz,constvoid*csrVal,constint*csrRowPtr,constint*csrColInd,void*cscVal,int*cscColPtr,int*cscRowInd,cudaDataTypevalType,cusparseAction_tcopyValues,cusparseIndexBase_tidxBase,cusparseCsr2CscAlg_talg,void*buffer)

This function converts a sparse matrix in CSR format (that is defined by the three arrayscsrVal,csrRowPtr, andcsrColInd) into a sparse matrix in CSC format (that is defined by arrayscscVal,cscRowInd, andcscColPtr). The resulting matrix can also be seen as the transpose of the original sparse matrix. Notice that this routine can also be used to convert a matrix in CSC format into a matrix in CSR format.

The routine requires extra storage proportional to the number of nonzero valuesnnz. It provides in output always the same matrix.

It is executed asynchronously with respect to the host, and it may return control to the application on the host before the result is ready.

The functioncusparseCsr2cscEx2_bufferSize() returns the size of the workspace needed bycusparseCsr2cscEx2(). User needs to allocate a buffer of this size and give that buffer tocusparseCsr2cscEx2() as an argument.

Ifnnz==0, thencsrColInd,csrVal,cscVal, andcscRowInd could haveNULL value. In this case,cscColPtr is set toidxBase for all values.

Ifm==0 orn==0, the pointers are not checked and the routine returnsCUSPARSE_STATUS_SUCCESS.

Input

handle

Handle to the cuSPARSE library context

m

Number of rows of the CSR input matrix; number of columns of the CSC ouput matrix

n

Number of columns of the CSR input matrix; number of rows of the CSC ouput matrix

nnz

Number of nonzero elements of the CSR and CSC matrices

csrVal

Value array of sizennz of the CSR matrix; of same type asvalType

csrRowPtr

Integer array of sizem+1 that containes the CSR row offsets

csrColInd

Integer array of sizennz that containes the CSR column indices

cscVal

Value array of sizennz of the CSC matrix; of same type asvalType

cscColPtr

Integer array of sizen+1 that containes the CSC column offsets

cscRowInd

Integer array of sizennz that containes the CSC row indices

valType

Value type for both CSR and CSC matrices

copyValues

CUSPARSE_ACTION_SYMBOLIC orCUSPARSE_ACTION_NUMERIC

idxBase

Index baseCUSPARSE_INDEX_BASE_ZERO orCUSPARSE_INDEX_BASE_ONE

alg

Algorithm implementation. seecusparseCsr2CscAlg_t for possible values.

bufferSize

Number of bytes of workspace needed bycusparseCsr2cscEx2()

buffer

Pointer to workspace buffer

cusparseCsr2cscEx2() supports the following data types:

X/Y

CUDA_R_8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseCsr2cscEx2() supports the following algorithms (cusparseCsr2CscAlg_t):

Algorithm

Notes

CUSPARSE_CSR2CSC_ALG_DEFAULT,CUSPARSE_CSR2CSC_ALG1

Default algorithm

Action

Notes

CUSPARSE_ACTION_SYMBOLIC

Compute the “structure” of the CSC output matrix (offset, row indices)

CUSPARSE_ACTION_NUMERIC

Compute the “structure” of the CSC output matrix and copy the values

cusparseCsr2cscEx2() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

cusparseCsr2cscEx2() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

5.9.9.cusparse<t>nnz()

cusparseStatus_tcusparseSnnz(cusparseHandle_thandle,cusparseDirection_tdirA,intm,intn,constcusparseMatDescr_tdescrA,constfloat*A,intlda,int*nnzPerRowColumn,int*nnzTotalDevHostPtr)cusparseStatus_tcusparseDnnz(cusparseHandle_thandle,cusparseDirection_tdirA,intm,intn,constcusparseMatDescr_tdescrA,constdouble*A,intlda,int*nnzPerRowColumn,int*nnzTotalDevHostPtr)cusparseStatus_tcusparseCnnz(cusparseHandle_thandle,cusparseDirection_tdirA,intm,intn,constcusparseMatDescr_tdescrA,constcuComplex*A,intlda,int*nnzPerRowColumn,int*nnzTotalDevHostPtr)cusparseStatus_tcusparseZnnz(cusparseHandle_thandle,cusparseDirection_tdirA,intm,intn,constcusparseMatDescr_tdescrA,constcuDoubleComplex*A,intlda,int*nnzPerRowColumn,int*nnzTotalDevHostPtr)

This function computes the number of nonzero elements per row or column and the total number of nonzero elements in a dense matrix.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

dirA

direction that specifies whether to count nonzero elements byCUSPARSE_DIRECTION_ROW or byCUSPARSE_DIRECTION_COLUMN.

m

number of rows of matrixA.

n

number of columns of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

A

array of dimensions(lda,n).

lda

leading dimension of dense arrayA.

Output

nnzPerRowColumn

array of sizem orn containing the number of nonzero elements per row or column, respectively

nnzTotalDevHostPtr

total number of nonzero elements in device or host memory

SeecusparseStatus_t for the description of the return status.

5.9.10.cusparseCreateIdentityPermutation() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateIdentityPermutation(cusparseHandle_thandle,intn,int*p);

This function creates an identity map. The output parameterp represents such map byp=0:1:(n-1).

This function is typically used withcoosort,csrsort,cscsort.

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

n

host

size of the map.

Output

parameter

deviceorhost

description

p

device

integer array of dimensionsn.

SeecusparseStatus_t for the description of the return status.

5.9.11.cusparseXcoosort()

cusparseStatus_tcusparseXcoosort_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,constint*cooRows,constint*cooCols,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseXcoosortByRow(cusparseHandle_thandle,intm,intn,intnnz,int*cooRows,int*cooCols,int*P,void*pBuffer)cusparseStatus_tcusparseXcoosortByColumn(cusparseHandle_thandle,intm,intn,intnnz,int*cooRows,int*cooCols,int*P,void*pBuffer);

This function sorts COO format. The sorting is in-place. Also the user can sort by row or sort by column.

A is an\(m \times n\) sparse matrix that is defined in COO storage format by the three arrayscooVals,cooRows, andcooCols.

There is no assumption for the base index of the matrix.coosort uses stable sort on signed integer, so the value ofcooRows orcooCols can be negative.

This functioncoosort() requires buffer size returned bycoosort_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

The parameterP is both input and output. If the user wants to compute sortedcooVal,P must be set as 0:1:(nnz-1) beforecoosort(), and aftercoosort(), new sorted value array satisfiescooVal_sorted=cooVal(P).

Remark: the dimensionm andn are not used. If the user does not know the value ofm orn, just passes a value positive. This usually happens if the user only reads a COO array first and needs to decide the dimensionm orn later.

  • The routine requires no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnz

host

number of nonzero elements of matrixA.

cooRows

device

integer array ofnnz unsorted row indices ofA.

cooCols

device

integer array ofnnz unsorted column indices ofA.

P

device

integer array ofnnz unsorted map indices. To constructcooVal, the user has to setP=0:1:(nnz-1).

pBuffer

device

buffer allocated by the user; the size is returned bycoosort_bufferSizeExt().

Output

parameter

deviceorhost

description

cooRows

device

integer array ofnnz sorted row indices ofA.

cooCols

device

integer array ofnnz sorted column indices ofA.

P

device

integer array ofnnz sorted map indices.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status

Please visitcuSPARSE Library Samples - cusparseXcoosortByRow for a code example.

5.9.12.cusparseXcsrsort()

cusparseStatus_tcusparseXcsrsort_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,constint*csrRowPtr,constint*csrColInd,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseXcsrsort(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,constint*csrRowPtr,int*csrColInd,int*P,void*pBuffer)

This function sorts CSR format. The stable sorting is in-place.

The matrix type is regarded asCUSPARSE_MATRIX_TYPE_GENERAL implicitly. In other words, any symmetric property is ignored.

This functioncsrsort() requires buffer size returned bycsrsort_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

The parameterP is both input and output. If the user wants to compute sortedcsrVal,P must be set as 0:1:(nnz-1) beforecsrsort(), and aftercsrsort(), new sorted value array satisfiescsrVal_sorted=csrVal(P).

The general procedure is as follows:

// A is a 3x3 sparse matrix, base-0//     | 1 2 3 |// A = | 4 5 6 |//     | 7 8 9 |constintm=3;constintn=3;constintnnz=9;csrRowPtr[m+1]={0,3,6,9};// on devicecsrColInd[nnz]={2,1,0,0,2,1,1,2,0};// on devicecsrVal[nnz]={3,2,1,4,6,5,8,9,7};// on devicesize_tpBufferSizeInBytes=0;void*pBuffer=NULL;int*P=NULL;// step 1: allocate buffercusparseXcsrsort_bufferSizeExt(handle,m,n,nnz,csrRowPtr,csrColInd,&pBufferSizeInBytes);cudaMalloc(&pBuffer,sizeof(char)*pBufferSizeInBytes);// step 2: setup permutation vector P to identitycudaMalloc((void**)&P,sizeof(int)*nnz);cusparseCreateIdentityPermutation(handle,nnz,P);// step 3: sort CSR formatcusparseXcsrsort(handle,m,n,nnz,descrA,csrRowPtr,csrColInd,P,pBuffer);// step 4: gather sorted csrValcusparseDgthr(handle,nnz,csrVal,csrVal_sorted,P,CUSPARSE_INDEX_BASE_ZERO);
  • The routine requires no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnz

host

number of nonzero elements of matrixA.

csrRowsPtr

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColInd

device

integer array ofnnz unsorted column indices ofA.

P

device

integer array ofnnz unsorted map indices. To constructcsrVal, the user has to setP=0:1:(nnz-1).

pBuffer

device

buffer allocated by the user; the size is returned bycsrsort_bufferSizeExt().

Output

parameter

deviceorhost

description

csrColInd

device

integer array ofnnz sorted column indices ofA.

P

device

integer array ofnnz sorted map indices.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.13.cusparseXcscsort()

cusparseStatus_tcusparseXcscsort_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,constint*cscColPtr,constint*cscRowInd,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseXcscsort(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,constint*cscColPtr,int*cscRowInd,int*P,void*pBuffer)

This function sorts CSC format. The stable sorting is in-place.

The matrix type is regarded asCUSPARSE_MATRIX_TYPE_GENERAL implicitly. In other words, any symmetric property is ignored.

This functioncscsort() requires buffer size returned bycscsort_bufferSizeExt(). The address ofpBuffer must be multiple of 128 bytes. If not,CUSPARSE_STATUS_INVALID_VALUE is returned.

The parameterP is both input and output. If the user wants to compute sortedcscVal,P must be set as 0:1:(nnz-1) beforecscsort(), and aftercscsort(), new sorted value array satisfiescscVal_sorted=cscVal(P).

The general procedure is as follows:

// A is a 3x3 sparse matrix, base-0//     | 1 2  |// A = | 4 0  |//     | 0 8  |constintm=3;constintn=2;constintnnz=4;cscColPtr[n+1]={0,2,4};// on devicecscRowInd[nnz]={1,0,2,0};// on devicecscVal[nnz]={4.0,1.0,8.0,2.0};// on devicesize_tpBufferSizeInBytes=0;void*pBuffer=NULL;int*P=NULL;// step 1: allocate buffercusparseXcscsort_bufferSizeExt(handle,m,n,nnz,cscColPtr,cscRowInd,&pBufferSizeInBytes);cudaMalloc(&pBuffer,sizeof(char)*pBufferSizeInBytes);// step 2: setup permutation vector P to identitycudaMalloc((void**)&P,sizeof(int)*nnz);cusparseCreateIdentityPermutation(handle,nnz,P);// step 3: sort CSC formatcusparseXcscsort(handle,m,n,nnz,descrA,cscColPtr,cscRowInd,P,pBuffer);// step 4: gather sorted cscValcusparseDgthr(handle,nnz,cscVal,cscVal_sorted,P,CUSPARSE_INDEX_BASE_ZERO);
  • The routine requires no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnz

host

number of nonzero elements of matrixA.

cscColPtr

device

integer array ofn+1 elements that contains the start of every column and the end of the last column plus one.

cscRowInd

device

integer array ofnnz unsorted row indices ofA.

P

device

integer array ofnnz unsorted map indices. To constructcscVal, the user has to setP=0:1:(nnz-1).

pBuffer

device

buffer allocated by the user; the size is returned bycscsort_bufferSizeExt().

Output

parameter

deviceorhost

description

cscRowInd

device

integer array ofnnz sorted row indices ofA.

P

device

integer array ofnnz sorted map indices.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.14.cusparseXcsru2csr() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseCreateCsru2csrInfo(csru2csrInfo_t*info);cusparseStatus_tcusparseDestroyCsru2csrInfo(csru2csrInfo_tinfo);cusparseStatus_tcusparseScsru2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,float*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDcsru2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,double*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseCcsru2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,cuComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseZcsru2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnz,cuDoubleComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseScsru2csr(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,float*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDcsru2csr(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,double*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseCcsru2csr(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseZcsru2csr(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)
cusparseStatus_tcusparseScsr2csru(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,float*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDcsr2csru(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,double*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseCcsr2csru(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,cuComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseZcsr2csru(cusparseHandle_thandle,intm,intn,intnnz,constcusparseMatDescr_tdescrA,cuDoubleComplex*csrVal,constint*csrRowPtr,int*csrColInd,csru2csrInfo_tinfo,void*pBuffer)

This function transfers unsorted CSR format to CSR format, and vice versa. The operation is in-place.

This function is a wrapper ofcsrsort andgthr. The usecase is the following scenario.

If the user has a matrixA of CSR format which is unsorted, and implements his own code (which can be CPU or GPU kernel) based on this special order (for example, diagonal first, then lower triangle, then upper triangle), and wants to convert it to CSR format when calling CUSPARSE library, and then convert it back when doing something else on his/her kernel. For example, suppose the user wants to solve a linear systemAx=b by the following iterative scheme

\[x^{(k+1)} = x^{(k)} + L^{(-1)}*(b - Ax^{(k)})\]

The code heavily uses SpMV and triangular solve. Assume that the user has an in-house design of SpMV (Sparse Matrix-Vector multiplication) based on special order ofA. However the user wants to use the cuSPARSE library for triangular solver. Then the following code can work:

do

step 1: compute residual vector

\(r = b - A x^k\) by in-house SpMV

step 2: B := sort(A), and L is lower triangular part of B

(only sort A once and keep the permutation vector)

step 3: solve

\(z = L (-1) * ( b - A x^k )\) by cusparseXcsrsv

step 4: add correction

\(x^{k+1} = x^k+z\)

step 5: A := unsort(B)

(use permutation vector to get back the unsorted CSR)

until convergence

The requirements of step 2 and step 5 are

  1. In-place operation.

  2. The permutation vectorP is hidden in an opaque structure.

  3. NocudaMalloc inside the conversion routine. Instead, the user has to provide the buffer explicitly.

  4. The conversion between unsorted CSR and sorted CSR may needs several times, but the function only generates the permutation vectorP once.

  5. The function is based oncsrsort,gather andscatter operations.

The operation is calledcsru2csr, which means unsorted CSR to sorted CSR. Also we provide the inverse operation, calledcsr2csru.

In order to keep the permutation vector invisible, we need an opaque structure calledcsru2csrInfo. Then two functions (cusparseCreateCsru2csrInfo,cusparseDestroyCsru2csrInfo) are used to initialize and to destroy the opaque structure.

cusparse[S|D|C|Z]csru2csr_bufferSizeExt returns the size of the buffer. The permutation vectorP is also allcated insidecsru2csrInfo. The lifetime of the permutation vector is the same as the lifetime ofcsru2csrInfo.

cusparse[S|D|C|Z]csru2csr performs forward transformation from unsorted CSR to sorted CSR. First call uses csrsort to generate the permutation vectorP, and subsequent call usesP to do transformation.

cusparse[S|D|C|Z]csr2csru performs backward transformation from sorted CSR to unsorted CSR.P is used to get unsorted form back.

The routinecusparse<t>csru2csr() has the following properties:

  • The routine requires no extra storage ifpBuffer!=NULL

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available

The routinecusparse<t>csr2csru() has the following properties ifpBuffer!=NULL:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine supports CUDA graph capture

The following tables describe parameters ofcsr2csru_bufferSizeExt andcsr2csru.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnz

host

number of nonzero elements of matrixA.

descrA

host

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrVal

device

<type> array of nnz unsorted nonzero elements of matrixA.

csrRowsPtr

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColInd

device

integer array ofnnz unsorted column indices ofA.

info

host

opaque structure initialized usingcusparseCreateCsru2csrInfo().

pBuffer

device

buffer allocated by the user; the size is returned bycsru2csr_bufferSizeExt().

Output

parameter

deviceorhost

description

csrVal

device

<type> array of nnz sorted nonzero elements of matrixA.

csrColInd

device

integer array ofnnz sorted column indices ofA.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.15.cusparseXpruneDense2csr() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseHpruneDense2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,const__half*A,intlda,const__half*threshold,constcusparseMatDescr_tdescrC,const__half*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseSpruneDense2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,constfloat*threshold,constcusparseMatDescr_tdescrC,constfloat*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDpruneDense2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,constdouble*threshold,constcusparseMatDescr_tdescrC,constdouble*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseHpruneDense2csrNnz(cusparseHandle_thandle,intm,intn,const__half*A,intlda,const__half*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseSpruneDense2csrNnz(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,constfloat*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseDpruneDense2csrNnz(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,constdouble*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)
cusparseStatus_tcusparseHpruneDense2csr(cusparseHandle_thandle,intm,intn,const__half*A,intlda,const__half*threshold,constcusparseMatDescr_tdescrC,__half*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)cusparseStatus_tcusparseSpruneDense2csr(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,constfloat*threshold,constcusparseMatDescr_tdescrC,float*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)cusparseStatus_tcusparseDpruneDense2csr(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,constdouble*threshold,constcusparseMatDescr_tdescrC,double*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)

This function prunes a dense matrix to a sparse matrix with CSR format.

Given a dense matrixA and a non-negative valuethreshold, the function returns a sparse matrixC, defined by

\[\begin{split}\begin{matrix} {{C(i,j)} = {A(i,j)}} & \text{if\ |A(i,j)|\ >\ threshold} \\\end{matrix}\end{split}\]

The implementation adopts a two-step approach to do the conversion. First, the user allocatescsrRowPtrC ofm+1 elements and uses functionpruneDense2csrNnz() to determine the number of nonzeros columns per row. Second, the user gathersnnzC (number of nonzeros of matrixC) from either(nnzC=*nnzTotalDevHostPtr) or(nnzC=csrRowPtrC[m]-csrRowPtrC[0]) and allocatescsrValC ofnnzC elements andcsrColIndC ofnnzC integers. Finally functionpruneDense2csr() is called to complete the conversion.

The user must obtain the size of the buffer required bypruneDense2csr() by callingpruneDense2csr_bufferSizeExt(), allocate the buffer, and pass the buffer pointer topruneDense2csr().

The routinecusparse<t>pruneDense2csrNnz() has the following properties:

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

The routinecusparse<t>DpruneDense2csr() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

A

device

array of dimension (lda, n).

lda

device

leading dimension ofA. It must be at least max(1, m).

threshold

hostordevice

a value to drop the entries of A.threshold can point to a device memory or host memory.

descrC

host

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

pBuffer

device

buffer allocated by the user; the size is returned bypruneDense2csr_bufferSizeExt().

Output

parameter

deviceorhost

description

nnzTotalDevHostPtr

deviceorhost

total number of nonzero of matrixC.nnzTotalDevHostPtr can point to a device memory or host memory.

csrValC

device

<type> array ofnnzC nonzero elements of matrixC.

csrRowsPtrC

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndC

device

integer array ofnnzC column indices ofC.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.16.cusparseXpruneCsr2csr() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseHpruneCsr2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,const__half*threshold,constcusparseMatDescr_tdescrC,const__half*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseSpruneCsr2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,constfloat*threshold,constcusparseMatDescr_tdescrC,constfloat*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDpruneCsr2csr_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,constdouble*threshold,constcusparseMatDescr_tdescrC,constdouble*csrValC,constint*csrRowPtrC,constint*csrColIndC,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseHpruneCsr2csrNnz(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,const__half*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseSpruneCsr2csrNnz(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,constfloat*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)cusparseStatus_tcusparseDpruneCsr2csrNnz(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,constdouble*threshold,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,void*pBuffer)
cusparseStatus_tcusparseHpruneCsr2csr(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,const__half*threshold,constcusparseMatDescr_tdescrC,__half*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)cusparseStatus_tcusparseSpruneCsr2csr(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,constfloat*threshold,constcusparseMatDescr_tdescrC,float*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)cusparseStatus_tcusparseDpruneCsr2csr(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,constdouble*threshold,constcusparseMatDescr_tdescrC,double*csrValC,constint*csrRowPtrC,int*csrColIndC,void*pBuffer)

This function prunes a sparse matrix to a sparse matrix with CSR format.

Given a sparse matrixA and a non-negative valuethreshold, the function returns a sparse matrixC, defined by

\[\begin{split}\begin{matrix} {{C(i,j)} = {A(i,j)}} & \text{if |A(i,j)| > threshold} \\\end{matrix}\end{split}\]

The implementation adopts a two-step approach to do the conversion. First, the user allocatescsrRowPtrC ofm+1 elements and uses functionpruneCsr2csrNnz() to determine the number of nonzeros columns per row. Second, the user gathersnnzC (number of nonzeros of matrixC) from either(nnzC=*nnzTotalDevHostPtr) or(nnzC=csrRowPtrC[m]-csrRowPtrC[0]) and allocatescsrValC ofnnzC elements andcsrColIndC ofnnzC integers. Finally functionpruneCsr2csr() is called to complete the conversion.

The user must obtain the size of the buffer required bypruneCsr2csr() by callingpruneCsr2csr_bufferSizeExt(), allocate the buffer, and pass the buffer pointer topruneCsr2csr().

The routinecusparse<t>pruneCsr2csrNnz() has the following properties:

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

The routinecusparse<t>pruneCsr2csr() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnzA

host

number of nonzeros of matrixA.

descrA

host

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

device

<type> array ofnnzA nonzero elements of matrixA.

csrRowsPtrA

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndA

device

integer array ofnnzA column indices ofA.

threshold

hostordevice

a value to drop the entries of A.threshold can point to a device memory or host memory.

descrC

host

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

pBuffer

device

buffer allocated by the user; the size is returned bypruneCsr2csr_bufferSizeExt().

Output

parameter

deviceorhost

description

nnzTotalDevHostPtr

deviceorhost

total number of nonzero of matrixC.nnzTotalDevHostPtr can point to a device memory or host memory.

csrValC

device

<type> array ofnnzC nonzero elements of matrixC.

csrRowsPtrC

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndC

device

integer array ofnnzC column indices ofC.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.17.cusparseXpruneDense2csrPercentage() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseHpruneDense2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,const__half*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,const__half*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseSpruneDense2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,constfloat*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDpruneDense2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,constdouble*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseHpruneDense2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,const__half*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseSpruneDense2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDpruneDense2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)
cusparseStatus_tcusparseHpruneDense2csrByPercentage(cusparseHandle_thandle,intm,intn,const__half*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,__half*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseSpruneDense2csrByPercentage(cusparseHandle_thandle,intm,intn,constfloat*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,float*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDpruneDense2csrByPercentage(cusparseHandle_thandle,intm,intn,constdouble*A,intlda,floatpercentage,constcusparseMatDescr_tdescrC,double*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)

This function prunes a dense matrix to a sparse matrix by percentage.

Given a dense matrixA and a non-negative valuepercentage, the function computes sparse matrixC by the following three steps:

Step 1: sort absolute value ofA in ascending order.

\[\begin{split}\begin{matrix} {key\ :=\ sort(\ |A|\ )} \\\end{matrix}\end{split}\]

Step 2: choose threshold by the parameterpercentage

\[\begin{split}\begin{matrix} {pos\ =\ ceil(m*n*(percentage/100))\ -\ 1} \\ {pos\ =\ min(pos,\ m*n-1)} \\ {pos\ =\ max(pos,\ 0)} \\ {threshold\ =\ key\lbrack pos\rbrack} \\\end{matrix}\end{split}\]

Step 3: callpruneDense2csr() by with the parameterthreshold.

The implementation adopts a two-step approach to do the conversion. First, the user allocatescsrRowPtrC ofm+1 elements and uses functionpruneDense2csrNnzByPercentage() to determine the number of nonzeros columns per row. Second, the user gathersnnzC (number of nonzeros of matrixC) from either(nnzC=*nnzTotalDevHostPtr) or(nnzC=csrRowPtrC[m]-csrRowPtrC[0]) and allocatescsrValC ofnnzC elements andcsrColIndC ofnnzC integers. Finally functionpruneDense2csrByPercentage() is called to complete the conversion.

The user must obtain the size of the buffer required bypruneDense2csrByPercentage() by callingpruneDense2csrByPercentage_bufferSizeExt(), allocate the buffer, and pass the buffer pointer topruneDense2csrByPercentage().

Remark 1: the value ofpercentage must be not greater than 100. Otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

Remark 2: the zeros ofA are not ignored. All entries are sorted, including zeros. This is different frompruneCsr2csrByPercentage()

The routinecusparse<t>pruneDense2csrNnzByPercentage() has the following properties:

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

The routinecusparse<t>pruneDense2csrByPercentage() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

A

device

array of dimension (lda, n).

lda

device

leading dimension ofA. It must be at least max(1, m).

percentage

host

percentage <=100 and percentage >= 0

descrC

host

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

pBuffer

device

buffer allocated by the user; the size is returned bypruneDense2csrByPercentage_bufferSizeExt().

Output

parameter

deviceorhost

description

nnzTotalDevHostPtr

deviceorhost

total number of nonzero of matrixC.nnzTotalDevHostPtr can point to a device memory or host memory.

csrValC

device

<type> array ofnnzC nonzero elements of matrixC.

csrRowsPtrC

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndC

device

integer array ofnnzC column indices ofC.

pBufferSizeInBytes

host

number of bytes of the buffer.

SeecusparseStatus_t for the description of the return status.

5.9.18.cusparseXpruneCsr2csrByPercentage() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseHpruneCsr2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,const__half*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseSpruneCsr2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,constfloat*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)cusparseStatus_tcusparseDpruneCsr2csrByPercentage_bufferSizeExt(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,constdouble*csrValC,constint*csrRowPtrC,constint*csrColIndC,pruneInfo_tinfo,size_t*pBufferSizeInBytes)
cusparseStatus_tcusparseHpruneCsr2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseSpruneCsr2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDpruneCsr2csrNnzByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,int*csrRowPtrC,int*nnzTotalDevHostPtr,pruneInfo_tinfo,void*pBuffer)
cusparseStatus_tcusparseHpruneCsr2csrByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,const__half*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,__half*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseSpruneCsr2csrByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constfloat*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,float*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)cusparseStatus_tcusparseDpruneCsr2csrByPercentage(cusparseHandle_thandle,intm,intn,intnnzA,constcusparseMatDescr_tdescrA,constdouble*csrValA,constint*csrRowPtrA,constint*csrColIndA,floatpercentage,constcusparseMatDescr_tdescrC,double*csrValC,constint*csrRowPtrC,int*csrColIndC,pruneInfo_tinfo,void*pBuffer)

This function prunes a sparse matrix to a sparse matrix by percentage.

Given a sparse matrixA and a non-negative valuepercentage, the function computes sparse matrixC by the following three steps:

Step 1: Sort absolute value ofA in ascending order:

\[\begin{split}\begin{matrix} {key\ :=\ sort(\ \|csrValA\|\ )} \\ \end{matrix}\end{split}\]

Step 2: Choose threshold by the parameterpercentage:

\[\begin{split}\begin{matrix} {pos\ =\ ceil(nnzA*(percentage/100))\ -\ 1} \\ {pos\ =\ min(pos,\ nnzA-1)} \\ {pos\ =\ max(pos,\ 0)} \\ {threshold\ =\ key\lbrack pos\rbrack} \\ \end{matrix}\end{split}\]

Step 3: CallpruneCsr2csr() by with the parameterthreshold.

The implementation adopts a two-step approach to do the conversion. First, the user allocatescsrRowPtrC ofm+1 elements and uses functionpruneCsr2csrNnzByPercentage() to determine the number of nonzeros columns per row. Second, the user gathersnnzC (number of nonzeros of matrixC) from either(nnzC=*nnzTotalDevHostPtr) or(nnzC=csrRowPtrC[m]-csrRowPtrC[0]) and allocatescsrValC ofnnzC elements andcsrColIndC ofnnzC integers. Finally functionpruneCsr2csrByPercentage() is called to complete the conversion.

The user must obtain the size of the buffer required bypruneCsr2csrByPercentage() by callingpruneCsr2csrByPercentage_bufferSizeExt(), allocate the buffer, and pass the buffer pointer topruneCsr2csrByPercentage().

Remark 1: the value ofpercentage must be not greater than 100. Otherwise,CUSPARSE_STATUS_INVALID_VALUE is returned.

The routinecusparse<t>pruneCsr2csrNnzByPercentage() has the following properties:

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

The routinecusparse<t>pruneCsr2csrByPercentage() has the following properties:

  • The routine requires no extra storage.

  • The routine supports asynchronous execution.

  • The routine supports CUDA graph capture.

Input

parameter

deviceorhost

description

handle

host

handle to the cuSPARSE library context.

m

host

number of rows of matrixA.

n

host

number of columns of matrixA.

nnzA

host

number of nonzeros of matrixA.

descrA

host

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

device

<type> array ofnnzA nonzero elements of matrixA.

csrRowsPtrA

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one.

csrColIndA

device

integer array ofnnzA column indices ofA.

percentage

host

percentage <=100 and percentage >= 0

descrC

host

the descriptor of matrixC. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL, Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

pBuffer

device

buffer allocated by the user; the size is returned bypruneCsr2csrByPercentage_bufferSizeExt().

Output

parameter

deviceorhost

description

nnzTotalDevHostPtr

deviceorhost

total number of nonzero of matrixC.nnzTotalDevHostPtr can point to a device memory or host memory

csrValC

device

<type> array ofnnzC nonzero elements of matrixC.

csrRowsPtrC

device

integer array ofm+1 elements that contains the start of every row and the end of the last row plus one

csrColIndC

device

integer array ofnnzC column indices ofC

pBufferSizeInBytes

host

number of bytes of the buffer

SeecusparseStatus_t for the description of the return status.

5.9.19.cusparse<t>nnz_compress() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseSnnz_compress(cusparseHandle_thandle,intm,constcusparseMatDescr_tdescr,constfloat*csrValA,constint*csrRowPtrA,int*nnzPerRow,int*nnzC,floattol)cusparseStatus_tcusparseDnnz_compress(cusparseHandle_thandle,intm,constcusparseMatDescr_tdescr,constdouble*csrValA,constint*csrRowPtrA,int*nnzPerRow,int*nnzC,doubletol)cusparseStatus_tcusparseCnnz_compress(cusparseHandle_thandle,intm,constcusparseMatDescr_tdescr,constcuComplex*csrValA,constint*csrRowPtrA,int*nnzPerRow,int*nnzC,cuComplextol)cusparseStatus_tcusparseZnnz_compress(cusparseHandle_thandle,intm,constcusparseMatDescr_tdescr,constcuDoubleComplex*csrValA,constint*csrRowPtrA,int*nnzPerRow,int*nnzC,cuDoubleComplextol)

This function is the step one to convert from csr format to compressed csr format.

Given a sparse matrix A and a non-negative value threshold, the function returns nnzPerRow(the number of nonzeros columns per row) and nnzC(the total number of nonzeros) of a sparse matrix C, defined by

\[\begin{split}\begin{matrix} {{C(i,j)} = {A(i,j)}} & \text{if\ |A(i,j)|\ >\ threshold} \\ \end{matrix}\end{split}\]

A key assumption for the cuComplex and cuDoubleComplex case is that this tolerance is given as the real part. For exampletol=1e-8+0*i and we extract cureal, that is the x component of this struct.

  • This function requires temporary extra storage that is allocated internally.

  • The routine supports asynchronous execution if the Stream Ordered Memory Allocator is available.

  • The routine supports CUDA graph capture if the Stream Ordered Memory Allocator is available.

Input

handle

handle to the cuSPARSE library context.

m

number of rows of matrixA.

descrA

the descriptor of matrixA. The supported matrix type isCUSPARSE_MATRIX_TYPE_GENERAL. Also, the supported index bases areCUSPARSE_INDEX_BASE_ZERO andCUSPARSE_INDEX_BASE_ONE.

csrValA

csr noncompressed values array

csrRowPtrA

the corresponding input noncompressed row pointer.

tol

non-negative tolerance to determine if a number less than or equal to it.

Output

nnzPerRow

this array contains the number of elements whose absolute values are greater than tol per row.

nnzC

host/device pointer of the total number of elements whose absolute values are greater than tol.

SeecusparseStatus_t for the description of the return status.


6.cuSPARSE Generic APIs

The cuSPARSE Generic APIs allow computing the most common sparse linear algebra operations, such as sparse matrix-vector (SpMV) and sparse matrix-matrix multiplication (SpMM), in a flexible way. The new APIs have the following capabilities and features:

  • Set matrix data layouts, number of batches, and storage formats (for example, CSR, COO, and so on).

  • Set input/output/compute data types. This also allows mixed data-type computation.

  • Set types of sparse vector/matrix indices (for example, 32-bit, 64-bit).

  • Choose the algorithm for the computation.

  • Guarantee external device memory for internal operations.

  • Provide extensive consistency checks across input matrices and vectors. This includes the validation of sizes, data types, layout, allowed operations, etc.

  • Provide constant descriptors for vector and matrix inputs to support const-safe interface and guarantee that the APIs do not modify their inputs.

6.1.Generic Types Reference

The cuSPARSE generic type references are described in this section.

6.1.1.cusparseFormat_t

This type indicates the format of the sparse matrix.SeecuSPARSE Storage Formats for their description.

Value

Meaning

CUSPARSE_FORMAT_COO

The matrix is stored in Coordinate (COO) format organized inStructure of Arrays (SoA) layout

CUSPARSE_FORMAT_CSR

The matrix is stored in Compressed Sparse Row (CSR) format

CUSPARSE_FORMAT_CSC

The matrix is stored in Compressed Sparse Column (CSC) format

CUSPARSE_FORMAT_BLOCKED_ELL

The matrix is stored in Blocked-Ellpack (Blocked-ELL) format

CUSPARSE_FORMAT_SLICED_ELL

The matrix is stored in Sliced-Ellpack (Sliced-ELL) format

CUSPARSE_FORMAT_BSR

The matrix is stored in Block Sparse Row (BSR) format


6.1.2.cusparseOrder_t

This type indicates the memory layout of a dense matrix.

Value

Meaning

CUSPARSE_ORDER_ROW

The matrix is stored in row-major

CUSPARSE_ORDER_COL

The matrix is stored in column-major


6.1.3.cusparseIndexType_t

This type indicates the index type for representing the sparse matrix indices.

Value

Meaning

CUSPARSE_INDEX_32I

32-bit signed integer [0, 2^31 - 1]

CUSPARSE_INDEX_64I

64-bit signed integer [0, 2^63 - 1]



6.2.Dense Vector APIs

The cuSPARSE helper functions for dense vector descriptor are described in this section.

See theDense Vector Format section for the detailed description of the storage format.


6.2.1.cusparseCreateDnVec()

cusparseStatus_tcusparseCreateDnVec(cusparseDnVecDescr_t*dnVecDescr,int64_tsize,void*values,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstDnVec(cusparseConstDnVecDescr_t*dnVecDescr,int64_tsize,constvoid*values,cudaDataTypevalueType)

This function initializes the dense vector descriptordnVecDescr.

Param.

Memory

In/out

Meaning

dnVecDescr

HOST

OUT

Dense vector descriptor

size

HOST

IN

Size of the dense vector

values

DEVICE

IN

Values of the dense vector. Array withsize elements

valueType

HOST

IN

Enumerator specifying the datatype ofvalues

cusparseCreateDnVec() has the following constraints:

  • values must be aligned to the size of the datatype specified byvalueType. Refer tocudaDataType_t for the description of the datatypes.

Refer tocusparseStatus_t for the description of the return status.


6.2.2.cusparseDestroyDnVec()

cusparseStatus_tcusparseDestroyDnVec(cusparseConstDnVecDescr_tdnVecDescr)// non-const descriptor supported

This function releases the host memory allocated for the dense vector descriptordnVecDescr.

Param.

Memory

In/out

Meaning

dnVecDescr

HOST

IN

Dense vector descriptor

Refer tocusparseStatus_t for the description of the return status.


6.2.3.cusparseDnVecGet()

cusparseStatus_tcusparseDnVecGet(cusparseDnVecDescr_tdnVecDescr,int64_t*size,void**values,cudaDataType*valueType)cusparseStatus_tcusparseConstDnVecGet(cusparseConstDnVecDescr_tdnVecDescr,int64_t*size,constvoid**values,cudaDataType*valueType)

This function returns the fields of the dense vector descriptordnVecDescr.

Param.

Memory

In/out

Meaning

dnVecDescr

HOST

IN

Dense vector descriptor

size

HOST

OUT

Size of the dense vector

values

DEVICE

OUT

Values of the dense vector. Array withnnz elements

valueType

HOST

OUT

Enumerator specifying the datatype ofvalues

Refer tocusparseStatus_t for the description of the return status.


6.2.4.cusparseDnVecGetValues()

cusparseStatus_tcusparseDnVecGetValues(cusparseDnVecDescr_tdnVecDescr,void**values)cusparseStatus_tcusparseConstDnVecGetValues(cusparseConstDnVecDescr_tdnVecDescr,constvoid**values)

This function returns thevalues field of the dense vector descriptordnVecDescr.

Param.

Memory

In/out

Meaning

dnVecDescr

HOST

IN

Dense vector descriptor

values

DEVICE

OUT

Values of the dense vector

Refer tocusparseStatus_t for the description of the return status.


6.2.5.cusparseDnVecSetValues()

cusparseStatus_tcusparseDnVecSetValues(cusparseDnVecDescr_tdnVecDescr,void*values)

This function set thevalues field of the dense vector descriptordnVecDescr.

Param.

Memory

In/out

Meaning

dnVecDescr

HOST

IN

Dense vector descriptor

values

DEVICE

IN

Values of the dense vector. Array withsize elements

cusparseDnVecSetValues() has the following constraints:

  • values must be aligned to the size of the datatype specified indnVecDescr. Refer tocudaDataType_t for the description of the datatypes.

Refer tocusparseStatus_t for the description of the return status.



6.3.Sparse Vector APIs

The cuSPARSE helper functions for sparse vector descriptor are described in this section.

See theSparse Vector Format section for the detailed description of the storage format.

6.3.1.cusparseCreateSpVec()

cusparseStatus_tcusparseCreateSpVec(cusparseSpVecDescr_t*spVecDescr,int64_tsize,int64_tnnz,void*indices,void*values,cusparseIndexType_tidxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstSpVec(cusparseConstSpVecDescr_t*spVecDescr,int64_tsize,int64_tnnz,constvoid*indices,constvoid*values,cusparseIndexType_tidxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

OUT

Sparse vector descriptor

size

HOST

IN

Size of the sparse vector

nnz

HOST

IN

Number of non-zero entries of the sparse vector

indices

DEVICE

IN

Indices of the sparse vector. Array withnnz elements

values

DEVICE

IN

Values of the sparse vector. Array withnnz elements

idxType

HOST

IN

Enumerator specifying the data type ofindices

idxBase

HOST

IN

Enumerator specifying the the index base ofindices

valueType

HOST

IN

Enumerator specifying the datatype ofvalues

cusparseCreateSpVec() has the following constraints:

  • indices andvalues must be aligned to the size of the datatypes specified byidxType andvalueType, respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.3.2.cusparseDestroySpVec()

cusparseStatus_tcusparseDestroySpVec(cusparseConstSpVecDescr_tspVecDescr)// non-const descriptor supported

This function releases the host memory allocated for the sparse vector descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

IN

Sparse vector descriptor

SeecusparseStatus_t for the description of the return status.


6.3.3.cusparseSpVecGet()

cusparseStatus_tcusparseSpVecGet(cusparseSpVecDescr_tspVecDescr,int64_t*size,int64_t*nnz,void**indices,void**values,cusparseIndexType_t*idxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)cusparseStatus_tcusparseConstSpVecGet(cusparseConstSpVecDescr_tspVecDescr,int64_t*size,int64_t*nnz,constvoid**indices,constvoid**values,cusparseIndexType_t*idxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)

This function returns the fields of the sparse vector descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

IN

Sparse vector descriptor

size

HOST

OUT

Size of the sparse vector

nnz

HOST

OUT

Number of non-zero entries of the sparse vector

indices

DEVICE

OUT

Indices of the sparse vector. Array withnnz elements

values

DEVICE

OUT

Values of the sparse vector. Array withnnz elements

idxType

HOST

OUT

Enumerator specifying the data type ofindices

idxBase

HOST

OUT

Enumerator specifying the the index base ofindices

valueType

HOST

OUT

Enumerator specifying the datatype ofvalues

SeecusparseStatus_t for the description of the return status.


6.3.4.cusparseSpVecGetIndexBase()

cusparseStatus_tcusparseSpVecGetIndexBase(cusparseConstSpVecDescr_tspVecDescr,// non-const descriptor supportedcusparseIndexBase_t*idxBase)

This function returns theidxBase field of the sparse vector descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

IN

Sparse vector descriptor

idxBase

HOST

OUT

Enumerator specifying the the index base ofindices

SeecusparseStatus_t for the description of the return status.


6.3.5.cusparseSpVecGetValues()

cusparseStatus_tcusparseSpVecGetValues(cusparseSpVecDescr_tspVecDescr,void**values)cusparseStatus_tcusparseConstSpVecGetValues(cusparseConstSpVecDescr_tspVecDescr,constvoid**values)

This function returns thevalues field of the sparse vector descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

IN

Sparse vector descriptor

values

DEVICE

OUT

Values of the sparse vector. Array withnnz elements

SeecusparseStatus_t for the description of the return status.


6.3.6.cusparseSpVecSetValues()

cusparseStatus_tcusparseSpVecSetValues(cusparseSpVecDescr_tspVecDescr,void*values)

This function set thevalues field of the sparse vector descriptorspVecDescr.

Param.

Memory

In/out

Meaning

spVecDescr

HOST

IN

Sparse vector descriptor

values

DEVICE

IN

Values of the sparse vector. Array withnnz elements

cusparseDnVecSetValues() has the following constraints:

  • values must be aligned to the size of the datatype specified inspVecDescr. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.



6.4.Dense Matrix APIs

The cuSPARSE helper functions for dense matrix descriptor are described in this section.

See theDense Matrix Format section for the detailed description of the storage format.

6.4.1.cusparseCreateDnMat()

cusparseStatus_tcusparseCreateDnMat(cusparseDnMatDescr_t*dnMatDescr,int64_trows,int64_tcols,int64_tld,void*values,cudaDataTypevalueType,cusparseOrder_torder)cusparseStatus_tcusparseCreateConstDnMat(cusparseConstDnMatDescr_t*dnMatDescr,int64_trows,int64_tcols,int64_tld,constvoid*values,cudaDataTypevalueType,cusparseOrder_torder)

The function initializes the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

OUT

Dense matrix descriptor

rows

HOST

IN

Number of rows of the dense matrix

cols

HOST

IN

Number of columns of the dense matrix

ld

HOST

IN

Leading dimension of the dense matrix

values

DEVICE

IN

Values of the dense matrix. Array withsize elements

valueType

HOST

IN

Enumerator specifying the datatype ofvalues

order

HOST

IN

Enumerator specifying the memory layout of the dense matrix

cusparseCreateDnMat() has the following constraints:

  • values must be aligned to the size of the datatype specified byvalueType. SeecudaDataType_t for the description of the datatypes.

Refer tocusparseStatus_t for the description of the return status.


6.4.2.cusparseDestroyDnMat()

cusparseStatus_tcusparseDestroyDnMat(cusparseConstDnMatDescr_tdnMatDescr)// non-const descriptor supported

This function releases the host memory allocated for the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

Refer tocusparseStatus_t for the description of the return status.


6.4.3.cusparseDnMatGet()

cusparseStatus_tcusparseDnMatGet(cusparseDnMatDescr_tdnMatDescr,int64_t*rows,int64_t*cols,int64_t*ld,void**values,cudaDataType*type,cusparseOrder_t*order)cusparseStatus_tcusparseConstDnMatGet(cusparseConstDnMatDescr_tdnMatDescr,int64_t*rows,int64_t*cols,int64_t*ld,constvoid**values,cudaDataType*type,cusparseOrder_t*order)

This function returns the fields of the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

rows

HOST

OUT

Number of rows of the dense matrix

cols

HOST

OUT

Number of columns of the dense matrix

ld

HOST

OUT

Leading dimension of the dense matrix

values

DEVICE

OUT

Values of the dense matrix. Array withld*cols elements

valueType

HOST

OUT

Enumerator specifying the datatype ofvalues

order

HOST

OUT

Enumerator specifying the memory layout of the dense matrix

Refer tocusparseStatus_t for the description of the return status.


6.4.4.cusparseDnMatGetValues()

cusparseStatus_tcusparseDnMatGetValues(cusparseDnMatDescr_tdnMatDescr,void**values)cusparseStatus_tcusparseConstDnMatGetValues(cusparseConstDnMatDescr_tdnMatDescr,constvoid**values)

This function returns thevalues field of the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

values

DEVICE

OUT

Values of the dense matrix. Array withld*cols elements

Refer tocusparseStatus_t for the description of the return status.


6.4.5.cusparseDnMatSetValues()

cusparseStatus_tcusparseDnMatSetValues(cusparseDnMatDescr_tdnMatDescr,void*values)

This function sets thevalues field of the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

values

DEVICE

IN

Values of the dense matrix. Array withld*cols elements

cusparseDnMatSetValues() has the following constraints:

  • values must be aligned to the size of the datatype specified indnMatDescr. SeecudaDataType_t for the description of the datatypes.

Refer tocusparseStatus_t for the description of the return status.


6.4.6.cusparseDnMatGetStridedBatch()

cusparseStatus_tcusparseDnMatGetStridedBatch(cusparseConstDnMatDescr_tdnMatDescr,// non-const descriptor supportedint*batchCount,int64_t*batchStride)

The function returns the number of batches and the batch stride of the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

batchCount

HOST

OUT

Number of batches of the dense matrix

batchStride

HOST

OUT

Address offset between a matrix and the next one in the batch

Refer tocusparseStatus_t for the description of the return status.


6.4.7.cusparseDnMatSetStridedBatch()

cusparseStatus_tcusparseDnMatSetStridedBatch(cusparseDnMatDescr_tdnMatDescr,intbatchCount,int64_tbatchStride)

The function sets the number of batches and the batch stride of the dense matrix descriptordnMatDescr.

Param.

Memory

In/out

Meaning

dnMatDescr

HOST

IN

Dense matrix descriptor

batchCount

HOST

IN

Number of batches of the dense matrix

batchStride

HOST

IN

Address offset between a matrix and the next one in the batch.batchStrideld*cols if the matrix uses column-major layout,batchStrideld*rows otherwise

Refer tocusparseStatus_t for the description of the return status.

6.5.Sparse Matrix APIs

The cuSPARSE helper functions for sparse matrix descriptor are described in this section.

See theCOO,CSR,CSC,SELL,BSR,Blocked-Ell sections for the detailed description of the storage formats.

6.5.1.Coordinate (COO)

6.5.1.1.cusparseCreateCoo()

cusparseStatus_tcusparseCreateCoo(cusparseSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,void*cooRowInd,void*cooColInd,void*cooValues,cusparseIndexType_tcooIdxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstCoo(cusparseConstSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,constvoid*cooRowInd,constvoid*cooColInd,constvoid*cooValues,cusparseIndexType_tcooIdxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspMatDescr in the COO format (Structure of Arrays layout).

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

rows

HOST

IN

Number of rows of the sparse matrix

cols

HOST

IN

Number of columns of the sparse matrix

nnz

HOST

IN

Number of non-zero entries of the sparse matrix

cooRowInd

DEVICE

IN

Row indices of the sparse matrix. Array withnnz elements

cooColInd

DEVICE

IN

Column indices of the sparse matrix. Array withnnz elements

cooValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cooIdxType

HOST

IN

Data type ofcooRowInd andcooColInd

idxBase

HOST

IN

Index base ofcooRowInd andcooColInd

valueType

HOST

IN

Datatype ofcooValues

cusparseCreateCoo() has the following constraints:

  • cooRowInd,cooColInd, andcooValues must be aligned to the size of the datatypes specified bycooIdxType,cooIdxType, andvalueType. respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.1.2.cusparseCooGet()

cusparseStatus_tcusparseCooGet(cusparseSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,void**cooRowInd,void**cooColInd,void**cooValues,cusparseIndexType_t*idxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)cusparseStatus_tcusparseConstCooGet(cusparseConstSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,constvoid**cooRowInd,constvoid**cooColInd,constvoid**cooValues,cusparseIndexType_t*idxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)

This function returns the fields of the sparse matrix descriptorspMatDescr stored in COO format (Array of Structures layout).

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

rows

HOST

OUT

Number of rows of the sparse matrix

cols

HOST

OUT

Number of columns of the sparse matrix

nnz

HOST

OUT

Number of non-zero entries of the sparse matrix

cooRowInd

DEVICE

OUT

Row indices of the sparse matrix. Arraynnz elements

cooColInd

DEVICE

OUT

Column indices of the sparse matrix. Arraynnz elements

cooValues

DEVICE

OUT

Values of the sparse matrix. Arraynnz elements

cooIdxType

HOST

OUT

Data type ofcooRowInd andcooColInd

idxBase

HOST

OUT

Index base ofcooRowInd andcooColInd

valueType

HOST

OUT

Datatype ofcooValues

SeecusparseStatus_t for the description of the return status.


6.5.1.3.cusparseCooSetPointers()

cusparseStatus_tcusparseCooSetPointers(cusparseSpMatDescr_tspMatDescr,void*cooRows,void*cooColumns,void*cooValues)

This function sets the pointers of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

cooRows

DEVICE

IN

Row indices of the sparse matrix. Array withnnz elements

cooColumns

DEVICE

IN

Column indices of the sparse matrix. Array withnnz elements

cooValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cusparseCooSetPointers() has the following constraints:

  • cooRows,cooColumns, andcooValues must be aligned to the size of their corresponding datatypes specified inspMatDescr. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.1.4.cusparseCooSetStridedBatch()

cusparseStatus_tcusparseCooSetStridedBatch(cusparseSpMatDescr_tspMatDescr,intbatchCount,int64_tbatchStride)

This function sets thebatchCount and thebatchStride fields of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

batchCount

HOST

IN

Number of batches of the sparse matrix

batchStride

HOST

IN

address offset between consecutive batches

SeecusparseStatus_t for the description of the return status.


6.5.2.Compressed Sparse Row (CSR)

6.5.2.1.cusparseCreateCsr()

cusparseStatus_tcusparseCreateCsr(cusparseSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,void*csrRowOffsets,void*csrColInd,void*csrValues,cusparseIndexType_tcsrRowOffsetsType,cusparseIndexType_tcsrColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstCsr(cusparseConstSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,constvoid*csrRowOffsets,constvoid*csrColInd,constvoid*csrValues,cusparseIndexType_tcsrRowOffsetsType,cusparseIndexType_tcsrColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspMatDescr in the CSR format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

rows

HOST

IN

Number of rows of the sparse matrix

cols

HOST

IN

Number of columns of the sparse matrix

nnz

HOST

IN

Number of non-zero entries of the sparse matrix

csrRowOffsets

DEVICE

IN

Row offsets of the sparse matrix. Array withrows+1 elements

csrColInd

DEVICE

IN

Column indices of the sparse matrix. Array withnnz elements

csrValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

csrRowOffsetsType

HOST

IN

Data type ofcsrRowOffsets

csrColIndType

HOST

IN

Data type ofcsrColInd

idxBase

HOST

IN

Index base ofcsrRowOffsets andcsrColInd

valueType

HOST

IN

Datatype ofcsrValues

cusparseCreateCsr() has the following constraints:

  • csrRowOffsets,csrColInd, andcsrValues must be aligned to the size of the datatypes specified bycsrRowOffsetsType,csrColIndType, andvalueType, respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.2.2.cusparseCsrGet()

cusparseStatus_tcusparseCsrGet(cusparseSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,void**csrRowOffsets,void**csrColInd,void**csrValues,cusparseIndexType_t*csrRowOffsetsType,cusparseIndexType_t*csrColIndType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)cusparseStatus_tcusparseConstCsrGet(cusparseConstSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,constvoid**csrRowOffsets,constvoid**csrColInd,constvoid**csrValues,cusparseIndexType_t*csrRowOffsetsType,cusparseIndexType_t*csrColIndType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)

This function returns the fields of the sparse matrix descriptorspMatDescr stored in CSR format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

rows

HOST

OUT

Number of rows of the sparse matrix

cols

HOST

OUT

Number of columns of the sparse matrix

nnz

HOST

OUT

Number of non-zero entries of the sparse matrix

csrRowOffsets

DEVICE

OUT

Row offsets of the sparse matrix. Array withrows+1 elements

csrColInd

DEVICE

OUT

Column indices of the sparse matrix. Array withnnz elements

csrValues

DEVICE

OUT

Values of the sparse matrix. Array withnnz elements

csrRowOffsetsType

HOST

OUT

Data type ofcsrRowOffsets

csrColIndType

HOST

OUT

Data type ofcsrColInd

idxBase

HOST

OUT

Index base ofcsrRowOffsets andcsrColInd

valueType

HOST

OUT

Datatype ofcsrValues

SeecusparseStatus_t for the description of the return status.


6.5.2.3.cusparseCsrSetPointers()

cusparseStatus_tcusparseCsrSetPointers(cusparseSpMatDescr_tspMatDescr,void*csrRowOffsets,void*csrColInd,void*csrValues)

This function sets the pointers of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

csrRowOffsets

DEVICE

IN

Row offsets of the sparse matrix. Array withrows+1 elements

csrColInd

DEVICE

IN

Column indices of the sparse matrix. Array withnnz elements

csrValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cusparseCsrSetPointers() has the following constraints:

  • csrRowOffsets,csrColInd, andcsrValues must be aligned to the size of their corresponding datatypes specified inspMatDescr. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.2.4.cusparseCsrSetStridedBatch()

cusparseStatus_tcusparseCsrSetStridedBatch(cusparseSpMatDescr_tspMatDescr,intbatchCount,int64_toffsetsBatchStride,int64_tcolumnsValuesBatchStride)

This function sets thebatchCount and thebatchStride fields of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

batchCount

HOST

IN

Number of batches of the sparse matrix

offsetsBatchStride

HOST

IN

Address offset between consecutive batches for the row offset array

columnsValuesBatchStride

HOST

IN

Address offset between consecutive batches for the column and value arrays

SeecusparseStatus_t for the description of the return status.


6.5.3.Compressed Sparse Column (CSC)

6.5.3.1.cusparseCreateCsc()

cusparseStatus_tcusparseCreateCsc(cusparseSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,void*cscColOffsets,void*cscRowInd,void*cscValues,cusparseIndexType_tcscColOffsetsType,cusparseIndexType_tcscRowIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstCsc(cusparseConstSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,constvoid*cscColOffsets,constvoid*cscRowInd,constvoid*cscValues,cusparseIndexType_tcscColOffsetsType,cusparseIndexType_tcscRowIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspMatDescr in the CSC format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

rows

HOST

IN

Number of rows of the sparse matrix

cols

HOST

IN

Number of columns of the sparse matrix

nnz

HOST

IN

Number of non-zero entries of the sparse matrix

cscColOffsets

DEVICE

IN

Column offsets of the sparse matrix. Array withcols+1 elements

cscRowInd

DEVICE

IN

Row indices of the sparse matrix. Array withnnz elements

cscValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cscColOffsetsType

HOST

IN

Data type ofcscColOffsets

cscRowIndType

HOST

IN

Data type ofcscRowInd

idxBase

HOST

IN

Index base ofcscColOffsets andcscRowInd

valueType

HOST

IN

Datatype ofcscValues

cusparseCreateCsc() has the following constraints:

  • cscColOffsets,cscRowInd, andcscValues must be aligned to the size of the datatypes specified bycscColOffsetsType,cscRowIndType, andvalueType, respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.3.2.cusparseCscGet()

cusparseStatus_tcusparseCscGet(cusparseSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,void**cscColOffsets,void**cscRowInd,void**cscValues,cusparseIndexType_t*cscColOffsetsType,cusparseIndexType_t*cscRowIndType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)cusparseStatus_tcusparseConstCscGet(cusparseConstSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*nnz,constvoid**cscColOffsets,constvoid**cscRowInd,constvoid**cscValues,cusparseIndexType_t*cscColOffsetsType,cusparseIndexType_t*cscRowIndType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)

This function returns the fields of the sparse matrix descriptorspMatDescr stored in CSC format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

rows

HOST

OUT

Number of rows of the sparse matrix

cols

HOST

OUT

Number of columns of the sparse matrix

nnz

HOST

OUT

Number of non-zero entries of the sparse matrix

cscColOffsets

DEVICE

OUT

Col offsets of the sparse matrix. Array withcols+1 elements

cscRowInd

DEVICE

OUT

Row indices of the sparse matrix. Array withnnz elements

cscValues

DEVICE

OUT

Values of the sparse matrix. Array withnnz elements

cscColOffsetsType

HOST

OUT

Data type ofcscColOffsets

cscRowIndType

HOST

OUT

Data type ofcscRowInd

idxBase

HOST

OUT

Index base ofcscColOffsets andcscRowInd

valueType

HOST

OUT

Datatype ofcscValues

SeecusparseStatus_t for the description of the return status.


6.5.3.3.cusparseCscSetPointers()

cusparseStatus_tcusparseCscSetPointers(cusparseSpMatDescr_tspMatDescr,void*cscColOffsets,void*cscRowInd,void*cscValues)

This function sets the pointers of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

cscColOffsets

DEVICE

IN

Col offsets of the sparse matrix. Array withcols+1 elements

cscRowInd

DEVICE

IN

Row indices of the sparse matrix. Array withnnz elements

cscValues

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cusparseCscSetPointers() has the following constraints:

  • cscColOffsets,cscRowInd, andcscValues must be aligned to the size of their corresponding datatypes specified inspMatDescr. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.4.Blocked-Ellpack (Blocked-ELL)

6.5.4.1.cusparseCreateBlockedEll()

cusparseStatus_tcusparseCreateBlockedEll(cusparseSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tellBlockSize,int64_tellCols,void*ellColInd,void*ellValue,cusparseIndexType_tellIdxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstBlockedEll(cusparseConstSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tellBlockSize,int64_tellCols,constvoid*ellColInd,constvoid*ellValue,cusparseIndexType_tellIdxType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspMatDescr for the Blocked-Ellpack (ELL) format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

rows

HOST

IN

Number of rows of the sparse matrix

cols

HOST

IN

Number of columns of the sparse matrix

ellBlockSize

HOST

IN

Size of the ELL-Block

ellCols

HOST

IN

Actual number of columns of the Blocked-Ellpack format (ellValue columns)

ellColInd

DEVICE

IN

Blocked-ELL Column indices. Array with[ellCols/ellBlockSize][rows/ellBlockSize] elements

ellValue

DEVICE

IN

Values of the sparse matrix. Array withrows*ellCols elements

ellIdxType

HOST

IN

Data type ofellColInd

idxBase

HOST

IN

Index base ofellColInd

valueType

HOST

IN

Data type ofellValue

Blocked-ELL Column indices (ellColInd) are in the range[0,cols/ellBlockSize-1]. The array can contain-1 values for indicating empty blocks.

SeecusparseStatus_t for the description of the return status.


6.5.4.2.cusparseBlockedEllGet()

cusparseStatus_tcusparseBlockedEllGet(cusparseSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*ellBlockSize,int64_t*ellCols,void**ellColInd,void**ellValue,cusparseIndexType_t*ellIdxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)cusparseStatus_tcusparseConstBlockedEllGet(cusparseConstSpMatDescr_tspMatDescr,int64_t*rows,int64_t*cols,int64_t*ellBlockSize,int64_t*ellCols,constvoid**ellColInd,constvoid**ellValue,cusparseIndexType_t*ellIdxType,cusparseIndexBase_t*idxBase,cudaDataType*valueType)

This function returns the fields of the sparse matrix descriptorspMatDescr stored in Blocked-Ellpack (ELL) format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

rows

HOST

OUT

Number of rows of the sparse matrix

cols

HOST

OUT

Number of columns of the sparse matrix

ellBlockSize

HOST

OUT

Size of the ELL-Block

ellCols

HOST

OUT

Actual number of columns of the Blocked-Ellpack format

ellColInd

DEVICE

OUT

Column indices for the ELL-Block. Array with[cols/ellBlockSize][rows/ellBlockSize] elements

ellValue

DEVICE

OUT

Values of the sparse matrix. Array withrows*ellCols elements

ellIdxType

HOST

OUT

Data type ofellColInd

idxBase

HOST

OUT

Index base ofellColInd

valueType

HOST

OUT

Datatype ofellValue

SeecusparseStatus_t for the description of the return status.


6.5.5.Sliced-Ellpack (SELL)

6.5.5.1.cusparseCreateSlicedEll()

cusparseStatus_tcusparseCreateSlicedEll(cusparseSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,int64_tsellValuesSize,int64_tsliceSize,void*sellSliceOffsets,void*sellColInd,void*sellValues,cusparseIndexType_tsellSliceOffsetsType,cusparseIndexType_tsellColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)cusparseStatus_tcusparseCreateConstSlicedEll(cusparseConstSpMatDescr_t*spMatDescr,int64_trows,int64_tcols,int64_tnnz,int64_tsellValuesSize,int64_tsliceSize,constvoid*sellSliceOffsets,constvoid*sellColInd,constvoid*sellValues,cusparseIndexType_tsellSliceOffsetsType,cusparseIndexType_tsellColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType)

This function initializes the sparse matrix descriptorspMatDescr for the Sliced Ellpack (SELL) format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

rows

HOST

IN

Number of rows of the sparse matrix

cols

HOST

IN

Number of columns of the sparse matrix

nnz

HOST

IN

Number of nonzero elements in the sparse matrix

sellValuesSize

HOST

IN

Total number of elements insellValues array (nonzero and padding)

sliceSize

HOST

IN

The number of rows per slice

sellSliceOffsets

DEVICE

IN

Slice offsets of the sparse matrix. Array of size\(\left \lceil{\frac{rows}{sliceSize}}\right \rceil + 1\)

sellColInd

DEVICE

IN

Column indexes of the sparse matrix. Array of sizesellValuesSize

sellValues

DEVICE

IN

Values of the sparse matrix. Array of sizesellValuesSize elements

sellSliceOffsetsType

HOST

IN

Data type ofsellSliceOffsets

sellColIndType

HOST

IN

Data type ofsellColInd

idxBase

HOST

IN

Index base ofsellColInd

valueType

HOST

IN

Data type ofsellValues

Note

Sliced Ellpack Column arraysellColInd contains-1 values for indicating padded entries.

cusparseCreateSlicedEll() has the following constraints:

  • sellSliceOffsets,sellColInd, andsellValues must be aligned to the size of the datatypes specified bysellSliceOffsetsType,sellColIndType, andvalueType, respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.6.Block Sparse Row (BSR)

6.5.6.1.cusparseCreateBsr()

cusparseStatus_tcusparseCreateBsr(cusparseSpMatDescr_t*spMatDescr,int64_tbrows,int64_tbcols,int64_tbnnz,int64_trowBlockSize,int64_tcolBlockSize,void*bsrRowOffsets,void*bsrColInd,void*bsrValues,cusparseIndexType_tbsrRowOffsetsType,cusparseIndexType_tbsrColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType,cusparseOrder_torder)cusparseStatus_tcusparseCreateConstBsr(cusparseConstSpMatDescr_t*spMatDescr,int64_tbrows,int64_tbcols,int64_tbnnz,int64_trowBlockSize,int64_tcolBlockSize,constvoid*bsrRowOffsets,constvoid*bsrColInd,constvoid*bsrValues,cusparseIndexType_tbsrRowOffsetsType,cusparseIndexType_tbsrColIndType,cusparseIndexBase_tidxBase,cudaDataTypevalueType,cusparseOrder_torder)

This function initializes the sparse matrix descriptorspMatDescr for the Block Compressed Row (BSR) format.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

brows

HOST

IN

Number of block rows of the sparse matrix

bcols

HOST

IN

Number of block columns of the sparse matrix

bnnz

HOST

IN

Number of blocks of the sparse matrix

rowBlockSize

HOST

IN

Number of rows of each block

colBlockSize

HOST

IN

Number of columns of each block

bsrRowOffsets

DEVICE

IN

Block row offsets of the sparse matrix. Array of sizebrows+1

bsrColInd

DEVICE

IN

Block column indices of the sparse matrix. Array of sizebnnz

bsrValues

DEVICE

IN

Values of the sparse matrix. Array of sizebnnz*rowBlockSize*colBlockSize

bsrRowOffsetsType

HOST

IN

Data type ofbsrRowOffsets

bsrColIndType

HOST

IN

Data type ofbsrColInd

idxBase

HOST

IN

Base index ofbsrRowOffsets andbsrColInd

valueType

HOST

IN

Datatype ofbsrValues

order

HOST

IN

Enumerator specifying the memory layout of values in each block

cusparseCreateBsr() has the following constraints:

  • bsrRowOffsets,bsrColInd, andbsrValues must be aligned to the size of the datatypes specified bybsrRowOffsetsType,bsrColIndType, andvalueType, respectively. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.6.2.cusparseBsrSetStridedBatch()

cusparseStatus_tcusparseBsrSetStridedBatch(cusparseSpMatDescr_tspMatDescr,intbatchCount,int64_toffsetsBatchStride,int64_tcolumnsBatchStride,int64_tvaluesBatchStride)

This function sets thebatchCount and thebatchStride fields of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

batchCount

HOST

IN

Number of batches of the sparse matrix

offsetsBatchStride

HOST

IN

Address offset between consecutive batches for the row offset array

columnsBatchStride

HOST

IN

Address offset between consecutive batches for the column array

valuesBatchStride

HOST

IN

Address offset between consecutive batches for the values array

SeecusparseStatus_t for the description of the return status.


6.5.7.All Sparse Formats

6.5.7.1.cusparseDestroySpMat()

cusparseStatus_tcusparseDestroySpMat(cusparseConstSpMatDescr_tspMatDescr)// non-const descriptor supported

This function releases the host memory allocated for the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

SeecusparseStatus_t for the description of the return status.


6.5.7.2.cusparseSpMatGetSize()

cusparseStatus_tcusparseSpMatGetSize(cusparseConstSpMatDescr_tspMatDescr,// non-const descriptor supportedint64_t*rows,int64_t*cols,int64_t*nnz)

This function returns the sizes of the sparse matrixspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

rows

HOST

OUT

Number of rows of the sparse matrix

cols

HOST

OUT

Number of columns of the sparse matrix

nnz

HOST

OUT

Number of non-zero entries of the sparse matrix

SeecusparseStatus_t for the description of the return status.


6.5.7.3.cusparseSpMatGetFormat()

cusparseStatus_tcusparseSpMatGetFormat(cusparseConstSpMatDescr_tspMatDescr,// non-const descriptor supportedcusparseFormat_t*format)

This function returns theformat field of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

format

HOST

OUT

Storage format of the sparse matrix

SeecusparseStatus_t for the description of the return status.


6.5.7.4.cusparseSpMatGetIndexBase()

cusparseStatus_tcusparseSpMatGetIndexBase(cusparseConstSpMatDescr_tspMatDescr,// non-const descriptor supportedcusparseIndexBase_t*idxBase)

This function returns theidxBase field of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

idxBase

HOST

OUT

Index base of the sparse matrix

SeecusparseStatus_t for the description of the return status.


6.5.7.5.cusparseSpMatGetValues()

cusparseStatus_tcusparseSpMatGetValues(cusparseSpMatDescr_tspMatDescr,void**values)cusparseStatus_tcusparseConstSpMatGetValues(cusparseConstSpMatDescr_tspMatDescr,constvoid**values)

This function returns thevalues field of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

values

DEVICE

OUT

Values of the sparse matrix. Array withnnz elements

SeecusparseStatus_t for the description of the return status.


6.5.7.6.cusparseSpMatSetValues()

cusparseStatus_tcusparseSpMatSetValues(cusparseSpMatDescr_tspMatDescr,void*values)

This function sets thevalues field of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

values

DEVICE

IN

Values of the sparse matrix. Array withnnz elements

cusparseSpMatSetValues() has the following constraints:

  • values must be aligned to the size of its corresponding datatype specified inspMatDescr. SeecudaDataType_t for the description of the datatypes.

SeecusparseStatus_t for the description of the return status.


6.5.7.7.cusparseSpMatGetStridedBatch()

cusparseStatus_tcusparseSpMatGetStridedBatch(cusparseConstSpMatDescr_tspMatDescr,// non-const descriptor supportedint*batchCount)

This function returns thebatchCount field of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

batchCount

HOST

OUT

Number of batches of the sparse matrix

SeecusparseStatus_t for the description of the return status.


6.5.7.8.cusparseSpMatGetAttribute()

cusparseStatus_tcusparseSpMatGetAttribute(cusparseConstSpMatDescr_tspMatDescr,// non-const descriptor supportedcusparseSpMatAttribute_tattribute,void*data,size_tdataSize)

The function gets the attributes of the sparse matrix descriptorspMatDescr.

Param.

Memory

In/out

Meaning

spMatDescr

HOST

IN

Sparse matrix descriptor

attribute

HOST

IN

Attribute enumerator

data

HOST

OUT

Attribute value

dataSize

HOST

IN

Size of the attribute in bytes for safety

Attribute

Meaning

Possible Values

CUSPARSE_SPMAT_FILL_MODE

Indicates if the lower or upper part of a matrix is stored in sparse storage

CUSPARSE_FILL_MODE_LOWER  CUSPARSE_FILL_MODE_UPPER

CUSPARSE_SPMAT_DIAG_TYPE

Indicates if the matrix diagonal entries are unity

CUSPARSE_DIAG_TYPE_NON_UNIT  CUSPARSE_DIAG_TYPE_UNIT

SeecusparseStatus_t for the description of the return status.


6.5.7.9.cusparseSpMatSetAttribute()

cusparseStatus_tcusparseSpMatSetAttribute(cusparseSpMatDescr_tspMatDescr,cusparseSpMatAttribute_tattribute,constvoid*data,size_tdataSize)

The function sets the attributes of the sparse matrix descriptorspMatDescr

Param.

Memory

In/out

Meaning

spMatDescr

HOST

OUT

Sparse matrix descriptor

attribute

HOST

IN

Attribute enumerator

data

HOST

IN

Attribute value

dataSize

HOST

IN

Size of the attribute in bytes for safety

Attribute

Meaning

Possible Values

CUSPARSE_SPMAT_FILL_MODE

Indicates if the lower or upper part of a matrix is stored in sparse storage

CUSPARSE_FILL_MODE_LOWER  CUSPARSE_FILL_MODE_UPPER

CUSPARSE_SPMAT_DIAG_TYPE

Indicates if the matrix diagonal entries are unity

CUSPARSE_DIAG_TYPE_NON_UNIT  CUSPARSE_DIAG_TYPE_UNIT

SeecusparseStatus_t for the description of the return status.



6.6.Generic API Functions

6.6.1.cusparseAxpby() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseAxpby(cusparseHandle_thandle,constvoid*alpha,cusparseConstSpVecDescr_tvecX,// non-const descriptor supportedconstvoid*beta,cusparseDnVecDescr_tvecY)

The function computes the sum of a sparse vectorvecX and a dense vectorvecY.

\[\mathbf{Y} = \alpha\mathbf{X} + \beta\mathbf{Y}\]

In other words,

fori=0ton-1Y[i]=beta*Y[i]fori=0tonnz-1Y[X_indices[i]]+=alpha*X_values[i]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of compute type

vecX

HOST

IN

Sparse vectorX

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of compute type

vecY

HOST

IN/OUT

Dense vectorY

cusparseAxpby supports the following index type for representing the sparse vectorvecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseAxpby supports the following data types:

Uniform-precision computation:

X/Y/compute

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

compute

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

[DEPRECATED]

cusparseAxpby() has the following constraints:

  • The arrays representing the sparse vectorvecX must be aligned to 16 bytes

cusparseAxpby() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vectorvecX indices are distinct

  • The routine allowsindices ofvecX to be unsorted

cusparseAxpby() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseAxpby for a code example.


6.6.2.cusparseGather()

cusparseStatus_tcusparseGather(cusparseHandle_thandle,cusparseConstDnVecDescr_tvecY,// non-const descriptor supportedcusparseSpVecDescr_tvecX)

The function gathers the elements of the dense vectorvecY into the sparse vectorvecX

In other words,

fori=0tonnz-1X_values[i]=Y[X_indices[i]]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

vecX

HOST

OUT

Sparse vectorX

vecY

HOST

IN

Dense vectorY

cusparseGather supports the following index type for representing the sparse vectorvecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseGather supports the following data types:

X/Y

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseGather() has the following constraints:

  • The arrays representing the sparse vectorvecX must be aligned to 16 bytes

cusparseGather() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vectorvecX indices are distinct

  • The routine allowsindices ofvecX to be unsorted

cusparseGather() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseGather for a code example.


6.6.3.cusparseScatter()

cusparseStatus_tcusparseScatter(cusparseHandle_thandle,cusparseConstSpVecDescr_tvecX,// non-const descriptor supportedcusparseDnVecDescr_tvecY)

The function scatters the elements of the sparse vectorvecX into the dense vectorvecY

In other words,

fori=0tonnz-1Y[X_indices[i]]=X_values[i]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

vecX

HOST

IN

Sparse vectorX

vecY

HOST

OUT

Dense vectorY

cusparseScatter supports the following index type for representing the sparse vectorvecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseScatter supports the following data types:

X/Y

CUDA_R_8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseScatter() has the following constraints:

  • The arrays representing the sparse vectorvecX must be aligned to 16 bytes

cusparseScatter() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vectorvecX indices are distinct

  • The routine allowsindices ofvecX to be unsorted

cusparseScatter() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseScatter for a code example.


6.6.4.cusparseRot() [DEPRECATED]

>The routine will be removed in the next major release

cusparseStatus_tcusparseRot(cusparseHandle_thandle,constvoid*c_coeff,constvoid*s_coeff,cusparseSpVecDescr_tvecX,cusparseDnVecDescr_tvecY)

The function computes the Givens rotation matrix

\[\begin{split}G = \begin{bmatrix} c & s \\ {- s} & c \\ \end{bmatrix}\end{split}\]

to a sparsevecX and a dense vectorvecY

In other words,

fori=0tonnz-1Y[X_indices[i]]=c*Y[X_indices[i]]-s*X_values[i]X_values[i]=c*X_values[i]+s*Y[X_indices[i]]

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

c_coeff

HOST or DEVICE

IN

cosine element of the rotation matrix

vecX

HOST

IN/OUT

Sparse vectorX

s_coeff

HOST or DEVICE

IN

sine element of the rotation matrix

vecY

HOST

IN/OUT

Dense vectorY

cusparseRot supports the following index type for representing the sparse vectorvecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseRot supports the following data types:

Uniform-precision computation:

X/Y/compute

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

compute

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

[DEPRECATED]

cusparseRot() has the following constraints:

  • The arrays representing the sparse vectorvecX must be aligned to 16 bytes

cusparseRot() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vectorvecX indices are distinct

cusparseRot() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseRot for a code example.


6.6.5.cusparseSpVV() [DEPRECATED]

>This routine will be removed in a future major release.

cusparseStatus_tcusparseSpVV_bufferSize(cusparseHandle_thandle,cusparseOperation_topX,cusparseConstSpVecDescr_tvecX,// non-const descriptor supportedcusparseConstDnVecDescr_tvecY,// non-const descriptor supportedvoid*result,cudaDataTypecomputeType,size_t*bufferSize)
cusparseStatus_tcusparseSpVV(cusparseHandle_thandle,cusparseOperation_topX,cusparseConstSpVecDescr_tvecX,// non-const descriptor supportedcusparseConstDnVecDescr_tvecY,// non-const descriptor supportedvoid*result,cudaDataTypecomputeType,void*externalBuffer)

The function computes the inner dot product of a sparse vectorvecX and a dense vectorvecY

\[result = op\left(\mathbf{X}\right) \cdot \mathbf{Y}\]

In other words,

result=0;fori=0tonnz-1result+=op(X_values[i])*Y[X_indices[i]]

image10

The functioncusparseSpVV_bufferSize() returns the size of the workspace needed bycusparseSpVV()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opX

HOST

IN

Operationop(X) that is non-transpose or conjugate transpose

vecX

HOST

IN

Sparse vectorX

vecY

HOST

IN

Dense vectorY

result

HOST or DEVICE

OUT

The resulting dot product

computeType

HOST

IN

Datatype in which the computation is executed

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpVV

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at leastbufferSize bytes

cusparseSpVV supports the following index type for representing the sparse vectorvecX:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

The data types combinations currently supported forcusparseSpVV are listed below:

Uniform-precision computation:

X/Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

X/Y

computeType/result

Notes

CUDA_R_8I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_32F

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_32F

[DEPRECATED]

cusparseSpVV() has the following constraints:

  • The arrays representing the sparse vectorvecX must be aligned to 16 bytes

cusparseSpVV() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run if the the sparse vectorvecX indices are distinct

  • The routine allowsindices ofvecX to be unsorted

cusparseSpVV() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpVV for a code example.


6.6.6.cusparseSpMV()

cusparseStatus_tcusparseSpMV_bufferSize(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedconstvoid*beta,cusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpMVAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseSpMV_preprocess(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedconstvoid*beta,cusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpMVAlg_talg,void*externalBuffer)
cusparseStatus_tcusparseSpMV(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedconstvoid*beta,cusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpMVAlg_talg,void*externalBuffer)

This function performs the multiplication of a sparse matrixmatA and a dense vectorvecX

\[\mathbf{Y} = \alpha op\left( \mathbf{A} \right) \cdot \mathbf{X} + \beta\mathbf{Y}\]

where

  • op(A) is a sparse matrix of size\(m \times k\)

  • X is a dense vector of size\(k\)

  • Y is a dense vector of size\(m\)

  • \(\alpha\) and\(\beta\) are scalars

Also, for matrixA

image11

The functioncusparseSpMV_bufferSize() returns the size of the workspace needed bycusparseSpMV_preprocess() andcusparseSpMV()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of typecomputeType

matA

HOST

IN

Sparse matrixA

vecX

HOST

IN

Dense vectorX

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of typecomputeType

vecY

HOST

IN/OUT

Dense vectorY

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpMV

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at leastbufferSize bytes

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_CSC

  • CUSPARSE_FORMAT_BSR

  • CUSPARSE_FORMAT_SLICED_ELL

cusparseSpMV supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMV supports the following data types:

Uniform-precision computation:

A/X/Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/X

Y

computeType

Notes

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

CUDA_C_32F

CUDA_C_32F

CUDA_C_32F

CUDA_C_16F

CUDA_C_16F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_16BF

[DEPRECATED]

A

X/Y/computeType

CUDA_R_32F

CUDA_R_64F

Mixed Regular/Complex computation:

A

X/Y/computeType

CUDA_R_32F

CUDA_C_32F

CUDA_R_64F

CUDA_C_64F

NOTE:CUDA_R_16F,CUDA_R_16BF,CUDA_C_16F, andCUDA_C_16BF data types always imply mixed-precision computation.

cusparseSpMV() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMV_ALG_DEFAULT

Default algorithm for any sparse matrix format.

CUSPARSE_SPMV_COO_ALG1

Default algorithm for COO sparse matrix format. May produce slightly different results during different runs with the same input parameters.

CUSPARSE_SPMV_COO_ALG2

Provides deterministic (bit-wise) results for each run. IfopA!=CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical toCUSPARSE_SPMV_COO_ALG1.

CUSPARSE_SPMV_CSR_ALG1

Default algorithm for CSR/CSC sparse matrix format. May produce slightly different results during different runs with the same input parameters.

CUSPARSE_SPMV_CSR_ALG2

Provides deterministic (bit-wise) results for each run. IfopA!=CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical toCUSPARSE_SPMV_CSR_ALG1.

CUSPARSE_SPMV_SELL_ALG1

Default algorithm for Sliced Ellpack sparse matrix format. Provides deterministic (bit-wise) results for each run.

CUSPARSE_SPMV_BSR_ALG1

Default algorithm for BSR sparse matrix format. Provides deterministic (bit-wise) results for each run.Supports onlyopA==CUSPARSE_OPERATION_NON_TRANSPOSE. Supports both row-major and column-major block layouts inA.

CallingcusparseSpMV_preprocess() is optional.It may accelerate subsequent calls tocusparseSpMV().It is useful whencusparseSpMV() is called multiple times with the same sparsity pattern (matA).

CallingcusparseSpMV_preprocess() withbuffer makes that buffer “active” formatA SpMV calls.Subsequent calls tocusparseSpMV() withmatA and the active buffermust use the same values for all parameters as the call tocusparseSpMV_preprocess().The exceptions are:alpha,beta,vecX,vecY, and the values (but not indices) ofmatA may be different.Importantly, the buffer contents must be unmodified since the call tocusparseSpMV_preprocess().WhencusparseSpMV() is called withmatA and its active buffer, it may read acceleration data from the buffer.

CallingcusparseSpMV_preprocess() again withmatA and a new buffer will make the new buffer active,forgetting about the previously-active buffer and making it inactive.ForcusparseSpMV(), there can only be one active buffer per sparse matrix at a time.To get the effect of multiple active buffers for a single sparse matrix,create multiple matrix handles that all point to the same index and value buffers,and callcusparseSpMV_preprocess() once per handle with different workspace buffers.

CallingcusparseSpMV() with an inactive buffer is always permitted.However, there may be no acceleration from the preprocessing in that case.

For the purposes ofthread safety,cusparseSpMV_preprocess() is writing tomatA internal state.

Performance notes:

  • CUSPARSE_SPMV_COO_ALG1 andCUSPARSE_SPMV_CSR_ALG1 provide higher performance thanCUSPARSE_SPMV_COO_ALG2 andCUSPARSE_SPMV_CSR_ALG2.

  • In general,opA==CUSPARSE_OPERATION_NON_TRANSPOSE is 3x faster thanopA!=CUSPARSE_OPERATION_NON_TRANSPOSE.

  • UsingcusparseSpMV_preprocess() helps improve performance ofcusparseSpMV() in CSR. It is beneficial when we need to runcusparseSpMV() multiple times with a same matrix (cusparseSpMV_preprocess() is executed only once).

cusparseSpMV() has the following properties:

  • The routine requires extra storage for CSR/CSC format (all algorithms) and for COO format withCUSPARSE_SPMV_COO_ALG2 algorithm.

  • Provides deterministic (bit-wise) results for each run only forCUSPARSE_SPMV_COO_ALG2,CUSPARSE_SPMV_CSR_ALG2 andCUSPARSE_SPMV_BSR_ALG1 algorithms, andopA==CUSPARSE_OPERATION_NON_TRANSPOSE.

  • The routine supports asynchronous execution.

  • compute-sanitizer could report false race conditions for this routine whenbeta==0. This is for optimization purposes and does not affect the correctness of the computation.

  • The routine allows the indices ofmatA to be unsorted.

cusparseSpMV() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpMV CSR andcusparseSpMV COO for a code example.


6.6.7.cusparseSpSV()

cusparseStatus_tcusparseSpSV_createDescr(cusparseSpSVDescr_t*spsvDescr);cusparseStatus_tcusparseSpSV_destroyDescr(cusparseSpSVDescr_tspsvDescr);
cusparseStatus_tcusparseSpSV_bufferSize(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedcusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpSVAlg_talg,cusparseSpSVDescr_tspsvDescr,size_t*bufferSize)
cusparseStatus_tcusparseSpSV_analysis(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedcusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpSVAlg_talg,cusparseSpSVDescr_tspsvDescrvoid*externalBuffer)
cusparseStatus_tcusparseSpSV_solve(cusparseHandle_thandle,cusparseOperation_topA,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnVecDescr_tvecX,// non-const descriptor supportedcusparseDnVecDescr_tvecY,cudaDataTypecomputeType,cusparseSpSVAlg_talg,cusparseSpSVDescr_tspsvDescr)
cusparseStatus_tcusparseSpSV_updateMatrix(cusparseHandle_thandle,cusparseSpSVDescr_tspsvDescr,void*newValues,cusparseSpSVUpdate_tupdatePart)

The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:

\[op\left( \mathbf{A} \right) \cdot \mathbf{Y} = \alpha\mathbf{X}\]

where

  • op(A) is a sparse square matrix of size\(m \times m\)

  • X is a dense vector of size\(m\)

  • Y is a dense vector of size\(m\)

  • \(\alpha\) is a scalar

Also, for matrixA

image11

The functioncusparseSpSV_bufferSize() returns the size of the workspace needed bycusparseSpSV_analysis() andcusparseSpSV_solve().The functioncusparseSpSV_analysis() performs the analysis phase, whilecusparseSpSV_solve() executes the solve phase for a sparse triangular linear system.The opaque data structurespsvDescr is used to share information among all functions.The functioncusparseSpSV_updateMatrix() updatesspsvDescr with new matrix values.

The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.

NOTE: all parameters must be consistent acrosscusparseSpSV API calls and the matrix descriptions andexternalBuffer must not be modified betweencusparseSpSV_analysis() andcusparseSpSV_solve(). The functioncusparseSpSV_updateMatrix() can be used to update the values on the sparse matrix stored inside the opaque data structurespsvDescr

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of typecomputeType

matA

HOST

IN

Sparse matrixA

vecX

HOST

IN

Dense vectorX

vecY

HOST

IN/OUT

Dense vectorY

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpSV_analysis() andcusparseSpSV_solve()

externalBuffer

DEVICE

IN/OUT

Pointer to a workspace buffer of at leastbufferSize bytes. It is used bycusparseSpSV_analysis andcusparseSpSV_solve()

spsvDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_SLICED_ELL

ThecusparseSpSV() supports the following shapes and properties:

  • CUSPARSE_FILL_MODE_LOWER andCUSPARSE_FILL_MODE_UPPER fill modes

  • CUSPARSE_DIAG_TYPE_NON_UNIT andCUSPARSE_DIAG_TYPE_UNIT diagonal types

The fill mode and diagonal type can be set bycusparseSpMatSetAttribute().

cusparseSpSV() supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpSV() supports the following data types:

Uniform-precision computation:

A/X/Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

cusparseSpSV() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPSV_ALG_DEFAULT

Default algorithm

cusparseSpSV() has the following properties:

  • The routine requires extra storage for the analysis phase which is proportional to number of non-zero entries of the sparse matrix

  • Provides deterministic (bit-wise) results for each run for the solving phasecusparseSpSV_solve()

  • The routine supports in-place operation

  • ThecusparseSpSV_solve() routine supports asynchronous execution

  • cusparseSpSV_bufferSize() andcusparseSpSV_analysis() routines acceptNULL forvecX andvecY

  • The routine allows the indices ofmatA to be unsorted

cusparseSpSV() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

cusparseSpSV_updateMatrix() updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart):

Strategy

Notes

CUSPARSE_SPSV_UPDATE_GENERAL

Updates the sparse matrix values with values ofnewValues array

CUSPARSE_SPSV_UPDATE_DIAGONAL

Updates the diagonal part of the matrix with diagonal values stored innewValues array. That is,newValues has the new diagonal values only

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpSV CSR andcuSPARSE Library Samples - cusparseSpSV COO for code examples.


6.6.8.cusparseSpMM()

cusparseStatus_tcusparseSpMM_bufferSize(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpMMAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseSpMM_preprocess(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpMMAlg_talg,void*externalBuffer)
cusparseStatus_tcusparseSpMM(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpMMAlg_talg,void*externalBuffer)

The function performs the multiplication of a sparse matrixmatA and a dense matrixmatB.

\[\mathbf{C} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\]

where

  • op(A) is a sparse matrix of size\(m \times k\)

  • op(B) is a dense matrix of size\(k \times n\)

  • C is a dense matrix of size\(m \times n\)

  • \(\alpha\) and\(\beta\) are scalars

The routine can be also used to perform the multiplication of a dense matrix and a sparse matrix by switching the dense matrices layout:

\[\begin{split}\begin{array}{l} \left. \mathbf{C}_{C} = \mathbf{B}_{C} \cdot \mathbf{A} + \beta\mathbf{C}_{C}\rightarrow \right. \\ {\mathbf{C}_{R} = \mathbf{A}^{T} \cdot \mathbf{B}_{R} + \beta\mathbf{C}_{R}} \\ \end{array}\end{split}\]

where\(\mathbf{B}_{C}\) ,\(\mathbf{C}_{C}\) indicate column-major layout, while\(\mathbf{B}_{R}\) ,\(\mathbf{C}_{R}\) refer to row-major layout

Also, for matrixA andB

image11

image12

When using the (conjugate) transpose of the sparse matrixA, this routine may produce slightly different results during different runs with the same input parameters.

The functioncusparseSpMM_bufferSize() returns the size of the workspace needed bycusparseSpMM()

CallingcusparseSpMM_preprocess() is optional.It may accelerate subsequent calls tocusparseSpMM().It is useful whencusparseSpMM() is called multiple times with the same sparsity pattern (matA).It provides performance advantages withCUSPARSE_SPMM_CSR_ALG1 orCUSPARSE_SPMM_CSR_ALG3.For all other formats and algorithms have no effect.

CallingcusparseSpMM_preprocess() withbuffer makes that buffer “active” formatA SpMM calls.Subsequent calls tocusparseSpMM() withmatA and the active buffermust use the same values for all parameters as the call tocusparseSpMM_preprocess().The exceptions are:alpha,beta,matX,matY, and the values (but not indices) ofmatA may be different.Importantly, the buffer contents must be unmodified since the call tocusparseSpMM_preprocess().WhencusparseSpMM() is called withmatA and its active buffer, it may read acceleration data from the buffer.

CallingcusparseSpMM_preprocess() again withmatA and a new buffer will make the new buffer active,forgetting about the previously-active buffer and making it inactive.ForcusparseSpMM(), there can only be one active buffer per sparse matrix at a time.To get the effect of multiple active buffers for a single sparse matrix,create multiple matrix handles that all point to the same index and value buffers,and callcusparseSpMM_preprocess() once per handle with different workspace buffers.

CallingcusparseSpMM() with an inactive buffer is always permitted.However, there may be no acceleration from the preprocessing in that case.

For the purposes ofthread safety,cusparseSpMM_preprocess() is writing tomatA internal state.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

opB

HOST

IN

Operationop(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of typecomputeType

matA

HOST

IN

Sparse matrixA

matB

HOST

IN

Dense matrixB

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of typecomputeType

matC

HOST

IN/OUT

Dense matrixC

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpMM

externalBuffer

DEVICE

IN

Pointer to workspace buffer of at leastbufferSize bytes

cusparseSpMM supports the following sparse matrix formats:

  • CUSPARSE_FORMAT_COO

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_CSC

  • CUSPARSE_FORMAT_BSR

  • CUSPARSE_FORMAT_BLOCKED_ELL

(1)

COO/CSR/CSC/BSR FORMATS

cusparseSpMM supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMM supports the following data types:

Uniform-precision computation:

A/B/C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

CUDA_C_16F

CUDA_C_16F

CUDA_C_32F

[DEPRECATED]

CUDA_C_16BF

CUDA_C_16BF

[DEPRECATED]

NOTE:CUDA_R_16F,CUDA_R_16BF,CUDA_C_16F, andCUDA_C_16BF data types always imply mixed-precision computation.

cusparseSpMM supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMM_ALG_DEFAULT

Default algorithm for any sparse matrix format

CUSPARSE_SPMM_COO_ALG1

Algorithm 1 for COO sparse matrix format

  • May provide better performance for small number of nnz

  • Provides the best performance with column-major layout

  • It supports batched computation

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_COO_ALG2

Algorithm 2 for COO sparse matrix format

  • It provides deterministic result

  • Provides the best performance with column-major layout

  • In general, slower than Algorithm 1

  • It supports batched computation

  • It requires additional memory

  • IfopA!=CUSPARSE_OPERATION_NON_TRANSPOSE, it is identical toCUSPARSE_SPMM_COO_ALG1

CUSPARSE_SPMM_COO_ALG3

Algorithm 3 for COO sparse matrix format

  • May provide better performance for large number of nnz

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_COO_ALG4

Algorithm 4 for COO sparse matrix format

  • Provides better performance with row-major layout

  • It supports batched computation

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG1

Algorithm 1 for CSR/CSC sparse matrix format

  • Provides the best performance with column-major layout

  • It supports batched computation

  • It requires additional memory

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG2

Algorithm 2 for CSR/CSC sparse matrix format

  • Provides the best performance with row-major layout

  • It supports batched computation

  • It requires additional memory

  • May produce slightly different results during different runs with the same input parameters

CUSPARSE_SPMM_CSR_ALG3

Algorithm 3 for CSR/CSC sparse matrix format

  • It provides deterministic result

  • It requires additional memory

  • It supports onlyopA==CUSPARSE_OPERATION_NON_TRANSPOSE

  • It does not supportopB==CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE

  • It does not supportCUDA_C_16FandCUDA_C_16BF data types

CUSPARSE_SPMM_BSR_ALG1

Algorithm 1 for BSR sparse matrix format

  • It provides deterministic result

  • It requires no additional memory

  • It supports onlyopA==CUSPARSE_OPERATION_NON_TRANSPOSE

  • It does not supportCUDA_C_16F andCUDA_C_16BF data types

  • It does not support column-major blocks inA

Performance notes:

  • Row-major layout provides higher performance than column-major

  • CUSPARSE_SPMM_COO_ALG4 andCUSPARSE_SPMM_CSR_ALG2 should be used with row-major layout, whileCUSPARSE_SPMM_COO_ALG1,CUSPARSE_SPMM_COO_ALG2,CUSPARSE_SPMM_COO_ALG3, andCUSPARSE_SPMM_CSR_ALG1 with column-major layout

  • Forbeta!=1, most algorithms scale the output matrix before the main computation

  • Forn==1, the routine may usecusparseSpMV()

cusparseSpMM() with all algorithms support the following batch modes except forCUSPARSE_SPMM_CSR_ALG3:

  • \(C_{i} = A \cdot B_{i}\)

  • \(C_{i} = A_{i} \cdot B\)

  • \(C_{i} = A_{i} \cdot B_{i}\)

The number of batches and their strides can be set by usingcusparseCooSetStridedBatch,cusparseCsrSetStridedBatch, andcusparseDnMatSetStridedBatch. The maximum number of batches forcusparseSpMM() is 65,535.

cusparseSpMM() has the following properties:

  • The routine requires no extra storage forCUSPARSE_SPMM_COO_ALG1,CUSPARSE_SPMM_COO_ALG3,CUSPARSE_SPMM_COO_ALG4,CUSPARSE_SPMM_BSR_ALG1

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run only forCUSPARSE_SPMM_COO_ALG2,CUSPARSE_SPMM_CSR_ALG3, andCUSPARSE_SPMM_BSR_ALG1 algorithms

  • compute-sanitizer could report false race conditions for this routine. This is for optimization purposes and does not affect the correctness of the computation

  • The routine allows the indices ofmatA to be unsorted

cusparseSpMM() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Please visitcuSPARSE Library Samples - cusparseSpMM CSR andcusparseSpMM COO for a code example. For batched computation please visitcusparseSpMM CSR Batched andcusparseSpMM COO Batched.

(2)

BLOCKED-ELLPACK FORMAT

cusparseSpMM supports the following data types forCUSPARSE_FORMAT_BLOCKED_ELL format and the following GPU architectures for exploiting NVIDIA Tensor Cores:

A/B

C

computeType

opB

ComputeCapability

CUDA_R_16F

CUDA_R_16F

CUDA_R_16F

N,T

70

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

N,T

70

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

N,T

70

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

N column-major

75

T row-major

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

N,T

80

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

N,T

80

CUDA_R_32F

CUDA_R_32F

CUDA_R_32F

N,T

80

CUDA_R_64F

CUDA_R_64F

CUDA_R_64F

N,T

80

cusparseSpMM supports the following algorithms withCUSPARSE_FORMAT_BLOCKED_ELL format:

Algorithm

Notes

CUSPARSE_SPMM_ALG_DEFAULT

Default algorithm for any sparse matrix format

CUSPARSE_SPMM_BLOCKED_ELL_ALG1

Default algorithm for Blocked-ELL format

Performance notes:

  • Blocked-ELL SpMM provides the best performance with Power-of-2 Block-Sizes.

  • Large Block-Sizes (e.g. ≥ 64) provide the best performance.

The function has the following limitations:

  • The pointer mode must be equal toCUSPARSE_POINTER_MODE_HOST

  • OnlyopA==CUSPARSE_OPERATION_NON_TRANSPOSE is supported.

  • opB==CUSPARSE_OPERATION_CONJUGATE_TRANSPOSE is not supported.

  • OnlyCUSPARSE_INDEX_32I is supported.

Please visitcuSPARSE Library Samples - cusparseSpMM Blocked-ELL for a code example.

SeecusparseStatus_t for the description of the return status.


6.6.9.cusparseSpMMOp()

cusparseStatus_tCUSPARSEAPIcusparseSpMMOp_createPlan(cusparseHandle_thandle,cusparseSpMMOpPlan_t*plan,cusparseOperation_topA,cusparseOperation_topB,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedcusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpMMOpAlg_talg,constvoid*addOperationNvvmBuffer,size_taddOperationBufferSize,constvoid*mulOperationNvvmBuffer,size_tmulOperationBufferSize,constvoid*epilogueNvvmBuffer,size_tepilogueBufferSize,size_t*SpMMWorkspaceSize)cusparseStatus_tcusparseSpMMOp_destroyPlan(cusparseSpMMOpPlan_tplan)
cusparseStatus_tcusparseSpMMOp(cusparseSpMMOpPlan_tplan,void*externalBuffer)

NOTE 1: NVRTC and nvJitLink are not currently available on Arm64 Android platforms.

NOTE 2: The routine does not support Android and Tegra platforms except Judy (sm87).

Experimental: The function performs the multiplication of a sparse matrixmatA and a dense matrixmatB with custom operators.

\[{C^{\prime}}_{ij} = \text{epilogue}\left( {\sum_{k}^{\oplus}{op\left( A_{ik} \right) \otimes op\left( B_{kj} \right),C_{ij}}} \right)\]

where

  • op(A) is a sparse matrix of size\(m \times k\)

  • op(B) is a dense matrix of size\(k \times n\)

  • C is a dense matrix of size\(m \times n\)

  • \(\oplus\) ,\(\otimes\) , and\(\text{epilogue}\) are customadd,mul, andepilogue operators respectively.

Also, for matrixA andB

image13

image7

OnlyopA==CUSPARSE_OPERATION_NON_TRANSPOSE is currently supported

The functioncusparseSpMMOp_createPlan() returns the size of the workspace and the compiled kernel needed bycusparseSpMMOp()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

opB

HOST

IN

Operationop(B)

matA

HOST

IN

Sparse matrixA

matB

HOST

IN

Dense matrixB

matC

HOST

IN/OUT

Dense matrixC

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

addOperationNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the customadd operator

addOperationBufferSize

HOST

IN

Size in bytes ofaddOperationNvvmBuffer

mulOperationNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the custommul operator

mulOperationBufferSize

HOST

IN

Size in bytes ofmulOperationNvvmBuffer

epilogueNvvmBuffer

HOST

IN

Pointer to the NVVM buffer containing the customepilogue operator

epilogueBufferSize

HOST

IN

Size in bytes ofepilogueNvvmBuffer

SpMMWorkspaceSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpMMOp

The operators must have the following signature and return type

__device__<computetype>add_op(<computetype>value1,<computetype>value2);__device__<computetype>mul_op(<computetype>value1,<computetype>value2);__device__<computetype>epilogue(<computetype>value1,<computetype>value2);

<computetype> is one offloat,double,cuComplex,cuDoubleComplex, orint,

cusparseSpMMOp supports the following sparse matrix formats:

  • CUSPARSE_FORMAT_CSR

cusparseSpMMOp supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpMMOp supports the following data types:

Uniform-precision computation:

A/B/C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_8I

CUDA_R_32I

CUDA_R_32I

CUDA_R_8I

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16F

CUDA_R_16F

CUDA_R_16BF

CUDA_R_16BF

cusparseSpMMOp supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPMM_OP_ALG_DEFAULT

Default algorithm for any sparse matrix format

Performance notes:

  • Row-major layout provides higher performance than column-major.

cusparseSpMMOp() has the following properties:

  • The routine requires extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine allows the indices ofmatA to be unsorted

cusparseSpMMOp() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Please visitcuSPARSE Library Samples - cusparseSpMMOp

SeecusparseStatus_t for the description of the return status.


6.6.10.cusparseSpSM()

cusparseStatus_tcusparseSpSM_createDescr(cusparseSpSMDescr_t*spsmDescr);cusparseStatus_tcusparseSpSM_destroyDescr(cusparseSpSMDescr_tspsmDescr);
cusparseStatus_tcusparseSpSM_bufferSize(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedcusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpSMAlg_talg,cusparseSpSMDescr_tspsmDescr,size_t*bufferSize)
cusparseStatus_tcusparseSpSM_analysis(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedcusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpSMAlg_talg,cusparseSpSMDescr_tspsmDescr,void*externalBuffer)
cusparseStatus_tcusparseSpSM_solve(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedcusparseDnMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpSMAlg_talg,cusparseSpSMDescr_tspsmDescr)
cusparseStatus_tcusparseSpSM_updateMatrix(cusparseHandle_thandle,cusparseSpSMDescr_tspsmDescr,void*newValues,cusparseSpSMUpdate_tupdatePart)

The function solves a system of linear equations whose coefficients are represented in a sparse triangular matrix:

\[op\left( \mathbf{A} \right) \cdot \mathbf{C} = \mathbf{\alpha}op\left( \mathbf{B} \right)\]

where

  • op(A) is a sparse square matrix of size\(m \times m\)

  • op(B) is a dense matrix of size\(m \times n\)

  • C is a dense matrix of size\(m \times n\)

  • \(\alpha\) is a scalar

Also, for matrixA

image11

image14

The functioncusparseSpSM_bufferSize() returns the size of the workspace needed bycusparseSpSM_analysis() andcusparseSpSM_solve().The functioncusparseSpSM_analysis() performs the analysis phase, whilecusparseSpSM_solve() executes the solve phase for a sparse triangular linear system.The opaque data structurespsmDescr is used to share information among all functions.The functioncusparseSpSM_updateMatrix() updatesspsmDescr with new matrix values.

The routine supports arbitrary sparsity for the input matrix, but only the upper or lower triangular part is taken into account in the computation.

cusparseSpSM_bufferSize() requires a buffer size for the analysis phase which is proportional to number of non-zero entries of the sparse matrix

TheexternalBuffer is stored intospsmDescr and used bycusparseSpSM_solve(). For this reason, the device memory buffer must be deallocated only aftercusparseSpSM_solve()

NOTE: all parameters must be consistent acrosscusparseSpSM API calls and the matrix descriptions andexternalBuffer must not be modified betweencusparseSpSM_analysis() andcusparseSpSM_solve()

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

opB

HOST

IN

Operationop(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of typecomputeType

matA

HOST

IN

Sparse matrixA

matB

HOST

IN

Dense matrixB

matC

HOST

IN/OUT

Dense matrixC

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSpSM_analysis() andcusparseSpSM_solve()

externalBuffer

DEVICE

IN/OUT

Pointer to a workspace buffer of at leastbufferSize bytes. It is used bycusparseSpSM_analysis andcusparseSpSM_solve()

spsmDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

The sparse matrix formats currently supported are listed below:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_COO

ThecusparseSpSM() supports the following shapes and properties:

  • CUSPARSE_FILL_MODE_LOWER andCUSPARSE_FILL_MODE_UPPER fill modes

  • CUSPARSE_DIAG_TYPE_NON_UNIT andCUSPARSE_DIAG_TYPE_UNIT diagonal types

The fill mode and diagonal type can be set bycusparseSpMatSetAttribute().

cusparseSpSM() supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSpSM() supports the following data types:

Uniform-precision computation:

A/B/C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

cusparseSpSM() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SPSM_ALG_DEFAULT

Default algorithm

cusparseSpSM() has the following properties:

  • The routine requires no extra storage

  • Provides deterministic (bit-wise) results for each run for the solving phasecusparseSpSM_solve()

  • ThecusparseSpSM_solve() routine supports asynchronous execution

  • The routine supports in-place operation. The same device pointer must be provided to thevalues parameter of the dense matricesmatB andmatC. All other dense matrix descriptor parameters (e.g.,order) can be set independently

  • cusparseSpSM_bufferSize() andcusparseSpSM_analysis() routines accept descriptors ofNULL values formatB andmatC. These two routines do not acceptNULL descriptors

  • The routine allows the indices ofmatA to be unsorted

cusparseSpSM() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

cusparseSpSM_updateMatrix() updates the sparse matrix after calling the analysis phase. This functions supports the following update strategies (updatePart):

Strategy

Notes

CUSPARSE_SPSM_UPDATE_GENERAL

Updates the sparse matrix values with values ofnewValues array

CUSPARSE_SPSM_UPDATE_DIAGONAL

Updates the diagonal part of the matrix with diagonal values stored innewValues array. That is,newValues has the new diagonal values only

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpSM CSR andcuSPARSE Library Samples - cusparseSpSM COO for code examples.


6.6.11.cusparseSDDMM()

cusparseStatus_tcusparseSDDMM_bufferSize(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSDDMMAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseSDDMM_preprocess(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSDDMMAlg_talg,void*externalBuffer)
cusparseStatus_tcusparseSDDMM(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseConstDnMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSDDMMAlg_talg,void*externalBuffer)

This function performs the multiplication ofmatA andmatB, followed by an element-wise multiplication with the sparsity pattern ofmatC. Formally, it performs the following operation:

\[\mathbf{C} = \alpha({op}(\mathbf{A}) \cdot {op}(\mathbf{B})) \circ {spy}(\mathbf{C}) + \beta\mathbf{C}\]

where

  • op(A) is a dense matrix of size\(m \times k\)

  • op(B) is a dense matrix of size\(k \times n\)

  • C is a sparse matrix of size\(m \times n\)

  • \(\alpha\) and\(\beta\) are scalars

  • \(\circ\) denotes the Hadamard (entry-wise) matrix product, and\({spy}\left( \mathbf{C} \right)\) is the structural sparsity pattern matrix ofC defined as:

image8

Also, for matrixA andB

image13

image7

The functioncusparseSDDMM_bufferSize() returns the size of the workspace needed bycusparseSDDMM orcusparseSDDMM_preprocess.

CallingcusparseSDDMM_preprocess() is optional.It may accelerate subsequent calls tocusparseSDDMM().It is useful whencusparseSDDMM() is called multiple times with the same sparsity pattern (matC).

CallingcusparseSDDMM_preprocess() withbuffer makes that buffer “active” formatC SDDMM calls.Subsequent calls tocusparseSDDMM() withmatC and the active buffermust use the same values for all parameters as the call tocusparseSDDMM_preprocess().The exceptions are:alpha,beta,matA,matB, and the values (but not indices) ofmatC may be different.Importantly, the buffer contents must be unmodified since the call tocusparseSDDMM_preprocess().WhencusparseSDDMM() is called withmatC and its active buffer, it may read acceleration data from the buffer.

CallingcusparseSDDMM_preprocess() again withmatC and a new buffer will make the new buffer active,forgetting about the previously-active buffer and making it inactive.ForcusparseSDDMM(), there can only be one active buffer per sparse matrix at a time.To get the effect of multiple active buffers for a single sparse matrix,create multiple matrix handles that all point to the same index and value buffers,and callcusparseSDDMM_preprocess() once per handle with different workspace buffers.

CallingcusparseSDDMM() with an inactive buffer is always permitted.However, there may be no acceleration from the preprocessing in that case.

For the purposes ofthread safety,cusparseSDDMM_preprocess() is writing tomatC internal state.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

opB

HOST

IN

Operationop(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication of typecomputeType

matA

HOST

IN

Dense matrixmatA

matB

HOST

IN

Dense matrixmatB

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication of typecomputeType

matC

HOST

IN/OUT

Sparse matrixmatC

computeType

HOST

IN

Datatype in which the computation is executed

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSDDMM

externalBuffer

DEVICE

IN

Pointer to a workspace buffer of at leastbufferSize bytes

Currently supported sparse matrix formats:

  • CUSPARSE_FORMAT_CSR

  • CUSPARSE_FORMAT_BSR

cusparseSDDMM() supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

The data types combinations currently supported forcusparseSDDMM are listed below:

Uniform-precision computation:

A/X/Y/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation:

A/B

C

computeType

CUDA_R_16F

CUDA_R_32F

CUDA_R_32F

CUDA_R_16F

CUDA_R_16F

cusparseSDDMM forCUSPARSE_FORMAT_BSR also supports the following mixed-precision computation:

A/B

C

computeType

CUDA_R_16BF

CUDA_R_32F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

NOTE:CUDA_R_16F,CUDA_R_16BF data types always imply mixed-precision computation.

cusparseSDDMM() forCUSPASRE_FORMAT_BSR supports block sizes of 2, 4, 8, 16, 32, 64 and 128.

cusparseSDDMM() supports the following algorithms:

Algorithm

Notes

CUSPARSE_SDDMM_ALG_DEFAULT

Default algorithm. It supports batched computation.

Performance notes:cusparseSDDMM() forCUSPARSE_FORMAT_CSR provides the best performance whenmatA andmatB satisfy:

  • matA:

    • matA is in row-major order andopA isCUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matA is in col-major order andopA is notCUSPARSE_OPERATION_NON_TRANSPOSE

  • matB:

    • matB is in col-major order andopB isCUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matB is in row-major order andopB is notCUSPARSE_OPERATION_NON_TRANSPOSE

cusparseSDDMM() forCUSPARSE_FORMAT_BSR provides the best performance whenmatA andmatB satisfy:

  • matA:

    • matA is in row-major order andopA isCUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matA is in col-major order andopA is notCUSPARSE_OPERATION_NON_TRANSPOSE

  • matB:

    • matB is in row-major order andopB isCUSPARSE_OPERATION_NON_TRANSPOSE, or

    • matB is in col-major order andopB is notCUSPARSE_OPERATION_NON_TRANSPOSE

cusparseSDDMM() supports the following batch modes:

  • \(C_{i} = (A \cdot B) \circ C_{i}\)

  • \(C_{i} = \left( A_{i} \cdot B \right) \circ C_{i}\)

  • \(C_{i} = \left( A \cdot B_{i} \right) \circ C_{i}\)

  • \(C_{i} = \left( A_{i} \cdot B_{i} \right) \circ C_{i}\)

The number of batches and their strides can be set by usingcusparseCsrSetStridedBatch andcusparseDnMatSetStridedBatch. The maximum number of batches forcusparseSDDMM() is 65,535.

cusparseSDDMM() has the following properties:

  • The routine requires no extra storage

  • Provides deterministic (bit-wise) results for each run

  • The routine supports asynchronous execution

  • The routine allows the indices ofmatC to be unsorted

cusparseSDDMM() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSDDMM for a code example. For batched computation please visitcusparseSDDMM CSR Batched.


6.6.12.cusparseSpGEMM()

cusparseStatus_tcusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t*descr)cusparseStatus_tcusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_tdescr)
cusparseStatus_tcusparseSpGEMM_workEstimation(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstSpMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,size_t*bufferSize1,void*externalBuffer1)cusparseStatus_tcusparseSpGEMM_getNumProducts(cusparseSpGEMMDescr_tspgemmDescr,int64_t*num_prods)cusparseStatus_tcusparseSpGEMM_estimateMemory(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstSpMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,floatchunk_fraction,size_t*bufferSize3,void*externalBuffer3,size_t*bufferSize2)cusparseStatus_tcusparseSpGEMM_compute(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstSpMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,size_t*bufferSize2,void*externalBuffer2)cusparseStatus_tcusparseSpGEMM_copy(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseConstSpMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr)

This function performs the multiplication of two sparse matricesmatA andmatB.

\[\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\]

where\(\alpha,\)\(\beta\) are scalars, and\(\mathbf{C},\)\(\mathbf{C^{\prime}}\) have the same sparsity pattern.

The functionscusparseSpGEMM_workEstimation(),cusparseSpGEMM_estimateMemory(), andcusparseSpGEMM_compute() are used for both determining the buffer size and performing the actual computation.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

opA

HOST

IN

Operationop(A)

opB

HOST

IN

Operationop(B)

alpha

HOST or DEVICE

IN

\(\alpha\) scalar used for multiplication

matA

HOST

IN

Sparse matrixA

matB

HOST

IN

Sparse matrixB

beta

HOST or DEVICE

IN

\(\beta\) scalar used for multiplication

matC

HOST

IN/OUT

Sparse matrixC

computeType

HOST

IN

Enumerator specifying the datatype in which the computation is executed

alg

HOST

IN

Enumerator specifying the algorithm for the computation

spgemmDescr

HOST

IN/OUT

Opaque descriptor for storing internal data used across the three steps

num_prods

HOST

OUT

Pointer to a 64-bit integer that stores the number of intermediate products calculated bycusparseSpGEMM_workEstimation

chunk_fraction

HOST

IN

The fraction of total intermediate products being computed in a chunk. Used byCUSPARSE_SPGEMM_ALG3 only. Value is in range (0,1].

bufferSize1

HOST

IN/OUT

Number of bytes of workspace requested bycusparseSpGEMM_workEstimation

bufferSize2

HOST

IN/OUT

Number of bytes of workspace requested bycusparseSpGEMM_compute

bufferSize3

HOST

IN/OUT

Number of bytes of workspace requested bycusparseSpGEMM_estimateMemory

externalBuffer1

DEVICE

IN

Pointer to workspace buffer needed bycusparseSpGEMM_workEstimation andcusparseSpGEMM_compute

externalBuffer2

DEVICE

IN

Pointer to workspace buffer needed bycusparseSpGEMM_compute andcusparseSpGEMM_copy

externalBuffer3

DEVICE

IN

Pointer to workspace buffer needed bycusparseSpGEMM_estimateMemory

cusparseSpGEMM supports the following index type for representing the sparse matrixA,B andC (all matrices must have the same index type):

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

Currently, the function has the following limitations:

  • Only CSR formatCUSPARSE_FORMAT_CSR is supported

  • OnlyopA,opB equal toCUSPARSE_OPERATION_NON_TRANSPOSE are supported

The data types combinations currently supported forcusparseSpGEMM are listed below :

Uniform-precision computation:

A/B/C/computeType

CUDA_R_16F [DEPRECATED]

CUDA_R_16BF [DEPRECATED]

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseSpGEMM routine runs for the following algorithms:

Algorithm

Notes

CUSPARSE_SPGEMM_DEFAULT

Default algorithm. Currently, it isCUSPARSE_SPGEMM_ALG1.

CUSPARSE_SPGEMM_ALG1

Algorithm 1

  • InvokescusparseSpGEMM_compute twice. The first invocation provides an upper bound of the memory required for the computation.

  • The required memory is generally several times larger of the actual memory used.

  • The user can provide an arbitrary buffer size bufferSize2 in the second invocation. If it is not sufficient, the routine will returnsCUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

  • Provides better performance than other algorithms.

  • Provides deterministic (bit-wise) results for each run.

CUSPARSE_SPGEMM_ALG2

Algorithm 2

  • InvokescusparseSpGEMM_estimateMemory to get the amount of the memory required for the computation.

  • Requires less memory for the computation than Algorithm 1.

  • Performance is lower than Algorithm 1, higher than Algorithm 3.

  • Provides deterministic (bit-wise) results for each run.

CUSPARSE_SPGEMM_ALG3

Algorithm 3

  • Computes the intermediate products in chunks, one chunk at a time.

  • InvokescusparseSpGEMM_estimateMemory to get the amount of the memory required for the computation.

  • The user can control the amount of required memory by changing the chunk size viachunk_fraction.

  • The chunk size is a fraction of total intermediate products:chunk_fraction*(*num_prods).

  • Provides deterministic (bit-wise) results for each run.

cusparseSpGEMM() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine allows the indices ofmatA andmatB to be unsorted

  • The routine guarantees the indices ofmatC to be sorted

cusparseSpGEMM() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpGEMM for a code example forCUSPARSE_SPGEMM_DEFAULT andCUSPARSE_SPGEMM_ALG1, andcuSPARSE Library Samples - memory-optimzed cusparseSpGEMM for a code example forCUSPARSE_SPGEMM_ALG2 andCUSPARSE_SPGEMM_ALG3.


6.6.13.cusparseSpGEMMreuse()

cusparseStatus_tcusparseSpGEMM_createDescr(cusparseSpGEMMDescr_t*descr)cusparseStatus_tcusparseSpGEMM_destroyDescr(cusparseSpGEMMDescr_tdescr)
cusparseStatus_tcusparseSpGEMMreuse_workEstimation(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,cusparseSpMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,// non-const descriptor supportedcusparseSpMatDescr_tmatC,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,size_t*bufferSize1,void*externalBuffer1)cusparseStatus_tcusparseSpGEMMreuse_nnz(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,cusparseSpMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,// non-const descriptor supportedcusparseSpMatDescr_tmatC,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,size_t*bufferSize2,void*externalBuffer2,size_t*bufferSize3,void*externalBuffer3,size_t*bufferSize4,void*externalBuffer4)cusparseStatus_tCUSPARSEAPIcusparseSpGEMMreuse_copy(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,cusparseSpMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,// non-const descriptor supportedcusparseSpMatDescr_tmatC,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr,size_t*bufferSize5,void*externalBuffer5)cusparseStatus_tCUSPARSEAPIcusparseSpGEMMreuse_compute(cusparseHandle_thandle,cusparseOperation_topA,cusparseOperation_topB,constvoid*alpha,cusparseSpMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,// non-const descriptor supportedconstvoid*beta,cusparseSpMatDescr_tmatC,cudaDataTypecomputeType,cusparseSpGEMMAlg_talg,cusparseSpGEMMDescr_tspgemmDescr)

This function performs the multiplication of two sparse matricesmatA andmatB where the structure of the output matrixmatC can be reused for multiple computations with different values.

\[\mathbf{C^{\prime}} = \alpha op\left( \mathbf{A} \right) \cdot op\left( \mathbf{B} \right) + \beta\mathbf{C}\]

where\(\alpha\) and\(\beta\) are scalars.

The functionscusparseSpGEMMreuse_workEstimation(),cusparseSpGEMMreuse_nnz(), andcusparseSpGEMMreuse_copy() are used for determining the buffer size and performing the actual computation.

Note:cusparseSpGEMMreuse() output CSR matrix (matC) is sorted by column indices.

MEMORY REQUIREMENT:cusparseSpGEMMreuse requires to keep in memory all intermediate products to reuse the structure of the output matrix. On the other hand, the number of intermediate products is orders of magnitude higher than the number of non-zero entries in general. In order to minimize the memory requirements, the routine uses multiple buffers that can be deallocated after they are no more needed. If the number of intermediate product exceeds2^31-1, the routine will returnsCUSPARSE_STATUS_INSUFFICIENT_RESOURCES status.

Currently, the function has the following limitations:

  • Only 32-bit indicesCUSPARSE_INDEX_32I is supported

  • Only CSR formatCUSPARSE_FORMAT_CSR is supported

  • OnlyopA,opB equal toCUSPARSE_OPERATION_NON_TRANSPOSE are supported

The data types combinations currently supported forcusparseSpGEMMreuse are listed below.

Uniform-precision computation:

A/B/C/computeType

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

Mixed-precision computation: [DEPRECATED]

A/B

C

computeType

CUDA_R_16F

CUDA_R_16F

CUDA_R_32F

CUDA_R_16BF

CUDA_R_16BF

CUDA_R_32F

cusparseSpGEMMreuse routine runs for the following algorithm:

Algorithm

Notes

CUSPARSE_SPGEMM_DEFAULT

CUSPARSE_SPGEMM_CSR_ALG_NONDETERMINITIC

Default algorithm. Provides deterministic (bit-wise) structure for the output matrix for each run, while value computation is not deterministic.

CUSPARSE_SPGEMM_CSR_ALG_DETERMINITIC

Provides deterministic (bit-wise) structure for the output matrix and value computation for each run.

cusparseSpGEMMreuse() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • The routine allows the indices ofmatA andmatB to be unsorted

  • The routine guarantees the indices ofmatC to be sorted

cusparseSpGEMMreuse() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

Refer tocusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSpGEMMreuse for a code example.


6.6.14.cusparseSparseToDense()

cusparseStatus_tcusparseSparseToDense_bufferSize(cusparseHandle_thandle,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseDnMatDescr_tmatB,cusparseSparseToDenseAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseSparseToDense(cusparseHandle_thandle,cusparseConstSpMatDescr_tmatA,// non-const descriptor supportedcusparseDnMatDescr_tmatB,cusparseSparseToDenseAlg_talg,void*buffer)

The function converts the sparse matrixmatA in CSR, CSC, or COO format into its dense representationmatB. Blocked-ELL is not currently supported.

The functioncusparseSparseToDense_bufferSize() returns the size of the workspace needed bycusparseSparseToDense().

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

matA

HOST

IN

Sparse matrixA

matB

HOST

OUT

Dense matrixB

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseSparseToDense()

buffer

DEVICE

IN

Pointer to workspace buffer

cusparseSparseToDense() supports the following index type for representing the sparse matrixmatA:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseSparseToDense() supports the following data types:

A/B

CUDA_R_8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseSparse2Dense() supports the following algorithm:

Algorithm

Notes

CUSPARSE_SPARSETODENSE_ALG_DEFAULT

Default algorithm

cusparseSparseToDense() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine allows the indices ofmatA to be unsorted

cusparseSparseToDense() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseSparseToDense for a code example.


6.6.15.cusparseDenseToSparse()

cusparseStatus_tcusparseDenseToSparse_bufferSize(cusparseHandle_thandle,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,cusparseDenseToSparseAlg_talg,size_t*bufferSize)
cusparseStatus_tcusparseDenseToSparse_analysis(cusparseHandle_thandle,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,cusparseDenseToSparseAlg_talg,void*buffer)
cusparseStatus_tcusparseDenseToSparse_convert(cusparseHandle_thandle,cusparseConstDnMatDescr_tmatA,// non-const descriptor supportedcusparseSpMatDescr_tmatB,cusparseDenseToSparseAlg_talg,void*buffer)

The function converts the dense matrixmatA into a sparse matrixmatB in CSR, CSC, COO, or Blocked-ELL format.

The functioncusparseDenseToSparse_bufferSize() returns the size of the workspace needed bycusparseDenseToSparse_analysis().

The functioncusparseDenseToSparse_analysis() updates the number of non-zero elements in the sparse matrix descriptormatB. The user is responsible to allocate the memory required by the sparse matrix:

  • Row/Column indices and value arrays for CSC and CSR respectively

  • Row, column, value arrays for COO

  • Column (ellColInd), value (ellValue) arrays for Blocked-ELL

Finally, we callcusparseDenseToSparse_convert() for filling the arrays allocated in the previous step.

Param.

Memory

In/out

Meaning

handle

HOST

IN

Handle to the cuSPARSE library context

matA

HOST

IN

Dense matrixA

matB

HOST

OUT

Sparse matrixB

alg

HOST

IN

Algorithm for the computation

bufferSize

HOST

OUT

Number of bytes of workspace needed bycusparseDenseToSparse_analysis()

buffer

DEVICE

IN

Pointer to workspace buffer

cusparseDenseToSparse() supports the following index type for representing the sparse vectormatB:

  • 32-bit indices (CUSPARSE_INDEX_32I)

  • 64-bit indices (CUSPARSE_INDEX_64I)

cusparseDenseToSparse() supports the following data types:

A/B

CUDA_R_*8I

CUDA_R_16F

CUDA_R_16BF

CUDA_R_32F

CUDA_R_64F

CUDA_C_16F [DEPRECATED]

CUDA_C_16BF [DEPRECATED]

CUDA_C_32F

CUDA_C_64F

cusparseDense2Sparse() supports the following algorithm:

Algorithm

Notes

CUSPARSE_DENSETOSPARSE_ALG_DEFAULT

Default algorithm

cusparseDenseToSparse() has the following properties:

  • The routine requires no extra storage

  • The routine supports asynchronous execution

  • Provides deterministic (bit-wise) results for each run

  • The routine does not guarantee the indices ofmatB to be sorted

cusparseDenseToSparse() supports the followingoptimizations:

  • CUDA graph capture

  • Hardware Memory Compression

SeecusparseStatus_t for the description of the return status.

Please visitcuSPARSE Library Samples - cusparseDenseToSparse (CSR) andcuSPARSE Library Samples - cusparseDenseToSparse (Blocked-ELL) for code examples.



7.cuSPARSE Fortran Bindings

The cuSPARSE library is implemented using the C-based CUDA toolchain, and it thus provides a C-style API that makes interfacing to applications written in C or C++ trivial. There are also many applications implemented in Fortran that would benefit from using cuSPARSE, and therefore a cuSPARSE Fortran interface has been developed.

Unfortunately, Fortran-to-C calling conventions are not standardized and differ by platform and toolchain. In particular, differences may exist in the following areas:

  • Symbol names (capitalization, name decoration)

  • Argument passing (by value or reference)

  • Passing of pointer arguments (size of the pointer)

To provide maximum flexibility in addressing those differences, the cuSPARSE Fortran interface is provided in the form of wrapper functions, which are written in C and are located in the filecusparse_fortran.c. This file also contains a few additional wrapper functions (forcudaMalloc(),cudaMemset, and so on) that can be used to allocate memory on the GPU.

The cuSPARSE Fortran wrapper code is provided as an example only and needs to be compiled into an application for it to call the cuSPARSE API functions. Providing this source code allows users to make any changes necessary for a particular platform and toolchain.

The cuSPARSE Fortran wrapper code has been used to demonstrate interoperability with the compilers g95 0.91 (on 32-bit and 64-bit Linux) and g95 0.92 (on 32-bit and 64-bit Mac OS X). In order to use other compilers, users have to make any changes to the wrapper code that may be required.

The direct wrappers, intended for production code, substitute device pointers for vector and matrix arguments in all cuSPARSE functions. To use these interfaces, existing applications need to be modified slightly to allocate and deallocate data structures in GPU memory space (usingCUDA_MALLOC() andCUDA_FREE()) and to copy data between GPU and CPU memory spaces (using theCUDA_MEMCPY() routines). The sample wrappers provided incusparse_fortran.c map device pointers to the OS-dependent typesize_t, which is 32 bits wide on 32-bit platforms and 64 bits wide on a 64-bit platforms.

One approach to dealing with index arithmetic on device pointers in Fortran code is to use C-style macros and to use the C preprocessor to expand them. On Linux and Mac OS X, preprocessing can be done by using the option'-cpp' with g95 or gfortran. The functionGET_SHIFTED_ADDRESS(), provided with the cuSPARSE Fortran wrappers, can also be used, as shown in example B.

Example B shows the the C++ of example A implemented in Fortran 77 on the host. This example should be compiled withARCH_64 defined as 1 on a 64-bit OS system and as undefined on a 32-bit OS system. For example, on g95 or gfortran, it can be done directly on the command line using the option-cpp-DARCH_64=1.

7.1.Fortran Application

c     #define ARCH_64 0c     #define ARCH_64 1      program cusparse_fortran_example      implicit none      integer cuda_malloc      external cuda_free      integer cuda_memcpy_c2fort_int      integer cuda_memcpy_c2fort_real      integer cuda_memcpy_fort2c_int      integer cuda_memcpy_fort2c_real      integer cuda_memset      integer cusparse_create      external cusparse_destroy      integer cusparse_get_version      integer cusparse_create_mat_descr      external cusparse_destroy_mat_descr      integer cusparse_set_mat_type      integer cusparse_get_mat_type      integer cusparse_get_mat_fill_mode      integer cusparse_get_mat_diag_type      integer cusparse_set_mat_index_base      integer cusparse_get_mat_index_base      integer cusparse_xcoo2csr      integer cusparse_dsctr      integer cusparse_dcsrmv      integer cusparse_dcsrmm      external get_shifted_address#if ARCH_64      integer*8 handle      integer*8 descrA      integer*8 cooRowIndex      integer*8 cooColIndex      integer*8 cooVal      integer*8 xInd      integer*8 xVal      integer*8 y      integer*8 z      integer*8 csrRowPtr      integer*8 ynp1#else      integer*4 handle      integer*4 descrA      integer*4 cooRowIndex      integer*4 cooColIndex      integer*4 cooVal      integer*4 xInd      integer*4 xVal      integer*4 y      integer*4 z      integer*4 csrRowPtr      integer*4 ynp1#endif      integer status      integer cudaStat1,cudaStat2,cudaStat3      integer cudaStat4,cudaStat5,cudaStat6      integer n, nnz, nnz_vector      parameter (n=4, nnz=9, nnz_vector=3)      integer cooRowIndexHostPtr(nnz)      integer cooColIndexHostPtr(nnz)      real*8  cooValHostPtr(nnz)      integer xIndHostPtr(nnz_vector)      real*8  xValHostPtr(nnz_vector)      real*8  yHostPtr(2*n)      real*8  zHostPtr(2*(n+1))      integer i, j      integer version, mtype, fmode, dtype, ibase      real*8  dzero,dtwo,dthree,dfive      real*8  epsilon      write(*,*) "testing fortran example"c     predefined constants (need to be careful with them)      dzero = 0.0      dtwo  = 2.0      dthree= 3.0      dfive = 5.0c     create the following sparse test matrix in COO formatc     (notice one-based indexing)c     |1.0     2.0 3.0|c     |    4.0        |c     |5.0     6.0 7.0|c     |    8.0     9.0|      cooRowIndexHostPtr(1)=1      cooColIndexHostPtr(1)=1      cooValHostPtr(1)     =1.0      cooRowIndexHostPtr(2)=1      cooColIndexHostPtr(2)=3      cooValHostPtr(2)     =2.0      cooRowIndexHostPtr(3)=1      cooColIndexHostPtr(3)=4      cooValHostPtr(3)     =3.0      cooRowIndexHostPtr(4)=2      cooColIndexHostPtr(4)=2      cooValHostPtr(4)     =4.0      cooRowIndexHostPtr(5)=3      cooColIndexHostPtr(5)=1      cooValHostPtr(5)     =5.0      cooRowIndexHostPtr(6)=3      cooColIndexHostPtr(6)=3      cooValHostPtr(6)     =6.0      cooRowIndexHostPtr(7)=3      cooColIndexHostPtr(7)=4      cooValHostPtr(7)     =7.0      cooRowIndexHostPtr(8)=4      cooColIndexHostPtr(8)=2      cooValHostPtr(8)     =8.0      cooRowIndexHostPtr(9)=4      cooColIndexHostPtr(9)=4      cooValHostPtr(9)     =9.0c     print the matrix      write(*,*) "Input data:"      do i=1,nnz         write(*,*) "cooRowIndexHostPtr[",i,"]=",cooRowIndexHostPtr(i)         write(*,*) "cooColIndexHostPtr[",i,"]=",cooColIndexHostPtr(i)         write(*,*) "cooValHostPtr[",     i,"]=",cooValHostPtr(i)      enddoc     create a sparse and dense vectorc     xVal= [100.0 200.0 400.0]   (sparse)c     xInd= [0     1     3    ]c     y   = [10.0 20.0 30.0 40.0 | 50.0 60.0 70.0 80.0] (dense)c     (notice one-based indexing)      yHostPtr(1) = 10.0      yHostPtr(2) = 20.0      yHostPtr(3) = 30.0      yHostPtr(4) = 40.0      yHostPtr(5) = 50.0      yHostPtr(6) = 60.0      yHostPtr(7) = 70.0      yHostPtr(8) = 80.0      xIndHostPtr(1)=1      xValHostPtr(1)=100.0      xIndHostPtr(2)=2      xValHostPtr(2)=200.0      xIndHostPtr(3)=4      xValHostPtr(3)=400.0c     print the vectors      do j=1,2         do i=1,n            write(*,*) "yHostPtr[",i,",",j,"]=",yHostPtr(i+n*(j-1))         enddo      enddo      do i=1,nnz_vector         write(*,*) "xIndHostPtr[",i,"]=",xIndHostPtr(i)         write(*,*) "xValHostPtr[",i,"]=",xValHostPtr(i)      enddoc     allocate GPU memory and copy the matrix and vectors into itc     cudaSuccess=0c     cudaMemcpyHostToDevice=1      cudaStat1 = cuda_malloc(cooRowIndex,nnz*4)      cudaStat2 = cuda_malloc(cooColIndex,nnz*4)      cudaStat3 = cuda_malloc(cooVal,     nnz*8)      cudaStat4 = cuda_malloc(y,          2*n*8)      cudaStat5 = cuda_malloc(xInd,nnz_vector*4)      cudaStat6 = cuda_malloc(xVal,nnz_vector*8)      if ((cudaStat1 /= 0) .OR.     $    (cudaStat2 /= 0) .OR.     $    (cudaStat3 /= 0) .OR.     $    (cudaStat4 /= 0) .OR.     $    (cudaStat5 /= 0) .OR.     $    (cudaStat6 /= 0)) then         write(*,*) "Device malloc failed"         write(*,*) "cudaStat1=",cudaStat1         write(*,*) "cudaStat2=",cudaStat2         write(*,*) "cudaStat3=",cudaStat3         write(*,*) "cudaStat4=",cudaStat4         write(*,*) "cudaStat5=",cudaStat5         write(*,*) "cudaStat6=",cudaStat6         stop 2      endif      cudaStat1 = cuda_memcpy_fort2c_int(cooRowIndex,cooRowIndexHostPtr,     $                                   nnz*4,1)      cudaStat2 = cuda_memcpy_fort2c_int(cooColIndex,cooColIndexHostPtr,     $                                   nnz*4,1)      cudaStat3 = cuda_memcpy_fort2c_real(cooVal,    cooValHostPtr,     $                                    nnz*8,1)      cudaStat4 = cuda_memcpy_fort2c_real(y,      yHostPtr,     $                                    2*n*8,1)      cudaStat5 = cuda_memcpy_fort2c_int(xInd,       xIndHostPtr,     $                                   nnz_vector*4,1)      cudaStat6 = cuda_memcpy_fort2c_real(xVal,      xValHostPtr,     $                                    nnz_vector*8,1)      if ((cudaStat1 /= 0) .OR.     $    (cudaStat2 /= 0) .OR.     $    (cudaStat3 /= 0) .OR.     $    (cudaStat4 /= 0) .OR.     $    (cudaStat5 /= 0) .OR.     $    (cudaStat6 /= 0)) then         write(*,*) "Memcpy from Host to Device failed"         write(*,*) "cudaStat1=",cudaStat1         write(*,*) "cudaStat2=",cudaStat2         write(*,*) "cudaStat3=",cudaStat3         write(*,*) "cudaStat4=",cudaStat4         write(*,*) "cudaStat5=",cudaStat5         write(*,*) "cudaStat6=",cudaStat6         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         stop 1      endifc     initialize cusparse libraryc     CUSPARSE_STATUS_SUCCESS=0      status = cusparse_create(handle)      if (status /= 0) then         write(*,*) "CUSPARSE Library initialization failed"         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         stop 1      endifc     get versionc     CUSPARSE_STATUS_SUCCESS=0      status = cusparse_get_version(handle,version)      if (status /= 0) then         write(*,*) "CUSPARSE Library initialization failed"         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cusparse_destroy(handle)         stop 1      endif      write(*,*) "CUSPARSE Library version",versionc     create and setup the matrix descriptorc     CUSPARSE_STATUS_SUCCESS=0c     CUSPARSE_MATRIX_TYPE_GENERAL=0c     CUSPARSE_INDEX_BASE_ONE=1      status= cusparse_create_mat_descr(descrA)      if (status /= 0) then         write(*,*) "Creating matrix descriptor failed"         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cusparse_destroy(handle)         stop 1      endif      status = cusparse_set_mat_type(descrA,0)      status = cusparse_set_mat_index_base(descrA,1)c     print the matrix descriptor      mtype = cusparse_get_mat_type(descrA)      fmode = cusparse_get_mat_fill_mode(descrA)      dtype = cusparse_get_mat_diag_type(descrA)      ibase = cusparse_get_mat_index_base(descrA)      write (*,*) "matrix descriptor:"      write (*,*) "t=",mtype,"m=",fmode,"d=",dtype,"b=",ibasec     exercise conversion routines (convert matrix from COO 2 CSR format)c     cudaSuccess=0c     CUSPARSE_STATUS_SUCCESS=0c     CUSPARSE_INDEX_BASE_ONE=1      cudaStat1 = cuda_malloc(csrRowPtr,(n+1)*4)      if (cudaStat1 /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Device malloc failed (csrRowPtr)"         stop 2      endif      status= cusparse_xcoo2csr(handle,cooRowIndex,nnz,n,     $                          csrRowPtr,1)      if (status /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Conversion from COO to CSR format failed"         stop 1      endifc     csrRowPtr = [0 3 4 7 9]c     exercise Level 1 routines (scatter vector elements)c     CUSPARSE_STATUS_SUCCESS=0c     CUSPARSE_INDEX_BASE_ONE=1      call get_shifted_address(y,n*8,ynp1)      status= cusparse_dsctr(handle, nnz_vector, xVal, xInd,     $                       ynp1, 1)      if (status /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Scatter from sparse to dense vector failed"         stop 1      endifc     y = [10 20 30 40 | 100 200 70 400]c     exercise Level 2 routines (csrmv)c     CUSPARSE_STATUS_SUCCESS=0c     CUSPARSE_OPERATION_NON_TRANSPOSE=0      status= cusparse_dcsrmv(handle, 0, n, n, nnz, dtwo,     $                       descrA, cooVal, csrRowPtr, cooColIndex,     $                       y, dthree, ynp1)      if (status /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Matrix-vector multiplication failed"         stop 1      endifc     print intermediate results (y)c     y = [10 20 30 40 | 680 760 1230 2240]c     cudaSuccess=0c     cudaMemcpyDeviceToHost=2      cudaStat1 = cuda_memcpy_c2fort_real(yHostPtr, y, 2*n*8, 2)      if (cudaStat1 /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Memcpy from Device to Host failed"         stop 1      endif      write(*,*) "Intermediate results:"      do j=1,2         do i=1,n             write(*,*) "yHostPtr[",i,",",j,"]=",yHostPtr(i+n*(j-1))         enddo      enddoc     exercise Level 3 routines (csrmm)c     cudaSuccess=0c     CUSPARSE_STATUS_SUCCESS=0c     CUSPARSE_OPERATION_NON_TRANSPOSE=0      cudaStat1 = cuda_malloc(z, 2*(n+1)*8)      if (cudaStat1 /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Device malloc failed (z)"         stop 2      endif      cudaStat1 = cuda_memset(z, 0, 2*(n+1)*8)      if (cudaStat1 /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(z)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Memset on Device failed"         stop 1      endif      status= cusparse_dcsrmm(handle, 0, n, 2, n, nnz, dfive,     $                        descrA, cooVal, csrRowPtr, cooColIndex,     $                        y, n, dzero, z, n+1)      if (status /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(z)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Matrix-matrix multiplication failed"         stop 1      endifc     print final results (z)c     cudaSuccess=0c     cudaMemcpyDeviceToHost=2      cudaStat1 = cuda_memcpy_c2fort_real(zHostPtr, z, 2*(n+1)*8, 2)      if (cudaStat1 /= 0) then         call cuda_free(cooRowIndex)         call cuda_free(cooColIndex)         call cuda_free(cooVal)         call cuda_free(xInd)         call cuda_free(xVal)         call cuda_free(y)         call cuda_free(z)         call cuda_free(csrRowPtr)         call cusparse_destroy_mat_descr(descrA)         call cusparse_destroy(handle)         write(*,*) "Memcpy from Device to Host failed"         stop 1      endifc     z = [950 400 2550 2600 0 | 49300 15200 132300 131200 0]      write(*,*) "Final results:"      do j=1,2         do i=1,n+1            write(*,*) "z[",i,",",j,"]=",zHostPtr(i+(n+1)*(j-1))         enddo      enddoc     check the results      epsilon = 0.00000000000001      if ((DABS(zHostPtr(1) - 950.0)   .GT. epsilon)  .OR.     $    (DABS(zHostPtr(2) - 400.0)   .GT. epsilon)  .OR.     $    (DABS(zHostPtr(3) - 2550.0)  .GT. epsilon)  .OR.     $    (DABS(zHostPtr(4) - 2600.0)  .GT. epsilon)  .OR.     $    (DABS(zHostPtr(5) - 0.0)     .GT. epsilon)  .OR.     $    (DABS(zHostPtr(6) - 49300.0) .GT. epsilon)  .OR.     $    (DABS(zHostPtr(7) - 15200.0) .GT. epsilon)  .OR.     $    (DABS(zHostPtr(8) - 132300.0).GT. epsilon)  .OR.     $    (DABS(zHostPtr(9) - 131200.0).GT. epsilon)  .OR.     $    (DABS(zHostPtr(10) - 0.0)    .GT. epsilon)  .OR.     $    (DABS(yHostPtr(1) - 10.0)    .GT. epsilon)  .OR.     $    (DABS(yHostPtr(2) - 20.0)    .GT. epsilon)  .OR.     $    (DABS(yHostPtr(3) - 30.0)    .GT. epsilon)  .OR.     $    (DABS(yHostPtr(4) - 40.0)    .GT. epsilon)  .OR.     $    (DABS(yHostPtr(5) - 680.0)   .GT. epsilon)  .OR.     $    (DABS(yHostPtr(6) - 760.0)   .GT. epsilon)  .OR.     $    (DABS(yHostPtr(7) - 1230.0)  .GT. epsilon)  .OR.     $    (DABS(yHostPtr(8) - 2240.0)  .GT. epsilon)) then          write(*,*) "fortran example test FAILED"       else          write(*,*) "fortran example test PASSED"       endifc      deallocate GPU memory and exit       call cuda_free(cooRowIndex)       call cuda_free(cooColIndex)       call cuda_free(cooVal)       call cuda_free(xInd)       call cuda_free(xVal)       call cuda_free(y)       call cuda_free(z)       call cuda_free(csrRowPtr)       call cusparse_destroy_mat_descr(descrA)       call cusparse_destroy(handle)       stop 0       end

8.Acknowledgements

NVIDIA would like to thank the following individuals and institutions for their contributions:

  • The cusparse<t>gtsv implementation is derived from a version developed by Li-Wen Chang from the University of Illinois.

  • The cusparse<t>gtsvInterleavedBatch adopts cuThomasBatch developed by Pedro Valero-Lara and Ivan Martínez-Pérez from Barcelona Supercomputing Center and BSC/UPC NVIDIA GPU Center of Excellence.

  • This product includes {fmt} - A modern formatting libraryhttps://fmt.dev Copyright (c) 2012 - present, Victor Zverovich.

9.Bibliography

[1] N. Bell and M. Garland,“Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors”, Supercomputing, 2009.

[2] R. Grimes, D. Kincaid, and D. Young, “ITPACK 2.0 User’s Guide”, Technical Report CNA-150, Center for Numerical Analysis, University of Texas, 1979.

[3] M. Naumov,“Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS”, Technical Report and White Paper, 2011.

[4] Pedro Valero-Lara, Ivan Martínez-Pérez, Raül Sirvent, Xavier Martorell, and Antonio J. Peña. NVIDIA GPUs Scalability to Solve Multiple (Batch) Tridiagonal Systems. Implementation of cuThomasBatch. In Parallel Processing and Applied Mathematics - 12th International Conference (PPAM), 2017.

10.Notices

10.1.Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no representations or warranties, expressed or implied, as to the accuracy or completeness of the information contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or deliver any Material (defined below), code, or functionality.

NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to this document, at any time without notice.

Customer should obtain the latest relevant information before placing orders and should verify that such information is current and complete.

NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations are formed either directly or indirectly by this document.

NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at customer’s own risk.

NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to evaluate and determine the applicability of any information contained in this document, ensure the product is suitable and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product designs.

No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual property right under this document. Information published by NVIDIA regarding third-party products or services does not constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such information may require a license from a third party under the patents or other intellectual property rights of the third party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.

Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all associated conditions, limitations, and notices.

THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be limited in accordance with the Terms of Sale for the product.

10.2.OpenCL

OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.

10.3.Trademarks

NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the U.S. and other countries. Other company and product names may be trademarks of the respective companies with which they are associated.