GPUEngineering/GPUtilsPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star4

A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)

License

GPL-3.0 license

4 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 370 Commits
.github		.github
ci		ci
example		example
include		include
python		python
test		test
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md
main.cu		main.cu

Repository files navigation

GPUtils

1. DTensor

TheDTensor class is for manipulating data on a GPU.It manages their memory and facilitates various algebraic operations.

A tensor has three axes:[rows (m) x columns (n) x matrices (k)].An (m,n,1)-tensor stores amatrix, and an (m,1,1)-tensor stores avector.

We first need to decide on a data type betweenfloat ordouble.We will usefloat in the following examples.

1.1. Vectors

The simplest way to create an emptyDTensor object is by constructing a vector:

size_t n =100;DTensormyTensor(n);

Important

This creates an n-dimensional vector as an (n,1,1)-tensor on the device.

ADTensor can be instantiated from host memory:

std::vector<float> h_a{4., -5.,6.,9.,8.,5.,9., -10.2,9.,11.};DTensor<float>myTensor(h_a, h_a.size());std::cout << myTensor <<"\n";

Caution

Printing aDTensor tostd::cout will slow down your program(it requires the data to be downloaded from the device).Printing was designed for quick debugging.

We will often need to create slices (or shallow copies) of aDTensorgiven a range of values. We can then do:

size_t axis =0;// rows=0, cols=1, mats=2size_t from =3;size_t to =5;DTensor<float>mySlice(myTensor, axis, from, to);std::cout << mySlice <<"\n";

Sometimes we need to reuse an already allocatedDTensor by uploadingnew data from the host by using the methodupload. Here is a short example:

std::vector<float> h_a{1.,2.,3.};// host data aDTensor<float>myVec(h_a,3);// create vector in tensor on devicestd::vector<float> h_b{4., -5.,6.};// host data bmyVec.upload(h_b);std::cout << myVec <<"\n";

We can upload some host data to a particular position of aDTensor as follows:

std::vector<float> hostData{1.,2.,3.};// here, `true` tells the constructor to set all allocated elements to zeroDTensor<float>x(7,1,1,true);// x = [0, 0, 0, 0, 0, 0, 0]'DTensor<float>mySlice(x,0,3,5); mySlice.upload(hostData);std::cout << x <<"\n";// x = [0, 0, 0, 1, 2, 3, 0]'

If necessary, the data can be downloaded from the device to the host usingdownload.

Very often we will also need to copy data from an existingDTensorto anotherDTensor (without passing through the host).To do this we can usedeviceCopyTo. Here is an example:

DTensor<float>x(10);DTensor<float>y(10);x.deviceCopyTo(y);// x ---> y (device memory to device memory)

The copy constructor has also been implemented; to hard-copy aDTensor justdoDTensor<float> myCopy(existingTensor).

Lastly, a not so efficient method that should only be used fordebugging, if at all, is the() operator (e.g.,x(i, j, k)), which fetchesone element of theDTensor to the host.This cannot be used to set a value, so don't do anything likex(0, 0, 0) = 4.5!

Caution

For the love of god, do not put this() operator in a loop.

1.2. Computation of scalar quantities

The following scalar quantities can be computed (internally,we usecublas functions):

.normF(): the Frobenius norm of a tensor$x$, usingnrm2 (i.e., the 2-norm, or Euclidean norm, if$x$ is a vector)
.sumAbs(): the sum of the absolute of all the elements, usingasum (i.e., the 1-norm if$x$ is a vector)

1.3. Some cool operators

We can element-wise addDTensors on the device as follows:

std::vector<float> host_x{1.,2.,3.,4.,5.,6.,7.};std::vector<float> host_y{1.,3.,5.,7.,9.,11.,13.};DTensor<float>x(host_x, host_x.size());DTensor<float>y(host_y, host_y.size());x += y;// x = [2, 5, 8, 11, 14, 17, 20]'std::cout << x <<"\n";

To element-wise subtracty fromx we can usex -= y.

We can also scale aDTensor by a scalar with*= (e.g,x *= 5.0f).To negate the values of aDTensor we can dox *= -1.0f.

We can also compute the inner product (as a (1,1,1)-tensor) of two vectors as follows:

std::vector<float> host_x{1.,2.,3.,4.,5.,6.,7.};std::vector<float> host_y{1.,3.,5.,7.,9.,11.,13.};DTensor<float>xtr(host_x,1, host_x.size());// column vectorDTensor<float>y(host_y, host_y.size());// row vectorDTensor<float> innerProduct = x * y;

If necessary, we can also use the following element-wise operations

DTensor<float>x(host_x, host_x.size());// row vectorauto sum = x + y;auto diff = x - y;auto scaledX =3.0f * x;

1.4. Matrices

To store a matrix in aDTensor we need to provide the data in an array;we can use either column-major (default) or row-major format.TODO implement row-majorSuppose we need to store the matrix

$$A = \begin{bmatrix}1 & 2 & 3 \\4 & 5 & 6 \\7 & 8 & 9 \\10 & 11 & 12 \\13 & 14 & 15\end{bmatrix},$$

where this data is stored in row-major format.Then, we do

size_t rows =5;size_t cols =3;std::vector<float> h_data{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f};DTensor<float>myTensor(h_data, rows, cols,1, rowMajor);

ChooserowMajor orcolumnMajor as appropriate.

We can also preallocate memory for aDTensor as follows:

DTensor<float>a(rows, cols,1);

Then, we can upload the data as follows:

a.upload(h_data, rowMajor);

The copy constructor has also been implemented;to hard-copy a vector just doDTensor<float> myCopy(existingTensor).

The number of rows and columns of aDTensor can beretrieved using the methods.numRows() and.numCols() respectively.

1.5. More operations

The operators+= are-= supported for device matrices.

Matrix-matrix multiplication is as simple as:

size_t m =2, k =3, n=5;std::vector<float> aData{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f};std::vector<float> bData{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f};DTensor<float>A(aData, m, k,1, rowMajor);DTensor<float>B(bData, k, n,1, rowMajor);auto X = A * B;std::cout << A << B << X <<"\n";

1.6. Tensors

As you would expect, all operations mentioned so far are supported by actual tensorsas batched operations (that is, (m,n)-matrix-wise).

Also, we can create the transposes of aDTensor using.tr().This transposes each (m,n)-matrix and stores it in a newDTensorat the same k-index.Transposition in-place is not possible.

1.7. Least squares

The solution of least squares has been implmented as a tensor method.Say we want to solveA\b using least squares.We first create$A$ and$b$

size_t m =4;size_t n =3;std::vector<float> aData{1.0f,2.0f,4.0f,2.0f,13.0f,23.0f,4.0f,23.0f,77.0f,6.0f,7.0f,8.0f};std::vector<float> bData{1.0f,2.0f,3.0f,4.0f};DTensor<float>A(aData, m, n,1, rowMajor);DTensor<float>B(bData, m);

Then, we can solve the system by

A.leastSquaresBatched(B);

TheDTensorB will be overwritten with the solution.

Important

This particular example demonstrates how the solution mayoverwrite only part of the givenB, asB is a(4,1,1)-tensor and the solution is a (3,1,1)-tensor.

1.8. Saving and loading tensors

Tensor data can be stored in simple text files or binary files.The text-based format has the following structure

number_of_rowsnumber_of_columnsnumber_of_matricesdata (one entry per line)

To save a tensor in a file, simply callDTensor::saveToFile(filename).

If the file extension is.bt (binary tensor), the data will be stored in binary format.The structure of the binary encoding is similar to that of the text encoding:the first threeuint64_t-sized positions correspond to the number of rows, columnsand matrices, followed by the elements of the tensor.

To load a tensor from a file, the static functionDTensor<T>::parseFromFile(filename) can be used. For example:

auto z = DTensor<double>::parseFromFile("path/to/my.dtensor")

If necessary, you can provide a second argument toparseFromFile to specify the order in which the data are stored (theStorageMode).

Soon we will release a Python API for reading and serialising (numpy) arrays to.bt files.

2. Cholesky factorisation and system solution

Warning

This factorisation only works with positive-definite matrices.

Here is an example:

$$A = \begin{bmatrix}1 & 2 & 4 \\2 & 13 & 23 \\4 & 23 & 77\end{bmatrix}.$$

This is how to perform a Cholesky factorisation:

size_t n =3;std::vector<float> aData{1.0f,2.0f,4.0f,2.0f,13.0f,23.0f,4.0f,23.0f,77.0f};DTensor<float>A(aData, n, n,1, rowMajor);CholeskyFactoriser<float>cfEngine(A);status = cfEngine.factorise();

Then, you can solve the systemA\b

std::vector<float> bData{1.0f,2.0f,3.0f};DTensor<float>B(bData, n);cfEngine.solve(B);

TheDTensorB will be overwritten with the solution.

3. Singular Value Decomposition

Warning

This implementation only works with square or tall matrices.

Here is an example with the 4-by-3 matrix

$$B = \begin{bmatrix}1 & 2 & 3 \\6 & 7 & 8 \\6 & 7 & 8 \\6 & 7 & 8\end{bmatrix}.$$

Evidently, the rank of$B$ is 2, so there will be two nonzero singular values.

This is how to perform an SVD decomposition:

size_t m =4;size_t n =3;std::vector<float> bData{1.0f,2.0f,3.0f,6.0f,7.0f,8.0f,6.0f,7.0f,8.0f,6.0f,7.0f,8.0f};DTensor<float>B(bData, m, n,1, rowMajor);SvdFactoriser<float>svdEngine(B);status = svdEngine.factorise();

By default,SvdFactoriser will not compute matrix$U$. If you need it,create an instance ofSvdFactoriser as follows

SvdFactoriser<float>svdEngine(B,true);// computes U

Note that the default behaviour of.factorise() is to destroythe given matrix$B$. If you want the factoriser to keep yourmatrix, you need to set the third argument of the above constructortofalse.

After you have factorised the matrix, you can access$S$,$V'$ and, perhaps,$U$.You can do:

std::cout <<"S =" << svdEngine.singularValues() <<"\n";std::cout <<"V' =" << svdEngine.rightSingularVectors() <<"\n";

Note that$U$ can be obtained, if it is computedin the first place, by the method.leftSingularVectors() which returns an objectof typestd::optional<DeviceMatrix<TElement>>.Here is an example:

auto U = svdEngine.leftSingularVectors();if (U) std::cout <<"U =" << U.value();

4. Projection onto a nullspace

The nullspace of a matrix is computed by SVD.The user provides aDTensor made of (padded) matrices.Then,Nullspace computes, possibly pads, and returns thenullspace matricesN = (N1, ..., Nk) in anotherDTensor.

DTensor<float>paddedMatrices(m, n, k);NullspaceN(paddedMatrices);// computes N and NN'DTensor<float> ns = N.nullspace();// returns N

Each padded nullspace matrixNi is orthogonal,andNullspace further computes and stores thenullspace projection operatorsNN' = (N1N1', ..., NkNk').This allows the user to project-in-place onto the nullspace.

DTensor<float>vectors(m,1, k);N.project(vectors);std::cout << vectors <<"\n";

5. Other

We can get the total allocated bytes (on the GPU) with

size_t allocatedBytes =Session::getInstance().totalAllocatedBytes();

GPUtils supports multiple streams. By default a single stream is created,but you can set the number of streams you need with

/* This needs to be the first line in your code*/Session::setStreams(4);// create 4 strems

Then, you can usesetStreamIdx to select a stream to go with your instance ofDTensor

auto a = DTensor<double>::createRandomTensor(3,6,4, -1,1).setStreamIdx(0);auto b = DTensor<double>::createRandomTensor(3,6,4, -1,1).setStreamIdx(1);// do stuff...Session::getInstance().synchronizeAllStreams();

Happy number crunching!

About

A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)

Releases19

v1.9.0 Latest

Mar 28, 2025

+ 18 releases

Packages

No packages published

Movatterモバイル変換

License

GPUEngineering/GPUtils

Folders and files

Latest commit

History

Repository files navigation

GPUtils

1. DTensor

1.1. Vectors

1.2. Computation of scalar quantities

1.3. Some cool operators

1.4. Matrices

1.5. More operations

1.6. Tensors

1.7. Least squares

1.8. Saving and loading tensors

2. Cholesky factorisation and system solution

3. Singular Value Decomposition

4. Projection onto a nullspace

5. Other

Happy number crunching!

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases19

Packages0

Uh oh!

Contributors2

Uh oh!

Languages

Packages