- Notifications
You must be signed in to change notification settings - Fork0
A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)
License
GPUEngineering/GPUtils
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
TheDTensor
class is for manipulating data on a GPU.It manages their memory and facilitates various algebraic operations.
A tensor has three axes:[rows (m) x columns (n) x matrices (k)]
.An (m,n,1)-tensor stores amatrix, and an (m,1,1)-tensor stores avector.
We first need to decide on a data type betweenfloat
ordouble
.We will usefloat
in the following examples.
The simplest way to create an emptyDTensor
object is by constructing a vector:
size_t n =100;DTensormyTensor(n);
Important
This creates an n-dimensional vector as an (n,1,1)-tensor on the device.
ADTensor
can be instantiated from host memory:
std::vector<float> h_a{4., -5.,6.,9.,8.,5.,9., -10.2,9.,11.};DTensor<float>myTensor(h_a, h_a.size());std::cout << myTensor <<"\n";
Caution
Printing aDTensor
tostd::cout
will slow down your program(it requires the data to be downloaded from the device).Printing was designed for quick debugging.
We will often need to create slices (or shallow copies) of aDTensor
given a range of values. We can then do:
size_t axis =0;// rows=0, cols=1, mats=2size_t from =3;size_t to =5;DTensor<float>mySlice(myTensor, axis, from, to);std::cout << mySlice <<"\n";
Sometimes we need to reuse an already allocatedDTensor
by uploadingnew data from the host by using the methodupload
. Here is a short example:
std::vector<float> h_a{1.,2.,3.};// host data aDTensor<float>myVec(h_a,3);// create vector in tensor on devicestd::vector<float> h_b{4., -5.,6.};// host data bmyVec.upload(h_b);std::cout << myVec <<"\n";
We can upload some host data to a particular position of aDTensor
as follows:
std::vector<float> hostData{1.,2.,3.};// here, `true` tells the constructor to set all allocated elements to zeroDTensor<float>x(7,1,1,true);// x = [0, 0, 0, 0, 0, 0, 0]'DTensor<float>mySlice(x,0,3,5); mySlice.upload(hostData);std::cout << x <<"\n";// x = [0, 0, 0, 1, 2, 3, 0]'
If necessary, the data can be downloaded from the device to the host usingdownload
.
Very often we will also need to copy data from an existingDTensor
to anotherDTensor
(without passing through the host).To do this we can usedeviceCopyTo
. Here is an example:
DTensor<float>x(10);DTensor<float>y(10);x.deviceCopyTo(y);// x ---> y (device memory to device memory)
The copy constructor has also been implemented; to hard-copy aDTensor
justdoDTensor<float> myCopy(existingTensor)
.
Lastly, a not so efficient method that should only be used fordebugging, if at all, is the()
operator (e.g.,x(i, j, k)
), which fetchesone element of theDTensor
to the host.This cannot be used to set a value, so don't do anything likex(0, 0, 0) = 4.5
!
Caution
For the love of god, do not put this()
operator in a loop.
The following scalar quantities can be computed (internally,we usecublas
functions):
.normF()
: the Frobenius norm of a tensor$x$ , usingnrm2
(i.e., the 2-norm, or Euclidean norm, if$x$ is a vector).sumAbs()
: the sum of the absolute of all the elements, usingasum
(i.e., the 1-norm if$x$ is a vector)
We can element-wise addDTensor
s on the device as follows:
std::vector<float> host_x{1.,2.,3.,4.,5.,6.,7.};std::vector<float> host_y{1.,3.,5.,7.,9.,11.,13.};DTensor<float>x(host_x, host_x.size());DTensor<float>y(host_y, host_y.size());x += y;// x = [2, 5, 8, 11, 14, 17, 20]'std::cout << x <<"\n";
To element-wise subtracty
fromx
we can usex -= y
.
We can also scale aDTensor
by a scalar with*=
(e.g,x *= 5.0f
).To negate the values of aDTensor
we can dox *= -1.0f
.
We can also compute the inner product (as a (1,1,1)-tensor) of two vectors as follows:
std::vector<float> host_x{1.,2.,3.,4.,5.,6.,7.};std::vector<float> host_y{1.,3.,5.,7.,9.,11.,13.};DTensor<float>xtr(host_x,1, host_x.size());// column vectorDTensor<float>y(host_y, host_y.size());// row vectorDTensor<float> innerProduct = x * y;
If necessary, we can also use the following element-wise operations
DTensor<float>x(host_x, host_x.size());// row vectorauto sum = x + y;auto diff = x - y;auto scaledX =3.0f * x;
To store a matrix in aDTensor
we need to provide the data in an array;we can use either column-major (default) or row-major format.TODO implement row-major
Suppose we need to store the matrix
where this data is stored in row-major format.Then, we do
size_t rows =5;size_t cols =3;std::vector<float> h_data{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f};DTensor<float>myTensor(h_data, rows, cols,1, rowMajor);
ChooserowMajor
orcolumnMajor
as appropriate.
We can also preallocate memory for aDTensor
as follows:
DTensor<float>a(rows, cols,1);
Then, we can upload the data as follows:
a.upload(h_data, rowMajor);
The copy constructor has also been implemented;to hard-copy a vector just doDTensor<float> myCopy(existingTensor)
.
The number of rows and columns of aDTensor
can beretrieved using the methods.numRows()
and.numCols()
respectively.
The operators+=
are-=
supported for device matrices.
Matrix-matrix multiplication is as simple as:
size_t m =2, k =3, n=5;std::vector<float> aData{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f};std::vector<float> bData{1.0f,2.0f,3.0f,4.0f,5.0f,6.0f,7.0f,8.0f,9.0f,10.0f,11.0f,12.0f,13.0f,14.0f,15.0f};DTensor<float>A(aData, m, k,1, rowMajor);DTensor<float>B(bData, k, n,1, rowMajor);auto X = A * B;std::cout << A << B << X <<"\n";
As you would expect, all operations mentioned so far are supported by actual tensorsas batched operations (that is, (m,n)-matrix-wise).
Also, we can create the transposes of aDTensor
using.tr()
.This transposes each (m,n)-matrix and stores it in a newDTensor
at the same k-index.Transposition in-place is not possible.
The solution of least squares has been implmented as a tensor method.Say we want to solveA\b
using least squares.We first create
size_t m =4;size_t n =3;std::vector<float> aData{1.0f,2.0f,4.0f,2.0f,13.0f,23.0f,4.0f,23.0f,77.0f,6.0f,7.0f,8.0f};std::vector<float> bData{1.0f,2.0f,3.0f,4.0f};DTensor<float>A(aData, m, n,1, rowMajor);DTensor<float>B(bData, m);
Then, we can solve the system by
A.leastSquaresBatched(B);
TheDTensor
B
will be overwritten with the solution.
Important
This particular example demonstrates how the solution mayoverwrite only part of the givenB
, asB
is a(4,1,1)-tensor and the solution is a (3,1,1)-tensor.
Tensor data can be stored in simple text files or binary files.The text-based format has the following structure
number_of_rowsnumber_of_columnsnumber_of_matricesdata (one entry per line)
To save a tensor in a file, simply callDTensor::saveToFile(filename)
.
If the file extension is.bt
(binary tensor), the data will be stored in binary format.The structure of the binary encoding is similar to that of the text encoding:the first threeuint64_t
-sized positions correspond to the number of rows, columnsand matrices, followed by the elements of the tensor.
To load a tensor from a file, the static functionDTensor<T>::parseFromFile(filename)
can be used. For example:
auto z = DTensor<double>::parseFromFile("path/to/my.dtensor")
If necessary, you can provide a second argument toparseFromFile
to specify the order in which the data are stored (theStorageMode
).
Soon we will release a Python API for reading and serialising (numpy) arrays to.bt
files.
Warning
This factorisation only works with positive-definite matrices.
Here is an example:
This is how to perform a Cholesky factorisation:
size_t n =3;std::vector<float> aData{1.0f,2.0f,4.0f,2.0f,13.0f,23.0f,4.0f,23.0f,77.0f};DTensor<float>A(aData, n, n,1, rowMajor);CholeskyFactoriser<float>cfEngine(A);status = cfEngine.factorise();
Then, you can solve the systemA\b
std::vector<float> bData{1.0f,2.0f,3.0f};DTensor<float>B(bData, n);cfEngine.solve(B);
TheDTensor
B
will be overwritten with the solution.
Warning
This implementation only works with square or tall matrices.
Here is an example with the 4-by-3 matrix
Evidently, the rank of
This is how to perform an SVD decomposition:
size_t m =4;size_t n =3;std::vector<float> bData{1.0f,2.0f,3.0f,6.0f,7.0f,8.0f,6.0f,7.0f,8.0f,6.0f,7.0f,8.0f};DTensor<float>B(bData, m, n,1, rowMajor);SvdFactoriser<float>svdEngine(B);status = svdEngine.factorise();
By default,SvdFactoriser
will not compute matrixSvdFactoriser
as follows
SvdFactoriser<float>svdEngine(B,true);// computes U
Note that the default behaviour of.factorise()
is to destroythe given matrixfalse
.
After you have factorised the matrix, you can access
std::cout <<"S =" << svdEngine.singularValues() <<"\n";std::cout <<"V' =" << svdEngine.rightSingularVectors() <<"\n";
Note that.leftSingularVectors()
which returns an objectof typestd::optional<DeviceMatrix<TElement>>
.Here is an example:
auto U = svdEngine.leftSingularVectors();if (U) std::cout <<"U =" << U.value();
The nullspace of a matrix is computed by SVD.The user provides aDTensor
made of (padded) matrices.Then,Nullspace
computes, possibly pads, and returns thenullspace matricesN = (N1, ..., Nk)
in anotherDTensor
.
DTensor<float>paddedMatrices(m, n, k);NullspaceN(paddedMatrices);// computes N and NN'DTensor<float> ns = N.nullspace();// returns N
Each padded nullspace matrixNi
is orthogonal,andNullspace
further computes and stores thenullspace projection operatorsNN' = (N1N1', ..., NkNk')
.This allows the user to project-in-place onto the nullspace.
DTensor<float>vectors(m,1, k);N.project(vectors);std::cout << vectors <<"\n";
We can get the total allocated bytes (on the GPU) with
size_t allocatedBytes =Session::getInstance().totalAllocatedBytes();
About
A C++ header-only library for parallel linear algebra on GPUs (CUDA/cuBLAS under the hood)