Movatterモバイル変換

Java and GPU: where are wenow?And why?2

Dmitry AlexandrovT-Systems | @bercut20003

What is a video card?A video card (also called a display card, graphics card, displayadapter or graphics adapter) is an expansion card which generates afeed of output images to a display (such as a computer monitor).Frequently, these are advertised as discrete or dedicated graphicscards, emphasizing the distinction between these and integratedgraphics.6

What is a video card?But as for today:Video cards are not limited to simple image output, they have a built-ingraphics processor that can perform additional processing, removingthis task from the central processor of the computer.7

What is a GPU?• Graphics Processing Unit10

What is a GPU?• Graphics Processing Unit• First used by Nvidia in 199911

What is a GPU?• Graphics Processing Unit• First used by Nvidia in 1999• GeForce 256 is called as «The world’s first GPU»12

What is a GPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”13

What is a GPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”• ATI called them VPU..14

GPGPU• General-purpose computing on graphics processing units16

GPGPU• General-purpose computing on graphics processing units• Performs not only graphic calculations..17

GPGPU• General-purpose computing on graphics processing units• Performs not only graphic calculations..• … but also those usually performed on CPU18

So much cool! We have to usethem!19

Let’s look at the hardware!20Based on“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University

The CPU in general looks like this21

Then let’s just clone them24

But we are doing the samecalculation just with differentdata26

And here we start to talk about vectors..29

… and in the and we are here:30

Nice! But how on earth can wecode here?!31

It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU32

It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”33

It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”34

It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”• Developers were able just to call them35

It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms36

It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs37

It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders38

It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders• From this moment video card could work on transformations, geometry andtextures as the developers want!39

It all started with a shader• First shadres were different:• Vertex• Geometry• Pixel• Then they were united to Common Shader Architecture40

There are several shaders languages• RenderMan• OSL• GLSL• Cg• DirectX ASM• HLSL• …41

Having in mind it all started withgaming…45

Several abstractions were created:• OpenGL• is a cross-language, cross-platform application programming interface (API) forrendering 2D and 3Dvector graphics. The API is typically used to interact witha graphics processing unit (GPU), to achieve hardware-accelerated rendering.• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it inJanuary 1992;• DirectX• is a collection of application programming interfaces (APIs) for handling tasks relatedto multimedia, especially game programming and video, on Microsoft platforms.Originally, the names of these APIs all began with Direct, suchas Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. Thename DirectX was coined as a shorthand term for all of these APIs (the X standing infor the particular API names) and soon became the name of the collection.46

By the way, what about Java?47

OpenGL in Java• JSR – 231• Started in 2003• Latest release in 2008• Supports OpenGL 2.048

OpenGL• Now is an independent project GOGL• Supports OpenGL up to 4.5• Provide support for GLU и GLUT• Access to low level API on С via JNI49

But somewhere in 2005 it wasfinally realized this can be usedfor general computations as well51

BrookGPU• Early efforts to use GPGPU• Own subset of ANSI C• Brook Streaming Language• Made in Stanford University52

GPGPU• CUDA — Nvidia C subset proprietary platform.• DirectCompute — Microsoft proprietary shader language, part ofDirect3d, starting from DirectX 10.• AMD FireStream — ATI proprietary technology.• OpenACC – multivendor consortium• C++ AMP – Microsoft proprietary language• OpenCL – Common standard controlled by Kronos group.53

Why should we ever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices54

Why should we ever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices• Where can we apply GPU• Data Analytics and Data Science (Hadoop, Spark …)• Security analytics (log processing)• Finance/Banking55

But Java works on JVM.. Butthere we have some low level..57

For low level we use:• JNI (Java Native Interface)• JNA (Java Native Access)58

Someone actually did this…60

But may be there is somethingdone already?61

For OpenCL:• JOCL• JogAmp• JavaCL (not supported anymore)62

.. and for Cuda• JCuda• Cublas• JCufft• JCurand• JCusparse• JCusolver• Jnvgraph• Jcudpp• JNpp• JCudnn63

Disclaimer: its hard to work with GPU!• Its not just run a program• You need to know your hardware!• Its low level..64

What’s that?• Short for Open Compute Language• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola andmany more• Very abstract model• Works both on GPU and CPU66

All in all it works like this:HOST DEVICEDataProgram/Kernel68

All in all it works like this:HOST69

All in all it works like this:HOST DEVICEResult70

Typical lifecycle of an OpenCL app• Create context• Create command queue• Create memory buffers/fill with data• Create program from sources/load binaries• Compile (if required)• Create kernel from the program• Supply kernel arguments• Define ND range• Execute• Return resulting data• Release resources71

1. There is the host code. Its onJava.74

2. There is a device code. Aspecific subset of C.75

3. Communication between thehost and the device is done viamemory buffers.76

So what can we actually transfer?77

The data is not quite the same..78

Datatypes:vectorsfloat f = 4.0f;float3 f3 = (float3)(1.0f, 2.0f, 3.0f);float4 f4 = (float4)(f3, f);//f4.x = 1.0f,//f4.y = 2.0f,//f4.z = 3.0f,//f4.w = 4.0f81

So how are they saved there?82

So how are they saved there?In a hard way..83

Memory Model• __global• __constant• __local• __private84

Execution model• We’ve got a lot of data• We need to perform the same computations over them• So we can just shard them• OpenCL is here t help us88

For example: matrix multiplication• We would write it like this:void MatrixMul_sequential(int dim, float *A, float *B, float *C) {for(int iRow=0; iRow<dim;++iRow) {for(int iCol=0; iCol<dim;++iCol) {float result = 0.f;for(int i=0; i<dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}}}91

For example: matrix multiplication92

For example: matrix multiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}93

For example: matrix multiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}94

Typical GPU--- Info for device GeForce GT 650M: ---CL_DEVICE_NAME: GeForce GT 650MCL_DEVICE_VENDOR: NVIDIACL_DRIVER_VERSION: 10.14.20 355.10.05.15f03CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPUCL_DEVICE_MAX_COMPUTE_UNITS: 2CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: localCL_DEVICE_LOCAL_MEM_SIZE: 48 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 256CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 16384CL_DEVICE_2D_MAX_HEIGHT 16384CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 195

Typical CPU--- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: ---CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzCL_DEVICE_VENDOR: IntelCL_DRIVER_VERSION: 1.1CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPUCL_DEVICE_MAX_COMPUTE_UNITS: 8CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: globalCL_DEVICE_LOCAL_MEM_SIZE: 32 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 8192CL_DEVICE_2D_MAX_HEIGHT 8192CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 296

And what about CUDA?Well.. It looks to be easier98

And what about CUDA?Well.. It looks to be easierfor C developers…99

CUDA kernel#define N 10__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x; // this thread handles the data at its thread idif (tid < N)c[tid] = a[tid] + b[tid];}100

CUDA setupint a[N], b[N], c[N];int *dev_a, *dev_b, *dev_c;// allocate the memory on the GPUcudaMalloc( (void**)&dev_a, N * sizeof(int) );cudaMalloc( (void**)&dev_b, N * sizeof(int) );cudaMalloc( (void**)&dev_c, N * sizeof(int) );// fill the arrays 'a' and 'b' on the CPUfor (int i=0; i<N; i++) {a[i] = -i;b[i] = i * i;}101

CUDA copy to memory and run// copy the arrays 'a' and 'b' to the GPUcudaMemcpy(dev_a, a, N *sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);add<<<N,1>>>(dev_a,dev_b,dev_c);// copy the array 'c' back from the GPU to the CPUcudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);102

CUDA get results// display the resultsfor (int i=0; i<N; i++) {printf( "%d + %d = %dn", a[i], b[i], c[i] );}// free the memory allocated on the GPUcudaFree( dev_a );cudaFree( dev_b );cudaFree( dev_c );103

But CUDA has some other superpowers• Cublas – all about matrices• JCufft – Fast Frontier Transformation• Jcurand – all about random• JCusparse – sparse matrices• Jcusolver – factorization and some other crazy stuff• Jnvgraph – all about graphs• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting• JNpp – image processing on GPU• Jcudnn – Deep Neural Network library (that’s scary)104

For example we need a good randint n = 100;curandGenerator generator = new curandGenerator();float hostData[] = new float[n];Pointer deviceData = new Pointer();cudaMalloc(deviceData, n * Sizeof.FLOAT);curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);curandSetPseudoRandomGeneratorSeed(generator, 1234);curandGenerateUniform(generator, deviceData, n);cudaMemcpy(Pointer.to(hostData), deviceData,n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);System.out.println(Arrays.toString(hostData));curandDestroyGenerator(generator);cudaFree(deviceData);105

For example we need a good rand• With a strong theory underneath• Developed by Russian mathematician Ilya Sobolev back in 1967• https://en.wikipedia.org/wiki/Sobol_sequence106

nVidia memory looks like this107

Optimizations…__kernel void MatrixMul_kernel_basic(int dim,__global float *A,__global float *B,__global float *C){int iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i){result +=A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}109

<—Optimizations#define VECTOR_SIZE 4__kernel void MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 110

<—Optimizations#define VECTOR_SIZE 4__kernel void MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 111

But we don’t want to have C atall…112

We don’t want to think aboutthose hosts and devices…113

Project Sumatra• Research project115

Project Sumatra• Research project• Focused on Java 8116

Project Sumatra• Research project• Focused on Java 8• … to be more precise on streams117

Project Sumatra• Research project• Focused on Java 8• … to be more precise on streams• … and even more precise lambdas and .forEach()118

AMD HSAIL• Detects forEach() block• Gets HSAIL code with Graal• On low level supply thegenerated from lambdakernel to the GPU121

But if we want some moregeneral solution..123

IBM patched JVM for GPU• Focused on CUDA (for now)• Focused on Stream API• Created their own .parallel()124

IBM patched JVM for GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}125

IBM patched JVM for GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}… we would like the lambda to be automatically converted to GPU code…126

IBM patched JVM for GPUWhen n is big the lambda code is executed on GPU:class Par {void foo(float[] a, float[] b, float[] c, int n) {IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;});}}*only lambdas with primitive types in one dimension arrays.127

IBM patched JVM for GPUOptimized IBM JIT compiler:• Use read-only cache• Fewer writes to global GPU memory• Optimized Host to Device data copy rate• Fewer data to be copied• Eliminate exceptions as much as possible• In the GPU Kernel128

IBM patched JVM for GPU• Success story:+ +129

IBM patched JVM for GPU• Officially:130

IBM patched JVM for GPU• More info:https://github.com/IBMSparkGPU/GPUEnabler131

But can we just write in Java,and its just being converted toOpenCL/CUDA?132

Aparapi• Short for «A PARallel API»135

Aparapi• Short for «A PARallel API»• Works like Hibernate for databases136

Aparapi• Short for «A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device137

Aparapi• Short for «A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device• OpenCL under the cover138

Aparapi• Started by AMD• Then abandoned…140

Aparapi• Started by AMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license141

Aparapi• Started by AMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license• Back to life!!!142

Aparapi – now its so much simple!public static void main(String[] _args) {final int size = 512;final float[] a = new float[size];final float[] b = new float[size];for (int i = 0; i < size; i++) {a[i] = (float) (Math.random() * 100);b[i] = (float) (Math.random() * 100);}final float[] sum = new float[size];Kernel kernel = new Kernel(){@Override public void run() {int gid = getGlobalId();sum[gid] = a[gid] + b[gid];}};kernel.execute(Range.create(size));for (int i = 0; i < size; i++) {System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);}kernel.dispose();}143

We can’t sell our product if itsnot cloud native!145

nVidia GRID• Announced in 2012• Already in production• Works on the most of thehypervisors• .. And in the clouds!147

Its here: Nvidia Tegra Parker156

So use it!If the task is suitable160

Thanks!Dank je!Merci beaucoup!163

Movatterモバイル変換

Change Language

Java on the GPU: Where are we now?

Embed presentation

Recommended

More Related Content

What's hot

Viewers also liked

Similar to Java on the GPU: Where are we now?

Recently uploaded

Java on the GPU: Where are we now?