Movatterモバイル変換


[0]ホーム

URL:


1,029 views

Java on the GPU: Where are we now?

1. The document discusses the history and evolution of GPUs and GPGPU programming. It describes how GPUs started as dedicated graphics cards but now have programmable capabilities through shaders. 2. It explains the key concepts of GPGPU including the host/device model, memory models, and execution models using concepts like work items, work groups, and ND ranges. 3. The document uses OpenCL as an example programming model, covering memory transfers between host and device, data types, and how a matrix multiplication kernel could be implemented in OpenCL using the execution model.

Embed presentation

Java and GPU: where are wenow?And why?2
Dmitry AlexandrovT-Systems | @bercut20003
4
5
What is a video card?A video card (also called a display card, graphics card, displayadapter or graphics adapter) is an expansion card which generates afeed of output images to a display (such as a computer monitor).Frequently, these are advertised as discrete or dedicated graphicscards, emphasizing the distinction between these and integratedgraphics.6
What is a video card?But as for today:Video cards are not limited to simple image output, they have a built-ingraphics processor that can perform additional processing, removingthis task from the central processor of the computer.7
So what does it do?8
9
What is a GPU?• Graphics Processing Unit10
What is a GPU?• Graphics Processing Unit• First used by Nvidia in 199911
What is a GPU?• Graphics Processing Unit• First used by Nvidia in 1999• GeForce 256 is called as «The world’s first GPU»12
What is a GPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”13
What is a GPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”• ATI called them VPU..14
By idea it looks like this15
GPGPU• General-purpose computing on graphics processing units16
GPGPU• General-purpose computing on graphics processing units• Performs not only graphic calculations..17
GPGPU• General-purpose computing on graphics processing units• Performs not only graphic calculations..• … but also those usually performed on CPU18
So much cool! We have to usethem!19
Let’s look at the hardware!20Based on“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University
The CPU in general looks like this21
How to convert?22
Let’s simplify!23
Then let’s just clone them24
To make a lot of them!25
But we are doing the samecalculation just with differentdata26
So we come to SIMD paradigm27
So we use this paradigm28
And here we start to talk about vectors..29
… and in the and we are here:30
Nice! But how on earth can wecode here?!31
It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU32
It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”33
It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”34
It all started with a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”• Developers were able just to call them35
It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms36
It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs37
It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders38
It all started with a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders• From this moment video card could work on transformations, geometry andtextures as the developers want!39
It all started with a shader• First shadres were different:• Vertex• Geometry• Pixel• Then they were united to Common Shader Architecture40
There are several shaders languages• RenderMan• OSL• GLSL• Cg• DirectX ASM• HLSL• …41
As an example:42
With or without them43
But they are so low level..44
Having in mind it all started withgaming…45
Several abstractions were created:• OpenGL• is a cross-language, cross-platform application programming interface (API) forrendering 2D and 3Dvector graphics. The API is typically used to interact witha graphics processing unit (GPU), to achieve hardware-accelerated rendering.• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it inJanuary 1992;• DirectX• is a collection of application programming interfaces (APIs) for handling tasks relatedto multimedia, especially game programming and video, on Microsoft platforms.Originally, the names of these APIs all began with Direct, suchas Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. Thename DirectX was coined as a shorthand term for all of these APIs (the X standing infor the particular API names) and soon became the name of the collection.46
By the way, what about Java?47
OpenGL in Java• JSR – 231• Started in 2003• Latest release in 2008• Supports OpenGL 2.048
OpenGL• Now is an independent project GOGL• Supports OpenGL up to 4.5• Provide support for GLU и GLUT• Access to low level API on С via JNI49
50
But somewhere in 2005 it wasfinally realized this can be usedfor general computations as well51
BrookGPU• Early efforts to use GPGPU• Own subset of ANSI C• Brook Streaming Language• Made in Stanford University52
GPGPU• CUDA — Nvidia C subset proprietary platform.• DirectCompute — Microsoft proprietary shader language, part ofDirect3d, starting from DirectX 10.• AMD FireStream — ATI proprietary technology.• OpenACC – multivendor consortium• C++ AMP – Microsoft proprietary language• OpenCL – Common standard controlled by Kronos group.53
Why should we ever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices54
Why should we ever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices• Where can we apply GPU• Data Analytics and Data Science (Hadoop, Spark …)• Security analytics (log processing)• Finance/Banking55
For this we have:56
But Java works on JVM.. Butthere we have some low level..57
For low level we use:• JNI (Java Native Interface)• JNA (Java Native Access)58
But we can go crazy there..59
Someone actually did this…60
But may be there is somethingdone already?61
For OpenCL:• JOCL• JogAmp• JavaCL (not supported anymore)62
.. and for Cuda• JCuda• Cublas• JCufft• JCurand• JCusparse• JCusolver• Jnvgraph• Jcudpp• JNpp• JCudnn63
Disclaimer: its hard to work with GPU!• Its not just run a program• You need to know your hardware!• Its low level..64
Let’s start with:65
What’s that?• Short for Open Compute Language• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola andmany more• Very abstract model• Works both on GPU and CPU66
Should work on everything67
All in all it works like this:HOST DEVICEDataProgram/Kernel68
All in all it works like this:HOST69
All in all it works like this:HOST DEVICEResult70
Typical lifecycle of an OpenCL app• Create context• Create command queue• Create memory buffers/fill with data• Create program from sources/load binaries• Compile (if required)• Create kernel from the program• Supply kernel arguments• Define ND range• Execute• Return resulting data• Release resources71
Better take a look72
73
1. There is the host code. Its onJava.74
2. There is a device code. Aspecific subset of C.75
3. Communication between thehost and the device is done viamemory buffers.76
So what can we actually transfer?77
The data is not quite the same..78
Datatypes: scalars79
Datatypes:vectors80
Datatypes:vectorsfloat f = 4.0f;float3 f3 = (float3)(1.0f, 2.0f, 3.0f);float4 f4 = (float4)(f3, f);//f4.x = 1.0f,//f4.y = 2.0f,//f4.z = 3.0f,//f4.w = 4.0f81
So how are they saved there?82
So how are they saved there?In a hard way..83
Memory Model• __global• __constant• __local• __private84
Memory Model85
But that’s not all86
Remember SIMD?87
Execution model• We’ve got a lot of data• We need to perform the same computations over them• So we can just shard them• OpenCL is here t help us88
Execution model89
ND Range – what is that?90
For example: matrix multiplication• We would write it like this:void MatrixMul_sequential(int dim, float *A, float *B, float *C) {for(int iRow=0; iRow<dim;++iRow) {for(int iCol=0; iCol<dim;++iCol) {float result = 0.f;for(int i=0; i<dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}}}91
For example: matrix multiplication92
For example: matrix multiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}93
For example: matrix multiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}94
Typical GPU--- Info for device GeForce GT 650M: ---CL_DEVICE_NAME: GeForce GT 650MCL_DEVICE_VENDOR: NVIDIACL_DRIVER_VERSION: 10.14.20 355.10.05.15f03CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPUCL_DEVICE_MAX_COMPUTE_UNITS: 2CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: localCL_DEVICE_LOCAL_MEM_SIZE: 48 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 256CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 16384CL_DEVICE_2D_MAX_HEIGHT 16384CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 195
Typical CPU--- Info for device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: ---CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzCL_DEVICE_VENDOR: IntelCL_DRIVER_VERSION: 1.1CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPUCL_DEVICE_MAX_COMPUTE_UNITS: 8CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: globalCL_DEVICE_LOCAL_MEM_SIZE: 32 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 8192CL_DEVICE_2D_MAX_HEIGHT 8192CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 296
And what about CUDA?97
And what about CUDA?Well.. It looks to be easier98
And what about CUDA?Well.. It looks to be easierfor C developers…99
CUDA kernel#define N 10__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x; // this thread handles the data at its thread idif (tid < N)c[tid] = a[tid] + b[tid];}100
CUDA setupint a[N], b[N], c[N];int *dev_a, *dev_b, *dev_c;// allocate the memory on the GPUcudaMalloc( (void**)&dev_a, N * sizeof(int) );cudaMalloc( (void**)&dev_b, N * sizeof(int) );cudaMalloc( (void**)&dev_c, N * sizeof(int) );// fill the arrays 'a' and 'b' on the CPUfor (int i=0; i<N; i++) {a[i] = -i;b[i] = i * i;}101
CUDA copy to memory and run// copy the arrays 'a' and 'b' to the GPUcudaMemcpy(dev_a, a, N *sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);add<<<N,1>>>(dev_a,dev_b,dev_c);// copy the array 'c' back from the GPU to the CPUcudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);102
CUDA get results// display the resultsfor (int i=0; i<N; i++) {printf( "%d + %d = %dn", a[i], b[i], c[i] );}// free the memory allocated on the GPUcudaFree( dev_a );cudaFree( dev_b );cudaFree( dev_c );103
But CUDA has some other superpowers• Cublas – all about matrices• JCufft – Fast Frontier Transformation• Jcurand – all about random• JCusparse – sparse matrices• Jcusolver – factorization and some other crazy stuff• Jnvgraph – all about graphs• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting• JNpp – image processing on GPU• Jcudnn – Deep Neural Network library (that’s scary)104
For example we need a good randint n = 100;curandGenerator generator = new curandGenerator();float hostData[] = new float[n];Pointer deviceData = new Pointer();cudaMalloc(deviceData, n * Sizeof.FLOAT);curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);curandSetPseudoRandomGeneratorSeed(generator, 1234);curandGenerateUniform(generator, deviceData, n);cudaMemcpy(Pointer.to(hostData), deviceData,n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);System.out.println(Arrays.toString(hostData));curandDestroyGenerator(generator);cudaFree(deviceData);105
For example we need a good rand• With a strong theory underneath• Developed by Russian mathematician Ilya Sobolev back in 1967• https://en.wikipedia.org/wiki/Sobol_sequence106
nVidia memory looks like this107
Btw.. Talking about memory108©Wikipedia
Optimizations…__kernel void MatrixMul_kernel_basic(int dim,__global float *A,__global float *B,__global float *C){int iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i){result +=A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}109
<—Optimizations#define VECTOR_SIZE 4__kernel void MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 110
<—Optimizations#define VECTOR_SIZE 4__kernel void MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 111
But we don’t want to have C atall…112
We don’t want to think aboutthose hosts and devices…113
We can use GPU partially..114
Project Sumatra• Research project115
Project Sumatra• Research project• Focused on Java 8116
Project Sumatra• Research project• Focused on Java 8• … to be more precise on streams117
Project Sumatra• Research project• Focused on Java 8• … to be more precise on streams• … and even more precise lambdas and .forEach()118
AMD HSAIL119
AMD HSAIL120
AMD HSAIL• Detects forEach() block• Gets HSAIL code with Graal• On low level supply thegenerated from lambdakernel to the GPU121
AMD APU tries to solve the main issue..122©Wikipedia
But if we want some moregeneral solution..123
IBM patched JVM for GPU• Focused on CUDA (for now)• Focused on Stream API• Created their own .parallel()124
IBM patched JVM for GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}125
IBM patched JVM for GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}… we would like the lambda to be automatically converted to GPU code…126
IBM patched JVM for GPUWhen n is big the lambda code is executed on GPU:class Par {void foo(float[] a, float[] b, float[] c, int n) {IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;});}}*only lambdas with primitive types in one dimension arrays.127
IBM patched JVM for GPUOptimized IBM JIT compiler:• Use read-only cache• Fewer writes to global GPU memory• Optimized Host to Device data copy rate• Fewer data to be copied• Eliminate exceptions as much as possible• In the GPU Kernel128
IBM patched JVM for GPU• Success story:+ +129
IBM patched JVM for GPU• Officially:130
IBM patched JVM for GPU• More info:https://github.com/IBMSparkGPU/GPUEnabler131
But can we just write in Java,and its just being converted toOpenCL/CUDA?132
Yes, you can!133
Aparapi is there for you!134
Aparapi• Short for «A PARallel API»135
Aparapi• Short for «A PARallel API»• Works like Hibernate for databases136
Aparapi• Short for «A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device137
Aparapi• Short for «A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device• OpenCL under the cover138
Aparapi• Started by AMD139
Aparapi• Started by AMD• Then abandoned…140
Aparapi• Started by AMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license141
Aparapi• Started by AMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license• Back to life!!!142
Aparapi – now its so much simple!public static void main(String[] _args) {final int size = 512;final float[] a = new float[size];final float[] b = new float[size];for (int i = 0; i < size; i++) {a[i] = (float) (Math.random() * 100);b[i] = (float) (Math.random() * 100);}final float[] sum = new float[size];Kernel kernel = new Kernel(){@Override public void run() {int gid = getGlobalId();sum[gid] = a[gid] + b[gid];}};kernel.execute(Range.create(size));for (int i = 0; i < size; i++) {System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);}kernel.dispose();}143
But what about the clouds?144
We can’t sell our product if itsnot cloud native!145
nVidia is your friend!146
nVidia GRID• Announced in 2012• Already in production• Works on the most of thehypervisors• .. And in the clouds!147
nVidia GRID148
nVidia GRID149
… AMD is a bit behind…150
Anyway, its here!151
Its here: Nvidia GPU152
Its here : ATI Radeon153
Its here: AMD APU154
Its here: Intel Skylake155
Its here: Nvidia Tegra Parker156
Intel with VEGA??157
But first read:158
So use it!159
So use it!If the task is suitable160
…its hard,but worth it!161
You will rule’em’all!162
Thanks!Dank je!Merci beaucoup!163
164

Recommended

PDF
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
PDF
GPU Virtualization on VMware's Hosted I/O Architecture
PDF
0xdroid -- community-developed Android distribution by 0xlab
PDF
Newbie’s guide to_the_gpgpu_universe
PDF
The GPGPU Continuum
PDF
[02][cuda c 프로그래밍 소개] gateau intro to_cuda_c
PDF
PyKinect: Body Iteration Application Development Using Python
PDF
[Ubucon Europe 2018] Introduction to mesa, the open-source graphics API imple...
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
KEY
Using Smalltalk for controlling robotics systems
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
PPTX
Hands on OpenCL
PDF
Kinect Hacks for Dummies
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
PDF
Introduction to OpenCL, 2010
PDF
You Can’t Do That With Smalltalk!
 
PDF
OpenCL - The Open Standard for Heterogeneous Parallel Programming
PDF
GPU Ecosystem
PPTX
Masked Occlusion Culling
PDF
Perceptual Computing Workshop à Paris
PDF
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
PPT
nodebots presentation @seekjobs
PDF
Perceptual Computing Workshop in Munich
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
PPTX
Optimizing Total War*: WARHAMMER II
PPTX
Getting Space Pirate Trainer* to Perform on Intel® Graphics
PDF
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
PDF
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
PDF
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
PDF
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability

More Related Content

PDF
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
PDF
GPU Virtualization on VMware's Hosted I/O Architecture
PDF
0xdroid -- community-developed Android distribution by 0xlab
PDF
Newbie’s guide to_the_gpgpu_universe
PDF
The GPGPU Continuum
PDF
[02][cuda c 프로그래밍 소개] gateau intro to_cuda_c
PDF
PyKinect: Body Iteration Application Development Using Python
PDF
[Ubucon Europe 2018] Introduction to mesa, the open-source graphics API imple...
WT-4071, GPU accelerated 3D graphics for Java, by Kevin Rushforth, Chien Yang...
GPU Virtualization on VMware's Hosted I/O Architecture
0xdroid -- community-developed Android distribution by 0xlab
Newbie’s guide to_the_gpgpu_universe
The GPGPU Continuum
[02][cuda c 프로그래밍 소개] gateau intro to_cuda_c
PyKinect: Body Iteration Application Development Using Python
[Ubucon Europe 2018] Introduction to mesa, the open-source graphics API imple...

What's hot

PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
KEY
Using Smalltalk for controlling robotics systems
PDF
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
PPTX
Hands on OpenCL
PDF
Kinect Hacks for Dummies
PDF
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
PDF
Introduction to OpenCL, 2010
PDF
You Can’t Do That With Smalltalk!
 
PDF
OpenCL - The Open Standard for Heterogeneous Parallel Programming
PDF
GPU Ecosystem
PPTX
Masked Occlusion Culling
PDF
Perceptual Computing Workshop à Paris
PDF
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
PPT
nodebots presentation @seekjobs
PDF
Perceptual Computing Workshop in Munich
PDF
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
PPTX
Optimizing Total War*: WARHAMMER II
PPTX
Getting Space Pirate Trainer* to Perform on Intel® Graphics
PDF
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
PDF
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
Using Smalltalk for controlling robotics systems
[Harvard CS264] 06 - CUDA Ninja Tricks: GPU Scripting, Meta-programming & Aut...
 
Hands on OpenCL
Kinect Hacks for Dummies
[Harvard CS264] 10a - Easy, Effective, Efficient: GPU Programming in Python w...
 
Introduction to OpenCL, 2010
You Can’t Do That With Smalltalk!
 
OpenCL - The Open Standard for Heterogeneous Parallel Programming
GPU Ecosystem
Masked Occlusion Culling
Perceptual Computing Workshop à Paris
Ultra HD Video Scaling: Low-Power HW FF vs. CNN-based Super-Resolution
nodebots presentation @seekjobs
Perceptual Computing Workshop in Munich
[Harvard CS264] 02 - Parallel Thinking, Architecture, Theory & Patterns
 
Optimizing Total War*: WARHAMMER II
Getting Space Pirate Trainer* to Perform on Intel® Graphics
CC-4001, Aparapi and HSA: Easing the developer path to APU/GPU accelerated Ja...
Keynote (Nandini Ramani) - The Role of Java in Heterogeneous Computing & How ...

Viewers also liked

PDF
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
PDF
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
PDF
Latency tracing in distributed Java applications
PDF
Libnetwork updates
PDF
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
PDF
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
PDF
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
PDF
Raspberry home server
PDF
Database Security Threats - MariaDB Security Best Practices
PDF
HPC DAY 2017 | Prometheus - energy efficient supercomputing
PDF
LinuxKit and OpenOverlay
PDF
HPC DAY 2017 | HPE Storage and Data Management for Big Data
PDF
GPU databases - How to use them and what the future holds
PDF
Design patterns in Java - Monitis 2017
PDF
Getting Started with Embedded Python: MicroPython and CircuitPython
PPTX
An Introduction to OMNeT++ 5.1
PPT
Drive into calico architecture
PDF
세션1. block chain as a platform
PDF
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...
HPC DAY 2017 | HPE Strategy And Portfolio for AI, BigData and HPC
HPC DAY 2017 | NVIDIA Volta Architecture. Performance. Efficiency. Availability
Latency tracing in distributed Java applications
Libnetwork updates
Model Simulation, Graphical Animation, and Omniscient Debugging with EcoreToo...
HPC DAY 2017 | Altair's PBS Pro: Your Gateway to HPC Computing
HPC DAY 2017 | Accelerating tomorrow's HPC and AI workflows with Intel Archit...
Raspberry home server
Database Security Threats - MariaDB Security Best Practices
HPC DAY 2017 | Prometheus - energy efficient supercomputing
LinuxKit and OpenOverlay
HPC DAY 2017 | HPE Storage and Data Management for Big Data
GPU databases - How to use them and what the future holds
Design patterns in Java - Monitis 2017
Getting Started with Embedded Python: MicroPython and CircuitPython
An Introduction to OMNeT++ 5.1
Drive into calico architecture
세션1. block chain as a platform
Scylla Summit 2017: Repair, Backup, Restore: Last Thing Before You Go to Prod...

Similar to Java on the GPU: Where are we now?

PDF
GPU Programming with Java
PPTX
GPU Computing: A brief overview
PPTX
SeaJUG 5 15-2018
PDF
VisionizeBeforeVisulaize_IEVC_Final
PDF
Low Level Graphics & OpenGL
PPT
Cuda intro
PPT
Advanced Graphics Workshop - GFX2011
PPT
CS 354 Introduction
PPT
NVIDIA CUDA
PPT
01 first
PPTX
Graphics processing uni computer archiecture
PPTX
2D graphics
PDF
Compute API –Past & Future
PPTX
Cgp lecture2 graphics_standard, opengl
PPTX
Slideshare
PPTX
GPU in Computer Science advance topic .pptx
PPTX
Graphics pipelining
PPTX
Computer Graphics
PDF
Managed DirectX
 
DOCX
Ha4 displaying 3 d polygon animations
GPU Programming with Java
GPU Computing: A brief overview
SeaJUG 5 15-2018
VisionizeBeforeVisulaize_IEVC_Final
Low Level Graphics & OpenGL
Cuda intro
Advanced Graphics Workshop - GFX2011
CS 354 Introduction
NVIDIA CUDA
01 first
Graphics processing uni computer archiecture
2D graphics
Compute API –Past & Future
Cgp lecture2 graphics_standard, opengl
Slideshare
GPU in Computer Science advance topic .pptx
Graphics pipelining
Computer Graphics
Managed DirectX
 
Ha4 displaying 3 d polygon animations

Recently uploaded

PDF
DSD-INT 2025 Quantifying Flood Mitigation Strategies Under Sea Level Rise - H...
PDF
DSD-INT 2025 Advancing Urban Flood Modeling with Delft3D FM 1D2D - A Pilot St...
PDF
DSD-INT 2025 Modernizing Hydrodynamics in Large Flood Forecasting System - Mi...
PDF
DSD-INT 2025 DevOps - Automated testing and delivery of Delft3D FM - van West...
PDF
CCM_External_Sales_Commissions_Standard_Configuration_2022-3.pdf
 
PDF
Data Integration with Salesforce Bootcamp
PDF
DSD-INT 2025 From Software to Impact - Water Quality Modelling for the UN Oce...
PDF
ECFT Case Study: Digital Pilot Transportation System
PDF
Oracle AI Database 26ai _ AI-Native Database for Enterprises.pdf
PDF
IAAM Meetup #7 chez Onepoint - Construire un Rag-as-a-service en production. ...
PPTX
AI Clinic Management Tool for Dermatologists Making Skin Care Smarter, Simple...
PDF
DSD-INT 2025 Exploring different domain decomposition approaches for enhanced...
PDF
DevOps Monitoring Tools: The 2025 Guide to Performance & Observability
PDF
DSD-INT 2025 Next-Generation Flood Inundation Mapping for Taiwan - Challenges...
PDF
Microservices Architecture Benefits For Mobile Development.pdf
PDF
DSD-INT 2025 Building-Aware Flood and Lifeline Scour Modeling with Delft3D FM...
PDF
Building Custom Insurance Applications With
PDF
DSD-INT 2025 Hydrodynamic and Morphodynamic Modeling with Delft3D FM for an I...
PPTX
The Sync Strikes Back: Tales from the MOPs Trenches
PDF
BCA 1st Semester Fundamentals Solved Question Paper 44121
DSD-INT 2025 Quantifying Flood Mitigation Strategies Under Sea Level Rise - H...
DSD-INT 2025 Advancing Urban Flood Modeling with Delft3D FM 1D2D - A Pilot St...
DSD-INT 2025 Modernizing Hydrodynamics in Large Flood Forecasting System - Mi...
DSD-INT 2025 DevOps - Automated testing and delivery of Delft3D FM - van West...
CCM_External_Sales_Commissions_Standard_Configuration_2022-3.pdf
 
Data Integration with Salesforce Bootcamp
DSD-INT 2025 From Software to Impact - Water Quality Modelling for the UN Oce...
ECFT Case Study: Digital Pilot Transportation System
Oracle AI Database 26ai _ AI-Native Database for Enterprises.pdf
IAAM Meetup #7 chez Onepoint - Construire un Rag-as-a-service en production. ...
AI Clinic Management Tool for Dermatologists Making Skin Care Smarter, Simple...
DSD-INT 2025 Exploring different domain decomposition approaches for enhanced...
DevOps Monitoring Tools: The 2025 Guide to Performance & Observability
DSD-INT 2025 Next-Generation Flood Inundation Mapping for Taiwan - Challenges...
Microservices Architecture Benefits For Mobile Development.pdf
DSD-INT 2025 Building-Aware Flood and Lifeline Scour Modeling with Delft3D FM...
Building Custom Insurance Applications With
DSD-INT 2025 Hydrodynamic and Morphodynamic Modeling with Delft3D FM for an I...
The Sync Strikes Back: Tales from the MOPs Trenches
BCA 1st Semester Fundamentals Solved Question Paper 44121

Java on the GPU: Where are we now?

  • 2.
    Java and GPU:where are wenow?And why?2
  • 3.
  • 4.
  • 5.
  • 6.
    What is avideo card?A video card (also called a display card, graphics card, displayadapter or graphics adapter) is an expansion card which generates afeed of output images to a display (such as a computer monitor).Frequently, these are advertised as discrete or dedicated graphicscards, emphasizing the distinction between these and integratedgraphics.6
  • 7.
    What is avideo card?But as for today:Video cards are not limited to simple image output, they have a built-ingraphics processor that can perform additional processing, removingthis task from the central processor of the computer.7
  • 8.
  • 9.
  • 10.
    What is aGPU?• Graphics Processing Unit10
  • 11.
    What is aGPU?• Graphics Processing Unit• First used by Nvidia in 199911
  • 12.
    What is aGPU?• Graphics Processing Unit• First used by Nvidia in 1999• GeForce 256 is called as «The world’s first GPU»12
  • 13.
    What is aGPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”13
  • 14.
    What is aGPU?• Defined as “single-chip processor with integrated transform, lighting,triangle setup/clipping, and rendering engines capable of processingof 10000 polygons per second”• ATI called them VPU..14
  • 15.
    By idea itlooks like this15
  • 16.
    GPGPU• General-purpose computingon graphics processing units16
  • 17.
    GPGPU• General-purpose computingon graphics processing units• Performs not only graphic calculations..17
  • 18.
    GPGPU• General-purpose computingon graphics processing units• Performs not only graphic calculations..• … but also those usually performed on CPU18
  • 19.
    So much cool!We have to usethem!19
  • 20.
    Let’s look atthe hardware!20Based on“From Shader Code to a Teraflop: How GPU Shader Cores Work”, By Kayvon Fatahalian, Stanford University
  • 21.
    The CPU ingeneral looks like this21
  • 22.
  • 23.
  • 24.
    Then let’s justclone them24
  • 25.
    To make alot of them!25
  • 26.
    But we aredoing the samecalculation just with differentdata26
  • 27.
    So we cometo SIMD paradigm27
  • 28.
    So we usethis paradigm28
  • 29.
    And here westart to talk about vectors..29
  • 30.
    … and inthe and we are here:30
  • 31.
    Nice! But howon earth can wecode here?!31
  • 32.
    It all startedwith a shader• Cool video cards were able to offload some of the tasks from the CPU32
  • 33.
    It all startedwith a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”33
  • 34.
    It all startedwith a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”34
  • 35.
    It all startedwith a shader• Cool video cards were able to offload some of the tasks from the CPU• But the most of the algorithms we just “hardcoded”• They were considered “standard”• Developers were able just to call them35
  • 36.
    It all startedwith a shader• But its obvious, not everything can be done with “hardcoded” algorithms36
  • 37.
    It all startedwith a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs37
  • 38.
    It all startedwith a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders38
  • 39.
    It all startedwith a shader• But its obvious, not everything can be done with “hardcoded” algorithms• That’s why some of the vendors “opened access” for developers to use their ownalgorithms with own programs• These programs are called Shaders• From this moment video card could work on transformations, geometry andtextures as the developers want!39
  • 40.
    It all startedwith a shader• First shadres were different:• Vertex• Geometry• Pixel• Then they were united to Common Shader Architecture40
  • 41.
    There are severalshaders languages• RenderMan• OSL• GLSL• Cg• DirectX ASM• HLSL• …41
  • 42.
  • 43.
  • 44.
    But they areso low level..44
  • 45.
    Having in mindit all started withgaming…45
  • 46.
    Several abstractions werecreated:• OpenGL• is a cross-language, cross-platform application programming interface (API) forrendering 2D and 3Dvector graphics. The API is typically used to interact witha graphics processing unit (GPU), to achieve hardware-accelerated rendering.• Silicon Graphics Inc., (SGI) started developing OpenGL in 1991 and released it inJanuary 1992;• DirectX• is a collection of application programming interfaces (APIs) for handling tasks relatedto multimedia, especially game programming and video, on Microsoft platforms.Originally, the names of these APIs all began with Direct, suchas Direct3D, DirectDraw, DirectMusic, DirectPlay, DirectSound, and so forth. Thename DirectX was coined as a shorthand term for all of these APIs (the X standing infor the particular API names) and soon became the name of the collection.46
  • 47.
    By the way,what about Java?47
  • 48.
    OpenGL in Java•JSR – 231• Started in 2003• Latest release in 2008• Supports OpenGL 2.048
  • 49.
    OpenGL• Now isan independent project GOGL• Supports OpenGL up to 4.5• Provide support for GLU и GLUT• Access to low level API on С via JNI49
  • 50.
  • 51.
    But somewhere in2005 it wasfinally realized this can be usedfor general computations as well51
  • 52.
    BrookGPU• Early effortsto use GPGPU• Own subset of ANSI C• Brook Streaming Language• Made in Stanford University52
  • 53.
    GPGPU• CUDA —Nvidia C subset proprietary platform.• DirectCompute — Microsoft proprietary shader language, part ofDirect3d, starting from DirectX 10.• AMD FireStream — ATI proprietary technology.• OpenACC – multivendor consortium• C++ AMP – Microsoft proprietary language• OpenCL – Common standard controlled by Kronos group.53
  • 54.
    Why should weever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices54
  • 55.
    Why should weever use GPU on Java• Why Java• Safe and secure• Portability (“write once, run everywhere”)• Used on 3 000 000 000 devices• Where can we apply GPU• Data Analytics and Data Science (Hadoop, Spark …)• Security analytics (log processing)• Finance/Banking55
  • 56.
  • 57.
    But Java workson JVM.. Butthere we have some low level..57
  • 58.
    For low levelwe use:• JNI (Java Native Interface)• JNA (Java Native Access)58
  • 59.
    But we cango crazy there..59
  • 60.
  • 61.
    But may bethere is somethingdone already?61
  • 62.
    For OpenCL:• JOCL•JogAmp• JavaCL (not supported anymore)62
  • 63.
    .. and forCuda• JCuda• Cublas• JCufft• JCurand• JCusparse• JCusolver• Jnvgraph• Jcudpp• JNpp• JCudnn63
  • 64.
    Disclaimer: its hardto work with GPU!• Its not just run a program• You need to know your hardware!• Its low level..64
  • 65.
  • 66.
    What’s that?• Shortfor Open Compute Language• Consortium of Apple, nVidia, AMD, IBM, Intel, ARM, Motorola andmany more• Very abstract model• Works both on GPU and CPU66
  • 67.
    Should work oneverything67
  • 68.
    All in allit works like this:HOST DEVICEDataProgram/Kernel68
  • 69.
    All in allit works like this:HOST69
  • 70.
    All in allit works like this:HOST DEVICEResult70
  • 71.
    Typical lifecycle ofan OpenCL app• Create context• Create command queue• Create memory buffers/fill with data• Create program from sources/load binaries• Compile (if required)• Create kernel from the program• Supply kernel arguments• Define ND range• Execute• Return resulting data• Release resources71
  • 72.
  • 73.
  • 74.
    1. There isthe host code. Its onJava.74
  • 75.
    2. There isa device code. Aspecific subset of C.75
  • 76.
    3. Communication betweenthehost and the device is done viamemory buffers.76
  • 77.
    So what canwe actually transfer?77
  • 78.
    The data isnot quite the same..78
  • 79.
  • 80.
  • 81.
    Datatypes:vectorsfloat f =4.0f;float3 f3 = (float3)(1.0f, 2.0f, 3.0f);float4 f4 = (float4)(f3, f);//f4.x = 1.0f,//f4.y = 2.0f,//f4.z = 3.0f,//f4.w = 4.0f81
  • 82.
    So how arethey saved there?82
  • 83.
    So how arethey saved there?In a hard way..83
  • 84.
    Memory Model• __global•__constant• __local• __private84
  • 85.
  • 86.
  • 87.
  • 88.
    Execution model• We’vegot a lot of data• We need to perform the same computations over them• So we can just shard them• OpenCL is here t help us88
  • 89.
  • 90.
    ND Range –what is that?90
  • 91.
    For example: matrixmultiplication• We would write it like this:void MatrixMul_sequential(int dim, float *A, float *B, float *C) {for(int iRow=0; iRow<dim;++iRow) {for(int iCol=0; iCol<dim;++iCol) {float result = 0.f;for(int i=0; i<dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}}}91
  • 92.
    For example: matrixmultiplication92
  • 93.
    For example: matrixmultiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}93
  • 94.
    For example: matrixmultiplication• So on GPU:void MatrixMul_kernel_basic(int dim,__global float *A, __global float *B, __global float *C) {//Get the index of the work-itemint iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i) {result += A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}94
  • 95.
    Typical GPU--- Infofor device GeForce GT 650M: ---CL_DEVICE_NAME: GeForce GT 650MCL_DEVICE_VENDOR: NVIDIACL_DRIVER_VERSION: 10.14.20 355.10.05.15f03CL_DEVICE_TYPE: CL_DEVICE_TYPE_GPUCL_DEVICE_MAX_COMPUTE_UNITS: 2CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1024 / 64CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 900 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 256 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 1024 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: localCL_DEVICE_LOCAL_MEM_SIZE: 48 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 256CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 16CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 16384CL_DEVICE_2D_MAX_HEIGHT 16384CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 1, SHORT 1, INT 1, LONG 1, FLOAT 1, DOUBLE 195
  • 96.
    Typical CPU--- Infofor device Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz: ---CL_DEVICE_NAME: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHzCL_DEVICE_VENDOR: IntelCL_DRIVER_VERSION: 1.1CL_DEVICE_TYPE: CL_DEVICE_TYPE_CPUCL_DEVICE_MAX_COMPUTE_UNITS: 8CL_DEVICE_MAX_WORK_ITEM_DIMENSIONS: 3CL_DEVICE_MAX_WORK_ITEM_SIZES: 1024 / 1 / 1CL_DEVICE_MAX_WORK_GROUP_SIZE: 1024CL_DEVICE_MAX_CLOCK_FREQUENCY: 2600 MHzCL_DEVICE_ADDRESS_BITS: 64CL_DEVICE_MAX_MEM_ALLOC_SIZE: 2048 MByteCL_DEVICE_GLOBAL_MEM_SIZE: 8192 MByteCL_DEVICE_ERROR_CORRECTION_SUPPORT: noCL_DEVICE_LOCAL_MEM_TYPE: globalCL_DEVICE_LOCAL_MEM_SIZE: 32 KByteCL_DEVICE_MAX_CONSTANT_BUFFER_SIZE: 64 KByteCL_DEVICE_QUEUE_PROPERTIES: CL_QUEUE_PROFILING_ENABLECL_DEVICE_IMAGE_SUPPORT: 1CL_DEVICE_MAX_READ_IMAGE_ARGS: 128CL_DEVICE_MAX_WRITE_IMAGE_ARGS: 8CL_DEVICE_SINGLE_FP_CONFIG: CL_FP_DENORM CL_FP_INF_NAN CL_FP_ROUND_TO_NEAREST CL_FP_ROUND_TO_ZERO CL_FP_ROUND_TO_INFCL_FP_FMA CL_FP_CORRECTLY_ROUNDED_DIVIDE_SQRTCL_DEVICE_2D_MAX_WIDTH 8192CL_DEVICE_2D_MAX_HEIGHT 8192CL_DEVICE_3D_MAX_WIDTH 2048CL_DEVICE_3D_MAX_HEIGHT 2048CL_DEVICE_3D_MAX_DEPTH 2048CL_DEVICE_PREFERRED_VECTOR_WIDTH_<t> CHAR 16, SHORT 8, INT 4, LONG 2, FLOAT 4, DOUBLE 296
  • 97.
  • 98.
    And what aboutCUDA?Well.. It looks to be easier98
  • 99.
    And what aboutCUDA?Well.. It looks to be easierfor C developers…99
  • 100.
    CUDA kernel#define N10__global__ void add( int *a, int *b, int *c ) {int tid = blockIdx.x; // this thread handles the data at its thread idif (tid < N)c[tid] = a[tid] + b[tid];}100
  • 101.
    CUDA setupint a[N],b[N], c[N];int *dev_a, *dev_b, *dev_c;// allocate the memory on the GPUcudaMalloc( (void**)&dev_a, N * sizeof(int) );cudaMalloc( (void**)&dev_b, N * sizeof(int) );cudaMalloc( (void**)&dev_c, N * sizeof(int) );// fill the arrays 'a' and 'b' on the CPUfor (int i=0; i<N; i++) {a[i] = -i;b[i] = i * i;}101
  • 102.
    CUDA copy tomemory and run// copy the arrays 'a' and 'b' to the GPUcudaMemcpy(dev_a, a, N *sizeof(int),cudaMemcpyHostToDevice);cudaMemcpy(dev_b,b,N*sizeof(int),cudaMemcpyHostToDevice);add<<<N,1>>>(dev_a,dev_b,dev_c);// copy the array 'c' back from the GPU to the CPUcudaMemcpy(c,dev_c,N*sizeof(int),cudaMemcpyDeviceToHost);102
  • 103.
    CUDA get results//display the resultsfor (int i=0; i<N; i++) {printf( "%d + %d = %dn", a[i], b[i], c[i] );}// free the memory allocated on the GPUcudaFree( dev_a );cudaFree( dev_b );cudaFree( dev_c );103
  • 104.
    But CUDA hassome other superpowers• Cublas – all about matrices• JCufft – Fast Frontier Transformation• Jcurand – all about random• JCusparse – sparse matrices• Jcusolver – factorization and some other crazy stuff• Jnvgraph – all about graphs• Jcudpp – CUDA Data Parallel Primitives Library, and some sorting• JNpp – image processing on GPU• Jcudnn – Deep Neural Network library (that’s scary)104
  • 105.
    For example weneed a good randint n = 100;curandGenerator generator = new curandGenerator();float hostData[] = new float[n];Pointer deviceData = new Pointer();cudaMalloc(deviceData, n * Sizeof.FLOAT);curandCreateGenerator(generator, CURAND_RNG_PSEUDO_DEFAULT);curandSetPseudoRandomGeneratorSeed(generator, 1234);curandGenerateUniform(generator, deviceData, n);cudaMemcpy(Pointer.to(hostData), deviceData,n * Sizeof.FLOAT, cudaMemcpyDeviceToHost);System.out.println(Arrays.toString(hostData));curandDestroyGenerator(generator);cudaFree(deviceData);105
  • 106.
    For example weneed a good rand• With a strong theory underneath• Developed by Russian mathematician Ilya Sobolev back in 1967• https://en.wikipedia.org/wiki/Sobol_sequence106
  • 107.
  • 108.
    Btw.. Talking aboutmemory108©Wikipedia
  • 109.
    Optimizations…__kernel void MatrixMul_kernel_basic(intdim,__global float *A,__global float *B,__global float *C){int iCol = get_global_id(0);int iRow = get_global_id(1);float result = 0.0;for(int i=0;i< dim;++i){result +=A[iRow*dim + i]*B[i*dim + iCol];}C[iRow*dim + iCol] = result;}109
  • 110.
    <—Optimizations#define VECTOR_SIZE 4__kernelvoid MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 110
  • 111.
    <—Optimizations#define VECTOR_SIZE 4__kernelvoid MatrixMul_kernel_basic_vector4(int dim,__global float4 *A,__global float4 *B,__global float *C)int localIdx = get_global_id(0);int localIdy = get_global_id(1);float result = 0.0;float4 Bvector[4];float4 Avector, temp;float4 resultVector[4] = {0,0,0,0};int rowElements = dim/VECTOR_SIZE;for(int i=0; i<rowElements; ++i){Avector = A[localIdy*rowElements + i];Bvector[0] = B[dim*i + localIdx];Bvector[1] = B[dim*i + rowElements + localIdx];Bvector[2] = B[dim*i + 2*rowElements + localIdx];Bvector[3] = B[dim*i + 3*rowElements + localIdx];temp = (float4)(Bvector[0].x, Bvector[1].x, Bvector[2].x, Bvector[3].x);resultVector[0] += Avector * temp;temp = (float4)(Bvector[0].y, Bvector[1].y, Bvector[2].y, Bvector[3].y);resultVector[1] += Avector * temp;temp = (float4)(Bvector[0].z, Bvector[1].z, Bvector[2].z, Bvector[3].z);resultVector[2] += Avector * temp;temp = (float4)(Bvector[0].w, Bvector[1].w, Bvector[2].w, Bvector[3].w);resultVector[3] += Avector * temp;}C[localIdy*dim + localIdx*VECTOR_SIZE] = resultVector[0].x + resultVector[0].y + resultVector[0].z + resultVector[0].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 1] = resultVector[1].x + resultVector[1].y + resultVector[1].z + resultVector[1].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 2] = resultVector[2].x + resultVector[2].y + resultVector[2].z + resultVector[2].w;C[localIdy*dim + localIdx*VECTOR_SIZE + 3] = resultVector[3].x + resultVector[3].y + resultVector[3].z + resultVector[3].w;} 111
  • 112.
    But we don’twant to have C atall…112
  • 113.
    We don’t wantto think aboutthose hosts and devices…113
  • 114.
    We can useGPU partially..114
  • 115.
  • 116.
    Project Sumatra• Researchproject• Focused on Java 8116
  • 117.
    Project Sumatra• Researchproject• Focused on Java 8• … to be more precise on streams117
  • 118.
    Project Sumatra• Researchproject• Focused on Java 8• … to be more precise on streams• … and even more precise lambdas and .forEach()118
  • 119.
  • 120.
  • 121.
    AMD HSAIL• DetectsforEach() block• Gets HSAIL code with Graal• On low level supply thegenerated from lambdakernel to the GPU121
  • 122.
    AMD APU triesto solve the main issue..122©Wikipedia
  • 123.
    But if wewant some moregeneral solution..123
  • 124.
    IBM patched JVMfor GPU• Focused on CUDA (for now)• Focused on Stream API• Created their own .parallel()124
  • 125.
    IBM patched JVMfor GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}125
  • 126.
    IBM patched JVMfor GPUImagine:void fooJava(float A[], float B[], int n) {// similar to for (idx = 0; i < n; i++)IntStream.range(0, N).parallel().forEach(i -> { b[i] = a[i] * 2.0; });}… we would like the lambda to be automatically converted to GPU code…126
  • 127.
    IBM patched JVMfor GPUWhen n is big the lambda code is executed on GPU:class Par {void foo(float[] a, float[] b, float[] c, int n) {IntStream.range(0, n).parallel().forEach(i -> {b[i] = a[i] * 2.0;c[i] = a[i] * 3.0;});}}*only lambdas with primitive types in one dimension arrays.127
  • 128.
    IBM patched JVMfor GPUOptimized IBM JIT compiler:• Use read-only cache• Fewer writes to global GPU memory• Optimized Host to Device data copy rate• Fewer data to be copied• Eliminate exceptions as much as possible• In the GPU Kernel128
  • 129.
    IBM patched JVMfor GPU• Success story:+ +129
  • 130.
    IBM patched JVMfor GPU• Officially:130
  • 131.
    IBM patched JVMfor GPU• More info:https://github.com/IBMSparkGPU/GPUEnabler131
  • 132.
    But can wejust write in Java,and its just being converted toOpenCL/CUDA?132
  • 133.
  • 134.
    Aparapi is therefor you!134
  • 135.
    Aparapi• Short for«A PARallel API»135
  • 136.
    Aparapi• Short for«A PARallel API»• Works like Hibernate for databases136
  • 137.
    Aparapi• Short for«A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device137
  • 138.
    Aparapi• Short for«A PARallel API»• Works like Hibernate for databases• Dynamically converts JVM Bytecode to code for Host and Device• OpenCL under the cover138
  • 139.
  • 140.
    Aparapi• Started byAMD• Then abandoned…140
  • 141.
    Aparapi• Started byAMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license141
  • 142.
    Aparapi• Started byAMD• Then abandoned…• In 5 years Opensourced under Apache 2.0 license• Back to life!!!142
  • 143.
    Aparapi – nowits so much simple!public static void main(String[] _args) {final int size = 512;final float[] a = new float[size];final float[] b = new float[size];for (int i = 0; i < size; i++) {a[i] = (float) (Math.random() * 100);b[i] = (float) (Math.random() * 100);}final float[] sum = new float[size];Kernel kernel = new Kernel(){@Override public void run() {int gid = getGlobalId();sum[gid] = a[gid] + b[gid];}};kernel.execute(Range.create(size));for (int i = 0; i < size; i++) {System.out.printf("%6.2f + %6.2f = %8.2fn", a[i], b[i], sum[i]);}kernel.dispose();}143
  • 144.
    But what aboutthe clouds?144
  • 145.
    We can’t sellour product if itsnot cloud native!145
  • 146.
    nVidia is yourfriend!146
  • 147.
    nVidia GRID• Announcedin 2012• Already in production• Works on the most of thehypervisors• .. And in the clouds!147
  • 148.
  • 149.
  • 150.
    … AMD isa bit behind…150
  • 151.
  • 152.
  • 153.
    Its here :ATI Radeon153
  • 154.
  • 155.
    Its here: IntelSkylake155
  • 156.
    Its here: NvidiaTegra Parker156
  • 157.
  • 158.
  • 159.
  • 160.
    So use it!Ifthe task is suitable160
  • 161.
  • 162.
  • 163.
  • 164.

[8]ページ先頭

©2009-2025 Movatter.jp