Embed presentation
Downloaded 346 times









![Then and Now 2012 OpenGL 4.3: Real-time Global IlluminationOpenGL 1.0: Per-vertex lighting 1992 [Crassin]](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-10-2048.jpg&f=jpg&w=240)








![Classic OpenGL State Machine From 1991-2007 * vertex & fragment processing got programmable 2001 & 2003 [source: GL 1.0 specification, 1992]](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-19-2048.jpg&f=jpg&w=240)









![Single Program, Multiple Data Example Standard C Code, running single-threadedvoid SAXPY_CPU(int n, float alpha, float x[256], float y[256]){ if (n > 256) n = 256; for (int i = 0; i < n; i++) // loop over each element explicitly y[i] = alpha*x[i] + y[i];}#version 430layout(local_size_x=256) in; // spawn groups of 256 threads!buffer xBuffer { float x[]; }; buffer yBuffer { float y[]; };uniform float alpha;void main(){ int i = int(gl_GlobalInvocationID.x); if (i < x.length()) // derive size from buffer bound y[i] = alpha*x[i] + y[i];} OpenGL Compute Shader, running SPMD SAXPY = BLAS library's single-precision alpha times x plus y](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-29-2048.jpg&f=jpg&w=240)



![Per-Work Group Shared Variables Any thread in a work group can read/write shared variables Typical idiom is to index by each thread’s invocation # Compute Shader a[0][0] a[0][1] a[0][2] a[0][3] source code shared float a[3][4]; a[1][0] a[1][1] a[1][2] a[1][3] unsigned int x = gl_LocalInvocationID.x a[2][0] a[2][1] a[2][2] a[2][3] unsigned int y = gl_LocalInvocationID.y no access to shared variables of a different work group a[y][x] = 2*ndx; a[y][x^1] += a[y][x]; a[0][0] a[0][1] a[0][2] a[0][3] memoryBarrierShared(); a[y][x^2] += a[y][x]; a[1][0] a[1][1] a[1][2] a[1][3] use shared memory barriers a[2][0] a[2][1] a[2][2] a[2][3] to synchronize access to shared variables](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-33-2048.jpg&f=jpg&w=240)



![Compiles into NV_compute_program5 Assembly!!NVcp5.0 # NV_compute_program5 assemblyGROUP_SIZE 16 16; # work group is 16x16 so 256 threadsPARAM c[2] = { program.local[0..1] }; # internal constantsTEMP R0, R1; # temporariesIMAGE images[] = { image[0..7] }; # input & output imagesMAD.S R1.xy,invocation.groupid,{16,16,0,0}.x,invocation.localid;MOV.S R0.x, c[0];LOADIM.U32 R0.x, R1, images[R0.x], 2D; # load from input imageMOV.S R1.z, c[1].x;UP4UB.F R0, R0.x; # unpack RGBA pixel into float4 vectorSTOREIM.F images[R1.z], R0, R1, 2D; # store to output imageEND](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-37-2048.jpg&f=jpg&w=240)




![Copy Compute Shader Tiling gl_WorkGroupID=[x,y] [0,4] [1,4] [2,4] [3,4] [4,4] [0,4] [1,4] [2,4] [3,4] [4,4] [0,3] [1,3] [2,3] [3,3] [4,3] [0,3] [1,3] [2,3] [3,3] [4,3] [0,2] [1,2] [2,2] [3,2] [4,2] [0,2] [1,2] [2,2] [3,2] [4,2] [0,1] [1,1] [2,1] [3,1] [4,1] [0,1] [1,1] [2,1] [3,1] [4,1] [0,0] [1,0] [2,0] [3,0] [4,0] [0,0] [1,0] [2,0] [3,0] [4,0] Input (source) image 76x76 Output (destination) image 76x76](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-42-2048.jpg&f=jpg&w=240)



![Implementing a General Convolution Basic algorithm Tile-oriented: generate MxM pixel tiles So operating on a (M+2N)x(M+2N) region of the image Phase 1: Read all the pixels for a region from input image Phase 2: Perform weighted sum of pixels in [-N,N]x[-N,N] region around each output pixel Phase 3: Output the result pixel to output image](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-46-2048.jpg&f=jpg&w=240)
![General Convolution: Preliminaries// Various kernel-wide constantsconst int tileWidth = 16, tileHeight = 16;const int filterWidth = 5, filterHeight = 5;const ivec2 tileSize = ivec2(tileWidth,tileHeight);const ivec2 filterOffset = ivec2(filterWidth/2,filterHeight/2);const ivec2 neighborhoodSize = tileSize + 2*filterOffset;// Declare the input and output images.layout(binding=0,rgba8) uniform image2D input_image;layout(binding=1,rgba8) uniform image2D output_image;uniform vec4 weight[filterHeight][filterWidth];uniform ivec4 imageBounds; // Bounds of the input image for pixel coordinate clamping.void retirePhase() { memoryBarrierShared(); barrier(); }ivec2 clampLocation(ivec2 xy) { // Clamp the image pixel location to the image boundary. return clamp(xy, imageBounds.xy, imageBounds.zw);}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-47-2048.jpg&f=jpg&w=240)
![General Convolution: Phase 1layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH];void main() { const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int j=0; j<neighborhoodSize.y; j += tileHeight) { for (int i=0; i<neighborhoodSize.x; i += tileWidth) { if (x+i < neighborhoodSize.x && y+j < neighborhoodSize.y) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i,j)-filterOffset); pixel[y+j][x+i] = imageLoad(input_image, read_at); } } } retirePhase();](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-48-2048.jpg&f=jpg&w=240)
![General Convolution: Phases 2 & 3 // Phase 2: Compute general convolution. vec4 result = vec4(0); for (int j=0; j<filterHeight; j++) { for (int i=0; i<filterWidth; i++) { result += pixel[y+j][x+i] * weight[j][i]; } } // Phase 3: Store result to output image. imageStore(output_image, pixel_xy, result);}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-49-2048.jpg&f=jpg&w=240)




![GLSL Separable Filter Implementation<< assume preliminaries from earlier general convolution example>>layout(local_size_x=TILE_WIDTH,local_size_y=NEIGHBORHOOD_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH]; // values read from input imageshared vec4 row[NEIGHBORHOOD_HEIGHT][TILE_WIDTH]; // weighted row sumsvoid main() // separable convolution{ const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + (thread_xy-ivec2(0,filterOffset.y)); const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int i=0; i<NEIGHBORHOOD_WIDTH; i += TILE_WIDTH) { if (x+i < NEIGHBORHOOD_WIDTH) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i-filterOffset.x,0)); pixel[y][x+i] = imageLoad(input_image, read_at); } } retirePhase();](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-54-2048.jpg&f=jpg&w=240)
![GLSL Separable Filter Implementation // Phase 2: Weighted sum the rows horizontally. row[y][x] = vec4(0); for (int i=0; i<filterWidth; i++) { row[y][x] += pixel[y][x+i] * rowWeight[i]; } retirePhase(); // Phase 3: Weighted sum the row sums vertically and write result to output image. // Does this thread correspond to a tile pixel? // Recall: There are more threads in the Y direction than tileHeight. if (y < tileHeight) { vec4 result = vec4(0); for (int i=0; i<filterHeight; i++) { result += row[y+i][x] * columnWeight[i]; } // Phase 4: Store result to output image. const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; imageStore(output_image, pixel_xy, result); }}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-55-2048.jpg&f=jpg&w=240)











![GLSL 4.3 new functionality ARB_arrays_of_arrays Allows multi-dimensional arrays in GLSL. float f[4][3]; ARB_shader_image_size Query size of an image in a shader ARB_explicit_uniform_location Set location of a default-block uniform in the shader ARB_texture_query_levels Query number of mipmap levels accessible through a sampler uniform ARB_fragment_layer_viewport gl_Layer and gl_ViewportIndex now available to fragment shader](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-67-2048.jpg&f=jpg&w=240)




























![Writing GLSL for Bindless Textures Request GLSL to understand bindless textures #version 400 // or later #extension GL_NV_bindless_texture : require Declare a sampler in the normal way in sampler2D bindless_texture; Alternatively, access bindless samplers in big array: uniform Samplers { sampler2D lotsOfSamplers[256]; } Exciting: 256 samplers exceeds the available texture units!](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-96-2048.jpg&f=jpg&w=240)






![OpenGL Technology GL 3.x uniform samplerBuffer matrixBuffer; // need helper functions Texture Buffers (TexBO, unsized mat4 getMatrix (samplerBuffer buf, int i){ return mat4( texelFetch (buf,(i*4)+0), 1D array of basic vector types) texelFetch (buf,(i*4)+1) ... Uniform Buffer Objects (UBO, } arbitrary types, size limitation uniform viewBuffer { 64kb) mat4 viewInvTM; mat4 viewProjTM; Texture Arrays (pack multiple float time; same-sized textures in one array) ... } GL 4.x // NEW 4.3 allows unsized arrays as last entry Shader Storage Buffer (SSBO, // and tighter array packing layout(std430) buffer matrixBuffer { unsized arbitrary buffer access) int willCostOnly16BytesNow[4]; (NEW 4.3) mat4 matrices[]; }](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-103-2048.jpg&f=jpg&w=240)
![OpenGL Technology GL 4.x glGenTextures (2,tex); // create texture with complete mipchain ARB_texture_storage helps driver glTexStorage (GL_..,levels, GL_RGBA8, w, h); // subimage data in later to create immutable “complete” glTexSubImage (GL_.., 0,..., mipData[0]) texture at once // NEW 4.3 // create another texture that references the // same data and interprets a single mip ARB_texture_view (NEW 4.3) // slightly differently allows multiple views (internal- glTextureView (tex[1], GL_TEXTURE_2D, tex[2], format casting with same texel bit GL_R32UI, minlevel, numlevels, minlayer, numlayers); size) on same texture data // NEW 4.3 bind range of buffer ARB_texture_buffer_range (NEW glTexBufferRange (GL_..., GL_RGBA32F, buffer, offset, size); 4.3)](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-104-2048.jpg&f=jpg&w=240)
![NVIDIA Technology // GLSL with true pointers NVIDIA Bindless Graphics uniform mat4* matrixBuffer; Exposes gpu resources directly // API glUniformui64NV (shd->matrixLocation, (pointers or objects) scene->matrixADDR); Reduces CPU cache thrashing greatly mat->diffuse = glGetTextureHandleNV (texobj); GL 3.x // later instead of glBindTexture NV_shader_buffer_load (SBL) for glUniformHandleui64NV (shd->diffuseLocation, arbitrary unsized cross buffer access mat->diffuse) // GLSL NV_vertex_buffer_unified_memory // can also store textures in resources, (VBUM) separates vertex data from // virtually no restrictions on # format uniform materialBuffer { sampler2D howManyTexturesIWant[LARGE]; GL 4.x } // virtual sparse texturing NV_bindless_texture allows sampler uniform usampler2D virtualTex; references anywhere ... sampler2D (packUint2x32 ( texelFetch (virtualTex, coord).xy));](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-105-2048.jpg&f=jpg&w=240)
![Data Transfer // NEW 4.3 copy rectangles of textures Textures glCopyImageSubData ( srcName, srcTarget, srcLevel,srcX,srcY,srcZ, ARB_pixel_buffer_object (GL 2.x) dstName, dstTarget, dstLevel,dstX,dstY,dstZ, srcWidth, srcHeight, srcDepth); Now ARB_copy_image (NEW 4.3) Buffers // EXT_direct_state_access style usage shown // classic functions exists as well ARB_map_buffer_range (GL 3.x) for // range map and invalidate fast mapping void* data; ARB_invalidate_subdata (NEW 4.3) data = glMapNamedBufferRangeEXT (textBuffer, 0, sizeof(MyChar) * textLength, ARB_clear_buffer_object (NEW 4.3) GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT); allows memset() operations ARB_sync (GL 3.x) for efficient // NEW 4.3 clearing a buffer GLuint zero[1] = {0}; threaded streaming glClearNamedBufferDataEXT (visibleBuffer, GTC 2012 “Optimizing Texture Transfers” GL_R32UI, GL_RED, GL_UNSIGNED_INT, zero);](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-106-2048.jpg&f=jpg&w=240)

![Drawing the Objects struct MyMaterial { Enhanced: vec4 diffuse; int shadeType; Grow buffers and dynamically index ... TexBO: can be large, but ugly to fetch }; uniform materialBuffer { UBO: fast, but size limited MyMaterial materials[128]; SSBO: large }; buffer transformBuffer { mat4 transforms[]; Pass assignment index as glUniform , }; glVertexAttribI (faster) ... gl_FragColor = materials[assign.x].diffuse; Go bindless beyond VBUM // bindless pointer datatypes struct Object { Bypassing binding completely MyMaterial* material; mat4* transform; };](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-108-2048.jpg&f=jpg&w=240)


![BaseInstance as Unique Object ID MultiDrawIndirect uses instanced drawing Can replicate a vertex attribute (material... assignment) Regular : VArray[ gl_VertexID + baseVertex ] Divisor != 0 : VArray[ gl_InstanceID / VAttribDivisor + baseInstance ] AssignBuffer (divisor) Combined Attributes VertexBuffer (regular) MultiDrawBuffer count = 9 count = 6 baseVertex = 0 baseVertex = 9 baseInstance = 1 baseInstance = 0](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-111-2048.jpg&f=jpg&w=240)

![Shader Switching subroutine void shade_fn (); Could use indexed subroutines subroutine (shade_fn) vec4 metal() ... subroutine (shade_fn) vec4 wood () ... If shaders are similar in resource // content of array set by gl api call consumption (register usage) subroutine uniform shade_fn shadeFuncs[2]; might want to combine them (GL flat in ivec4 assigns; 4.x) void main(){ gl_FragColor = shadeFuncs[assigns.y](); } Initiliaze the subroutine array // bindless pointer casting and texture sampling once, then dynamically index vec4 metal() { MetalParams* metal = packPtr (assigns.zw); ... texture (metal->roughnessMap, uv); } NVIDIA Bindless Graphics vec4 wood() { pointers allow casting buffer WoodParams* metal = packPtr (assigns.zw); addresses ... }](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-113-2048.jpg&f=jpg&w=240)

![OpenGL Computing // API - COMPUTE glDispatchCompute (gx, gy, gz); // GLSL shared float s_mem[SOMESIZE]; ... s_mem[gl_LocalInvocationIndex] = ... ARB_compute_shader (NEW 4.3) // API - FBO Dispatch threads with shared memory glBindFrameBuffer (...); support (as in CUDA/CL) glFramebufferParameteri (..., GL_FRAMEBUFFER_DEFAULT_WIDTH, 2048); Access to ALL resources, textures, no glFramebufferParameteri (..., interop, all in GLSL GL_FRAMEBUFFER_DEFAULT_HEIGHT, 2048); glDrawArrays(...) NV_BINDLESS benefit from pointer acccess // GLSL ... imageStore(...ivec2(gl_FragCoord),.); ARB_framebuffer_no_attachments (NEW 4.3) // API - XFB glEnable (GL_RASTERIZER_DISCARD); Use rasterizer to spawn threads and glDrawArrays (GL_POINTS,0, count); SSBO/imageStores to record your results // GLSL buffer indirectBuffer { GL 3.x Transform Feedback (XFB) DrawIndirect commands[]; } Allows simple 1D Kernels ... commands[gl_VertexID].instanceCount = visible ? 1 : 0;](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-115-2048.jpg&f=jpg&w=240)


![Culling Techniques Passing bbox fragments Raster Occlusion (GL 4.x) enable object Depth-Pass Raster “invisble” bounding boxes // GLSL fragment shader Geometry Shader to create the 3 sides // from ARB_shader_image_load_store depth buffer discards occluded fragments layout(early_fragment_tests) in; Fragment Shader does visible[objindex] = 1 buffer indirectBuffer { Temporal Coherence (vertex-bound) }; DrawIndirect commands[]; Render last visible flat in int objID; Test all bboxes against current depth void main(){ commands[objID].instanceCount = 1; Render newly added visible: (~last) & (visible) } Each object drawn only once // some other shader would have // cleared to 0 before](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-118-2048.jpg&f=jpg&w=240)







This document presents a detailed overview of OpenGL 4.3, highlighting its new features such as compute shaders, advancements in texture functionality, and compatibility with existing standards. It discusses the importance of OpenGL as an open industry standard, its evolution over two decades, and the implications of its advancements for developers using NVIDIA GPUs. The document also emphasizes NVIDIA's role in the further development of OpenGL, demonstrating how applications can leverage these new capabilities to enhance graphical processing.









![Then and Now 2012 OpenGL 4.3: Real-time Global IlluminationOpenGL 1.0: Per-vertex lighting 1992 [Crassin]](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-10-2048.jpg&f=jpg&w=240)








![Classic OpenGL State Machine From 1991-2007 * vertex & fragment processing got programmable 2001 & 2003 [source: GL 1.0 specification, 1992]](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-19-2048.jpg&f=jpg&w=240)









![Single Program, Multiple Data Example Standard C Code, running single-threadedvoid SAXPY_CPU(int n, float alpha, float x[256], float y[256]){ if (n > 256) n = 256; for (int i = 0; i < n; i++) // loop over each element explicitly y[i] = alpha*x[i] + y[i];}#version 430layout(local_size_x=256) in; // spawn groups of 256 threads!buffer xBuffer { float x[]; }; buffer yBuffer { float y[]; };uniform float alpha;void main(){ int i = int(gl_GlobalInvocationID.x); if (i < x.length()) // derive size from buffer bound y[i] = alpha*x[i] + y[i];} OpenGL Compute Shader, running SPMD SAXPY = BLAS library's single-precision alpha times x plus y](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-29-2048.jpg&f=jpg&w=240)



![Per-Work Group Shared Variables Any thread in a work group can read/write shared variables Typical idiom is to index by each thread’s invocation # Compute Shader a[0][0] a[0][1] a[0][2] a[0][3] source code shared float a[3][4]; a[1][0] a[1][1] a[1][2] a[1][3] unsigned int x = gl_LocalInvocationID.x a[2][0] a[2][1] a[2][2] a[2][3] unsigned int y = gl_LocalInvocationID.y no access to shared variables of a different work group a[y][x] = 2*ndx; a[y][x^1] += a[y][x]; a[0][0] a[0][1] a[0][2] a[0][3] memoryBarrierShared(); a[y][x^2] += a[y][x]; a[1][0] a[1][1] a[1][2] a[1][3] use shared memory barriers a[2][0] a[2][1] a[2][2] a[2][3] to synchronize access to shared variables](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-33-2048.jpg&f=jpg&w=240)



![Compiles into NV_compute_program5 Assembly!!NVcp5.0 # NV_compute_program5 assemblyGROUP_SIZE 16 16; # work group is 16x16 so 256 threadsPARAM c[2] = { program.local[0..1] }; # internal constantsTEMP R0, R1; # temporariesIMAGE images[] = { image[0..7] }; # input & output imagesMAD.S R1.xy,invocation.groupid,{16,16,0,0}.x,invocation.localid;MOV.S R0.x, c[0];LOADIM.U32 R0.x, R1, images[R0.x], 2D; # load from input imageMOV.S R1.z, c[1].x;UP4UB.F R0, R0.x; # unpack RGBA pixel into float4 vectorSTOREIM.F images[R1.z], R0, R1, 2D; # store to output imageEND](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-37-2048.jpg&f=jpg&w=240)




![Copy Compute Shader Tiling gl_WorkGroupID=[x,y] [0,4] [1,4] [2,4] [3,4] [4,4] [0,4] [1,4] [2,4] [3,4] [4,4] [0,3] [1,3] [2,3] [3,3] [4,3] [0,3] [1,3] [2,3] [3,3] [4,3] [0,2] [1,2] [2,2] [3,2] [4,2] [0,2] [1,2] [2,2] [3,2] [4,2] [0,1] [1,1] [2,1] [3,1] [4,1] [0,1] [1,1] [2,1] [3,1] [4,1] [0,0] [1,0] [2,0] [3,0] [4,0] [0,0] [1,0] [2,0] [3,0] [4,0] Input (source) image 76x76 Output (destination) image 76x76](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-42-2048.jpg&f=jpg&w=240)



![Implementing a General Convolution Basic algorithm Tile-oriented: generate MxM pixel tiles So operating on a (M+2N)x(M+2N) region of the image Phase 1: Read all the pixels for a region from input image Phase 2: Perform weighted sum of pixels in [-N,N]x[-N,N] region around each output pixel Phase 3: Output the result pixel to output image](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-46-2048.jpg&f=jpg&w=240)
![General Convolution: Preliminaries// Various kernel-wide constantsconst int tileWidth = 16, tileHeight = 16;const int filterWidth = 5, filterHeight = 5;const ivec2 tileSize = ivec2(tileWidth,tileHeight);const ivec2 filterOffset = ivec2(filterWidth/2,filterHeight/2);const ivec2 neighborhoodSize = tileSize + 2*filterOffset;// Declare the input and output images.layout(binding=0,rgba8) uniform image2D input_image;layout(binding=1,rgba8) uniform image2D output_image;uniform vec4 weight[filterHeight][filterWidth];uniform ivec4 imageBounds; // Bounds of the input image for pixel coordinate clamping.void retirePhase() { memoryBarrierShared(); barrier(); }ivec2 clampLocation(ivec2 xy) { // Clamp the image pixel location to the image boundary. return clamp(xy, imageBounds.xy, imageBounds.zw);}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-47-2048.jpg&f=jpg&w=240)
![General Convolution: Phase 1layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH];void main() { const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int j=0; j<neighborhoodSize.y; j += tileHeight) { for (int i=0; i<neighborhoodSize.x; i += tileWidth) { if (x+i < neighborhoodSize.x && y+j < neighborhoodSize.y) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i,j)-filterOffset); pixel[y+j][x+i] = imageLoad(input_image, read_at); } } } retirePhase();](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-48-2048.jpg&f=jpg&w=240)
![General Convolution: Phases 2 & 3 // Phase 2: Compute general convolution. vec4 result = vec4(0); for (int j=0; j<filterHeight; j++) { for (int i=0; i<filterWidth; i++) { result += pixel[y+j][x+i] * weight[j][i]; } } // Phase 3: Store result to output image. imageStore(output_image, pixel_xy, result);}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-49-2048.jpg&f=jpg&w=240)




![GLSL Separable Filter Implementation<< assume preliminaries from earlier general convolution example>>layout(local_size_x=TILE_WIDTH,local_size_y=NEIGHBORHOOD_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH]; // values read from input imageshared vec4 row[NEIGHBORHOOD_HEIGHT][TILE_WIDTH]; // weighted row sumsvoid main() // separable convolution{ const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + (thread_xy-ivec2(0,filterOffset.y)); const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int i=0; i<NEIGHBORHOOD_WIDTH; i += TILE_WIDTH) { if (x+i < NEIGHBORHOOD_WIDTH) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i-filterOffset.x,0)); pixel[y][x+i] = imageLoad(input_image, read_at); } } retirePhase();](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-54-2048.jpg&f=jpg&w=240)
![GLSL Separable Filter Implementation // Phase 2: Weighted sum the rows horizontally. row[y][x] = vec4(0); for (int i=0; i<filterWidth; i++) { row[y][x] += pixel[y][x+i] * rowWeight[i]; } retirePhase(); // Phase 3: Weighted sum the row sums vertically and write result to output image. // Does this thread correspond to a tile pixel? // Recall: There are more threads in the Y direction than tileHeight. if (y < tileHeight) { vec4 result = vec4(0); for (int i=0; i<filterHeight; i++) { result += row[y+i][x] * columnWeight[i]; } // Phase 4: Store result to output image. const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; imageStore(output_image, pixel_xy, result); }}](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-55-2048.jpg&f=jpg&w=240)











![GLSL 4.3 new functionality ARB_arrays_of_arrays Allows multi-dimensional arrays in GLSL. float f[4][3]; ARB_shader_image_size Query size of an image in a shader ARB_explicit_uniform_location Set location of a default-block uniform in the shader ARB_texture_query_levels Query number of mipmap levels accessible through a sampler uniform ARB_fragment_layer_viewport gl_Layer and gl_ViewportIndex now available to fragment shader](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-67-2048.jpg&f=jpg&w=240)




























![Writing GLSL for Bindless Textures Request GLSL to understand bindless textures #version 400 // or later #extension GL_NV_bindless_texture : require Declare a sampler in the normal way in sampler2D bindless_texture; Alternatively, access bindless samplers in big array: uniform Samplers { sampler2D lotsOfSamplers[256]; } Exciting: 256 samplers exceeds the available texture units!](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-96-2048.jpg&f=jpg&w=240)






![OpenGL Technology GL 3.x uniform samplerBuffer matrixBuffer; // need helper functions Texture Buffers (TexBO, unsized mat4 getMatrix (samplerBuffer buf, int i){ return mat4( texelFetch (buf,(i*4)+0), 1D array of basic vector types) texelFetch (buf,(i*4)+1) ... Uniform Buffer Objects (UBO, } arbitrary types, size limitation uniform viewBuffer { 64kb) mat4 viewInvTM; mat4 viewProjTM; Texture Arrays (pack multiple float time; same-sized textures in one array) ... } GL 4.x // NEW 4.3 allows unsized arrays as last entry Shader Storage Buffer (SSBO, // and tighter array packing layout(std430) buffer matrixBuffer { unsized arbitrary buffer access) int willCostOnly16BytesNow[4]; (NEW 4.3) mat4 matrices[]; }](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-103-2048.jpg&f=jpg&w=240)
![OpenGL Technology GL 4.x glGenTextures (2,tex); // create texture with complete mipchain ARB_texture_storage helps driver glTexStorage (GL_..,levels, GL_RGBA8, w, h); // subimage data in later to create immutable “complete” glTexSubImage (GL_.., 0,..., mipData[0]) texture at once // NEW 4.3 // create another texture that references the // same data and interprets a single mip ARB_texture_view (NEW 4.3) // slightly differently allows multiple views (internal- glTextureView (tex[1], GL_TEXTURE_2D, tex[2], format casting with same texel bit GL_R32UI, minlevel, numlevels, minlayer, numlayers); size) on same texture data // NEW 4.3 bind range of buffer ARB_texture_buffer_range (NEW glTexBufferRange (GL_..., GL_RGBA32F, buffer, offset, size); 4.3)](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-104-2048.jpg&f=jpg&w=240)
![NVIDIA Technology // GLSL with true pointers NVIDIA Bindless Graphics uniform mat4* matrixBuffer; Exposes gpu resources directly // API glUniformui64NV (shd->matrixLocation, (pointers or objects) scene->matrixADDR); Reduces CPU cache thrashing greatly mat->diffuse = glGetTextureHandleNV (texobj); GL 3.x // later instead of glBindTexture NV_shader_buffer_load (SBL) for glUniformHandleui64NV (shd->diffuseLocation, arbitrary unsized cross buffer access mat->diffuse) // GLSL NV_vertex_buffer_unified_memory // can also store textures in resources, (VBUM) separates vertex data from // virtually no restrictions on # format uniform materialBuffer { sampler2D howManyTexturesIWant[LARGE]; GL 4.x } // virtual sparse texturing NV_bindless_texture allows sampler uniform usampler2D virtualTex; references anywhere ... sampler2D (packUint2x32 ( texelFetch (virtualTex, coord).xy));](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-105-2048.jpg&f=jpg&w=240)
![Data Transfer // NEW 4.3 copy rectangles of textures Textures glCopyImageSubData ( srcName, srcTarget, srcLevel,srcX,srcY,srcZ, ARB_pixel_buffer_object (GL 2.x) dstName, dstTarget, dstLevel,dstX,dstY,dstZ, srcWidth, srcHeight, srcDepth); Now ARB_copy_image (NEW 4.3) Buffers // EXT_direct_state_access style usage shown // classic functions exists as well ARB_map_buffer_range (GL 3.x) for // range map and invalidate fast mapping void* data; ARB_invalidate_subdata (NEW 4.3) data = glMapNamedBufferRangeEXT (textBuffer, 0, sizeof(MyChar) * textLength, ARB_clear_buffer_object (NEW 4.3) GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT); allows memset() operations ARB_sync (GL 3.x) for efficient // NEW 4.3 clearing a buffer GLuint zero[1] = {0}; threaded streaming glClearNamedBufferDataEXT (visibleBuffer, GTC 2012 “Optimizing Texture Transfers” GL_R32UI, GL_RED, GL_UNSIGNED_INT, zero);](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-106-2048.jpg&f=jpg&w=240)

![Drawing the Objects struct MyMaterial { Enhanced: vec4 diffuse; int shadeType; Grow buffers and dynamically index ... TexBO: can be large, but ugly to fetch }; uniform materialBuffer { UBO: fast, but size limited MyMaterial materials[128]; SSBO: large }; buffer transformBuffer { mat4 transforms[]; Pass assignment index as glUniform , }; glVertexAttribI (faster) ... gl_FragColor = materials[assign.x].diffuse; Go bindless beyond VBUM // bindless pointer datatypes struct Object { Bypassing binding completely MyMaterial* material; mat4* transform; };](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-108-2048.jpg&f=jpg&w=240)


![BaseInstance as Unique Object ID MultiDrawIndirect uses instanced drawing Can replicate a vertex attribute (material... assignment) Regular : VArray[ gl_VertexID + baseVertex ] Divisor != 0 : VArray[ gl_InstanceID / VAttribDivisor + baseInstance ] AssignBuffer (divisor) Combined Attributes VertexBuffer (regular) MultiDrawBuffer count = 9 count = 6 baseVertex = 0 baseVertex = 9 baseInstance = 1 baseInstance = 0](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-111-2048.jpg&f=jpg&w=240)

![Shader Switching subroutine void shade_fn (); Could use indexed subroutines subroutine (shade_fn) vec4 metal() ... subroutine (shade_fn) vec4 wood () ... If shaders are similar in resource // content of array set by gl api call consumption (register usage) subroutine uniform shade_fn shadeFuncs[2]; might want to combine them (GL flat in ivec4 assigns; 4.x) void main(){ gl_FragColor = shadeFuncs[assigns.y](); } Initiliaze the subroutine array // bindless pointer casting and texture sampling once, then dynamically index vec4 metal() { MetalParams* metal = packPtr (assigns.zw); ... texture (metal->roughnessMap, uv); } NVIDIA Bindless Graphics vec4 wood() { pointers allow casting buffer WoodParams* metal = packPtr (assigns.zw); addresses ... }](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-113-2048.jpg&f=jpg&w=240)

![OpenGL Computing // API - COMPUTE glDispatchCompute (gx, gy, gz); // GLSL shared float s_mem[SOMESIZE]; ... s_mem[gl_LocalInvocationIndex] = ... ARB_compute_shader (NEW 4.3) // API - FBO Dispatch threads with shared memory glBindFrameBuffer (...); support (as in CUDA/CL) glFramebufferParameteri (..., GL_FRAMEBUFFER_DEFAULT_WIDTH, 2048); Access to ALL resources, textures, no glFramebufferParameteri (..., interop, all in GLSL GL_FRAMEBUFFER_DEFAULT_HEIGHT, 2048); glDrawArrays(...) NV_BINDLESS benefit from pointer acccess // GLSL ... imageStore(...ivec2(gl_FragCoord),.); ARB_framebuffer_no_attachments (NEW 4.3) // API - XFB glEnable (GL_RASTERIZER_DISCARD); Use rasterizer to spawn threads and glDrawArrays (GL_POINTS,0, count); SSBO/imageStores to record your results // GLSL buffer indirectBuffer { GL 3.x Transform Feedback (XFB) DrawIndirect commands[]; } Allows simple 1D Kernels ... commands[gl_VertexID].instanceCount = visible ? 1 : 0;](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-115-2048.jpg&f=jpg&w=240)


![Culling Techniques Passing bbox fragments Raster Occlusion (GL 4.x) enable object Depth-Pass Raster “invisble” bounding boxes // GLSL fragment shader Geometry Shader to create the 3 sides // from ARB_shader_image_load_store depth buffer discards occluded fragments layout(early_fragment_tests) in; Fragment Shader does visible[objindex] = 1 buffer indirectBuffer { Temporal Coherence (vertex-bound) }; DrawIndirect commands[]; Render last visible flat in int objID; Test all bboxes against current depth void main(){ commands[objID].instanceCount = 1; Render newly added visible: (~last) & (visible) } Each object drawn only once // some other shader would have // cleared to 0 before](/image.pl?url=https%3a%2f%2fimage.slidesharecdn.com%2fsiggraph2012nvidiaopenglfor2012-120820103157-phpapp01%2f75%2fSIGGRAPH-2012-NVIDIA-OpenGL-for-2012-118-2048.jpg&f=jpg&w=240)





