Movatterモバイル変換


[0]ホーム

URL:


Mark Kilgard, profile picture
Uploaded byMark Kilgard
PPT, PDF13,211 views

SIGGRAPH 2012: NVIDIA OpenGL for 2012

This document presents a detailed overview of OpenGL 4.3, highlighting its new features such as compute shaders, advancements in texture functionality, and compatibility with existing standards. It discusses the importance of OpenGL as an open industry standard, its evolution over two decades, and the implications of its advancements for developers using NVIDIA GPUs. The document also emphasizes NVIDIA's role in the further development of OpenGL, demonstrating how applications can leverage these new capabilities to enhance graphical processing.

Embed presentation

Downloaded 346 times
(unabridged                          slide deck)NVIDIA OpenGL in 2012:  Version 4.3 is here!      Mark Kilgard
Mark Kilgard Principal System Software Engineer    OpenGL driver and API evolution    Cg (“C for graphics”) shading language    GPU-accelerated path rendering OpenGL Utility Toolkit (GLUT) implementer Author of OpenGL for the X Window System Co-author of Cg Tutorial Worked on OpenGL for 20+ years
Talk DetailsLocation: West Hall Meeting Room 503, Los Angeles Convention CenterDate: Wednesday, August 8, 2012Time: 11:50 AM - 12:50 PMMark Kilgard (Principal Software Engineer, NVIDIA)Abstract: Attend this session to get the most out of OpenGL on NVIDIA Quadro and GeForceGPUs. Learn about the new features in OpenGL 4.3, particularly Compute Shaders. Other topicsinclude bindless graphics; Linux improvements; and how to best use the modern OpenGLgraphics pipeline. Learn how your application can benefit from NVIDIA's leadership drivingOpenGL as a cross-platform, open industry standard.Topic Areas: Computer Graphics; Development Tools & Libraries; Visualization; Image andVideo ProcessingLevel: IntermediateWatch video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS104.html
Outline State of OpenGL & OpenGL’s importance to NVIDIA Compute Shaders explored Other stuff in OpenGL 4.3 Further NVIDIA OpenGL Work How to exploit OpenGL’s modern graphics pipeline
State of OpenGL &OpenGL’s importance to NVIDIA
OpenGL Standard is 20 Years and Strong
Think back to Computing in 1992Programming Languages    ANSI C (C 89) was just 3 years old    C++ still implemented as a front-end to C    OpenGL in 1992 provided FORTRAN and Pascal bindingsOne year before NCSA Mosaic web browser first written    Now WebGL standard in almost every browserWindows version    Windows 3.1 ships!    NT 3.1 still a year awayEntertainment    Great video game graphics? Mortal Kombat?    Top grossing movie (Aladdin) was animated        Back when animated movies were still hand-drawn
20 Years Ago: Enter OpenGL
20 Years in Print
Then and Now                                                    2012                                  OpenGL 4.3: Real-time Global IlluminationOpenGL 1.0: Per-vertex lighting            1992                                                                              [Crassin]
Big News                                                4.3      OpenGL 4.3 announced Monday here at SIGGRAPH          August 6, 2012          Moments later… NVIDIA beta OpenGL 4.3 driver on the web          http://www.nvidia.com/content/devzone/opengl-driver-4.3.html      OpenGL 4.3 brings substantial new features          Compute Shaders!          OpenGL Shading Language (GLSL) updates (multi-dimensional arrays, etc.)          New texture functionality (stencil texturing, more queries)MarqueeFeature   New buffer functionality (clear buffers, invalidate buffers, etc.)          More Direct3D-isms (texture views, parity with DirectX compute shaders)          OpenGL ES 3.0 compatibility
NVIDIA’s OpenGL Leverage                           GeForce      Programmable  Graphics (GLSL, Cg)                                          Debugging with        Tegra                             Parallel Nsight                                     Quadro                        OptiX
Single 3D API for Every Platform                                      Windows      OS X      Linux                           Android                            Solaris                 FreeBSD
OpenGL 3D Graphics API • cross-platform • most functional • peak performance • open standard • inter-operable • well specified & documented • 20 years of compatibility
OpenGL Spawns Closely Related Standards     Congratulations: WebGL officially approved, February 2012                   “The web is now 3D enabled”
Accelerating OpenGL Innovation  2004          2005        2006         2007               2008           2009       2010      2011   2012                              DirectX 10.0   DirectX 10.1                    DirectX 11         DirectX 9.0c                                                              OpenGL 3.1      OpenGL 3.3                                                                                   +OpenGL 2.0              OpenGL 2.1                OpenGL 3.0       OpenGL 3.2 OpenGL 4.0          OpenGL 4.3                                                                                                        Now with                                                                                  OpenGL 4.1            compute • OpenGL has fast innovation + standardization                                                         shaders!     - Pace is 7 new spec versions in four years     - Actual implementations following specifications closely                                                                                             OpenGL 4.2 • OpenGL 4.3 is a superset of DirectX 11 functionality     - While retaining backwards compatibility
OpenGL Today – DirectX 11 Superset                                  Buffer and                                                                      Event                                                                     Interop First-class graphics + compute solution    OpenGL 4.3 = graphics + compute shaders    NVIDIA still has existing inter-op with CUDA / OpenCL Shaders can be saved to and loaded from binary blobs    Ability to query a binary shader, and save it for reuse later Flow of content between desktop and mobile    Brings ES 2.0 and 3.0 API and capabilities to desktop    WebGL bridging desktop and mobile Cross platform    Mac, Windows, Linux, Android, Solaris, FreeBSD    Result of being an open standard
Increasing Functional Scope of OpenGL                               First-class Compute Shaders                                 4.3                                       Tessellation Features                                          4.0                                                 Geometry Shaders                                                   3.X                                                         Vertex and Fragment Shaders                                                           2.X                                                                      Fixed Function                                                                       1.XArguably, OpenGL 4.3is a 5.0 worthy feature-set!
Classic OpenGL State Machine    From 1991-2007     * vertex & fragment processing got programmable 2001 & 2003               [source: GL 1.0 specification, 1992]
Complicated from inception
OpenGL 3.0 Conceptual Processing Flow (2008)                                                                                                                 uniform/                                                                  primitive topology,                                                                                                                 parameters                                                                     transformed                                                                   Legend                                                                      vertex data Geometric primitive                                                                                                                 buffer objects     Vertex                                     Vertex                                                                                         assembly &                                                         vertices    assembly        primitive                 processing                      batch                                          transformed         processing                                    programmable                                                                                                                                                            pixels                                                                                                                                         operations                      type,                                          vertex                                                                                                                                                           fragments                   vertex data                                       attributes                                        point, line,                                                                                                                       and polygon      fixed-function      filtered texels                                                                                                 geometry                                 operations                                                             Transform                           texture               fragments                           buffer data            vertex                                          transform          feedback                            fetches            buffer            objects                       feedback                                                                                                                      pixels in framebuffer object textures                                          buffer                                          objects                                                                                     stenciling, depth testing, primitive batch type,                                                         vertex                                                                                                                                      blending, accumulation vertex indices,                                                              texture vertex attributes                            texture                        fetches                                              buffer                    buffer data,              objects             Texture                                 Fragment                  Raster                      unmap                                                                                                                              Framebuffer                                                                  mapping                   fragment     processing               operations                      buffer                                                                   texture   Command                          Buffer                                                    fetches    parser                          store               pixel                    map buffer,                         pack                     get buffer                         buffer                       data                             objects                                                    image and bitmap                                                                         texture                                                                                                                   fragments                                pixel                                    image           pixel image or     unpack             Pixel                   specification           texture image       buffer           specification      objects                                                packing                                                        pixels to pack        image                                                                                                                      OpenGL 3.0                                                                              rectangles,                                                                              bitmaps                         Image                Pixel                            Pixel                                                                                                             primitive              unpacking unpacked              processing                                 pixels                                                                                                            processing                          copy pixels,                                                                                                                                         copy texture image
Control point                                                      Patch                                                                                        Patch tessellation                                Patch evaluation                        (2010)                                                   assembly &      processing          transformed                                 transformed          generation               transformed             processing                          control points           processing            transformed                                                                      patch                                       patch, bivariatetessellation     patch control                                           patch                                         domain                 points                                                          patch topology, evaluated patch vertextexture                                                                                                                                                              Legendfetches                                                                   primitive topology,                                                                                                                                                                                 patch data                                                                             transformed      Geometric primitive        Vertex                                           Vertex               vertex data                                                                   programmable         vertices                                                                                                     assembly &                                               operations       assembly            primitive                   processing                                                                                                                pixels                             batch                                           transformed             processing                                                                                                                                                                                 fragments                             type,                                           vertex                                                                         fixed-function                                                                             attributes                                            point, line,               operations         filtered texels                          vertex data                                                                                                           geometry                and polygon                                   buffer data                vertex                                                                      Transform                            texture                 fragments                                     compute                buffer                            transform           feedback                             fetches                objects                           feedback                                                                      pixels in framebuffer object textures                                                  buffer primitive batch type,                            objects                                                                                           stenciling, depth testing,                                                                                        vertex vertex indices,                                                                       texture                                                      blending, accumulation vertex attributes                                    texture                         fetches                                                      buffer                           buffer data,               objects              Texture                                    Fragment                      Raster                             unmap                                                                                                                                                                       Framebuffer                                                                           mapping                     fragment       processing                  operations                             buffer                             pixel                                     texture    Command                                 Buffer                                                                pack                                     fetches     parser                                 store               buffer                           map buffer,                            get buffer                          objects                                                                                 texture                                    image and bitmap                              data                                          pixel                                                                             fragments                                                                                 image               pixel image or           unpack            Pixel                                                                                 specification               texture image                             packing               specification                                         buffer                                        objects                                                                pixels to pack         image                                                                                                                                       OpenGL 4.0                                                                                       rectangles,                      Image                    Pixel                                 Pixel                        bitmaps                                                                                                                       primitive                                 copy pixels,                  unpacking          unpacked          processing                                       pixels                                                                                                                      processing                          copy texture image
Control point                                                      Patch                                                                                        Patch tessellation                                Patch evaluation                        (2012)                                                   assembly &      processing          transformed                                 transformed          generation               transformed             processing                          control points           processing            transformed                                                                      patch                                       patch, bivariatetessellation     patch control                                           patch                                         domain                 points                                                          patch topology, evaluated patch vertextexture                                                                                                                                                              Legendfetches                                                                   primitive topology,                                                                                                                                                                                 patch data                                                                             transformed      Geometric primitive        Vertex                                           Vertex               vertex data                                                                   programmable         vertices                                                                                                     assembly &                                               operations       assembly            primitive                   processing                                                                                                                pixels                             batch                                           transformed             processing                                                                                                                                                                                 fragments                             type,                                           vertex                                                                         fixed-function                                                                             attributes                                            point, line,               operations         filtered texels                          vertex data                                                                                                           geometry                and polygon                                   buffer data                vertex                                                                      Transform                            texture                 fragments                                     compute                buffer                            transform           feedback                             fetches                objects                           feedback                                                                      pixels in framebuffer object textures                                                  buffer primitive batch type,                            objects                                                                                           stenciling, depth testing,                                                                                        vertex vertex indices,                                                                       texture                                                      blending, accumulation vertex attributes                                    texture                         fetches                                                      buffer                           buffer data,               objects              Texture                                    Fragment                      Raster                             unmap                                                                                                                                                                       Framebuffer                                                                           mapping                     fragment       processing                  operations    Command                  buffer         Buffer                                                        texture                                                                                                         fetches     parser                                 store        pixel pack                           map buffer,                            get buffer                   buffer objects                                                                                 texture                                    image and bitmap                              data                                                                       Compute                                          pixel                                                                             fragments                                                                                 image               pixel image or           unpack            Pixel                                         processing                                                                                 specification               texture image             buffer          packing               specification            objects                                                                pixels to pack         image                                                                                                                                       OpenGL 4.3                                                                                       rectangles,                      Image                    Pixel                                 Pixel                        bitmaps                                                                                                                       primitive                                 copy pixels,                  unpacking          unpacked          processing                                       pixels                                                                                                                      processing                          copy texture image
OpenGL 4.3 Processing Pipelines                                From Application                                    From Application                                      Vertex Puller          Dispatch Indirect             Dispatch     Element Array Buffer b                                      Buffer b     Draw Indirect Buffer b          Vertex Shader                                                           Image Load / Store t/b                                                                                       Compute Shader                                   Tessellation Control     Vertex Buffer Object b              Shader              Atomic Counter b                                  Tessellation Primitive                                       Generator             Shader Storage b                                 Tessellation Evaluation                                         Shader                                                             Texture Fetch t/b                                    Geometry Shader                                                              Uniform Block b      Transform Feedback                                Transform Feedback            Buffer b    Legend                            Rasterization        From Application       Fixed Function Stage                                    Fragment Shader             Pixel Assembly       Pixel Unpack Buffer b       Programmable Stage        b – Buffer Binding          Raster Operations          Pixel Operations        Texture Image t        t – Texture Binding                                       Framebuffer                Pixel Pack          Pixel Pack Buffer b    Arrows indicate data flow
OpenGL 4.3Compute Shadersexplored
Why Compute Shaders?                                             particle                                                                 physics Execute algorithmically general-purpose GLSL shaders     Read and write uniforms and images     Grid-oriented Single Program, Multiple Data (SPMD)          fluid     execution model with communication via shared variables     behavior Process graphics data in context of the graphics pipeline     Easier than interoperating with a compute API when     processing ‘close to the pixel’                             crowd     Avoids involved “inter-op” APIs to connect OpenGL           simulation     objects to CUDA or OpenCL Complementary to OpenCL     Gives full access to OpenGL objects (multisample buffers,   ray     etc.)                                                                 tracing     Same GLSL language used for graphic shaders     In contrast to CUDA C/C++, not a full heterogonous     (CPU/GPU) programming framework using full ANSI C Standard part of all OpenGL 4.3 implementations                 global     Matches DirectX 11 functionality                            illumination
Compute Shader Particle System Demo                                            Mike Bailey                                            @ Oregon                                            State University                              co-author                                     of                              2nd edition                                    now                               available
OpenGL 4.3 Compute Shaders Single Program, Multiple Data (SPMD) Execution Model    Mental model: “scores of threads jump into same function at once” Hierarchical thread structure    Threads in Groups in Dispatches   invocation          work group                  dispatch    (thread)
Single Program, Multiple Data Example  Standard C Code, running single-threadedvoid SAXPY_CPU(int n, float alpha, float x[256], float y[256]){    if (n > 256) n = 256;    for (int i = 0; i < n; i++) // loop over each element explicitly        y[i] = alpha*x[i] + y[i];}#version 430layout(local_size_x=256) in; // spawn groups of 256 threads!buffer xBuffer { float x[]; }; buffer yBuffer { float y[]; };uniform float alpha;void main(){    int i = int(gl_GlobalInvocationID.x);    if (i < x.length())      // derive size from buffer bound        y[i] = alpha*x[i] + y[i];}  OpenGL Compute Shader, running SPMD                     SAXPY = BLAS library's                                                          single-precision alpha times x plus y
Examples of Single-threaded Execution vs.  SPMD Programming Systems   Single-threaded                       Single Program, Multiple Data   C/C++                                 CUDA C/C++   FORTRAN                               DirectCompute   Pascal                                OpenCL                                         OpenGL Compute Shaders            CPU-centric,                                  GPU-centric,hard to make multi-threaded & parallel          naturally expresses parallelism
Per-Thread Local Variables Each thread can read/write variables “private” to its execution     Each thread gets its own unique storage for each local variable                  work group             local                                                               Compute Shader                                                               source code   thread #1                                 v    i             int i;                                                                float v;                                       no access to locals      i++;                                            of other threads    v = 2*v + i;   thread #2                                             v    i
Special Per-thread Variables Work group can have a 1D, 2D or 3D “shape”    Specified via Compute Shader input declarations    Compute Shader syntax examples       1D, 256 threads: layout(local_size_x=256) in;       2D, 8x10 thread shape: layout(local_size_x=8,local_size_y=10) in;       3D, 4x4x4 thread shape:       layout(local_size_x=4,local_size_y=4,local_size_z=4) in; Every thread in work group has its own invocation #    Accessed through built-in variable                  gl_LocalInvocationID=(4,1,0)    in uvec3 gl_LocalInvocationID;    Think of every thread having a “who am I?” variable    Using these variables, threads are expected to       Index arrays       Determine their flow control       Compute thread-dependent computations                                                             6x3 work group
Per-Work Group Shared Variables Any thread in a work group can read/write shared variables    Typical idiom is to index by each thread’s invocation #                                                            Compute Shader                          a[0][0] a[0][1] a[0][2] a[0][3]   source code                                                            shared float a[3][4];                          a[1][0] a[1][1] a[1][2] a[1][3]                                                            unsigned int x =                                                              gl_LocalInvocationID.x                          a[2][0] a[2][1] a[2][2] a[2][3]   unsigned int y =                                                              gl_LocalInvocationID.y                    no access to shared variables                            of a different work group       a[y][x] = 2*ndx;                                                            a[y][x^1] += a[y][x];                          a[0][0] a[0][1] a[0][2] a[0][3]   memoryBarrierShared();                                                            a[y][x^2] += a[y][x];                          a[1][0] a[1][1] a[1][2] a[1][3]                                                             use shared memory barriers                          a[2][0] a[2][1] a[2][2] a[2][3]    to synchronize access to                                                             shared variables
work groupsReading and WritingGlobal ResourcesIn addition to local andshared variables…Compute Shaders can alsoaccess global resources    Read-only       Textures       Uniform buffer objects                                                                           red    Read-write                                                           green                                                                          blue        color       Texture images                                                       x                    vertex 0       Uniform buffers                                                      y                                                                            z         position                                atomic       Shader storage buffers                                              red                                counters                                 green       Atomic counters                                image               blue        color       Bindless buffers                               (within texture)      x                                                                                                 vertex 1                                                                            y       Take care updating                  textures                         z                                                                            z         position       shared read-write                                                     buffer object       resources                                           global OpenGL resources
Simple Compute Shader Let’s just copy from one 2D texture image to another…     Pseudo-code:     for each pixel in source image                                             pixels could be copied      copy pixel to destination image          fully in parallel               How would we write this as a compute shader...
Simple Compute Shader Let’s just copy from one 2D texture image to another…   #version 430 // use OpenGL 4.3’s GLSL with Compute Shaders   #define TILE_WIDTH 16   #define TILE_HEIGHT 16   const ivec2 tileSize = ivec2(TILE_WIDTH,TILE_HEIGHT);   layout(binding=0,rgba8) uniform image2D input_image;   layout(binding=1,rgba8) uniform image2D output_image;   layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in;   void main() {     const ivec2 tile_xy = ivec2(gl_WorkGroupID);     const ivec2 thread_xy = ivec2(gl_LocalInvocationID);     const ivec2 pixel_xy = tile_xy*tileSize + thread_xy;       vec4 pixel = imageLoad(input_image, pixel_xy);       imageStore(output_image, pixel_xy, pixel);   }
Compiles into NV_compute_program5 Assembly!!NVcp5.0 # NV_compute_program5 assemblyGROUP_SIZE 16 16; # work group is 16x16 so 256 threadsPARAM c[2] = { program.local[0..1] }; # internal constantsTEMP R0, R1; # temporariesIMAGE images[] = { image[0..7] }; # input & output imagesMAD.S R1.xy,invocation.groupid,{16,16,0,0}.x,invocation.localid;MOV.S R0.x, c[0];LOADIM.U32 R0.x, R1, images[R0.x], 2D; # load from input imageMOV.S R1.z, c[1].x;UP4UB.F R0, R0.x; # unpack RGBA pixel into float4 vectorSTOREIM.F images[R1.z], R0, R1, 2D; # store to output imageEND
What is NV_compute_program5? NVIDIA has always provided assembly-level interfaces to GPU programmability in OpenGL    NV_gpu_program5 is Shader Model 5.0 assembly       And NV_gpu_program4 was for Shader Model 4.0    NV_tessellation_program5 is programmable tessellation extension    NV_compute_program5 is further extension for Compute Shaders Advantages of assembly extensions    Faster load-time for shaders    Easier target for dynamic shader generation       Allows other languages/tools, such as Cg, to target the underlying hardware    Provides concrete underlying execution model       You don’t have to guess if your GLSL compiles well or not
Launching a Compute Shader First write your compute shader     Request GLSL 4.30 in your source code: #version 430     More on this later… Second compile your compute shader     Same compilation process as standard GLSL graphics shaders…     glCreateShader/glShaderSource with Compute Shader token     GLuint compute_shader = glCreateShader(GL_COMPUTE_SHADER);     glCreateProgram/glAttachShader/glLinkProgram         (compute and graphics shaders cannot mix in the same program) Bind to your program object     glUseProgram(compute_shader); Dispatch a grid of work groups                      dispatches a     glDispatchCompute(4, 4, 3);                     4x4x3 grid                                                     of work groups
Launching the Copy Compute ShaderSetup for copying from source to destination texture    Create an input (source) texture object    glTextureStorage2DEXT(input_texobj, GL_TEXTURE_2D,       1, GL_RGBA8, width, height);                            OpenGL 4.2 or    glTextureSubImage2DEXT(input_texobj, GL_TEXTURE_2D,        ARB_texture-       /*level*/0, /*x,y*/0,0, width, height,                  _storage plus       GL_RGBA, GL_UNSIGNED_BYTE, image);                      EXT_direct_state_access    Create an empty output (destination) texture object    glTextureStorage2DEXT(output_texobj, GL_TEXTURE_2D,       1, GL_RGBA8, width, height);    Bind level zero of both textures to texture images 0 and 1    GLboolean is_not_layered = GL_FALSE;    glBindImageTexture(/*image*/0, input_texobj, /*level*/0,             OpenGL 4.2 or       is_not_layered, /*layer*/0, GL_READ_ONLY, GL_RGBA8);              ARB_shader-    glBindImageTexture( /*image*/1, output_texobj, /*level*/0,           _image-       is_not_layered, /*layer*/0, GL_READ_WRITE, GL_RGBA8);             _load_store    Use the copy compute shader    glUseProgram(compute_shader);Dispatch sufficient work group instances of the copy compute shaderglDispatchCompute((width + 15) / 16, (height + 15) / 16), 1); OpenGL 4.3
Copy Compute Shader Execution  Input (source) image     Output (destination) image
Copy Compute Shader Tiling                          gl_WorkGroupID=[x,y] [0,4]    [1,4]   [2,4]   [3,4]   [4,4]   [0,4]      [1,4]    [2,4]   [3,4]   [4,4] [0,3]    [1,3]   [2,3]   [3,3]   [4,3]   [0,3]      [1,3]    [2,3]   [3,3]   [4,3] [0,2]    [1,2]   [2,2]   [3,2]   [4,2]   [0,2]      [1,2]    [2,2]   [3,2]   [4,2] [0,1]    [1,1]   [2,1]   [3,1]   [4,1]   [0,1]      [1,1]    [2,1]   [3,1]   [4,1] [0,0]    [1,0]   [2,0]   [3,0]   [4,0]   [0,0]      [1,0]    [2,0]   [3,0]   [4,0] Input (source) image 76x76                       Output (destination) image 76x76
Next Example: General Convolution Discrete convolution: common image processing operation    Building block for blurs, sharpening, edge detection, etc. Example: 5x5 convolution (N=5) of source (input) image s    Generates destination (output) image d, given NxN matrix of weights w
Input Image
Output Image after 5x5 Gaussian Blur                                       sigma=2.0
Implementing a General Convolution Basic algorithm    Tile-oriented: generate MxM pixel tiles    So operating on a (M+2N)x(M+2N) region of the image Phase 1: Read all the pixels for a region from input image Phase 2: Perform weighted sum of pixels in          [-N,N]x[-N,N] region around each output pixel Phase 3: Output the result pixel to output image
General Convolution: Preliminaries// Various kernel-wide constantsconst int tileWidth = 16,          tileHeight = 16;const int filterWidth = 5,          filterHeight = 5;const ivec2 tileSize = ivec2(tileWidth,tileHeight);const ivec2 filterOffset = ivec2(filterWidth/2,filterHeight/2);const ivec2 neighborhoodSize = tileSize + 2*filterOffset;// Declare the input and output images.layout(binding=0,rgba8) uniform image2D input_image;layout(binding=1,rgba8) uniform image2D output_image;uniform vec4 weight[filterHeight][filterWidth];uniform ivec4 imageBounds;   // Bounds of the input image for pixel coordinate clamping.void retirePhase() { memoryBarrierShared(); barrier(); }ivec2 clampLocation(ivec2 xy) {    // Clamp the image pixel location to the image boundary.    return clamp(xy, imageBounds.xy, imageBounds.zw);}
General Convolution: Phase 1layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH];void main() {    const ivec2 tile_xy = ivec2(gl_WorkGroupID);    const ivec2 thread_xy = ivec2(gl_LocalInvocationID);    const ivec2 pixel_xy = tile_xy*tileSize + thread_xy;    const uint x = thread_xy.x;    const uint y = thread_xy.y;    // Phase 1: Read the image's neighborhood into shared pixel arrays.    for (int j=0; j<neighborhoodSize.y; j += tileHeight) {        for (int i=0; i<neighborhoodSize.x; i += tileWidth) {            if (x+i < neighborhoodSize.x && y+j < neighborhoodSize.y) {                const ivec2 read_at = clampLocation(pixel_xy+ivec2(i,j)-filterOffset);                pixel[y+j][x+i] = imageLoad(input_image, read_at);            }        }    }    retirePhase();
General Convolution: Phases 2 & 3    // Phase 2: Compute general convolution.      vec4 result = vec4(0);      for (int j=0; j<filterHeight; j++) {          for (int i=0; i<filterWidth; i++) {              result += pixel[y+j][x+i] * weight[j][i];          }      }      // Phase 3: Store result to output image.      imageStore(output_image, pixel_xy, result);}
Separable Convolution Many important convolutions expressible in “separable” form    More efficient to evaluate    Allows two step process: 1) blur rows, then 2) blur columns    Two sets of weights: column vector weights c and row vector weights r Practical example for demonstrating Compute Shader shared variables…
Example Separable Convolutions   Original         Original     Original
Example Separable ConvolutionsGaussian filter, sigma=2.25   Sobel filter, horizontal   Sobel filter, vertical
Example Separable Convolutions Weights0.026232   0.035279   0.038941   0.035279   0.026232                                                               -1 0 1                   -1 2 10.0352790.038941           0.047446           0.052371                      0.052371                      0.057807                                 0.047446                                 0.052371                                            0.035279                                            0.038941           -2 0 2                    0 0 00.035279   0.047446   0.052371   0.047446   0.0352790.026232   0.035279   0.038941   0.035279   0.026232           -1 0 1                   -1 2 1                        =                                          =                       =0.1619640.2178200.240432                                                           1                                        1           0.161964 0.217820 .240432 0.217820 0.1619640.2178200.161964                                                           2           -1 0 1       -1 0 1          2                                                           1                                        1  5x5 Gaussian filter, sigma=2.25                        Sobel filter, horizontal   Sobel filter, vertical
GLSL Separable Filter Implementation<< assume preliminaries from earlier general convolution example>>layout(local_size_x=TILE_WIDTH,local_size_y=NEIGHBORHOOD_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH]; // values read from input imageshared vec4 row[NEIGHBORHOOD_HEIGHT][TILE_WIDTH];           // weighted row sumsvoid main() // separable convolution{  const ivec2 tile_xy = ivec2(gl_WorkGroupID);  const ivec2 thread_xy = ivec2(gl_LocalInvocationID);  const ivec2 pixel_xy = tile_xy*tileSize + (thread_xy-ivec2(0,filterOffset.y));  const uint x = thread_xy.x;  const uint y = thread_xy.y;  // Phase 1: Read the image's neighborhood into shared pixel arrays.  for (int i=0; i<NEIGHBORHOOD_WIDTH; i += TILE_WIDTH) {    if (x+i < NEIGHBORHOOD_WIDTH) {      const ivec2 read_at = clampLocation(pixel_xy+ivec2(i-filterOffset.x,0));      pixel[y][x+i] = imageLoad(input_image, read_at);    }  }  retirePhase();
GLSL Separable Filter Implementation    // Phase 2: Weighted sum the rows horizontally.    row[y][x] = vec4(0);    for (int i=0; i<filterWidth; i++) {      row[y][x] += pixel[y][x+i] * rowWeight[i];    }    retirePhase();    //    Phase 3: Weighted sum the row sums vertically and write result to output image.    //    Does this thread correspond to a tile pixel?    //    Recall: There are more threads in the Y direction than tileHeight.    if    (y < tileHeight) {        vec4 result = vec4(0);        for (int i=0; i<filterHeight; i++) {           result += row[y+i][x] * columnWeight[i];        }        // Phase 4: Store result to output image.        const ivec2 pixel_xy = tile_xy*tileSize + thread_xy;        imageStore(output_image, pixel_xy, result);    }}
Compute Shader Median Filter Simple idea    “For each pixel, replace it with the median-valued pixel in its NxN    neighborhood”    Non-linear, good for image enhancement through noise reduction    Expensive: naively, requires lots sorting to find median       Very expensive when the neighborhood is large Reasonably efficient with Compute Shaders
Median Filter Example                               Noisy appearance in candy                    Original
Median Filter Example                                      Noisy lost in blur                                       But text is blurry too                  Gaussian 5x5 blur
Median Filter Example                                      Noisy gone                                       Text still sharp                  Median filter 5x5
Large Median Filters for Impressionistic Effect        Original                 7x7 Estimated Median Filter
Other stuff in OpenGL 4.3
OpenGL Evolves ModularlyEach core revision is specified as a set of extensions                                       4.3    Example: ARB_compute_shader       Puts together all the functionality for compute shaders   ARB_compute_shader       Describe in its own text file                                 ARB_ES3_compatibility    May have dependencies on other extensions                                      many more …       Dependencies are stated explicitlyA core OpenGL revision (such as OpenGL 4.3) “bundles” a set ofagreed extensions—and mandates their mutual support    Note: implementations can also “unbundle” ARB extensions for    hardware unable to support the latest core revisionSo easiest to describe OpenGL 4.3 based on its bundledextensions…
OpenGL 4.3 debugging support ARB_debug_output    OpenGL can present debug information back to developer ARB_debug_output2    Easier enabling of debug output ARB_debug_group    Hierarchical grouping of debug tagging ARB_debug_label    Label OpenGL objects for debugging
OpenGL 4.3 new texture functionality ARB_texture_view    Provide different ways to interpret texture data without duplicating the texture    Match DX11 functionality ARB_internalformat_query2    Find out actual supported limits for most texture parameters ARB_copy_image    Direct copy of pixels between textures and render buffers ARB_texture_buffer_range    Create texture buffer object corresponding to a sub-range of a buffer’s    data store ARB_stencil_texturing    Read stencil bits of a packed depth-stencil texture ARB_texture_storage_multisample    Immutable storage objects for multisampled textures
OpenGL 4.3 new buffer functionalityARB_shader_storage_buffer_object  Enables shader stages to read & write to very large buffers      NVIDIA hardware allows every shader stage to read & write  structs, arrays, scalars, etc.ARB_invalidate_subdata  Invalidate all or some of the contents of textures and buffersARB_clear_buffer_object  Clear a buffer object with a constant valueARB_vertex_attrib_binding  Separate vertex attribute state from the data stores of each arrayARB_robust_buffer_access_behavior  Shader read/write to an object only allowed to data owned by the application  Applies to out of bounds accesses
OpenGL 4.3 new pipeline functionalityARB_compute_shader   Introduces new shader stage   Enables advanced processing algorithms that harness the parallelism of GPUsARB_multi_draw_indirect   Draw many GPU generated objects with one callARB_program_interface_query   Generic API to enumerate active variables and interface blocks for each stage   Enumerate active variables in interfaces between separable program objectsARB_ES3_compatibility   features not previously present in OpenGL   Brings EAC and ETC2 texture compression formatsARB_framebuffer_no_attachments   Render to an arbitrary sized framebuffer without actual populated pixels
GLSL 4.3 new functionality  ARB_arrays_of_arrays    Allows multi-dimensional arrays in GLSL. float f[4][3];  ARB_shader_image_size    Query size of an image in a shader  ARB_explicit_uniform_location    Set location of a default-block uniform in the shader  ARB_texture_query_levels    Query number of mipmap levels accessible through a sampler uniform  ARB_fragment_layer_viewport    gl_Layer and gl_ViewportIndex now available to fragment shader
New KHR and ARB extensions Not part of core but important and standardized at same time as OpenGL 4.3… KHR_texture_compression_astc_ldr    Adaptive Scalable Texture Compression (ASTC)    1-4 component, low bit rate < 1 bit/pixel – 8 bit/pixel ARB_robustness_isolation    If application causes GPU reset, no other application will be affected    For WebGL and other un-trusted 3D content sources
Getting at OpenGL 4.3 Easiest approach… Use OpenGL Extension Wrangler (GLEW)    Release 1.9.0 already has OpenGL 4.3 support    http://glew.sourceforge.net
Further NVIDIA OpenGL Work
Further NVIDIA OpenGL Work Linux enhancements Path Rendering for Resolution-independent 2D graphics Bindless Graphics Commitment to API Compatibility
OpenGL-relatedLinux Improvements  Support for X Resize, Rotate, and Reflect Extension    Also known as RandR    Version 1.2 and 1.3  OpenGL enables, by default, “Sync to Vertical Blank”    Locks your glXSwapBuffers to the monitor refresh rates    Matches Windows default now    Previously disabled by default
OpenGL-relatedLinux Improvements Expose additional full-scene antialiasing (FSAA) modes    16x multisample FSAA on all GeForce GPUs      2x2 supersampling of 4x multisampling    Ultra high-quality FSAA modes for Quadro GPUs      32x multisample FSAA       – 2x2 supersampling of 8x multisampling      64x multisample FSAA       – 4x4 supersampling of 4x multisampling Coverage sample FSAA on GeForce 8 series and better    4 color/depth samples + 12 depth samples
Multisampling FSAA Patterns      aliased       2x multisampling    4x multisampling      8x multisampling  1 sample/pixel     2 samples/pixel     4 samples/pixel       8 samples/pixel 64 bits/pixel      128 bits/pixel      256 bits/pixel       512 bits/pixel      Assume: 32-bit RGBA + 24-bit Z + 8-bit Stencil = 64 bits/sample
Supersampling FSAA Patterns   2x2 supersampling        2x2 supersampling          4x4 supersampling   of 4x multisampling      of 8x multisampling        of 16x multisampling   16 samples/pixel         32 samples/pixel           64 samples/pixel     1024 bits/pixel        2048 bits/pixel           4096 bits/pixel                                        Quadro GPUs    Assume: 32-bit RGBA + 24-bit Z + 8-bit Stencil = 64 bits/sample
Image Quality EvolvedNVIDIA Fast Approximated Anti-Alias (FXAA)                                       Supported on                                       Windows for several                                       driver releases…                                        Now enabled for                                        Linux in 304.xx                                        drivers
NVIDIA X Server Settings for LinuxControl Panel
GLX Protocol  Network transparent OpenGL     Run OpenGL app on one machine, display the X     and 3D on a different machine     3D app                                 X server                                                 GLX        OpenGL                                  Server               GLX              Client                               OpenGL                       network connection
OpenGL-relatedLinux Improvements    Official GLX Protocol support for OpenGL extensions ARB_half_float_pixel             EXT_point_parameters ARB_transpose_matrix                                  EXT_stencil_two_side EXT_blend_equation_separate EXT_depth_bounds_test            NV_copy_image EXT_framebuffer_blit             NV_depth_buffer_float EXT_framebuffer_multisample      NV_half_float EXT_packed_depth_stencil         NV_occlusion_query                                  NV_point_sprite                                  NV_register_combiners2                                  NV_texture_barrier
OpenGL-relatedLinux Improvements     Tentative GLX Protocol support for OpenGL extensionsARB_map_buffer_range       EXT_vertex_attrib_64bitARB_shader_subroutine      NV_conditional_renderARB_stencil_two_side       NV_framebuffer_multisample_coverageEXT_transform_feedback2    NV_texture_barrier                           NV_transform_feedback2
Synchronizing X11-based OpenGL Streams New extension     GL_EXT_x11_sync_object Bridges the X Synchronization Extension with OpenGL 3.2 “sync” objects (ARB_sync) Introduces new OpenGL command     GLintptr sync_handle;     GLsync glImportSyncEXT (GLenum external_sync_type, GLintptr external_sync,     GLbitfield flags);         external_sync_type must be GL_SYNC_X11_FENCE_EXT         flags must be zero
Other Linux Updates GL_CLAMP behaves in conformant way now    Long-standing work around for original Quake 3 Enabled 10-bit per component X desktop support    GeForce 8 and better GPUs Support for 3D Vision Pro stereo now
What is 3D Vision Pro? For Professionals All of 3D Vision support, plus  •   Radio frequency (RF) glasses,      Bidirectional  •   Query compass, accelerometer,      battery  •   Many RF channels – no collision  •   Up to ~120 feet  •   No line of sight needed to emitter  •   NVAPI to control
NV_path_rendering An NVIDIA OpenGL extension    GPU-accelerates resolution-    independent 2D graphics       Very fast!    Supports PostScript-style    renderingCome to my afternoon talk tolearn more    “GPU-Accelerated 2D and Web    Rendering”    This room     2:40 PM - 3:40 PM
Pixel pipeline                          Vertex pipeline          Path pipeline                        Application                                                               Path specificationPixel assembly                         Vertex assembly          Transform path    (unpack)                                       Vertex operations                   transform                    feedback                                      Primitive assemblyPixel operations                      Primitive operations   Fill/Stroke                                                              Covering  Pixel pack                             Rasterization   read            Texture                                      Fragment operations   back            memory                                                                   Fill/Stroke  Application                          Raster operations           Stenciling                                         Framebuffer         Display
Teaser Scene: 2D and 3D mix!
NVIDIA’s Vision of Bindless Graphics Problem: Binding to different objects (textures, buffers) takes a lot of validation time in driver     And applications are limited to a     small palette of bound buffers     and textures     Approach of OpenGL, but also     Direct3D Solution: Exposes GPU virtual addresses     Let shaders and vertex puller     access buffer and texture memory     by its virtual address!             Kepler GPUs support                                           bindless texture
Prior to Bindless Graphics Traditional OpenGL    GPU memory reads are “indirected” through bindings       Limited number of texture units and vertex array attributes    glBindTexture—for texture images and buffers    glBindBuffer—for vertex arrays
Buffer-centric Evolution       Data moves onto GPU, away from CPU                    Apps on CPUs just too slow at moving data otherwise       Array Element Buffer            glBegin, glDrawElements, etc.          Object (VeBO)                                                      Texture Buffer   Vertex Array Buffer Object                                                Object (TexBO)                                                Vertex Puller            (VaBO)                                                                         texel data       Transform Feedback          Buffer (XBO)                        Vertex Shading                                                    Pixel Unpack                       vertex data                                                                              Buffer (PuBO)                                                                                 Texturing                                           Geometry Shading    Parameter Buffer                                                                 glDrawPixels, glTexImage2D, etc.     Object (PaBO)                                                                Pixel                         Pixel Pack Buffer                                                  Fragment                       Pipeline                            (PpBO)      Uniform Buffer                               Shading                                      glReadPixels,                                                                                                                        pixel data      Object (UBO)                                                                               etc.   parameter data                                                                   Framebuffer
Kepler – Bindless Textures      Enormous increase in the number of unique textures available to shaders      at run-time      More different materials and richer texture detail in a scene                 texture #0 Shader code     texture #1                           Shader code                 texture #2                    …                texture #127                                                           …      Pre-Kepler texture binding model           Kepler bindless textures                                               over 1 million unique textures
Kepler – Bindless Textures  Pre-Kepler texture binding model        Kepler bindless textures  CPU                                   CPU    Load texture A                        Load textures A, B, C    Load texture B                        Draw()    Load texture C    Bind texture A to slot I                 GPU    Bind texture B to slot J                   Read from texture A    Draw()                                     Read from texture B                                               Read from texture C       GPU         Read from texture at slot I         Read from texture at slot J  CPU    Bind texture C to slot K            Bindless model reduces CPU    Draw()                                        overhead and improves GPU access        GPU                             efficiency          Read from texture at slot K
Bindless Textures        Apropos for ray-tracing and advanced rendering where        textures cannot be “bound” in advance   Shader code
Bindless performance benefit      Numbers obtained with a directed test
More Information on Bindless Texture Kepler has new NV_bindless_texture extension     Texture companion to         NV_vertex_buffer_unified_memory for bindless vertex arrays         NV_shader_buffer_load for bindless shader buffer reads         NV_shader_buffer_store (also NEW) for bindless shader buffer writes API specification publically available     http://developer.download.nvidia.com/opengl/specs/GL_NV_bindless_texture.txt
API Usage to Initialize Bindless Texture Make a conventional OpenGL texture object     With a 32-bit GLuint name Query a 64-bit texture handle from 32-bit texture name     GLuint64 glGetTextureHandleNV(GLuint); Make handle resident in context’s GPU address space     void glMakeTextureHandleResidentNV(GLuint64);
Writing GLSL for Bindless Textures Request GLSL to understand bindless textures     #version 400 // or later     #extension GL_NV_bindless_texture : require Declare a sampler in the normal way     in sampler2D bindless_texture; Alternatively, access bindless samplers in big array:     uniform Samplers {       sampler2D lotsOfSamplers[256];     }     Exciting: 256 samplers exceeds the available texture units!
Update Sampler Uniforms withBindless Texture Handle Get a location for a sampler or image uniform    GLint loc = glGetUniformLocation(program, “bindless_texture”);    GLint loc_array = glGetUniformLocation(program, “lotsOfSamplers”); Then set sampler to the bindless texture handle    glProgramUniformHandleui64NV(program, location, 1,    &bindless_handle);
NVIDIA’s Position on OpenGL Deprecation:Core vs. Compatibility Profiles OpenGL 3.1 introduced notion of              Best advice for real developers “core” profile                                     Simply use the “compatibility” Idea was remove stuff from core to                 profile make OpenGL “good-er”                              Easiest course of action                                                        Requesting the core profile requires     Well-intentioned perhaps but…                      special context creation gymnastics     Throws API backward                            Avoids the frustration of “they decided     compatibility out the window                   to remove what??” Lots of useful functionality got                   Allows you to use existing OpenGL                                                    libraries and code as-is removed that is in fast hardware     Examples: Polygon mode, line     width, GL_QUADS                          No, your program won’t go faster for                                              using the “core” profile Lots of easy-to-use, effective API                 It may go slower because of extra “is got labeled deprecated                             this allowed to work?” checks     Immediate mode     Display lists          Nothing changes with OpenGL 4.3                            NVIDIA still committed to compatibility without compromise
How to exploit OpenGL’smodern graphics pipeline               Albrecht Dürer’s less-than-modern rendering approaches
Modern OpenGL Pipeline Ideas Case Study: CAD assemblies    Geometry complexity less problematic    than scene complexity       Hardware can render billions of triangles,       but doesn‘t like spoon feeding    Many parts for individual pieces /    geometry features (bevels, chamfers,    joints...)       Must remain addressable as individual to       select, colorize, hide, transform...    Need to lower CPU overhead so that GPU    can plow through large chunks of work                                                    models courtesy of PTC
Concepts Minimize CPU/GPU interaction    Allow GPU to update its own data    Lower api usage when scene is changed little Avoid data redundancy    Data stored once on GPU, referenced multiple time    Update only once (less host to gpu transfers) Increase batching potential    Further cuts api calls    Less driver CPU work
Data Organization Large Scene Buffers    Matrices, Bounding Boxes, data intended for culling and drawing Grouped Content Buffers    Materials belonging to same shader, lights, view...    When not using bindless, higher numbers might improve batching, but also    more “work” to organize data. Draw Command Buffer    Scratch buffer that is sometimes rebuilt    Hosts data of all objects or references of “active objects”    OpenGL 4.x allows most data to be stored and consumed on GPU (draw    indirect)
OpenGL Technology GL 3.x                                 uniform samplerBuffer matrixBuffer;                                        // need helper functions    Texture Buffers (TexBO, unsized     mat4 getMatrix (samplerBuffer buf, int i){                                          return mat4( texelFetch (buf,(i*4)+0),    1D array of basic vector types)                    texelFetch (buf,(i*4)+1) ...    Uniform Buffer Objects (UBO,        }    arbitrary types, size limitation    uniform   viewBuffer {    64kb)                                 mat4    viewInvTM;                                          mat4    viewProjTM;    Texture Arrays (pack multiple         float   time;    same-sized textures in one array)     ...                                        } GL 4.x                                 // NEW 4.3 allows unsized arrays as last entry    Shader Storage Buffer (SSBO,        // and tighter array packing                                        layout(std430) buffer matrixBuffer {    unsized arbitrary buffer access)      int   willCostOnly16BytesNow[4];    (NEW 4.3)                             mat4 matrices[];                                        }
OpenGL Technology GL 4.x                                  glGenTextures (2,tex);                                         // create texture with complete mipchain    ARB_texture_storage helps driver     glTexStorage (GL_..,levels, GL_RGBA8, w, h);                                         // subimage data in later    to create immutable “complete”       glTexSubImage (GL_.., 0,..., mipData[0])    texture at once                                         // NEW 4.3                                         // create another texture that references the                                         // same data and interprets a single mip    ARB_texture_view (NEW 4.3)           // slightly differently    allows multiple views (internal-     glTextureView (tex[1], GL_TEXTURE_2D, tex[2],    format casting with same texel bit     GL_R32UI, minlevel, numlevels, minlayer,                                           numlayers);    size) on same texture data                                         // NEW 4.3 bind range of buffer    ARB_texture_buffer_range (NEW        glTexBufferRange (GL_..., GL_RGBA32F, buffer,                                           offset, size);    4.3)
NVIDIA Technology                                             // GLSL with true pointers NVIDIA Bindless Graphics                    uniform mat4* matrixBuffer;     Exposes gpu resources directly          // API                                             glUniformui64NV (shd->matrixLocation,     (pointers or objects)                                                             scene->matrixADDR);     Reduces CPU cache thrashing greatly                                             mat->diffuse = glGetTextureHandleNV (texobj); GL 3.x                                      // later instead of glBindTexture     NV_shader_buffer_load (SBL) for         glUniformHandleui64NV (shd->diffuseLocation,     arbitrary unsized cross buffer access                         mat->diffuse)                                             // GLSL     NV_vertex_buffer_unified_memory         // can also store textures in resources,     (VBUM) separates vertex data from       // virtually no restrictions on #     format                                  uniform materialBuffer {                                               sampler2D howManyTexturesIWant[LARGE]; GL 4.x                                      }                                             // virtual sparse texturing     NV_bindless_texture allows sampler      uniform usampler2D virtualTex;     references anywhere                     ... sampler2D (packUint2x32 (                                                    texelFetch (virtualTex, coord).xy));
Data Transfer                                               // NEW 4.3 copy rectangles of textures Textures                                      glCopyImageSubData (                                                srcName, srcTarget, srcLevel,srcX,srcY,srcZ,    ARB_pixel_buffer_object (GL 2.x)            dstName, dstTarget, dstLevel,dstX,dstY,dstZ,                                                srcWidth, srcHeight, srcDepth);    Now ARB_copy_image (NEW 4.3) Buffers                                       // EXT_direct_state_access style usage shown                                               // classic functions exists as well    ARB_map_buffer_range (GL 3.x) for                                               // range map and invalidate    fast mapping                               void* data;    ARB_invalidate_subdata (NEW 4.3)           data = glMapNamedBufferRangeEXT (textBuffer,                                                 0, sizeof(MyChar) * textLength,    ARB_clear_buffer_object (NEW 4.3)            GL_MAP_INVALIDATE_RANGE_BIT |                                                 GL_MAP_UNSYNCHRONIZED_BIT);    allows memset() operations    ARB_sync (GL 3.x) for efficient            // NEW 4.3 clearing a buffer                                               GLuint zero[1] = {0};    threaded streaming                         glClearNamedBufferDataEXT (visibleBuffer,     GTC 2012 “Optimizing Texture Transfers”    GL_R32UI, GL_RED, GL_UNSIGNED_INT, zero);
Drawing the Objects                                            /* setup once, similar to glVertexAttribPointer but                                            with relative offset last */ Classic: bind buffers/textures             glVertexAttribFormat (ATTR_NORMAL, 3, and draw                                       GL_FLOAT, GL_TRUE, offsetof(Vertex,normal));                                            glVertexAttribFormat (ATTR_POS, 3,     NVIDIA Bindless Graphics allows            GL_FLOAT, GL_FALSE, offsetof(Vertex,pos));                                            // bind to stream     very fast switching                    glVertexAttribBinding (ATTR_NORMAL, 0);     Group by bindings                      glVertexAttribBinding (ATTR_POS, 0); Enhanced:                                  // switch single stream buffer                                            glBindVertexBuffer (0, bufID, 0, sizeof(Vertex));     ARB_vertex_attrib_binding (NEW                                            // NV_vertex_buffer_unified_memory     4.3): allows less buffer changes       // enable once and set stride        Similar to VBUM it separates        glEnableClientState (GL_VERTEX...NV);...        format from data                    glBindVertexBuffer (0, 0, 0, sizeof(Vertex));        Map multiple vertex attributes to   // switch buffer        one buffer                          glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR,                                              bufSize);
Drawing the Objects                                                struct MyMaterial { Enhanced:                                         vec4 diffuse;                                                   int shadeType;    Grow buffers and dynamically index             ...       TexBO: can be large, but ugly to fetch   };                                                uniform materialBuffer {       UBO: fast, but size limited                 MyMaterial materials[128];       SSBO: large                              };                                                buffer transformBuffer {                                                   mat4 transforms[];       Pass assignment index as glUniform ,     };       glVertexAttribI (faster)                 ...                                                gl_FragColor = materials[assign.x].diffuse;    Go bindless beyond VBUM                     // bindless pointer datatypes                                                struct Object {       Bypassing binding completely                                                   MyMaterial* material;                                                   mat4*       transform;                                                };
Draw Call Reduction MultiDraw    Render ranges from current VBO/IBO    Single draw call for many distinct objects    Reduces overhead for low complexity    objects DrawIndirect                                    DrawElementsIndirect    Store drawcall information on GPU as well    {                                                          GLuint    count;    (primitiveCount...)                                   GLuint    instanceCount;    Let GPU create/modify such buffers to                 GLuint    firstIndex;                                                          GLint     baseVertex;    generate frame‘s drawcall buffers                     GLuint    baseInstance;                                                 } MultiDrawIndirect (NEW 4.3)
Drawing the Objects Combine drawcalls with MultiDraw                         1111                                             00000000             2   3    44444                                                           1    How to find object‘s transform,    material... assignment?                in vec4 oPos;    GL 3.x sacrifice a vertex attribute    in int objID;                                           flat out ivec4 assigns;        Inside main vertex buffer encode   uniform isamplerBuffer assignBuffer;        object index                       uniform samplerBuffer matrixBuffer;                                           ...        Fetch assign indices from                                             assigns = texelFetch (assignBuffer,        samplerBuffer                                              objID);        Matrix/material ... assignments        independent of geometry data         worldTM = getMatrix (matrixBuffer,                                                                 assigns.x);    GL 4.x MultiDrawIndirect exposes         vec4 wPos = worldTM * oPos;                                           ...    “baseInstance” to get assignments
BaseInstance as Unique Object ID MultiDrawIndirect uses instanced drawing      Can replicate a vertex attribute (material... assignment)        Regular      : VArray[ gl_VertexID + baseVertex ]        Divisor != 0 : VArray[ gl_InstanceID / VAttribDivisor + baseInstance ]  AssignBuffer (divisor)  Combined Attributes                                                                  VertexBuffer (regular)      MultiDrawBuffer      count = 9           count = 6                           baseVertex = 0      baseVertex = 9                           baseInstance = 1    baseInstance = 0
Recap Addressed:    Can render many low complexity objects stored in same vbo        Buffers with indexable content to lower overhead        MultiDraw/Indirect for keeping objects independent        baseInstance to provide unique index/assignments    NVIDIA Bindless Graphics        Instead of passing “index“ can pass pointer to buffer        Buffer can store texture references        Lowers CPU work even further for hot loop What remains:    State and Shader switching (can’t do much about state, but...)
Shader Switching                                         subroutine   void shade_fn (); Could use indexed subroutines           subroutine   (shade_fn) vec4 metal() ...                                         subroutine   (shade_fn) vec4 wood () ...    If shaders are similar in resource   // content   of array set by gl api call    consumption (register usage)         subroutine   uniform shade_fn shadeFuncs[2];    might want to combine them (GL       flat in ivec4 assigns;    4.x)                                 void main(){                                           gl_FragColor = shadeFuncs[assigns.y]();                                         }    Initiliaze the subroutine array      // bindless pointer casting and texture sampling    once, then dynamically index         vec4 metal() {                                           MetalParams* metal = packPtr (assigns.zw);                                           ... texture (metal->roughnessMap, uv);                                         }    NVIDIA Bindless Graphics             vec4 wood() {    pointers allow casting buffer          WoodParams* metal = packPtr (assigns.zw);    addresses                              ...                                         }
Let the GPU do More Work So far CPU still responsible for most decision making and hot loop (Multi) DrawIndirect allows GPU to generate its own work    Scene (objects, bboxs, materials..) described in    buffers / indices / pointers    Process scene data and build command lists for    active objects        E.g do culling, LOD picking, selection highlighting...
OpenGL Computing                               // API - COMPUTE                                               glDispatchCompute (gx, gy, gz);                                               // GLSL                                               shared float s_mem[SOMESIZE];                                               ... s_mem[gl_LocalInvocationIndex] = ... ARB_compute_shader (NEW 4.3)                                               // API - FBO    Dispatch threads with shared memory        glBindFrameBuffer (...);    support (as in CUDA/CL)                    glFramebufferParameteri (...,                                                  GL_FRAMEBUFFER_DEFAULT_WIDTH, 2048);    Access to ALL resources, textures, no      glFramebufferParameteri (...,    interop, all in GLSL                          GL_FRAMEBUFFER_DEFAULT_HEIGHT, 2048);                                               glDrawArrays(...)    NV_BINDLESS benefit from pointer acccess   // GLSL                                               ... imageStore(...ivec2(gl_FragCoord),.); ARB_framebuffer_no_attachments (NEW 4.3)                                          // API - XFB                                               glEnable (GL_RASTERIZER_DISCARD);    Use rasterizer to spawn threads and        glDrawArrays (GL_POINTS,0, count);    SSBO/imageStores to record your results    // GLSL                                               buffer indirectBuffer { GL 3.x Transform Feedback (XFB)                 DrawIndirect commands[];                                               }    Allows simple 1D Kernels                   ... commands[gl_VertexID].instanceCount =                                                   visible ? 1 : 0;
Culling Processing    Matrix and bounding box (bbox) buffer, object buffer    (which matrix to use with which bbox)    XFB or “invisible” rendering to create output buffer    Key: Single draw call for ALL active objects! No state    changes Results    “Readback” GPU to Host       Can use XFB to pack into a single bit stream for all active       objects                                                                     0,1,0,1,1,1,0,0,0    “Indirect” GPU to GPU       Set DrawIndirect‘s instanceCount to 0 or 1
Culling Techniques Frustum (GL 3.x)    XFB, VertexShader output 1 or 0        VertexAttributes are bbox index, matrix index, data        fetched via TBO        alternatively can feed bboxes directly, 2x vec3) HiZ Occlusion (GL 3.x)    Depth-Pass (useful for fragment bound scenes anyway)    Create mipmap pyramid, MAX depth    XFB, VertexShader        Compare object‘s closest clipspace bbox against z value        of depth mip                                                                  Projected size determines        Mip level chosen by clipspace 2D area                     depth mip level                                                                   mip texel covers object
Culling Techniques                                                                        Passing bbox fragments Raster Occlusion (GL 4.x)                                              enable object     Depth-Pass     Raster “invisble” bounding boxes                                                       // GLSL fragment shader         Geometry Shader to create the 3 sides                                                       // from ARB_shader_image_load_store         depth buffer discards occluded fragments      layout(early_fragment_tests) in;         Fragment Shader does visible[objindex] = 1    buffer indirectBuffer { Temporal Coherence (vertex-bound)                     };                                                         DrawIndirect commands[];     Render last visible                                                       flat in int objID;     Test all bboxes against current depth             void main(){                                                         commands[objID].instanceCount = 1;     Render newly added visible: (~last) & (visible)   }     Each object drawn only once                       // some other shader would have                                                       // cleared to 0 before
Realtime Global Illumination Octree-Based Sparse Voxelization for Real-Time Global Illumination    Technique by Cyril Crassin et al. (GTC 2012,    also presents here at SIGGRAPH)    http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-SparseV    Uses only OpenGL! Generates voxelization    of the scene as well as tracing it for global    effects (indirect lighting, glossy reflections)
Realtime Global Illumination Octree management    NV_shader_buffer_load    Pointers allow efficiently to manage access to the    octree memory cells    Casting is also possible to interpret the data for    node-type easily    Not limited to buffer bindings, can access many    buffers at once                               pool                                                   Linear Memory                             Octree                                      8 9   6 7                              4 5     2 3      1
Realtime Global Illumination Scene Voxelization    ARB_atomic_counter to generate work    queues    (Draw/Dispatch)Indirect to construct tree    asynchronously with glMemoryBarrier    providing dependency information Attachment-less FBO used to rasterize triangles to voxels    Material attributes (color, normal) contribute    to a voxel cell (NV_shader_atomic_float)    NV_shader_buffer_store,    ARB_shader_image_load_store to write to    voxels/octree cells
Questions?
Don’t Forget the 20th Anniversary Party    Date: August 8th 2012 ( today! )    Location: JW Marriott Los Angeles at LA Live    Venue: Gold Ballroom – Salon 1
Other OpenGL-related    NVIDIA Sessions at SIGGRAPHGPU-Accelerated 2D and Web Rendering    Wednesday in West Hall 503 (this room), 2:40 PM - 3:40 PM    Mark Kilgard, Principal Software Engineer, NVIDIAGPU Ray Tracing and OptiX    Wednesday in West Hall 503, 3:50 PM - 4:50 PM    David McAllister, OptiX Manager, NVIDIA    Phillip Miller, Director, Workstation Software Product Management, NVIDIAVoxel Cone Tracing & Sparse Voxel Octree for Real-time Global Illumination    Wednesday in NVIDIA Booth, 3:50 PM - 4:50 PM    Cyril Crassin, Postdoctoral Research Scientist, NVIDIA ResearchOpenSubdiv: High Performance GPU Subdivision Surface Drawing    Wednesday in NVIDIA Booth, 3:00 PM - 3:30 PM    Thursday in NVIDIA Booth, 10:00 AM - 10:30 AM (2nd time)    Pixar Animation Studios GPU Team, PixarnvFX : A New Scene & Material Effect Framework for OpenGL and DirectX    Thursday in NVIDIA Booth, 2:00 PM - 2:30 PM    Tristan Lorach, Developer Relations Senior Engineer, NVIDIA

Recommended

PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPSX
Dx11 performancereloaded
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
Lighting you up in Battlefield 3
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
D2 Hdr
PDF
Ndc2010 전형규 마비노기2 캐릭터 렌더링 기술
PPTX
Compute shader DX11
PPTX
Migrating from OpenGL to Vulkan
PPTX
Siggraph 2016 - Vulkan and nvidia : the essentials
PDF
Skia & Freetype - Android 2D Graphics Essentials
PPTX
Approaching zero driver overhead
PDF
쉐도우맵을 압축하여 대규모씬에 라이팅을 적용해보자
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPTX
DirectX 11 Rendering in Battlefield 3
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PDF
Advanced Scenegraph Rendering Pipeline
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Parallel Futures of a Game Engine
 
PPTX
[KGC2014] DX9에서DX11로의이행경험공유
PDF
빠른 렌더링을 위한 오브젝트 제외 기술
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPTX
Stochastic Screen-Space Reflections
PPTX
Progressive Lightmapper: An Introduction to Lightmapping in Unity
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PDF
Brdf기반 사전정의 스킨 셰이더
PPT
NVIDIA's OpenGL Functionality
PPT
OpenGL 3.2 and More

More Related Content

PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPSX
Dx11 performancereloaded
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
Lighting you up in Battlefield 3
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
D2 Hdr
PDF
Ndc2010 전형규 마비노기2 캐릭터 렌더링 기술
PPTX
Compute shader DX11
OpenGL 4.5 Update for NVIDIA GPUs
Dx11 performancereloaded
Secrets of CryENGINE 3 Graphics Technology
Lighting you up in Battlefield 3
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
D2 Hdr
Ndc2010 전형규 마비노기2 캐릭터 렌더링 기술
Compute shader DX11

What's hot

PPTX
Migrating from OpenGL to Vulkan
PPTX
Siggraph 2016 - Vulkan and nvidia : the essentials
PDF
Skia & Freetype - Android 2D Graphics Essentials
PPTX
Approaching zero driver overhead
PDF
쉐도우맵을 압축하여 대규모씬에 라이팅을 적용해보자
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPTX
DirectX 11 Rendering in Battlefield 3
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PDF
Advanced Scenegraph Rendering Pipeline
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Parallel Futures of a Game Engine
 
PPTX
[KGC2014] DX9에서DX11로의이행경험공유
PDF
빠른 렌더링을 위한 오브젝트 제외 기술
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPTX
Stochastic Screen-Space Reflections
PPTX
Progressive Lightmapper: An Introduction to Lightmapping in Unity
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PDF
Brdf기반 사전정의 스킨 셰이더
Migrating from OpenGL to Vulkan
Siggraph 2016 - Vulkan and nvidia : the essentials
Skia & Freetype - Android 2D Graphics Essentials
Approaching zero driver overhead
쉐도우맵을 압축하여 대규모씬에 라이팅을 적용해보자
Optimizing the Graphics Pipeline with Compute, GDC 2016
DirectX 11 Rendering in Battlefield 3
Rendering Technologies from Crysis 3 (GDC 2013)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Advanced Scenegraph Rendering Pipeline
OpenGL 4.4 - Scene Rendering Techniques
Parallel Futures of a Game Engine
 
[KGC2014] DX9에서DX11로의이행경험공유
빠른 렌더링을 위한 오브젝트 제외 기술
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Stochastic Screen-Space Reflections
Progressive Lightmapper: An Introduction to Lightmapping in Unity
Siggraph2016 - The Devil is in the Details: idTech 666
Physically Based and Unified Volumetric Rendering in Frostbite
Brdf기반 사전정의 스킨 셰이더

Similar to SIGGRAPH 2012: NVIDIA OpenGL for 2012

PPT
NVIDIA's OpenGL Functionality
PPT
OpenGL 3.2 and More
PPT
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
PPTX
OpenGL Introduction
PPTX
OpenGL Shading Language
PPT
NVIDIA OpenGL in 2016
PDF
GLSL Shading with OpenSceneGraph
PPT
CS 354 GPU Architecture
PPT
OpenGL 4 for 2010
PPTX
2D graphics
PPT
GTC 2012: NVIDIA OpenGL in 2012
PPT
GTC 2009 OpenGL Barthold
PPT
NVIDIA Graphics, Cg, and Transparency
PDF
GeForce 8800 OpenGL Extensions
PDF
Realizing OpenGL
PPT
Hardware Shaders
PPTX
3 CG_U1_P2_PPT_3 OpenGL.pptx
PPT
Advanced Graphics Workshop - GFX2011
PDF
Open gl
PDF
Casing3d opengl
 
NVIDIA's OpenGL Functionality
OpenGL 3.2 and More
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
OpenGL Introduction
OpenGL Shading Language
NVIDIA OpenGL in 2016
GLSL Shading with OpenSceneGraph
CS 354 GPU Architecture
OpenGL 4 for 2010
2D graphics
GTC 2012: NVIDIA OpenGL in 2012
GTC 2009 OpenGL Barthold
NVIDIA Graphics, Cg, and Transparency
GeForce 8800 OpenGL Extensions
Realizing OpenGL
Hardware Shaders
3 CG_U1_P2_PPT_3 OpenGL.pptx
Advanced Graphics Workshop - GFX2011
Open gl
Casing3d opengl
 

More from Mark Kilgard

PPT
NVIDIA OpenGL 4.6 in 2017
PPT
NVIDIA OpenGL and Vulkan Support for 2017
PPT
OpenGL for 2015
PPT
GTC 2012: GPU-Accelerated Path Rendering
PDF
GPU-accelerated Path Rendering
PPT
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
PPT
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
PDF
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
PDF
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
PPT
NV_path rendering Functional Improvements
PDF
D11: a high-performance, protocol-optional, transport-optional, window system...
PPT
Computers, Graphics, Engineering, Math, and Video Games for High School Students
PPT
CS 354 Acceleration Structures
PPT
Virtual Reality Features of NVIDIA GPUs
PPT
CS 354 Final Exam Review
PPT
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
PPT
GPU accelerated path rendering fastforward
PPT
CS 354 Performance Analysis
PPT
EXT_window_rectangles
PPT
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
NVIDIA OpenGL 4.6 in 2017
NVIDIA OpenGL and Vulkan Support for 2017
OpenGL for 2015
GTC 2012: GPU-Accelerated Path Rendering
GPU-accelerated Path Rendering
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
NV_path rendering Functional Improvements
D11: a high-performance, protocol-optional, transport-optional, window system...
Computers, Graphics, Engineering, Math, and Video Games for High School Students
CS 354 Acceleration Structures
Virtual Reality Features of NVIDIA GPUs
CS 354 Final Exam Review
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
GPU accelerated path rendering fastforward
CS 354 Performance Analysis
EXT_window_rectangles
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...

Recently uploaded

PPTX
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
PDF
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
PDF
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
PDF
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PDF
Lets Build a Serverless Function with Kiro
PDF
Oracle MySQL HeatWave - One Page - Version 3
PDF
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
PDF
[BDD 2025 - Artificial Intelligence] Building AI Systems That Users (and Comp...
PPTX
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
PPTX
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
PDF
Dev Dives: Build smarter agents with UiPath Agent Builder
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
So You Want to Work at Google | DevFest Seattle 2025
PDF
10 Best Automation QA Testing Software Tools in 2025.pdf
PDF
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
PDF
Integrating AI with Meaningful Human Collaboration
PDF
The partnership effect: Libraries and publishers on collaborating and thrivin...
PDF
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
PPTX
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
Open Source Post-Quantum Cryptography - Matt Caswell
Lets Build a Serverless Function with Kiro
Oracle MySQL HeatWave - One Page - Version 3
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
[BDD 2025 - Artificial Intelligence] Building AI Systems That Users (and Comp...
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
Leon Brands - Intro to GPU Occlusion (Graphics Programming Conference 2024)
Dev Dives: Build smarter agents with UiPath Agent Builder
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
So You Want to Work at Google | DevFest Seattle 2025
10 Best Automation QA Testing Software Tools in 2025.pdf
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
Integrating AI with Meaningful Human Collaboration
The partnership effect: Libraries and publishers on collaborating and thrivin...
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026

SIGGRAPH 2012: NVIDIA OpenGL for 2012

  • 1.
    (unabridged slide deck)NVIDIA OpenGL in 2012: Version 4.3 is here! Mark Kilgard
  • 2.
    Mark Kilgard PrincipalSystem Software Engineer OpenGL driver and API evolution Cg (“C for graphics”) shading language GPU-accelerated path rendering OpenGL Utility Toolkit (GLUT) implementer Author of OpenGL for the X Window System Co-author of Cg Tutorial Worked on OpenGL for 20+ years
  • 3.
    Talk DetailsLocation: WestHall Meeting Room 503, Los Angeles Convention CenterDate: Wednesday, August 8, 2012Time: 11:50 AM - 12:50 PMMark Kilgard (Principal Software Engineer, NVIDIA)Abstract: Attend this session to get the most out of OpenGL on NVIDIA Quadro and GeForceGPUs. Learn about the new features in OpenGL 4.3, particularly Compute Shaders. Other topicsinclude bindless graphics; Linux improvements; and how to best use the modern OpenGLgraphics pipeline. Learn how your application can benefit from NVIDIA's leadership drivingOpenGL as a cross-platform, open industry standard.Topic Areas: Computer Graphics; Development Tools & Libraries; Visualization; Image andVideo ProcessingLevel: IntermediateWatch video replay: http://nvidia.fullviewmedia.com/siggraph2012/ondemand/SS104.html
  • 4.
    Outline State ofOpenGL & OpenGL’s importance to NVIDIA Compute Shaders explored Other stuff in OpenGL 4.3 Further NVIDIA OpenGL Work How to exploit OpenGL’s modern graphics pipeline
  • 5.
    State of OpenGL&OpenGL’s importance to NVIDIA
  • 6.
    OpenGL Standard is20 Years and Strong
  • 7.
    Think back toComputing in 1992Programming Languages ANSI C (C 89) was just 3 years old C++ still implemented as a front-end to C OpenGL in 1992 provided FORTRAN and Pascal bindingsOne year before NCSA Mosaic web browser first written Now WebGL standard in almost every browserWindows version Windows 3.1 ships! NT 3.1 still a year awayEntertainment Great video game graphics? Mortal Kombat? Top grossing movie (Aladdin) was animated Back when animated movies were still hand-drawn
  • 8.
    20 Years Ago:Enter OpenGL
  • 9.
  • 10.
    Then and Now 2012 OpenGL 4.3: Real-time Global IlluminationOpenGL 1.0: Per-vertex lighting 1992 [Crassin]
  • 11.
    Big News 4.3 OpenGL 4.3 announced Monday here at SIGGRAPH August 6, 2012 Moments later… NVIDIA beta OpenGL 4.3 driver on the web http://www.nvidia.com/content/devzone/opengl-driver-4.3.html OpenGL 4.3 brings substantial new features Compute Shaders! OpenGL Shading Language (GLSL) updates (multi-dimensional arrays, etc.) New texture functionality (stencil texturing, more queries)MarqueeFeature New buffer functionality (clear buffers, invalidate buffers, etc.) More Direct3D-isms (texture views, parity with DirectX compute shaders) OpenGL ES 3.0 compatibility
  • 12.
    NVIDIA’s OpenGL Leverage GeForce Programmable Graphics (GLSL, Cg) Debugging with Tegra Parallel Nsight Quadro OptiX
  • 13.
    Single 3D APIfor Every Platform Windows OS X Linux Android Solaris FreeBSD
  • 14.
    OpenGL 3D GraphicsAPI • cross-platform • most functional • peak performance • open standard • inter-operable • well specified & documented • 20 years of compatibility
  • 15.
    OpenGL Spawns CloselyRelated Standards Congratulations: WebGL officially approved, February 2012 “The web is now 3D enabled”
  • 16.
    Accelerating OpenGL Innovation 2004 2005 2006 2007 2008 2009 2010 2011 2012 DirectX 10.0 DirectX 10.1 DirectX 11 DirectX 9.0c OpenGL 3.1 OpenGL 3.3 +OpenGL 2.0 OpenGL 2.1 OpenGL 3.0 OpenGL 3.2 OpenGL 4.0 OpenGL 4.3 Now with OpenGL 4.1 compute • OpenGL has fast innovation + standardization shaders! - Pace is 7 new spec versions in four years - Actual implementations following specifications closely OpenGL 4.2 • OpenGL 4.3 is a superset of DirectX 11 functionality - While retaining backwards compatibility
  • 17.
    OpenGL Today –DirectX 11 Superset Buffer and Event Interop First-class graphics + compute solution OpenGL 4.3 = graphics + compute shaders NVIDIA still has existing inter-op with CUDA / OpenCL Shaders can be saved to and loaded from binary blobs Ability to query a binary shader, and save it for reuse later Flow of content between desktop and mobile Brings ES 2.0 and 3.0 API and capabilities to desktop WebGL bridging desktop and mobile Cross platform Mac, Windows, Linux, Android, Solaris, FreeBSD Result of being an open standard
  • 18.
    Increasing Functional Scopeof OpenGL First-class Compute Shaders 4.3 Tessellation Features 4.0 Geometry Shaders 3.X Vertex and Fragment Shaders 2.X Fixed Function 1.XArguably, OpenGL 4.3is a 5.0 worthy feature-set!
  • 19.
    Classic OpenGL StateMachine From 1991-2007 * vertex & fragment processing got programmable 2001 & 2003 [source: GL 1.0 specification, 1992]
  • 20.
  • 21.
    OpenGL 3.0 ConceptualProcessing Flow (2008) uniform/ primitive topology, parameters transformed Legend vertex data Geometric primitive buffer objects Vertex Vertex assembly & vertices assembly primitive processing batch transformed processing programmable pixels operations type, vertex fragments vertex data attributes point, line, and polygon fixed-function filtered texels geometry operations Transform texture fragments buffer data vertex transform feedback fetches buffer objects feedback pixels in framebuffer object textures buffer objects stenciling, depth testing, primitive batch type, vertex blending, accumulation vertex indices, texture vertex attributes texture fetches buffer buffer data, objects Texture Fragment Raster unmap Framebuffer mapping fragment processing operations buffer texture Command Buffer fetches parser store pixel map buffer, pack get buffer buffer data objects image and bitmap texture fragments pixel image pixel image or unpack Pixel specification texture image buffer specification objects packing pixels to pack image OpenGL 3.0 rectangles, bitmaps Image Pixel Pixel primitive unpacking unpacked processing pixels processing copy pixels, copy texture image
  • 22.
    Control point Patch Patch tessellation Patch evaluation (2010) assembly & processing transformed transformed generation transformed processing control points processing transformed patch patch, bivariatetessellation patch control patch domain points patch topology, evaluated patch vertextexture Legendfetches primitive topology, patch data transformed Geometric primitive Vertex Vertex vertex data programmable vertices assembly & operations assembly primitive processing pixels batch transformed processing fragments type, vertex fixed-function attributes point, line, operations filtered texels vertex data geometry and polygon buffer data vertex Transform texture fragments compute buffer transform feedback fetches objects feedback pixels in framebuffer object textures buffer primitive batch type, objects stenciling, depth testing, vertex vertex indices, texture blending, accumulation vertex attributes texture fetches buffer buffer data, objects Texture Fragment Raster unmap Framebuffer mapping fragment processing operations buffer pixel texture Command Buffer pack fetches parser store buffer map buffer, get buffer objects texture image and bitmap data pixel fragments image pixel image or unpack Pixel specification texture image packing specification buffer objects pixels to pack image OpenGL 4.0 rectangles, Image Pixel Pixel bitmaps primitive copy pixels, unpacking unpacked processing pixels processing copy texture image
  • 23.
    Control point Patch Patch tessellation Patch evaluation (2012) assembly & processing transformed transformed generation transformed processing control points processing transformed patch patch, bivariatetessellation patch control patch domain points patch topology, evaluated patch vertextexture Legendfetches primitive topology, patch data transformed Geometric primitive Vertex Vertex vertex data programmable vertices assembly & operations assembly primitive processing pixels batch transformed processing fragments type, vertex fixed-function attributes point, line, operations filtered texels vertex data geometry and polygon buffer data vertex Transform texture fragments compute buffer transform feedback fetches objects feedback pixels in framebuffer object textures buffer primitive batch type, objects stenciling, depth testing, vertex vertex indices, texture blending, accumulation vertex attributes texture fetches buffer buffer data, objects Texture Fragment Raster unmap Framebuffer mapping fragment processing operations Command buffer Buffer texture fetches parser store pixel pack map buffer, get buffer buffer objects texture image and bitmap data Compute pixel fragments image pixel image or unpack Pixel processing specification texture image buffer packing specification objects pixels to pack image OpenGL 4.3 rectangles, Image Pixel Pixel bitmaps primitive copy pixels, unpacking unpacked processing pixels processing copy texture image
  • 24.
    OpenGL 4.3 ProcessingPipelines From Application From Application Vertex Puller Dispatch Indirect Dispatch Element Array Buffer b Buffer b Draw Indirect Buffer b Vertex Shader Image Load / Store t/b Compute Shader Tessellation Control Vertex Buffer Object b Shader Atomic Counter b Tessellation Primitive Generator Shader Storage b Tessellation Evaluation Shader Texture Fetch t/b Geometry Shader Uniform Block b Transform Feedback Transform Feedback Buffer b Legend Rasterization From Application Fixed Function Stage Fragment Shader Pixel Assembly Pixel Unpack Buffer b Programmable Stage b – Buffer Binding Raster Operations Pixel Operations Texture Image t t – Texture Binding Framebuffer Pixel Pack Pixel Pack Buffer b Arrows indicate data flow
  • 25.
  • 26.
    Why Compute Shaders? particle physics Execute algorithmically general-purpose GLSL shaders Read and write uniforms and images Grid-oriented Single Program, Multiple Data (SPMD) fluid execution model with communication via shared variables behavior Process graphics data in context of the graphics pipeline Easier than interoperating with a compute API when processing ‘close to the pixel’ crowd Avoids involved “inter-op” APIs to connect OpenGL simulation objects to CUDA or OpenCL Complementary to OpenCL Gives full access to OpenGL objects (multisample buffers, ray etc.) tracing Same GLSL language used for graphic shaders In contrast to CUDA C/C++, not a full heterogonous (CPU/GPU) programming framework using full ANSI C Standard part of all OpenGL 4.3 implementations global Matches DirectX 11 functionality illumination
  • 27.
    Compute Shader ParticleSystem Demo Mike Bailey @ Oregon State University co-author of 2nd edition now available
  • 28.
    OpenGL 4.3 ComputeShaders Single Program, Multiple Data (SPMD) Execution Model Mental model: “scores of threads jump into same function at once” Hierarchical thread structure Threads in Groups in Dispatches invocation work group dispatch (thread)
  • 29.
    Single Program, MultipleData Example Standard C Code, running single-threadedvoid SAXPY_CPU(int n, float alpha, float x[256], float y[256]){ if (n > 256) n = 256; for (int i = 0; i < n; i++) // loop over each element explicitly y[i] = alpha*x[i] + y[i];}#version 430layout(local_size_x=256) in; // spawn groups of 256 threads!buffer xBuffer { float x[]; }; buffer yBuffer { float y[]; };uniform float alpha;void main(){ int i = int(gl_GlobalInvocationID.x); if (i < x.length()) // derive size from buffer bound y[i] = alpha*x[i] + y[i];} OpenGL Compute Shader, running SPMD SAXPY = BLAS library's single-precision alpha times x plus y
  • 30.
    Examples of Single-threadedExecution vs. SPMD Programming Systems Single-threaded Single Program, Multiple Data C/C++ CUDA C/C++ FORTRAN DirectCompute Pascal OpenCL OpenGL Compute Shaders CPU-centric, GPU-centric,hard to make multi-threaded & parallel naturally expresses parallelism
  • 31.
    Per-Thread Local VariablesEach thread can read/write variables “private” to its execution Each thread gets its own unique storage for each local variable work group local Compute Shader source code thread #1 v i int i; float v; no access to locals i++; of other threads v = 2*v + i; thread #2 v i
  • 32.
    Special Per-thread VariablesWork group can have a 1D, 2D or 3D “shape” Specified via Compute Shader input declarations Compute Shader syntax examples 1D, 256 threads: layout(local_size_x=256) in; 2D, 8x10 thread shape: layout(local_size_x=8,local_size_y=10) in; 3D, 4x4x4 thread shape: layout(local_size_x=4,local_size_y=4,local_size_z=4) in; Every thread in work group has its own invocation # Accessed through built-in variable gl_LocalInvocationID=(4,1,0) in uvec3 gl_LocalInvocationID; Think of every thread having a “who am I?” variable Using these variables, threads are expected to Index arrays Determine their flow control Compute thread-dependent computations 6x3 work group
  • 33.
    Per-Work Group SharedVariables Any thread in a work group can read/write shared variables Typical idiom is to index by each thread’s invocation # Compute Shader a[0][0] a[0][1] a[0][2] a[0][3] source code shared float a[3][4]; a[1][0] a[1][1] a[1][2] a[1][3] unsigned int x = gl_LocalInvocationID.x a[2][0] a[2][1] a[2][2] a[2][3] unsigned int y = gl_LocalInvocationID.y no access to shared variables of a different work group a[y][x] = 2*ndx; a[y][x^1] += a[y][x]; a[0][0] a[0][1] a[0][2] a[0][3] memoryBarrierShared(); a[y][x^2] += a[y][x]; a[1][0] a[1][1] a[1][2] a[1][3] use shared memory barriers a[2][0] a[2][1] a[2][2] a[2][3] to synchronize access to shared variables
  • 34.
    work groupsReading andWritingGlobal ResourcesIn addition to local andshared variables…Compute Shaders can alsoaccess global resources Read-only Textures Uniform buffer objects red Read-write green blue color Texture images x vertex 0 Uniform buffers y z position atomic Shader storage buffers red counters green Atomic counters image blue color Bindless buffers (within texture) x vertex 1 y Take care updating textures z z position shared read-write buffer object resources global OpenGL resources
  • 35.
    Simple Compute ShaderLet’s just copy from one 2D texture image to another… Pseudo-code: for each pixel in source image  pixels could be copied copy pixel to destination image fully in parallel How would we write this as a compute shader...
  • 36.
    Simple Compute ShaderLet’s just copy from one 2D texture image to another… #version 430 // use OpenGL 4.3’s GLSL with Compute Shaders #define TILE_WIDTH 16 #define TILE_HEIGHT 16 const ivec2 tileSize = ivec2(TILE_WIDTH,TILE_HEIGHT); layout(binding=0,rgba8) uniform image2D input_image; layout(binding=1,rgba8) uniform image2D output_image; layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in; void main() { const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; vec4 pixel = imageLoad(input_image, pixel_xy); imageStore(output_image, pixel_xy, pixel); }
  • 37.
    Compiles into NV_compute_program5Assembly!!NVcp5.0 # NV_compute_program5 assemblyGROUP_SIZE 16 16; # work group is 16x16 so 256 threadsPARAM c[2] = { program.local[0..1] }; # internal constantsTEMP R0, R1; # temporariesIMAGE images[] = { image[0..7] }; # input & output imagesMAD.S R1.xy,invocation.groupid,{16,16,0,0}.x,invocation.localid;MOV.S R0.x, c[0];LOADIM.U32 R0.x, R1, images[R0.x], 2D; # load from input imageMOV.S R1.z, c[1].x;UP4UB.F R0, R0.x; # unpack RGBA pixel into float4 vectorSTOREIM.F images[R1.z], R0, R1, 2D; # store to output imageEND
  • 38.
    What is NV_compute_program5?NVIDIA has always provided assembly-level interfaces to GPU programmability in OpenGL NV_gpu_program5 is Shader Model 5.0 assembly And NV_gpu_program4 was for Shader Model 4.0 NV_tessellation_program5 is programmable tessellation extension NV_compute_program5 is further extension for Compute Shaders Advantages of assembly extensions Faster load-time for shaders Easier target for dynamic shader generation Allows other languages/tools, such as Cg, to target the underlying hardware Provides concrete underlying execution model You don’t have to guess if your GLSL compiles well or not
  • 39.
    Launching a ComputeShader First write your compute shader Request GLSL 4.30 in your source code: #version 430 More on this later… Second compile your compute shader Same compilation process as standard GLSL graphics shaders… glCreateShader/glShaderSource with Compute Shader token GLuint compute_shader = glCreateShader(GL_COMPUTE_SHADER); glCreateProgram/glAttachShader/glLinkProgram (compute and graphics shaders cannot mix in the same program) Bind to your program object glUseProgram(compute_shader); Dispatch a grid of work groups dispatches a glDispatchCompute(4, 4, 3); 4x4x3 grid of work groups
  • 40.
    Launching the CopyCompute ShaderSetup for copying from source to destination texture Create an input (source) texture object glTextureStorage2DEXT(input_texobj, GL_TEXTURE_2D, 1, GL_RGBA8, width, height); OpenGL 4.2 or glTextureSubImage2DEXT(input_texobj, GL_TEXTURE_2D, ARB_texture- /*level*/0, /*x,y*/0,0, width, height, _storage plus GL_RGBA, GL_UNSIGNED_BYTE, image); EXT_direct_state_access Create an empty output (destination) texture object glTextureStorage2DEXT(output_texobj, GL_TEXTURE_2D, 1, GL_RGBA8, width, height); Bind level zero of both textures to texture images 0 and 1 GLboolean is_not_layered = GL_FALSE; glBindImageTexture(/*image*/0, input_texobj, /*level*/0, OpenGL 4.2 or is_not_layered, /*layer*/0, GL_READ_ONLY, GL_RGBA8); ARB_shader- glBindImageTexture( /*image*/1, output_texobj, /*level*/0, _image- is_not_layered, /*layer*/0, GL_READ_WRITE, GL_RGBA8); _load_store Use the copy compute shader glUseProgram(compute_shader);Dispatch sufficient work group instances of the copy compute shaderglDispatchCompute((width + 15) / 16, (height + 15) / 16), 1); OpenGL 4.3
  • 41.
    Copy Compute ShaderExecution Input (source) image Output (destination) image
  • 42.
    Copy Compute ShaderTiling gl_WorkGroupID=[x,y] [0,4] [1,4] [2,4] [3,4] [4,4] [0,4] [1,4] [2,4] [3,4] [4,4] [0,3] [1,3] [2,3] [3,3] [4,3] [0,3] [1,3] [2,3] [3,3] [4,3] [0,2] [1,2] [2,2] [3,2] [4,2] [0,2] [1,2] [2,2] [3,2] [4,2] [0,1] [1,1] [2,1] [3,1] [4,1] [0,1] [1,1] [2,1] [3,1] [4,1] [0,0] [1,0] [2,0] [3,0] [4,0] [0,0] [1,0] [2,0] [3,0] [4,0] Input (source) image 76x76 Output (destination) image 76x76
  • 43.
    Next Example: GeneralConvolution Discrete convolution: common image processing operation Building block for blurs, sharpening, edge detection, etc. Example: 5x5 convolution (N=5) of source (input) image s Generates destination (output) image d, given NxN matrix of weights w
  • 44.
  • 45.
    Output Image after5x5 Gaussian Blur sigma=2.0
  • 46.
    Implementing a GeneralConvolution Basic algorithm Tile-oriented: generate MxM pixel tiles So operating on a (M+2N)x(M+2N) region of the image Phase 1: Read all the pixels for a region from input image Phase 2: Perform weighted sum of pixels in [-N,N]x[-N,N] region around each output pixel Phase 3: Output the result pixel to output image
  • 47.
    General Convolution: Preliminaries//Various kernel-wide constantsconst int tileWidth = 16, tileHeight = 16;const int filterWidth = 5, filterHeight = 5;const ivec2 tileSize = ivec2(tileWidth,tileHeight);const ivec2 filterOffset = ivec2(filterWidth/2,filterHeight/2);const ivec2 neighborhoodSize = tileSize + 2*filterOffset;// Declare the input and output images.layout(binding=0,rgba8) uniform image2D input_image;layout(binding=1,rgba8) uniform image2D output_image;uniform vec4 weight[filterHeight][filterWidth];uniform ivec4 imageBounds; // Bounds of the input image for pixel coordinate clamping.void retirePhase() { memoryBarrierShared(); barrier(); }ivec2 clampLocation(ivec2 xy) { // Clamp the image pixel location to the image boundary. return clamp(xy, imageBounds.xy, imageBounds.zw);}
  • 48.
    General Convolution: Phase1layout(local_size_x=TILE_WIDTH,local_size_y=TILE_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH];void main() { const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int j=0; j<neighborhoodSize.y; j += tileHeight) { for (int i=0; i<neighborhoodSize.x; i += tileWidth) { if (x+i < neighborhoodSize.x && y+j < neighborhoodSize.y) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i,j)-filterOffset); pixel[y+j][x+i] = imageLoad(input_image, read_at); } } } retirePhase();
  • 49.
    General Convolution: Phases2 & 3 // Phase 2: Compute general convolution. vec4 result = vec4(0); for (int j=0; j<filterHeight; j++) { for (int i=0; i<filterWidth; i++) { result += pixel[y+j][x+i] * weight[j][i]; } } // Phase 3: Store result to output image. imageStore(output_image, pixel_xy, result);}
  • 50.
    Separable Convolution Manyimportant convolutions expressible in “separable” form More efficient to evaluate Allows two step process: 1) blur rows, then 2) blur columns Two sets of weights: column vector weights c and row vector weights r Practical example for demonstrating Compute Shader shared variables…
  • 51.
    Example Separable Convolutions Original Original Original
  • 52.
    Example Separable ConvolutionsGaussianfilter, sigma=2.25 Sobel filter, horizontal Sobel filter, vertical
  • 53.
    Example Separable ConvolutionsWeights0.026232 0.035279 0.038941 0.035279 0.026232 -1 0 1 -1 2 10.0352790.038941 0.047446 0.052371 0.052371 0.057807 0.047446 0.052371 0.035279 0.038941 -2 0 2 0 0 00.035279 0.047446 0.052371 0.047446 0.0352790.026232 0.035279 0.038941 0.035279 0.026232 -1 0 1 -1 2 1 = = =0.1619640.2178200.240432 1 1 0.161964 0.217820 .240432 0.217820 0.1619640.2178200.161964 2 -1 0 1 -1 0 1 2 1 1 5x5 Gaussian filter, sigma=2.25 Sobel filter, horizontal Sobel filter, vertical
  • 54.
    GLSL Separable FilterImplementation<< assume preliminaries from earlier general convolution example>>layout(local_size_x=TILE_WIDTH,local_size_y=NEIGHBORHOOD_HEIGHT) in;shared vec4 pixel[NEIGHBORHOOD_HEIGHT][NEIGHBORHOOD_WIDTH]; // values read from input imageshared vec4 row[NEIGHBORHOOD_HEIGHT][TILE_WIDTH]; // weighted row sumsvoid main() // separable convolution{ const ivec2 tile_xy = ivec2(gl_WorkGroupID); const ivec2 thread_xy = ivec2(gl_LocalInvocationID); const ivec2 pixel_xy = tile_xy*tileSize + (thread_xy-ivec2(0,filterOffset.y)); const uint x = thread_xy.x; const uint y = thread_xy.y; // Phase 1: Read the image's neighborhood into shared pixel arrays. for (int i=0; i<NEIGHBORHOOD_WIDTH; i += TILE_WIDTH) { if (x+i < NEIGHBORHOOD_WIDTH) { const ivec2 read_at = clampLocation(pixel_xy+ivec2(i-filterOffset.x,0)); pixel[y][x+i] = imageLoad(input_image, read_at); } } retirePhase();
  • 55.
    GLSL Separable FilterImplementation // Phase 2: Weighted sum the rows horizontally. row[y][x] = vec4(0); for (int i=0; i<filterWidth; i++) { row[y][x] += pixel[y][x+i] * rowWeight[i]; } retirePhase(); // Phase 3: Weighted sum the row sums vertically and write result to output image. // Does this thread correspond to a tile pixel? // Recall: There are more threads in the Y direction than tileHeight. if (y < tileHeight) { vec4 result = vec4(0); for (int i=0; i<filterHeight; i++) { result += row[y+i][x] * columnWeight[i]; } // Phase 4: Store result to output image. const ivec2 pixel_xy = tile_xy*tileSize + thread_xy; imageStore(output_image, pixel_xy, result); }}
  • 56.
    Compute Shader MedianFilter Simple idea “For each pixel, replace it with the median-valued pixel in its NxN neighborhood” Non-linear, good for image enhancement through noise reduction Expensive: naively, requires lots sorting to find median Very expensive when the neighborhood is large Reasonably efficient with Compute Shaders
  • 57.
    Median Filter Example Noisy appearance in candy Original
  • 58.
    Median Filter Example Noisy lost in blur But text is blurry too Gaussian 5x5 blur
  • 59.
    Median Filter Example Noisy gone Text still sharp Median filter 5x5
  • 60.
    Large Median Filtersfor Impressionistic Effect Original 7x7 Estimated Median Filter
  • 61.
    Other stuff inOpenGL 4.3
  • 62.
    OpenGL Evolves ModularlyEachcore revision is specified as a set of extensions 4.3 Example: ARB_compute_shader Puts together all the functionality for compute shaders ARB_compute_shader Describe in its own text file ARB_ES3_compatibility May have dependencies on other extensions many more … Dependencies are stated explicitlyA core OpenGL revision (such as OpenGL 4.3) “bundles” a set ofagreed extensions—and mandates their mutual support Note: implementations can also “unbundle” ARB extensions for hardware unable to support the latest core revisionSo easiest to describe OpenGL 4.3 based on its bundledextensions…
  • 63.
    OpenGL 4.3 debuggingsupport ARB_debug_output OpenGL can present debug information back to developer ARB_debug_output2 Easier enabling of debug output ARB_debug_group Hierarchical grouping of debug tagging ARB_debug_label Label OpenGL objects for debugging
  • 64.
    OpenGL 4.3 newtexture functionality ARB_texture_view Provide different ways to interpret texture data without duplicating the texture Match DX11 functionality ARB_internalformat_query2 Find out actual supported limits for most texture parameters ARB_copy_image Direct copy of pixels between textures and render buffers ARB_texture_buffer_range Create texture buffer object corresponding to a sub-range of a buffer’s data store ARB_stencil_texturing Read stencil bits of a packed depth-stencil texture ARB_texture_storage_multisample Immutable storage objects for multisampled textures
  • 65.
    OpenGL 4.3 newbuffer functionalityARB_shader_storage_buffer_object Enables shader stages to read & write to very large buffers NVIDIA hardware allows every shader stage to read & write structs, arrays, scalars, etc.ARB_invalidate_subdata Invalidate all or some of the contents of textures and buffersARB_clear_buffer_object Clear a buffer object with a constant valueARB_vertex_attrib_binding Separate vertex attribute state from the data stores of each arrayARB_robust_buffer_access_behavior Shader read/write to an object only allowed to data owned by the application Applies to out of bounds accesses
  • 66.
    OpenGL 4.3 newpipeline functionalityARB_compute_shader Introduces new shader stage Enables advanced processing algorithms that harness the parallelism of GPUsARB_multi_draw_indirect Draw many GPU generated objects with one callARB_program_interface_query Generic API to enumerate active variables and interface blocks for each stage Enumerate active variables in interfaces between separable program objectsARB_ES3_compatibility features not previously present in OpenGL Brings EAC and ETC2 texture compression formatsARB_framebuffer_no_attachments Render to an arbitrary sized framebuffer without actual populated pixels
  • 67.
    GLSL 4.3 newfunctionality ARB_arrays_of_arrays Allows multi-dimensional arrays in GLSL. float f[4][3]; ARB_shader_image_size Query size of an image in a shader ARB_explicit_uniform_location Set location of a default-block uniform in the shader ARB_texture_query_levels Query number of mipmap levels accessible through a sampler uniform ARB_fragment_layer_viewport gl_Layer and gl_ViewportIndex now available to fragment shader
  • 68.
    New KHR andARB extensions Not part of core but important and standardized at same time as OpenGL 4.3… KHR_texture_compression_astc_ldr Adaptive Scalable Texture Compression (ASTC) 1-4 component, low bit rate < 1 bit/pixel – 8 bit/pixel ARB_robustness_isolation If application causes GPU reset, no other application will be affected For WebGL and other un-trusted 3D content sources
  • 69.
    Getting at OpenGL4.3 Easiest approach… Use OpenGL Extension Wrangler (GLEW) Release 1.9.0 already has OpenGL 4.3 support http://glew.sourceforge.net
  • 70.
  • 71.
    Further NVIDIA OpenGLWork Linux enhancements Path Rendering for Resolution-independent 2D graphics Bindless Graphics Commitment to API Compatibility
  • 72.
    OpenGL-relatedLinux Improvements Support for X Resize, Rotate, and Reflect Extension  Also known as RandR  Version 1.2 and 1.3  OpenGL enables, by default, “Sync to Vertical Blank”  Locks your glXSwapBuffers to the monitor refresh rates  Matches Windows default now  Previously disabled by default
  • 73.
    OpenGL-relatedLinux Improvements Exposeadditional full-scene antialiasing (FSAA) modes 16x multisample FSAA on all GeForce GPUs 2x2 supersampling of 4x multisampling Ultra high-quality FSAA modes for Quadro GPUs 32x multisample FSAA – 2x2 supersampling of 8x multisampling 64x multisample FSAA – 4x4 supersampling of 4x multisampling Coverage sample FSAA on GeForce 8 series and better 4 color/depth samples + 12 depth samples
  • 74.
    Multisampling FSAA Patterns aliased 2x multisampling 4x multisampling 8x multisampling 1 sample/pixel 2 samples/pixel 4 samples/pixel 8 samples/pixel 64 bits/pixel 128 bits/pixel 256 bits/pixel 512 bits/pixel Assume: 32-bit RGBA + 24-bit Z + 8-bit Stencil = 64 bits/sample
  • 75.
    Supersampling FSAA Patterns 2x2 supersampling 2x2 supersampling 4x4 supersampling of 4x multisampling of 8x multisampling of 16x multisampling 16 samples/pixel 32 samples/pixel 64 samples/pixel 1024 bits/pixel 2048 bits/pixel 4096 bits/pixel Quadro GPUs Assume: 32-bit RGBA + 24-bit Z + 8-bit Stencil = 64 bits/sample
  • 76.
    Image Quality EvolvedNVIDIAFast Approximated Anti-Alias (FXAA) Supported on Windows for several driver releases… Now enabled for Linux in 304.xx drivers
  • 77.
    NVIDIA X ServerSettings for LinuxControl Panel
  • 78.
    GLX ProtocolNetwork transparent OpenGL Run OpenGL app on one machine, display the X and 3D on a different machine 3D app X server GLX OpenGL Server GLX Client OpenGL network connection
  • 79.
    OpenGL-relatedLinux Improvements Official GLX Protocol support for OpenGL extensions ARB_half_float_pixel EXT_point_parameters ARB_transpose_matrix EXT_stencil_two_side EXT_blend_equation_separate EXT_depth_bounds_test NV_copy_image EXT_framebuffer_blit NV_depth_buffer_float EXT_framebuffer_multisample NV_half_float EXT_packed_depth_stencil NV_occlusion_query NV_point_sprite NV_register_combiners2 NV_texture_barrier
  • 80.
    OpenGL-relatedLinux Improvements Tentative GLX Protocol support for OpenGL extensionsARB_map_buffer_range EXT_vertex_attrib_64bitARB_shader_subroutine NV_conditional_renderARB_stencil_two_side NV_framebuffer_multisample_coverageEXT_transform_feedback2 NV_texture_barrier NV_transform_feedback2
  • 81.
    Synchronizing X11-based OpenGLStreams New extension GL_EXT_x11_sync_object Bridges the X Synchronization Extension with OpenGL 3.2 “sync” objects (ARB_sync) Introduces new OpenGL command GLintptr sync_handle; GLsync glImportSyncEXT (GLenum external_sync_type, GLintptr external_sync, GLbitfield flags); external_sync_type must be GL_SYNC_X11_FENCE_EXT flags must be zero
  • 82.
    Other Linux UpdatesGL_CLAMP behaves in conformant way now Long-standing work around for original Quake 3 Enabled 10-bit per component X desktop support GeForce 8 and better GPUs Support for 3D Vision Pro stereo now
  • 83.
    What is 3DVision Pro? For Professionals All of 3D Vision support, plus • Radio frequency (RF) glasses, Bidirectional • Query compass, accelerometer, battery • Many RF channels – no collision • Up to ~120 feet • No line of sight needed to emitter • NVAPI to control
  • 84.
    NV_path_rendering An NVIDIAOpenGL extension GPU-accelerates resolution- independent 2D graphics Very fast! Supports PostScript-style renderingCome to my afternoon talk tolearn more “GPU-Accelerated 2D and Web Rendering” This room 2:40 PM - 3:40 PM
  • 85.
    Pixel pipeline Vertex pipeline Path pipeline Application Path specificationPixel assembly Vertex assembly Transform path (unpack) Vertex operations transform feedback Primitive assemblyPixel operations Primitive operations Fill/Stroke Covering Pixel pack Rasterization read Texture Fragment operations back memory Fill/Stroke Application Raster operations Stenciling Framebuffer Display
  • 86.
    Teaser Scene: 2Dand 3D mix!
  • 87.
    NVIDIA’s Vision ofBindless Graphics Problem: Binding to different objects (textures, buffers) takes a lot of validation time in driver And applications are limited to a small palette of bound buffers and textures Approach of OpenGL, but also Direct3D Solution: Exposes GPU virtual addresses Let shaders and vertex puller access buffer and texture memory by its virtual address! Kepler GPUs support bindless texture
  • 88.
    Prior to BindlessGraphics Traditional OpenGL GPU memory reads are “indirected” through bindings Limited number of texture units and vertex array attributes glBindTexture—for texture images and buffers glBindBuffer—for vertex arrays
  • 89.
    Buffer-centric Evolution Data moves onto GPU, away from CPU Apps on CPUs just too slow at moving data otherwise Array Element Buffer glBegin, glDrawElements, etc. Object (VeBO) Texture Buffer Vertex Array Buffer Object Object (TexBO) Vertex Puller (VaBO) texel data Transform Feedback Buffer (XBO) Vertex Shading Pixel Unpack vertex data Buffer (PuBO) Texturing Geometry Shading Parameter Buffer glDrawPixels, glTexImage2D, etc. Object (PaBO) Pixel Pixel Pack Buffer Fragment Pipeline (PpBO) Uniform Buffer Shading glReadPixels, pixel data Object (UBO) etc. parameter data Framebuffer
  • 90.
    Kepler – BindlessTextures Enormous increase in the number of unique textures available to shaders at run-time More different materials and richer texture detail in a scene texture #0 Shader code texture #1 Shader code texture #2 … texture #127 … Pre-Kepler texture binding model Kepler bindless textures over 1 million unique textures
  • 91.
    Kepler – BindlessTextures Pre-Kepler texture binding model Kepler bindless textures CPU CPU Load texture A Load textures A, B, C Load texture B Draw() Load texture C Bind texture A to slot I GPU Bind texture B to slot J Read from texture A Draw() Read from texture B Read from texture C GPU Read from texture at slot I Read from texture at slot J CPU Bind texture C to slot K Bindless model reduces CPU Draw() overhead and improves GPU access GPU efficiency Read from texture at slot K
  • 92.
    Bindless Textures Apropos for ray-tracing and advanced rendering where textures cannot be “bound” in advance Shader code
  • 93.
    Bindless performance benefit Numbers obtained with a directed test
  • 94.
    More Information onBindless Texture Kepler has new NV_bindless_texture extension Texture companion to NV_vertex_buffer_unified_memory for bindless vertex arrays NV_shader_buffer_load for bindless shader buffer reads NV_shader_buffer_store (also NEW) for bindless shader buffer writes API specification publically available http://developer.download.nvidia.com/opengl/specs/GL_NV_bindless_texture.txt
  • 95.
    API Usage toInitialize Bindless Texture Make a conventional OpenGL texture object With a 32-bit GLuint name Query a 64-bit texture handle from 32-bit texture name GLuint64 glGetTextureHandleNV(GLuint); Make handle resident in context’s GPU address space void glMakeTextureHandleResidentNV(GLuint64);
  • 96.
    Writing GLSL forBindless Textures Request GLSL to understand bindless textures #version 400 // or later #extension GL_NV_bindless_texture : require Declare a sampler in the normal way in sampler2D bindless_texture; Alternatively, access bindless samplers in big array: uniform Samplers { sampler2D lotsOfSamplers[256]; } Exciting: 256 samplers exceeds the available texture units!
  • 97.
    Update Sampler UniformswithBindless Texture Handle Get a location for a sampler or image uniform GLint loc = glGetUniformLocation(program, “bindless_texture”); GLint loc_array = glGetUniformLocation(program, “lotsOfSamplers”); Then set sampler to the bindless texture handle glProgramUniformHandleui64NV(program, location, 1, &bindless_handle);
  • 98.
    NVIDIA’s Position onOpenGL Deprecation:Core vs. Compatibility Profiles OpenGL 3.1 introduced notion of Best advice for real developers “core” profile Simply use the “compatibility” Idea was remove stuff from core to profile make OpenGL “good-er” Easiest course of action Requesting the core profile requires Well-intentioned perhaps but… special context creation gymnastics Throws API backward Avoids the frustration of “they decided compatibility out the window to remove what??” Lots of useful functionality got Allows you to use existing OpenGL libraries and code as-is removed that is in fast hardware Examples: Polygon mode, line width, GL_QUADS No, your program won’t go faster for using the “core” profile Lots of easy-to-use, effective API It may go slower because of extra “is got labeled deprecated this allowed to work?” checks Immediate mode Display lists Nothing changes with OpenGL 4.3 NVIDIA still committed to compatibility without compromise
  • 99.
    How to exploitOpenGL’smodern graphics pipeline Albrecht Dürer’s less-than-modern rendering approaches
  • 100.
    Modern OpenGL PipelineIdeas Case Study: CAD assemblies Geometry complexity less problematic than scene complexity Hardware can render billions of triangles, but doesn‘t like spoon feeding Many parts for individual pieces / geometry features (bevels, chamfers, joints...) Must remain addressable as individual to select, colorize, hide, transform... Need to lower CPU overhead so that GPU can plow through large chunks of work models courtesy of PTC
  • 101.
    Concepts Minimize CPU/GPUinteraction Allow GPU to update its own data Lower api usage when scene is changed little Avoid data redundancy Data stored once on GPU, referenced multiple time Update only once (less host to gpu transfers) Increase batching potential Further cuts api calls Less driver CPU work
  • 102.
    Data Organization LargeScene Buffers Matrices, Bounding Boxes, data intended for culling and drawing Grouped Content Buffers Materials belonging to same shader, lights, view... When not using bindless, higher numbers might improve batching, but also more “work” to organize data. Draw Command Buffer Scratch buffer that is sometimes rebuilt Hosts data of all objects or references of “active objects” OpenGL 4.x allows most data to be stored and consumed on GPU (draw indirect)
  • 103.
    OpenGL Technology GL3.x uniform samplerBuffer matrixBuffer; // need helper functions Texture Buffers (TexBO, unsized mat4 getMatrix (samplerBuffer buf, int i){ return mat4( texelFetch (buf,(i*4)+0), 1D array of basic vector types) texelFetch (buf,(i*4)+1) ... Uniform Buffer Objects (UBO, } arbitrary types, size limitation uniform viewBuffer { 64kb) mat4 viewInvTM; mat4 viewProjTM; Texture Arrays (pack multiple float time; same-sized textures in one array) ... } GL 4.x // NEW 4.3 allows unsized arrays as last entry Shader Storage Buffer (SSBO, // and tighter array packing layout(std430) buffer matrixBuffer { unsized arbitrary buffer access) int willCostOnly16BytesNow[4]; (NEW 4.3) mat4 matrices[]; }
  • 104.
    OpenGL Technology GL4.x glGenTextures (2,tex); // create texture with complete mipchain ARB_texture_storage helps driver glTexStorage (GL_..,levels, GL_RGBA8, w, h); // subimage data in later to create immutable “complete” glTexSubImage (GL_.., 0,..., mipData[0]) texture at once // NEW 4.3 // create another texture that references the // same data and interprets a single mip ARB_texture_view (NEW 4.3) // slightly differently allows multiple views (internal- glTextureView (tex[1], GL_TEXTURE_2D, tex[2], format casting with same texel bit GL_R32UI, minlevel, numlevels, minlayer, numlayers); size) on same texture data // NEW 4.3 bind range of buffer ARB_texture_buffer_range (NEW glTexBufferRange (GL_..., GL_RGBA32F, buffer, offset, size); 4.3)
  • 105.
    NVIDIA Technology // GLSL with true pointers NVIDIA Bindless Graphics uniform mat4* matrixBuffer; Exposes gpu resources directly // API glUniformui64NV (shd->matrixLocation, (pointers or objects) scene->matrixADDR); Reduces CPU cache thrashing greatly mat->diffuse = glGetTextureHandleNV (texobj); GL 3.x // later instead of glBindTexture NV_shader_buffer_load (SBL) for glUniformHandleui64NV (shd->diffuseLocation, arbitrary unsized cross buffer access mat->diffuse) // GLSL NV_vertex_buffer_unified_memory // can also store textures in resources, (VBUM) separates vertex data from // virtually no restrictions on # format uniform materialBuffer { sampler2D howManyTexturesIWant[LARGE]; GL 4.x } // virtual sparse texturing NV_bindless_texture allows sampler uniform usampler2D virtualTex; references anywhere ... sampler2D (packUint2x32 ( texelFetch (virtualTex, coord).xy));
  • 106.
    Data Transfer // NEW 4.3 copy rectangles of textures Textures glCopyImageSubData ( srcName, srcTarget, srcLevel,srcX,srcY,srcZ, ARB_pixel_buffer_object (GL 2.x) dstName, dstTarget, dstLevel,dstX,dstY,dstZ, srcWidth, srcHeight, srcDepth); Now ARB_copy_image (NEW 4.3) Buffers // EXT_direct_state_access style usage shown // classic functions exists as well ARB_map_buffer_range (GL 3.x) for // range map and invalidate fast mapping void* data; ARB_invalidate_subdata (NEW 4.3) data = glMapNamedBufferRangeEXT (textBuffer, 0, sizeof(MyChar) * textLength, ARB_clear_buffer_object (NEW 4.3) GL_MAP_INVALIDATE_RANGE_BIT | GL_MAP_UNSYNCHRONIZED_BIT); allows memset() operations ARB_sync (GL 3.x) for efficient // NEW 4.3 clearing a buffer GLuint zero[1] = {0}; threaded streaming glClearNamedBufferDataEXT (visibleBuffer, GTC 2012 “Optimizing Texture Transfers” GL_R32UI, GL_RED, GL_UNSIGNED_INT, zero);
  • 107.
    Drawing the Objects /* setup once, similar to glVertexAttribPointer but with relative offset last */ Classic: bind buffers/textures glVertexAttribFormat (ATTR_NORMAL, 3, and draw GL_FLOAT, GL_TRUE, offsetof(Vertex,normal)); glVertexAttribFormat (ATTR_POS, 3, NVIDIA Bindless Graphics allows GL_FLOAT, GL_FALSE, offsetof(Vertex,pos)); // bind to stream very fast switching glVertexAttribBinding (ATTR_NORMAL, 0); Group by bindings glVertexAttribBinding (ATTR_POS, 0); Enhanced: // switch single stream buffer glBindVertexBuffer (0, bufID, 0, sizeof(Vertex)); ARB_vertex_attrib_binding (NEW // NV_vertex_buffer_unified_memory 4.3): allows less buffer changes // enable once and set stride Similar to VBUM it separates glEnableClientState (GL_VERTEX...NV);... format from data glBindVertexBuffer (0, 0, 0, sizeof(Vertex)); Map multiple vertex attributes to // switch buffer one buffer glBufferAddressRangeNV (GL_VERTEX...,0,bufADDR, bufSize);
  • 108.
    Drawing the Objects struct MyMaterial { Enhanced: vec4 diffuse; int shadeType; Grow buffers and dynamically index ... TexBO: can be large, but ugly to fetch }; uniform materialBuffer { UBO: fast, but size limited MyMaterial materials[128]; SSBO: large }; buffer transformBuffer { mat4 transforms[]; Pass assignment index as glUniform , }; glVertexAttribI (faster) ... gl_FragColor = materials[assign.x].diffuse; Go bindless beyond VBUM // bindless pointer datatypes struct Object { Bypassing binding completely MyMaterial* material; mat4* transform; };
  • 109.
    Draw Call ReductionMultiDraw Render ranges from current VBO/IBO Single draw call for many distinct objects Reduces overhead for low complexity objects DrawIndirect DrawElementsIndirect Store drawcall information on GPU as well { GLuint count; (primitiveCount...) GLuint instanceCount; Let GPU create/modify such buffers to GLuint firstIndex; GLint baseVertex; generate frame‘s drawcall buffers GLuint baseInstance; } MultiDrawIndirect (NEW 4.3)
  • 110.
    Drawing the ObjectsCombine drawcalls with MultiDraw 1111 00000000 2 3 44444 1 How to find object‘s transform, material... assignment? in vec4 oPos; GL 3.x sacrifice a vertex attribute in int objID; flat out ivec4 assigns; Inside main vertex buffer encode uniform isamplerBuffer assignBuffer; object index uniform samplerBuffer matrixBuffer; ... Fetch assign indices from assigns = texelFetch (assignBuffer, samplerBuffer objID); Matrix/material ... assignments independent of geometry data worldTM = getMatrix (matrixBuffer, assigns.x); GL 4.x MultiDrawIndirect exposes vec4 wPos = worldTM * oPos; ... “baseInstance” to get assignments
  • 111.
    BaseInstance as UniqueObject ID MultiDrawIndirect uses instanced drawing Can replicate a vertex attribute (material... assignment) Regular : VArray[ gl_VertexID + baseVertex ] Divisor != 0 : VArray[ gl_InstanceID / VAttribDivisor + baseInstance ] AssignBuffer (divisor) Combined Attributes VertexBuffer (regular) MultiDrawBuffer count = 9 count = 6 baseVertex = 0 baseVertex = 9 baseInstance = 1 baseInstance = 0
  • 112.
    Recap Addressed: Can render many low complexity objects stored in same vbo Buffers with indexable content to lower overhead MultiDraw/Indirect for keeping objects independent baseInstance to provide unique index/assignments NVIDIA Bindless Graphics Instead of passing “index“ can pass pointer to buffer Buffer can store texture references Lowers CPU work even further for hot loop What remains: State and Shader switching (can’t do much about state, but...)
  • 113.
    Shader Switching subroutine void shade_fn (); Could use indexed subroutines subroutine (shade_fn) vec4 metal() ... subroutine (shade_fn) vec4 wood () ... If shaders are similar in resource // content of array set by gl api call consumption (register usage) subroutine uniform shade_fn shadeFuncs[2]; might want to combine them (GL flat in ivec4 assigns; 4.x) void main(){ gl_FragColor = shadeFuncs[assigns.y](); } Initiliaze the subroutine array // bindless pointer casting and texture sampling once, then dynamically index vec4 metal() { MetalParams* metal = packPtr (assigns.zw); ... texture (metal->roughnessMap, uv); } NVIDIA Bindless Graphics vec4 wood() { pointers allow casting buffer WoodParams* metal = packPtr (assigns.zw); addresses ... }
  • 114.
    Let the GPUdo More Work So far CPU still responsible for most decision making and hot loop (Multi) DrawIndirect allows GPU to generate its own work Scene (objects, bboxs, materials..) described in buffers / indices / pointers Process scene data and build command lists for active objects E.g do culling, LOD picking, selection highlighting...
  • 115.
    OpenGL Computing // API - COMPUTE glDispatchCompute (gx, gy, gz); // GLSL shared float s_mem[SOMESIZE]; ... s_mem[gl_LocalInvocationIndex] = ... ARB_compute_shader (NEW 4.3) // API - FBO Dispatch threads with shared memory glBindFrameBuffer (...); support (as in CUDA/CL) glFramebufferParameteri (..., GL_FRAMEBUFFER_DEFAULT_WIDTH, 2048); Access to ALL resources, textures, no glFramebufferParameteri (..., interop, all in GLSL GL_FRAMEBUFFER_DEFAULT_HEIGHT, 2048); glDrawArrays(...) NV_BINDLESS benefit from pointer acccess // GLSL ... imageStore(...ivec2(gl_FragCoord),.); ARB_framebuffer_no_attachments (NEW 4.3) // API - XFB glEnable (GL_RASTERIZER_DISCARD); Use rasterizer to spawn threads and glDrawArrays (GL_POINTS,0, count); SSBO/imageStores to record your results // GLSL buffer indirectBuffer { GL 3.x Transform Feedback (XFB) DrawIndirect commands[]; } Allows simple 1D Kernels ... commands[gl_VertexID].instanceCount = visible ? 1 : 0;
  • 116.
    Culling Processing Matrix and bounding box (bbox) buffer, object buffer (which matrix to use with which bbox) XFB or “invisible” rendering to create output buffer Key: Single draw call for ALL active objects! No state changes Results “Readback” GPU to Host Can use XFB to pack into a single bit stream for all active objects 0,1,0,1,1,1,0,0,0 “Indirect” GPU to GPU Set DrawIndirect‘s instanceCount to 0 or 1
  • 117.
    Culling Techniques Frustum(GL 3.x) XFB, VertexShader output 1 or 0 VertexAttributes are bbox index, matrix index, data fetched via TBO alternatively can feed bboxes directly, 2x vec3) HiZ Occlusion (GL 3.x) Depth-Pass (useful for fragment bound scenes anyway) Create mipmap pyramid, MAX depth XFB, VertexShader Compare object‘s closest clipspace bbox against z value of depth mip Projected size determines Mip level chosen by clipspace 2D area depth mip level  mip texel covers object
  • 118.
    Culling Techniques Passing bbox fragments Raster Occlusion (GL 4.x) enable object Depth-Pass Raster “invisble” bounding boxes // GLSL fragment shader Geometry Shader to create the 3 sides // from ARB_shader_image_load_store depth buffer discards occluded fragments layout(early_fragment_tests) in; Fragment Shader does visible[objindex] = 1 buffer indirectBuffer { Temporal Coherence (vertex-bound) }; DrawIndirect commands[]; Render last visible flat in int objID; Test all bboxes against current depth void main(){ commands[objID].instanceCount = 1; Render newly added visible: (~last) & (visible) } Each object drawn only once // some other shader would have // cleared to 0 before
  • 119.
    Realtime Global IlluminationOctree-Based Sparse Voxelization for Real-Time Global Illumination Technique by Cyril Crassin et al. (GTC 2012, also presents here at SIGGRAPH) http://www.seas.upenn.edu/~pcozzi/OpenGLInsights/OpenGLInsights-SparseV Uses only OpenGL! Generates voxelization of the scene as well as tracing it for global effects (indirect lighting, glossy reflections)
  • 120.
    Realtime Global IlluminationOctree management NV_shader_buffer_load Pointers allow efficiently to manage access to the octree memory cells Casting is also possible to interpret the data for node-type easily Not limited to buffer bindings, can access many buffers at once pool Linear Memory Octree 8 9 6 7 4 5 2 3 1
  • 121.
    Realtime Global IlluminationScene Voxelization ARB_atomic_counter to generate work queues (Draw/Dispatch)Indirect to construct tree asynchronously with glMemoryBarrier providing dependency information Attachment-less FBO used to rasterize triangles to voxels Material attributes (color, normal) contribute to a voxel cell (NV_shader_atomic_float) NV_shader_buffer_store, ARB_shader_image_load_store to write to voxels/octree cells
  • 122.
  • 123.
    Don’t Forget the20th Anniversary Party Date: August 8th 2012 ( today! ) Location: JW Marriott Los Angeles at LA Live Venue: Gold Ballroom – Salon 1
  • 124.
    Other OpenGL-related NVIDIA Sessions at SIGGRAPHGPU-Accelerated 2D and Web Rendering Wednesday in West Hall 503 (this room), 2:40 PM - 3:40 PM Mark Kilgard, Principal Software Engineer, NVIDIAGPU Ray Tracing and OptiX Wednesday in West Hall 503, 3:50 PM - 4:50 PM David McAllister, OptiX Manager, NVIDIA Phillip Miller, Director, Workstation Software Product Management, NVIDIAVoxel Cone Tracing & Sparse Voxel Octree for Real-time Global Illumination Wednesday in NVIDIA Booth, 3:50 PM - 4:50 PM Cyril Crassin, Postdoctoral Research Scientist, NVIDIA ResearchOpenSubdiv: High Performance GPU Subdivision Surface Drawing Wednesday in NVIDIA Booth, 3:00 PM - 3:30 PM Thursday in NVIDIA Booth, 10:00 AM - 10:30 AM (2nd time) Pixar Animation Studios GPU Team, PixarnvFX : A New Scene & Material Effect Framework for OpenGL and DirectX Thursday in NVIDIA Booth, 2:00 PM - 2:30 PM Tristan Lorach, Developer Relations Senior Engineer, NVIDIA

Editor's Notes

  • #13 One way to view the OpenGL offerings from NVIDIA is as a tool to enable awesome visual applications to be developed through a comprehensive stack of software solutions. We offer: Cg: A shading language and effects system to develop rendering effects and techniques, and deploy them on various platforms and APIs, including OpenGL. Scenix: OpenGL based professional scene graph used in many markets today. Automotive styling, visualization, simulation, broadcast graphics, and more. Applications can quickly add features such as stereo, SDI, 30-bit color, scene distribution and interactive ray tracing Optix: Is an interactive ray-tracing engine built on top of CUDA and OpenGL. Hybrid rendering, mixing of traditional graphics rendering and ray tracing, are also enabled. OptiX integrates with SceniX. With Optix, you accelerate an existing renderer, or build a new one yourself. Complex: Maintains interactivity for large scenes as they exceed the limits of a single GPU, allowing massive data sets to be explored, using multiple GPUs in a system. The CompleX engine can be adopted by any product using OpenGL, and can be enabled immediately by an application using SceniX Parallel Nsight is an advanced debugger and analyzer fully integrated with Visual Studio. It enables you to better see what is going on with your OpenGL, CUDA, Direct3D, DirectCompute or OpenCL application.
  • #14 OpenGL is the only Cross Platform 3D API. Every major Operating System provides a version or flavor of OpenGL. Windows, Mac OS, iOS, Linux and Android.
  • #15 After Mark’s deep-dive, I’ll pull you back up into the higher level view of where OpenGL fits in as the 3D graphics API.
  • #16 OpenGL is a desktop API, as you’ve seen by now. OpenGL ES is the sister API for mobile and embedded devices. By keeping both APIs closely aligned, content can flow from there up into OpenGL enabled platforms and back down to OpenGL ES enabled platforms. WebGL was announced last year. It provides JavaScript bindings to OpenGL ES and provides plug-in less 3D graphics in a web browser. WebGL gives you access to the GPU in the system inside of a browser. Beta implementations of WebGL are already available from Mozilla, Google and Opera. For NVIDIA this means we will support and enhance OpenGL and WebGL on GeForce and OpenGL ES and WebGL on Tegra for the mobile market.
  • #65 Internal format query 2 allows an application to find out actual supported limits for most texture parameters. Examples: Query if a particular internal format is actually supported, if a texture is renderable, or can be used to texture from in a vertex/tess/geom/fragment/compute shader, etc. Support is indicated as either fully supported, not support, or there are caveats. If there is a caveat, a debug output message will be generated (if enabled). Provides the ability to directly copy pixels between textures and renderbuffers without requiring the use of an intermediate buffer object or rendering using a framebuffer object. Immutable storage for all types of textures besides multisample and buffer textures was introduced by ARB_texture_storage. For completeness, this extension introduces immutable storage for multisampled textures.
  • #66 Allows applications to invalidate all or some of the contents of textures, buffers, and framebuffers to permit implementations to perform optimizations that avoid any extra work needed to keep resources up to date. For example, one might invalidate the contents of a multisample framebuffer after a downsample operation since the individual samples may not be used again. Provides the ability to fill the contents of a buffer object&apos;s data store with a constant value, like C&apos;s memset() function. The only way to do this previously was to copy from a fully initialized scratch buffer via glBufferSubData(). Provides the ability to have multiple generic vertex attribute arrays to share a single data store, splitting the vertex attribute array state. In this model, there is a collection of &quot;buffer binding&quot; state that include strides (for interleaved arrays). There is also a collection of &quot;vertex attribute&quot; state, that includes the format of the attribute and the location of the data relative to one of the bindings. When switching between two sets of interleaved arrays with the same format but different buffer objects, it&apos;s only necessary to change a single piece of binding state.
  • #67 Provide generic APIs allowing applications to enumerate active variables and interface blocks, which will be used as the sole enumeration API for ARB_shader_storage_buffer_object. Provides some new enumeration rules to avoid enumerating every array element/member in arrays of structures for ARB_shader_storage_buffer_object. Consolidates and obsoletes existing APIs such as GetActiveUniforms, GetActiveAttrib. Provides new support for enumerating inputs and outputs of separate shader objects. Provides missing enumeration support for fragment shader outputs.
  • #68 For applications using the &quot;robustness&quot; APIs, specifies additional constraints on buffer object accesses. When accessing outside the bounds of a buffer object, the extension promises that crashes should not occur and that reads/writes access nothing other than the contents of the buffer object being accessed. Adds a new layout qualifier allowing shaders to specify an explicit locations for default uniforms. This allows applications to set their values without first having to call GetUniformLocation(). Removes the restriction forbidding multi-dimensional arrays in GLSL. As arrays are first-class objects in GLSL, a declaration like &quot;float f[4][3]&quot; is considered to be an array of four objects, each of which is an array of three floats. Useful for shared variables in compute shaders, which arrange threads into multi-dimensional groups and may naturally want to access shared values for a specific thread.
  • #69 Adaptive Scalable Texture Compression (ASTC) is a new texture compression technology that offers unprecendented flexibility, while producing better or comparable results than existing texture compressions at all bit rates. It includes support for 2D and 3D textures, with low and high dynamic range, at bitrates from below 1 bit/pixel up to 8 bits/pixel in fine steps. The goal of this extension is to support the 2D, LDR-only profile of the ASTC texture compression specification. Provides a new texture parameter allowing textures with an sRGB internal format to be decoded as though it had a non-sRGB format with the same texel values.
  • #77 NVIDIA FXAA technology harnesses the power of the GPU’s CUDA Cores to reduce visible aliasing. FXAA is a pixel shader-based image filter that is applied along with other post processing steps like motion blur and bloom. For game engines making use of deferred shading, FXAA provides a performance and memory advantage over deferred shading with multi-sample anti-aliasing (MSAA). FXAA targets edge aliasing and also aliasing on single-pixel and sub-pixel sized features, which tend to flicker as they move from frame to frame. FXAA reduces the visual contrast of these features so that they are less jarring to the eye. Note that FXAA cannot completely solve the sub-pixel aliasing problem, but it does substantially reduce it. The overall effect is smoother visual quality. FXAA reduces but does not completely eliminate shader aliasing. FXAA’s chief advantage over traditional MSAA is higher performance. In many cases, FXAA can be applied at a cost of 1ms per frame or less, resulting in frame rates that are often 2x higher than 4xMSAA with comparable image quality.
  • #84 Finally, Siggraph is where we introduce 3D Vision Pro as well. Again, come to the booth for more information. Now on to OpenGL!
  • #86 NV_path_rendering provides a new third pipeline—in addition to the vertex and pixel pipelines—for rendering pixels
  • #91 Before Kepler, in order to make a texture available for the GPU to reference it had to be assigned a “slot” in a fixed-size binding table. The number of slots in that table ultimately limits how many unique textures a shader can read from at run time. On Kepler, no such additional setup is necessary - shader can reference textures in memory directly and there is no need to go through the binding tables anymore. This effectively eliminates any limits on the number of unique textures it can use to render a scene.
  • #92 Bindless textures reduce CPU work and provide more efficient access for the GPU
  • #93 In advanced rendering apps such as raytracing it is impossible to know in advance which textures a given ray may hit – thus it is impossible to pre-“bind” them. Bindless model solves this problem by allowing shader reference textures directly

[8]ページ先頭

©2009-2025 Movatter.jp