Movatterモバイル変換


[0]ホーム

URL:


CE
Uploaded byCass Everitt
PPTX, PDF70,709 views

Beyond porting

The document discusses advancements in modern OpenGL to enhance graphics performance by reducing driver overhead through techniques like dynamic buffer generation, efficient texture management, and increasing draw call counts. It explains the use of persistent and coherent mapped buffers for better memory management and introduces features such as ARB_buffer_storage and ARB_bindless_texture to optimize texture handling and reduce state changes. The paper emphasizes the importance of minimizing validation costs in draw calls and presents solutions like ARB_multi_draw_indirect for efficient rendering of numerous small objects.

Embed presentation

Downloaded 582 times
Beyond PortingHow Modern OpenGL canRadically Reduce Driver Overhead
Who are we?Cass Everitt, NVIDIA CorporationJohn McDonald, NVIDIA Corporation
What will we cover?Dynamic Buffer GenerationEfficient Texture ManagementIncreasing Draw Call Count
Dynamic Buffer GenerationProblemOur goal is to generate dynamic geometry directly in place.It will be used one time, and will be completely regenerated next frame.Particle systems are the most common exampleVegetation / foliage also common
Typical Solutionvoid UpdateParticleData(uint _dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;UnmapBuffer(ARRAY_BUFFER);offset += dataSize;}};// Now render with everything.
The horrorvoid UpdateParticleData(uint _dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;UnmapBuffer(ARRAY_BUFFER);This is so slow.offset += dataSize;}};// Now render with everything.
Driver interludeFirst, a quick interlude on modern GL driversIn the application (client) thread, the driver is very thin.It simply packages work to hand off to the server thread.The server thread does the real processingIt turns command sequences into push buffer fragments.
Healthy Driver Interaction VisualizedApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
MAP_UNSYNCHRONIZEDAvoids an application-GPU sync point (a CPU-GPU sync point)But causes the Client and Server threads to serializeThis forces all pending work in the server thread to completeIt’s quite expensive (almost always needs to be avoided)
Healthy Driver Interaction VisualizedApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
Client-Server Stall of SadnessApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
It’s okayQ: What’s better than mapping in an unsynchronized manner?A: Keeping around a pointer to GPU-visible memory forever.Introducing: ARB_buffer_storage
ARB_buffer_storageConceptually similar to ARB_texture_storage (but for buffers)Creates an immutable pointer to storage for a bufferThe pointer is immutable, the contents are not.So BufferData cannot be called—BufferSubData is still okay.Allows for extra information at create time.For our usage, we care about the PERSISTENT and COHERENTbits.PERSISTENT: Allow this buffer to be mapped while the GPU is using it.COHERENT: Client writes to this buffer should be immediately visible tothe GPU.http://www.opengl.org/registry/specs/ARB/buffer_storage.txt
ARB_buffer_storage cont’dAlso affects the mapping behavior (pass persistent and coherentbits to MapBufferRange)Persistently mapped buffers are good for:Dynamic VB / IB dataHighly dynamic (~per draw call) uniform dataMulti_draw_indirect command buffers (more on this later)Not a good fit for:Static geometry buffersLong lived uniform data (still should use BufferData or BufferSubData forthis)
Armed with persistently mapped buffers// At the beginning of timeflags = MAP_WRITE_BIT | MAP_PERSISTENT_BIT | MAP_COHERENT_BIT;BufferStorage(ARRAY_BUFFER, allParticleSize, NULL, flags);mParticleDst = MapBufferRange(ARRAY_BUFFER, 0, allParticleSize,flags);mOffset = 0;// allParticleSize should be ~3x one frame’s worth of particles// to avoid stalling.
Update Loop (old and busted)void UpdateParticleData(uint _dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;offset += dataSize;UnmapBuffer(ARRAY_BUFFER);}};// Now render with everything.
Update Loop (new hotness)void UpdateParticleData() {for particle in allParticles {dataSize = GetParticleSize(particle);mParticleDst[mOffset] = *particle;mOffset += dataSize; // Wrapping not shown}};// Now render with everything.
Test App
Performance results160,000 point spritesSpecified in groups of 6 vertices (one particle at a time)Synthetic (naturally)MethodFPSParticles / SMap(UNSYNCHRONIZED)1.369219,040BufferSubData17.652,824,000D3D11 Map(NO_OVERWRITE)20.253,240,000
Performance results160,000 point spritesSpecified in groups of 6 vertices (one particle at a time)Synthetic (naturally)MethodFPSParticles / SMap(UNSYNCHRONIZED)1.369219,040BufferSubData17.652,824,000D3D11 Map(NO_OVERWRITE)20.253,240,000Map(COHERENT|PERSISTENT)79.912,784,000Room for improvement still, but much, much better.
The other shoeYou are responsible for not stomping on data in flight.Why 3x?1x: What the GPU is using right now.2x: What the driver is holding, getting ready for the GPU to use.3x: What you are writing to.3x should ~ guarantee enough buffer room*…Use fences to ensure that rendering is complete before you beginto write new data.
FencingUse FenceSync to place a new fence.When ready to scribble over that memory again, useClientWaitSync to ensure that memory is done.ClientWaitSync will block the client thread until it is readySo you should wrap this function with a performance counterAnd complain to your log file (or resize the underlying buffer) if youfrequently see stalls hereFor complete details on correct management of buffers withfencing, see Efficient Buffer Management [McDonald 2012]
Efficient Texture ManagementOr ―how to manage all texture memory myself‖
ProblemChanging textures breaks batches.Not all texture data is needed all the timeTexture data is large (typically the largest memory bucket for games)Bindless solves this, but can hurt GPU performanceToo many different textures can fall out of TexHdr$Not a bindless problem per se
TerminologyReserve – The act of allocating virtual memoryCommit – Tying a virtual memory allocation to a physical backingstore (Physical memory)Texture Shape – The characteristics of a texture that affect itsmemory consumptionSpecifically: Height, Width, Depth, Surface Format, Mipmap Level Count
Old SolutionTexture AtlasesProblemsCan impact art pipelineTexture wrap, border filteringColor bleeding in mip maps
Texture ArraysIntroduced in GL 3.0, and D3D 10.Arrays of textures that are the same shape and formatTypically can contain many ―layers‖ (2048+)Filtering works as expectedAs does mipmapping!
Sparse Bindless Texture ArraysOrganize loose textures into Texture Arrays.Sparsely allocate Texture ArraysIntroducing ARB_sparse_textureConsume virtual memory, but not physical memoryUse Bindless handles to deal with as many arrays as needed!Introducing ARB_bindless_textureuncommitteduncommitteduncommittedlayerlayerlayer
ARB_sparse_textureApplications get fine-grained control of physicalmemory for textures with large virtual allocationsInspired by Mega TexturePrimary expected use cases:Sparse texture dataTexture pagingDelayed-loading assetshttp://www.opengl.org/registry/specs/ARB/sparse_texture.txt
ARB_bindless_textureTextures specified by GPU-visible ―handle‖ (really an address)Rather than by name and binding pointCan come from ~anywhereUniformsVaryingSSBOOther texturesTexture residency also application-controlledResidency is ―does this live on the GPU or in sysmem?‖https://www.opengl.org/registry/specs/ARB/bindless_texture.txt
AdvantagesArtists work naturallyNo preprocessing required (no bake-step required)Although preprocessing is helpful if ARB_sparse_texture is unavailableReduce or eliminate TexHdr$ thrashingEven as compared to traditional texturingProgrammers manage texture residencyWorks well with arbitrary streamingFaster on the CPUFaster on the GPU
DisadvantagesTexture addresses are now structs (96 bits).64 bits for bindless handle32 bits for slice index (could reduce this to 10 bits at a perf cost)ARB_sparse_texture implementations are a bit immatureEarly adopters: please bring us your bugs.ARB_sparse_texture requires base level be a multiple of tile size(Smaller is okay)Tile size is queried at runtimeTextures that are power-of-2 should almost always be safe.
Implementation OverviewWhen creating a new texture…Check to see if any suitable texture array existsTexture arrays can contain a large number of textures of the same shapeEx. Many TEXTURE_2Ds grouped into a single TEXTURE_2D_ARRAYIf no suitable texture, create a new one.
Texture Container Creation (example)GetIntegerv( MAX_SPARSE_ARRAY_TEXTURE_LAYERS, maxLayers );Choose a reasonable size (e.g. array size ~100MB virtual )If new internalFormat, choose page sizeGetInternalformativ( …, internalformat, NUM_VIRTUAL_PAGE_SIZES, 1, &numIndexes);Note: numIndexes can be 0, so have a planIterate, select suitable pageSizeIndexBindTexture( TEXTURE_2D_ARRAY, newTexArray );TexParameteri( TEXTURE_SPARSE, TRUE );TexParameteri( VIRTUAL_PAGE_SIZE_INDEX, pageSizeIndex );Allocate the texture’s virtual memory using TexStorage3D
Specifying Texture DataUsing the located/created texture array from the previous stepAllocate a layer as the location of our dataFor each mipmap level of the allocated layer:Commit the entire mipmap level (using TexPageCommitment)Specify actual texel data as usual for arraysgl(Compressed|Copy|)TexSubImage3DPBO updates are fine toouncommittedAllocated layerfreefreelayersliceslice
Freeing TexturesTo free the texture, reverse the process:Use TexPageCommitment to mark the entire layer (slice) as free.Do once for each mipmap levelAdd the layer to the free list for future allocationuncommittedfreefreelayerslicesliceuncommittedFreed layerlayer
Combining with Bindless to eliminate bindsAt container create time:Specify sampling parameters via SamplerParameter calls firstCall GetTextureSamplerHandleARB to return a GPU-visible pointer to thetexture+sampler containerCall MakeTextureHandleResident to ensure the resource lives on the GPUAt delete time, call MakeTextureHandleNonResidentWith bindless, you explicitly manage the GPU’s working set
Using texture data in shadersWhen a texture is needed with the default sampling parametersCreate a GLSL-visible TextureRef object:struct TextureRef {sampler2DArray container;float slice;};When a texture is needed with custom sampling parametersCreate a separate sampler object for the shader with the parametersCreate a bindless handle to the pair usingGetTextureSamplerHandle, then call MakeTextureHandleResidentwith the new valueAnd fill out a TextureRef as above for usage by GLSL
C++ CodeBasic implementation (some details missing)BSD licensed (use as you will)https://github.com/nvMcJohn/apitest/blob/pdoane_newtests/sparse_bindless_texarray.hhttps://github.com/nvMcJohn/apitest/blob/pdoane_newtests/sparse_bindless_texarray.cpp
Increasing Draw Call CountLet’s draw all the calls!
All the Draw Calls!ProblemYou want more draw calls of smaller objects.D3D is slow at this.Naïve GL is faster than D3D, but not fast enough.
XY ProblemY: How can I have more draw calls?X: You don’t really care if it’s more draw calls, right?Really what you want is to be able to draw more small geometrygroupings. More objects.
Well why didn’t you just say so??First, some background.What makes draw calls slow?Real world API usageDraw Call Cost Visualization
Some backgroundWhat causes slow draw calls?Validation is the biggest bucket (by far).Pre-validation is ―difficult‖―Every application does the same things.‖Not really. Most applications are in completely disjoint statesTry this experiment: What is important to you?Now ask your neighbor what’s important to him.
Why is prevalidation difficult?The GPU is an exceedingly complex state machine.(Honestly, it’s probably the most complex state machine in all of CS)Any one of those states may have a problem that requires WARUsually the only problem is overall performanceBut sometimes not. There are millions of tests covering NVIDIA GPU functionality.
FINE.How can app devs mitigate these costs?Minimize state changes.All state changes are not created equal!Cost of a draw call:Small fixed cost + Cost of validation of changed state
Feels limiting…Artists want lots of materials, and small amounts of geometryEven better: What if artists just didn’t have to care about this?Ideal Programmer->Artist Interaction―You make pretty art. I’ll make it fit.‖
Relative costs of State ChangesIn decreasing cost…Render TargetProgramROPTexture BindingsVertex FormatUBO BindingsVertex BindingsUniform Updates~60K / s~300K / s~1.5M / s~10M / sNote: Not to scale
Real World API frequencyAPI usage looks roughly like this…Increasing Frequency of ChangeRender Target (scene)Per Scene Uniform Buffer + TexturesIB / VB and Input LayoutShader (Material)Per-material Uniform Buffer + TexturesPer-object Uniform Buffer + TexturesPer-piece Uniform Buffer + TexturesDraw
Draw Calls visualizedRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
Draw Calls visualized (cont’d)Read down, then rightBlack—no changeRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
GoalsLet’s minimize validation costs without affecting artistsThings we need to be fast (per app call frequency):Uniform Updates and bindingTexture Updates and bindingThese happen most often in app, ergo driving them to ~0 shouldbe a win.
TexturesUsing Sparse Bindless Texture Arrays (as previously described)solves this.All textures are set before any drawing begins(No need to change textures between draw calls)Note that from the CPU’s perspective, just using bindless issufficient.That was easy.
Eliminating Texture Binds -- visualizedIncreasing Frequency of ChangeRender Target (scene)Per Scene Uniform Buffer + TexturesIB / VB and Input LayoutShader (Material)Per-material Uniform Buffer + TexturesPer-object Uniform Buffer + TexturesPer-piece Uniform Buffer + TexturesDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
Boom!Increasing Frequency of ChangeRender Target (scene)Per Scene Uniform BufferIB / VB and Input LayoutShader (Material)Per-material Uniform BufferPer-object Uniform BufferPer-piece Uniform BufferDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
Buffer updates (old and busted)Typical Scene Graph Traversalfor obj in visibleObjectSet {update(buffer, obj);draw(obj);}
Buffer updates (new hotness)Typical Scene Graph Traversalfor obj in visibleObjectSet {update(bufferFragment, obj);}for obj in visibleObjectSet {draw(obj);}
bufferFragma-wha?Rather than one buffer per object, we share UBOs for manyobjects.ie, given struct ObjectUniforms { /* … */ };// Old (probably not explicitly instantiated,// just scattered in GLSL)ObjectUniforms uniformData;// NewObjectUniforms uniformData[ObjectsPerKickoff];Use persistent mapping for even more win here!For large amounts of data (bones) consider SSBO.Introducing ARB_shader_storage_buffer_object
SSBO?Like ―large‖ uniform buffer objects.Minimum required size to claim support is 16M.Accessed like uniforms in shaderSupport for better packing (std430)Caveat: They are typically implemented in hardware as textures(and can introduce dependent texture reads)Just one of a laundry list of things to consider, not to discourage use.http://www.opengl.org/registry/specs/ARB/shader_storage_buffer_object.txt
Eliminating Buffer Update OverheadIncreasing Frequency of ChangeRender Target (scene)Per Scene Uniform BufferIB / VB and Input LayoutShader (Material)Per-material Uniform BufferPer-object Uniform BufferPer-piece Uniform BufferDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
Sweet!Increasing Frequency of ChangeRender Target (scene)IB / VB and Input LayoutShader (Material)Draw ( * each object )Hrrrrmmmmmm….Render TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
So now…It’d be awesome if we could do all of those kickoffs at once.Validation is already only paid onceBut we could just pay the constant startup cost once.If only…….
So now…It’d be awesome if we could do all of those kickoffs at once.Validation is already only paid onceBut we could just pay the constant startup cost once.If only…….Introducing ARB_multi_draw_indirect
ARB_multi_draw_indirectAllows you to specify parameters to draw commands from abuffer.This means you can generate those parameters wide (on the CPU)Or even on the GPU, via compute program.http://www.opengl.org/registry/specs/ARB/multi_draw_indirect.txt
ARB_multi_draw_indirect cont’dvoid MultiDrawElementsIndirect(enum mode,enum typeconst void* indirect,sizei primcount,sizei stride);
ARB_multi_draw_indirect cont’dconst ubyte * ptr = (const ubyte *)indirect;for (i = 0; i < primcount; i++) {DrawArraysIndirect(mode,(DrawArraysIndirectCommand*)ptr);if (stride == 0){ptr += sizeof(DrawArraysIndirectCommand);} else {ptr += stride;}}
DrawArraysIndirectCommandtypedef struct {uint count;uint primCount;uint first;uint baseInstance;} DrawArraysIndirectCommand;
Knowing which shader data is mineUse ARB_shader_draw_parameters, a necessary companion toARB_multi_draw_indirectAdds a builtin to the VS: DrawID (InstanceID already available)This tells you which command of a MultiDraw command is beingexecuted.When not using MultiDraw, the builtin is specified to be 0.Caveat: Right now, you have to pass this down to other shaderstages as an interpolant.Hoping to have that rectified via ARB or EXT extension ―real soon now.‖http://www.opengl.org/registry/specs/ARB/shader_draw_parameters.txt
Applying everythingCPU Perf is massively better5-30x increase in number of distinct objects / sInteraction with driver is decreased ~75%Note: GPU perf can be affected negatively (although not toobadly)As always: Profile, profile, profile.
Previous ResultsRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
Visualized ResultsRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex FormatMultiDraw
Where we came fromRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
ConclusionGo forth and work magnify.
Questions?jmcdonald at nvidia dot comcass at nvidia dot com

Recommended

PPTX
Approaching zero driver overhead
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PDF
Rendering AAA-Quality Characters of Project A1
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PDF
Taking Killzone Shadow Fall Image Quality Into The Next Generation
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PPTX
Stochastic Screen-Space Reflections
PPTX
Shiny PC Graphics in Battlefield 3
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PDF
Advanced Scenegraph Rendering Pipeline
PPTX
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
PPTX
Moving Frostbite to Physically Based Rendering
PDF
Bindless Deferred Decals in The Surge 2
PPT
A Bit More Deferred Cry Engine3
PPT
Destruction Masking in Frostbite 2 using Volume Distance Fields
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PDF
Screen Space Reflections in The Surge
PPTX
The Rendering Technology of Killzone 2
PPT
Light prepass
PDF
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
PDF
Forward+ (EUROGRAPHICS 2012)
PPT
Z Buffer Optimizations
PPTX
Triangle Visibility buffer
PPT
OpenGL 3.2 and More

More Related Content

PPTX
Approaching zero driver overhead
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PDF
Rendering AAA-Quality Characters of Project A1
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PDF
Taking Killzone Shadow Fall Image Quality Into The Next Generation
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Approaching zero driver overhead
OpenGL 4.4 - Scene Rendering Techniques
Optimizing the Graphics Pipeline with Compute, GDC 2016
Siggraph2016 - The Devil is in the Details: idTech 666
Rendering AAA-Quality Characters of Project A1
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson

What's hot

PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PPTX
Stochastic Screen-Space Reflections
PPTX
Shiny PC Graphics in Battlefield 3
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PDF
Advanced Scenegraph Rendering Pipeline
PPTX
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
PPTX
Moving Frostbite to Physically Based Rendering
PDF
Bindless Deferred Decals in The Surge 2
PPT
A Bit More Deferred Cry Engine3
PPT
Destruction Masking in Frostbite 2 using Volume Distance Fields
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PDF
Screen Space Reflections in The Surge
PPTX
The Rendering Technology of Killzone 2
PPT
Light prepass
PDF
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
PDF
Forward+ (EUROGRAPHICS 2012)
PPT
Z Buffer Optimizations
Rendering Technologies from Crysis 3 (GDC 2013)
FrameGraph: Extensible Rendering Architecture in Frostbite
Stochastic Screen-Space Reflections
Shiny PC Graphics in Battlefield 3
Secrets of CryENGINE 3 Graphics Technology
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Advanced Scenegraph Rendering Pipeline
SPU-Based Deferred Shading in BATTLEFIELD 3 for Playstation 3
Moving Frostbite to Physically Based Rendering
Bindless Deferred Decals in The Surge 2
A Bit More Deferred Cry Engine3
Destruction Masking in Frostbite 2 using Volume Distance Fields
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Screen Space Reflections in The Surge
The Rendering Technology of Killzone 2
Light prepass
A 2.5D Culling for Forward+ (SIGGRAPH ASIA 2012)
Forward+ (EUROGRAPHICS 2012)
Z Buffer Optimizations

Similar to Beyond porting

PPTX
Triangle Visibility buffer
PPT
OpenGL 3.2 and More
PPT
NVIDIA's OpenGL Functionality
PPSX
Dx11 performancereloaded
PPTX
Opengl presentation
PPT
CS 354 Texture Mapping
PPTX
Efficient Buffer Management
PPTX
Windows to reality getting the most out of direct3 d 10 graphics in your games
PDF
OpenGL 4.6 Reference Guide
PPT
Topic 6 Graphic Transformation and Viewing.ppt
PPT
OpenGL 4 for 2010
PDF
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
PDF
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
PPT
Realtime Per Face Texture Mapping (PTEX)
PPTX
Opengl texturing
PPTX
Geometry Batching Using Texture-Arrays
PDF
GeForce 8800 OpenGL Extensions
PPTX
Optimizing Games for Mobiles
 
PPTX
Borderless Per Face Texture Mapping
PDF
Buffersdirectx
Triangle Visibility buffer
OpenGL 3.2 and More
NVIDIA's OpenGL Functionality
Dx11 performancereloaded
Opengl presentation
CS 354 Texture Mapping
Efficient Buffer Management
Windows to reality getting the most out of direct3 d 10 graphics in your games
OpenGL 4.6 Reference Guide
Topic 6 Graphic Transformation and Viewing.ppt
OpenGL 4 for 2010
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
IAP09 CUDA@MIT 6.963 - Lecture 04: CUDA Advanced #1 (Nicolas Pinto, MIT)
 
Realtime Per Face Texture Mapping (PTEX)
Opengl texturing
Geometry Batching Using Texture-Arrays
GeForce 8800 OpenGL Extensions
Optimizing Games for Mobiles
 
Borderless Per Face Texture Mapping
Buffersdirectx

Recently uploaded

PPTX
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
PDF
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
PDF
ODSC AI West: Agent Optimization: Beyond Context engineering
PDF
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
PDF
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PDF
Top Crypto Supers 15th Report November 2025
PDF
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
PPTX
kernel PPT (Explanation of Windows Kernal).pptx
PDF
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
PDF
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
PDF
Top 10 AI Development Companies in UK 2025
PPTX
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
PDF
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
PDF
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
PDF
10 Best Automation QA Testing Software Tools in 2025.pdf
PDF
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
PDF
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
PDF
The partnership effect: Libraries and publishers on collaborating and thrivin...
PPTX
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026
UFCD 0797 - SISTEMAS OPERATIVOS_Unidade Completa.pptx
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
ODSC AI West: Agent Optimization: Beyond Context engineering
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
Open Source Post-Quantum Cryptography - Matt Caswell
Top Crypto Supers 15th Report November 2025
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
kernel PPT (Explanation of Windows Kernal).pptx
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
The Necessity of Digital Forensics, the Digital Forensics Process & Laborator...
Top 10 AI Development Companies in UK 2025
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
10 Best Automation QA Testing Software Tools in 2025.pdf
[BDD 2025 - Artificial Intelligence] AI for the Underdogs: Innovation for Sma...
Mastering Agentic Orchestration with UiPath Maestro | Hands on Workshop
The partnership effect: Libraries and publishers on collaborating and thrivin...
How to Choose the Right Vendor for ADA PDF Accessibility and Compliance in 2026
In this document
Powered by AI

Introduction to reducing driver overhead in OpenGL through topics like Dynamic Buffer Generation.

Focus on generating dynamic geometry efficiently using particle systems, and the ARB_buffer_storage concept. Overview of efficient buffer mapping techniques, emphasizing the importance of persistent and coherent mappings.

Strategies for efficient texture management through sparse bindless texture arrays and addressing memory consumption.

Implementation of texture arrays and bindless textures for improved performance and ease of management.

Techniques to increase draw call counts while minimizing validation costs through better management strategies.

New methodologies to minimize buffer update overhead and improve rendering efficiency, particularly with UBOs.

Introduction of ARB_multi_draw_indirect to improve CPU performance for rendering numerous small objects.

Conclusion of the presentation emphasizing overall performance strategies and contact information for queries.

Beyond porting

  • 1.
    Beyond PortingHow ModernOpenGL canRadically Reduce Driver Overhead
  • 2.
    Who are we?CassEveritt, NVIDIA CorporationJohn McDonald, NVIDIA Corporation
  • 3.
    What will wecover?Dynamic Buffer GenerationEfficient Texture ManagementIncreasing Draw Call Count
  • 4.
    Dynamic Buffer GenerationProblemOurgoal is to generate dynamic geometry directly in place.It will be used one time, and will be completely regenerated next frame.Particle systems are the most common exampleVegetation / foliage also common
  • 5.
    Typical Solutionvoid UpdateParticleData(uint_dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;UnmapBuffer(ARRAY_BUFFER);offset += dataSize;}};// Now render with everything.
  • 6.
    The horrorvoid UpdateParticleData(uint_dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;UnmapBuffer(ARRAY_BUFFER);This is so slow.offset += dataSize;}};// Now render with everything.
  • 7.
    Driver interludeFirst, aquick interlude on modern GL driversIn the application (client) thread, the driver is very thin.It simply packages work to hand off to the server thread.The server thread does the real processingIt turns command sequences into push buffer fragments.
  • 8.
    Healthy Driver InteractionVisualizedApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
  • 9.
    MAP_UNSYNCHRONIZEDAvoids an application-GPUsync point (a CPU-GPU sync point)But causes the Client and Server threads to serializeThis forces all pending work in the server thread to completeIt’s quite expensive (almost always needs to be avoided)
  • 10.
    Healthy Driver InteractionVisualizedApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
  • 11.
    Client-Server Stall ofSadnessApplicationDriver (Client)Driver (Server)GPUThread separatorComponent separatorState ChangeAction Method (draw, clear, etc)Present
  • 12.
    It’s okayQ: What’sbetter than mapping in an unsynchronized manner?A: Keeping around a pointer to GPU-visible memory forever.Introducing: ARB_buffer_storage
  • 13.
    ARB_buffer_storageConceptually similar toARB_texture_storage (but for buffers)Creates an immutable pointer to storage for a bufferThe pointer is immutable, the contents are not.So BufferData cannot be called—BufferSubData is still okay.Allows for extra information at create time.For our usage, we care about the PERSISTENT and COHERENTbits.PERSISTENT: Allow this buffer to be mapped while the GPU is using it.COHERENT: Client writes to this buffer should be immediately visible tothe GPU.http://www.opengl.org/registry/specs/ARB/buffer_storage.txt
  • 14.
    ARB_buffer_storage cont’dAlso affectsthe mapping behavior (pass persistent and coherentbits to MapBufferRange)Persistently mapped buffers are good for:Dynamic VB / IB dataHighly dynamic (~per draw call) uniform dataMulti_draw_indirect command buffers (more on this later)Not a good fit for:Static geometry buffersLong lived uniform data (still should use BufferData or BufferSubData forthis)
  • 15.
    Armed with persistentlymapped buffers// At the beginning of timeflags = MAP_WRITE_BIT | MAP_PERSISTENT_BIT | MAP_COHERENT_BIT;BufferStorage(ARRAY_BUFFER, allParticleSize, NULL, flags);mParticleDst = MapBufferRange(ARRAY_BUFFER, 0, allParticleSize,flags);mOffset = 0;// allParticleSize should be ~3x one frame’s worth of particles// to avoid stalling.
  • 16.
    Update Loop (oldand busted)void UpdateParticleData(uint _dstBuf) {BindBuffer(ARRAY_BUFFER, _dstBuf);access = MAP_UNSYNCHRONIZED | MAP_WRITE_BIT;for particle in allParticles {dataSize = GetParticleSize(particle);void* dst = MapBuffer(ARRAY_BUFFER, offset, dataSize, access);(*(Particle*)dst) = *particle;offset += dataSize;UnmapBuffer(ARRAY_BUFFER);}};// Now render with everything.
  • 17.
    Update Loop (newhotness)void UpdateParticleData() {for particle in allParticles {dataSize = GetParticleSize(particle);mParticleDst[mOffset] = *particle;mOffset += dataSize; // Wrapping not shown}};// Now render with everything.
  • 18.
  • 19.
    Performance results160,000 pointspritesSpecified in groups of 6 vertices (one particle at a time)Synthetic (naturally)MethodFPSParticles / SMap(UNSYNCHRONIZED)1.369219,040BufferSubData17.652,824,000D3D11 Map(NO_OVERWRITE)20.253,240,000
  • 20.
    Performance results160,000 pointspritesSpecified in groups of 6 vertices (one particle at a time)Synthetic (naturally)MethodFPSParticles / SMap(UNSYNCHRONIZED)1.369219,040BufferSubData17.652,824,000D3D11 Map(NO_OVERWRITE)20.253,240,000Map(COHERENT|PERSISTENT)79.912,784,000Room for improvement still, but much, much better.
  • 21.
    The other shoeYouare responsible for not stomping on data in flight.Why 3x?1x: What the GPU is using right now.2x: What the driver is holding, getting ready for the GPU to use.3x: What you are writing to.3x should ~ guarantee enough buffer room*…Use fences to ensure that rendering is complete before you beginto write new data.
  • 22.
    FencingUse FenceSync toplace a new fence.When ready to scribble over that memory again, useClientWaitSync to ensure that memory is done.ClientWaitSync will block the client thread until it is readySo you should wrap this function with a performance counterAnd complain to your log file (or resize the underlying buffer) if youfrequently see stalls hereFor complete details on correct management of buffers withfencing, see Efficient Buffer Management [McDonald 2012]
  • 23.
    Efficient Texture ManagementOr―how to manage all texture memory myself‖
  • 24.
    ProblemChanging textures breaksbatches.Not all texture data is needed all the timeTexture data is large (typically the largest memory bucket for games)Bindless solves this, but can hurt GPU performanceToo many different textures can fall out of TexHdr$Not a bindless problem per se
  • 25.
    TerminologyReserve – Theact of allocating virtual memoryCommit – Tying a virtual memory allocation to a physical backingstore (Physical memory)Texture Shape – The characteristics of a texture that affect itsmemory consumptionSpecifically: Height, Width, Depth, Surface Format, Mipmap Level Count
  • 26.
    Old SolutionTexture AtlasesProblemsCanimpact art pipelineTexture wrap, border filteringColor bleeding in mip maps
  • 27.
    Texture ArraysIntroduced inGL 3.0, and D3D 10.Arrays of textures that are the same shape and formatTypically can contain many ―layers‖ (2048+)Filtering works as expectedAs does mipmapping!
  • 28.
    Sparse Bindless TextureArraysOrganize loose textures into Texture Arrays.Sparsely allocate Texture ArraysIntroducing ARB_sparse_textureConsume virtual memory, but not physical memoryUse Bindless handles to deal with as many arrays as needed!Introducing ARB_bindless_textureuncommitteduncommitteduncommittedlayerlayerlayer
  • 29.
    ARB_sparse_textureApplications get fine-grainedcontrol of physicalmemory for textures with large virtual allocationsInspired by Mega TexturePrimary expected use cases:Sparse texture dataTexture pagingDelayed-loading assetshttp://www.opengl.org/registry/specs/ARB/sparse_texture.txt
  • 30.
    ARB_bindless_textureTextures specified byGPU-visible ―handle‖ (really an address)Rather than by name and binding pointCan come from ~anywhereUniformsVaryingSSBOOther texturesTexture residency also application-controlledResidency is ―does this live on the GPU or in sysmem?‖https://www.opengl.org/registry/specs/ARB/bindless_texture.txt
  • 31.
    AdvantagesArtists work naturallyNopreprocessing required (no bake-step required)Although preprocessing is helpful if ARB_sparse_texture is unavailableReduce or eliminate TexHdr$ thrashingEven as compared to traditional texturingProgrammers manage texture residencyWorks well with arbitrary streamingFaster on the CPUFaster on the GPU
  • 32.
    DisadvantagesTexture addresses arenow structs (96 bits).64 bits for bindless handle32 bits for slice index (could reduce this to 10 bits at a perf cost)ARB_sparse_texture implementations are a bit immatureEarly adopters: please bring us your bugs.ARB_sparse_texture requires base level be a multiple of tile size(Smaller is okay)Tile size is queried at runtimeTextures that are power-of-2 should almost always be safe.
  • 33.
    Implementation OverviewWhen creatinga new texture…Check to see if any suitable texture array existsTexture arrays can contain a large number of textures of the same shapeEx. Many TEXTURE_2Ds grouped into a single TEXTURE_2D_ARRAYIf no suitable texture, create a new one.
  • 34.
    Texture Container Creation(example)GetIntegerv( MAX_SPARSE_ARRAY_TEXTURE_LAYERS, maxLayers );Choose a reasonable size (e.g. array size ~100MB virtual )If new internalFormat, choose page sizeGetInternalformativ( …, internalformat, NUM_VIRTUAL_PAGE_SIZES, 1, &numIndexes);Note: numIndexes can be 0, so have a planIterate, select suitable pageSizeIndexBindTexture( TEXTURE_2D_ARRAY, newTexArray );TexParameteri( TEXTURE_SPARSE, TRUE );TexParameteri( VIRTUAL_PAGE_SIZE_INDEX, pageSizeIndex );Allocate the texture’s virtual memory using TexStorage3D
  • 35.
    Specifying Texture DataUsingthe located/created texture array from the previous stepAllocate a layer as the location of our dataFor each mipmap level of the allocated layer:Commit the entire mipmap level (using TexPageCommitment)Specify actual texel data as usual for arraysgl(Compressed|Copy|)TexSubImage3DPBO updates are fine toouncommittedAllocated layerfreefreelayersliceslice
  • 36.
    Freeing TexturesTo freethe texture, reverse the process:Use TexPageCommitment to mark the entire layer (slice) as free.Do once for each mipmap levelAdd the layer to the free list for future allocationuncommittedfreefreelayerslicesliceuncommittedFreed layerlayer
  • 37.
    Combining with Bindlessto eliminate bindsAt container create time:Specify sampling parameters via SamplerParameter calls firstCall GetTextureSamplerHandleARB to return a GPU-visible pointer to thetexture+sampler containerCall MakeTextureHandleResident to ensure the resource lives on the GPUAt delete time, call MakeTextureHandleNonResidentWith bindless, you explicitly manage the GPU’s working set
  • 38.
    Using texture datain shadersWhen a texture is needed with the default sampling parametersCreate a GLSL-visible TextureRef object:struct TextureRef {sampler2DArray container;float slice;};When a texture is needed with custom sampling parametersCreate a separate sampler object for the shader with the parametersCreate a bindless handle to the pair usingGetTextureSamplerHandle, then call MakeTextureHandleResidentwith the new valueAnd fill out a TextureRef as above for usage by GLSL
  • 39.
    C++ CodeBasic implementation(some details missing)BSD licensed (use as you will)https://github.com/nvMcJohn/apitest/blob/pdoane_newtests/sparse_bindless_texarray.hhttps://github.com/nvMcJohn/apitest/blob/pdoane_newtests/sparse_bindless_texarray.cpp
  • 40.
    Increasing Draw CallCountLet’s draw all the calls!
  • 41.
    All the DrawCalls!ProblemYou want more draw calls of smaller objects.D3D is slow at this.Naïve GL is faster than D3D, but not fast enough.
  • 42.
    XY ProblemY: Howcan I have more draw calls?X: You don’t really care if it’s more draw calls, right?Really what you want is to be able to draw more small geometrygroupings. More objects.
  • 43.
    Well why didn’tyou just say so??First, some background.What makes draw calls slow?Real world API usageDraw Call Cost Visualization
  • 44.
    Some backgroundWhat causesslow draw calls?Validation is the biggest bucket (by far).Pre-validation is ―difficult‖―Every application does the same things.‖Not really. Most applications are in completely disjoint statesTry this experiment: What is important to you?Now ask your neighbor what’s important to him.
  • 45.
    Why is prevalidationdifficult?The GPU is an exceedingly complex state machine.(Honestly, it’s probably the most complex state machine in all of CS)Any one of those states may have a problem that requires WARUsually the only problem is overall performanceBut sometimes not. There are millions of tests covering NVIDIA GPU functionality.
  • 46.
    FINE.How can appdevs mitigate these costs?Minimize state changes.All state changes are not created equal!Cost of a draw call:Small fixed cost + Cost of validation of changed state
  • 47.
    Feels limiting…Artists wantlots of materials, and small amounts of geometryEven better: What if artists just didn’t have to care about this?Ideal Programmer->Artist Interaction―You make pretty art. I’ll make it fit.‖
  • 48.
    Relative costs ofState ChangesIn decreasing cost…Render TargetProgramROPTexture BindingsVertex FormatUBO BindingsVertex BindingsUniform Updates~60K / s~300K / s~1.5M / s~10M / sNote: Not to scale
  • 49.
    Real World APIfrequencyAPI usage looks roughly like this…Increasing Frequency of ChangeRender Target (scene)Per Scene Uniform Buffer + TexturesIB / VB and Input LayoutShader (Material)Per-material Uniform Buffer + TexturesPer-object Uniform Buffer + TexturesPer-piece Uniform Buffer + TexturesDraw
  • 50.
    Draw Calls visualizedRenderTargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 51.
    Draw Calls visualized(cont’d)Read down, then rightBlack—no changeRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 52.
    GoalsLet’s minimize validationcosts without affecting artistsThings we need to be fast (per app call frequency):Uniform Updates and bindingTexture Updates and bindingThese happen most often in app, ergo driving them to ~0 shouldbe a win.
  • 53.
    TexturesUsing Sparse BindlessTexture Arrays (as previously described)solves this.All textures are set before any drawing begins(No need to change textures between draw calls)Note that from the CPU’s perspective, just using bindless issufficient.That was easy.
  • 54.
    Eliminating Texture Binds-- visualizedIncreasing Frequency of ChangeRender Target (scene)Per Scene Uniform Buffer + TexturesIB / VB and Input LayoutShader (Material)Per-material Uniform Buffer + TexturesPer-object Uniform Buffer + TexturesPer-piece Uniform Buffer + TexturesDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 55.
    Boom!Increasing Frequency ofChangeRender Target (scene)Per Scene Uniform BufferIB / VB and Input LayoutShader (Material)Per-material Uniform BufferPer-object Uniform BufferPer-piece Uniform BufferDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 56.
    Buffer updates (oldand busted)Typical Scene Graph Traversalfor obj in visibleObjectSet {update(buffer, obj);draw(obj);}
  • 57.
    Buffer updates (newhotness)Typical Scene Graph Traversalfor obj in visibleObjectSet {update(bufferFragment, obj);}for obj in visibleObjectSet {draw(obj);}
  • 58.
    bufferFragma-wha?Rather than onebuffer per object, we share UBOs for manyobjects.ie, given struct ObjectUniforms { /* … */ };// Old (probably not explicitly instantiated,// just scattered in GLSL)ObjectUniforms uniformData;// NewObjectUniforms uniformData[ObjectsPerKickoff];Use persistent mapping for even more win here!For large amounts of data (bones) consider SSBO.Introducing ARB_shader_storage_buffer_object
  • 59.
    SSBO?Like ―large‖ uniformbuffer objects.Minimum required size to claim support is 16M.Accessed like uniforms in shaderSupport for better packing (std430)Caveat: They are typically implemented in hardware as textures(and can introduce dependent texture reads)Just one of a laundry list of things to consider, not to discourage use.http://www.opengl.org/registry/specs/ARB/shader_storage_buffer_object.txt
  • 60.
    Eliminating Buffer UpdateOverheadIncreasing Frequency of ChangeRender Target (scene)Per Scene Uniform BufferIB / VB and Input LayoutShader (Material)Per-material Uniform BufferPer-object Uniform BufferPer-piece Uniform BufferDrawRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 61.
    Sweet!Increasing Frequency ofChangeRender Target (scene)IB / VB and Input LayoutShader (Material)Draw ( * each object )Hrrrrmmmmmm….Render TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 62.
    So now…It’d beawesome if we could do all of those kickoffs at once.Validation is already only paid onceBut we could just pay the constant startup cost once.If only…….
  • 63.
    So now…It’d beawesome if we could do all of those kickoffs at once.Validation is already only paid onceBut we could just pay the constant startup cost once.If only…….Introducing ARB_multi_draw_indirect
  • 64.
    ARB_multi_draw_indirectAllows you tospecify parameters to draw commands from abuffer.This means you can generate those parameters wide (on the CPU)Or even on the GPU, via compute program.http://www.opengl.org/registry/specs/ARB/multi_draw_indirect.txt
  • 65.
    ARB_multi_draw_indirect cont’dvoid MultiDrawElementsIndirect(enummode,enum typeconst void* indirect,sizei primcount,sizei stride);
  • 66.
    ARB_multi_draw_indirect cont’dconst ubyte* ptr = (const ubyte *)indirect;for (i = 0; i < primcount; i++) {DrawArraysIndirect(mode,(DrawArraysIndirectCommand*)ptr);if (stride == 0){ptr += sizeof(DrawArraysIndirectCommand);} else {ptr += stride;}}
  • 67.
    DrawArraysIndirectCommandtypedef struct {uintcount;uint primCount;uint first;uint baseInstance;} DrawArraysIndirectCommand;
  • 68.
    Knowing which shaderdata is mineUse ARB_shader_draw_parameters, a necessary companion toARB_multi_draw_indirectAdds a builtin to the VS: DrawID (InstanceID already available)This tells you which command of a MultiDraw command is beingexecuted.When not using MultiDraw, the builtin is specified to be 0.Caveat: Right now, you have to pass this down to other shaderstages as an interpolant.Hoping to have that rectified via ARB or EXT extension ―real soon now.‖http://www.opengl.org/registry/specs/ARB/shader_draw_parameters.txt
  • 69.
    Applying everythingCPU Perfis massively better5-30x increase in number of distinct objects / sInteraction with driver is decreased ~75%Note: GPU perf can be affected negatively (although not toobadly)As always: Profile, profile, profile.
  • 70.
    Previous ResultsRender TargetTextureUniformUpdatesProgramUBO BindingDrawROPVertex Format
  • 71.
    Visualized ResultsRender TargetTextureUniformUpdatesProgramUBO BindingDrawROPVertex FormatMultiDraw
  • 72.
    Where we camefromRender TargetTextureUniform UpdatesProgramUBO BindingDrawROPVertex Format
  • 73.
  • 74.
    Questions?jmcdonald at nvidiadot comcass at nvidia dot com

Editor's Notes

  • #16 mParticleDst should be treated as a ring buffer.
  • #19 Lamest point sprites imaginableTriangles, but specified as 2-D coordinates. Spaced for obviousness
  • #20 Could obviously improve efficiency by only calling regular Map once per frame. This performance is worse than BufferSubData, but better Map(UNSYNCHRONIZED otherwise).
  • #21 Could obviously improve efficiency by only calling regular Map once per frame. This performance is worse than BufferSubData, but better Map(UNSYNCHRONIZED otherwise).
  • #38 With bindless, sampler* can be passed around like any other 64-bit integer. Note that if MakeTextureHandleResident succeeds, the texture is definitely on the GPUBut that MakeTextureHandleNonResident only tells the driver that it may choose to free the GPU memory consumed by the texture.
  • #39 M
  • #46 Speed and power come from fixed-functionality.But that means that the GPU is rigid in behaviors.

[8]ページ先頭

©2009-2025 Movatter.jp