Movatterモバイル変換


[0]ホーム

URL:


PPSX, PPTX32,726 views

Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14

The document discusses advanced techniques and optimizations for vertex shaders in DirectX 11, highlighting common bottlenecks and providing various performance-enhancing strategies. It covers the use of features such as advanced vertex outputs, instancing methods, and UAVs (Unordered Access Views) in shaders. Additionally, it presents examples and code snippets demonstrating how to implement full-screen triangles, point sprites, and a particle system utilizing these techniques.

Embed presentation

Download as PPSX, PPTX
Vertex Shader TricksNew Ways to Use the Vertex Shader to ImprovePerformanceBill BilodeauDeveloper Technology Engineer, AMD
Topics Covered● Overview of the DX11 front-end pipeline● Common bottlenecks● Advanced Vertex Shader Features● Vertex Shader Techniques● Samples and Results
Graphics HardwareDX11 Front-End Pipeline● VS –vertex data● HS – control points● Tessellator● DS – generated vertices● GS – primitives● Write to UAV at all stages● Starting with DX11.1Vector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication busVector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication busVector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication bus...Input AssemblerHull ShaderDomainShaderTessellatorGeometryShaderStreamOutCB,SRV,orUAVVertex Shader
Bottlenecks - VS● VS Attributes● Limit outputs to 4 attributes (AMD)●This applies to all shader stages (except PS)● VS Texture Fetches● Too many texture fetches can add latency●Especially dependent texture fetches●Group fetches together for better performance●Hide latency with ALU instructions
Bottlenecks - VS● Use the caches wisely● Avoid large vertex formatsthat waste pre-VS cachespace● DrawIndexed() allows forreuse of processed verticessaved in the post-VS cache●Vertices with the same indexonly need to get processed onceVertex ShaderPre-VS Cache(Hides Latency)Input AssemblerPost-VS Cache(Vertex Reuse)
Bottlenecks - GS● GS● Can add or remove primitives● Adding new primitives requires storing newvertices●Going off chip to store data can be a bandwidth issue● Using the GS means another shader stage●This means more competition for shader resources●Better if you can do everything in the VS
Advanced Vertex Shader Features● SV_VertexID, SV_InstanceID● UAV output (DX11.1)● NULL vertex buffer● VS can create its own vertex data
SV_VertexID● Can use the vertex id to decide whatvertex data to fetch● Fetch from SRV, or procedurally create avertexVSOut VertexShader(SV_VertexID id){float3 vertex = g_VertexBuffer[id];…}
UAV buffers● Write to UAVs from a Vertex Shader● New feature in DX11.1 (UAV at any stage)● Can be used instead of stream-out forwriting vertex data● Triangle output not limited to strips●You can use whatever format you want● Can output anything useful to a UAV
NULL Vertex Buffer● DX11/DX10 allows this● Just set the number of vertices in Draw()● VS will execute without a vertex buffer bound● Can be used for instancing● Call Draw() with the total number of vertices● Bind mesh and instance data as SRVs
Vertex Shader Techniques● Full Screen Triangle● Vertex Shader Instancing● Merged Instancing● Vertex Shader UAVs
Full Screen Triangle● For post-processing effects● Triangle has better performancethan quad● Fast and easy with VSgenerated coordinates● No IB or VB is necessary● Something you should beusing for full screen effectsClip Space Coordinates(-1, -1, 0)(-1, 3, 0)(3, -1, 0)
Full Screen Triangle: C++ code// Null VB, IBpd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL );pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 );pd3dImmediateContext->IASetInputLayout( NULL );// Set Shaderspd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 );pd3dImmediateContext->PSSetShader( … );pd3dImmediateContext->PSSetShaderResources( … );pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );// Render 3 vertices for the trianglepd3dImmediateContext->Draw(3, 0);
Full Screen Triangle: HLSL CodeVSOutput VSFullScreenTest(uint id:SV_VERTEXID){VSOutput output;// generate clip space positionoutput.pos.x = (float)(id / 2) * 4.0 - 1.0;output.pos.y = (float)(id % 2) * 4.0 - 1.0;output.pos.z = 0.0;output.pos.w = 1.0;// texture coordinatesoutput.tex.x = (float)(id / 2) * 2.0;output.tex.y = 1.0 - (float)(id % 2) * 2.0;// coloroutput.color = float4(1, 1, 1, 1);return output;}Clip Space Coordinates(-1, -1, 0)(-1, 3, 0)(3, -1, 0)
VS Instancing: Point Sprites● Often done on GS, but can be faster on VS● Create an SRV point buffer and bind to VS● Call Draw or DrawIndexed to render the fulltriangle list.● Read the location from the point buffer andexpand to vertex location in quad● Can be used for particles or Bokeh DOF sprites● Don’t use DrawInstanced for a small mesh
Point Sprites: C++ Codepd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 );pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);
Point Sprites: HLSL CodeVSInstancedParticleDrawOut VSIndexBuffer(uint id:SV_VERTEXID){VSInstancedParticleDrawOut output;uint particleIndex = id / 4;uint vertexInQuad = id % 4;// calculate the position of the vertexfloat3 position;position.x = (vertexInQuad % 2) ? 1.0 : -1.0;position.y = (vertexInQuad & 2) ? -1.0 : 1.0;position.z = 0.0;position.xy *= PARTICLE_RADIUS;position = mul( position, (float3x3)g_mInvView ) +g_bufPosColor[particleIndex].pos.xyz;output.pos = mul( float4(position,1.0), g_mWorldViewProj );output.color = g_bufPosColor[particleIndex].color;// texture coordinateoutput.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0;output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0;return output;}
Point Sprite PerformanceIndexed, 500K SpritesNon-Indexed, 500K SpritesGS, 500K SpritesDrawInstanced, 500K SpritesIndexed, 1M SpritesNon-Indexed, 1M SpritesGS, 1M SpritesDrawInstanced, 1M SpritR9 290x (ms) 0.52 0.77 1.38 1.77 1.02 1.53 2.7 3.54Titan (ms) 0.52 0.87 0.83 5.1 1.5 1.92 1.6 10.3024681012AMD Radeon R9 290xNvidia Titan
Point Sprite Performance● DrawIndexed() is the fastest method● Draw() is slower but doesn’t need an IB● Don’t use DrawInstanced() for creatingsprites on either AMD or NVidia hardware● Not recommended for a small number ofvertices
Merge Instancing● Combine multiple meshes that can beinstanced many times● Better than normal instancing which rendersonly one mesh● Instance nearby meshes for smaller bounding box● Each mesh is a page in the vertex data● Fixed vertex count for each mesh●Meshes smaller than page size use degenerate triangles
Merge InstancingMesh Vertex DataMesh Data 0Mesh Data 1Mesh Data 2...Mesh Instance DataInstance 0Mesh Index 2Instance 1Mesh Index 0...DegenerateTriangleVertex 0Vertex 1Vertex 2Vertex 3...000Fixed Length Page
Merged Instancing using VS● Use the vertex ID to look up the mesh toinstance● All meshes are the same size, so (id / SIZE)can be used as an offset to the mesh● Faster than using DrawInstanced()
Merge Instancing Performance051015202530DrawInstanced Soft InstancingR9 290xGTX 780● Instancing performance test byCloud Imperium Games for StarCitizen● Renders 13.5M triangles (~40Mverts)● DrawInstanced version callsDrawInstanced() and uses instancedata in a vertex buffer● Soft Instancing version usesvertex instancing with Draw() callsand fetches instance data fromSRVAMD RadeonR9 290XNvidiaGTX 780ms
Vertex Shader UAVs● Random access Read/Write in a VS● Can be used to store transformed vertexdata for use in multi-pass algorithms● Can be used for passing constantattributes between any shader stage (notjust from VS)
Skinning to UAV● Skin vertex data then output to UAV● Instance the skinned UAV data multiple times● Can also be used for non-instanced data● Multiple passes can reuse the transformedvertex data – Shadow map rendering● Performance is about the same asstream-out, but you can do more …
Bounding Box to UAV● Can calculate and store Bbox in the VS● Use a UAV to store the min/max values (6)● InterlockedMin/InterlockedMax determine minand max of the bbox●Need to use integer values with atomics● Use the stored bbox in later passes● GPU physics (collision)● Tile based processing
Bounding Box: HLSL Codevoid UAVBBoxSkinVS(VSSkinnedIn input, uint id:SV_VERTEXID ){// skin the vertex. . .// output the max and min for the bounding boxint x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integerint y = (int) (vSkinned.Pos.y * FLOAT_SCALE);int z = (int) (vSkinned.Pos.z * FLOAT_SCALE);InterlockedMin(g_BBoxUAV[0], x);InterlockedMin(g_BBoxUAV[1], y);InterlockedMin(g_BBoxUAV[2], z);InterlockedMax(g_BBoxUAV[3], x);InterlockedMax(g_BBoxUAV[4], y);InterlockedMax(g_BBoxUAV[5], z);. . .
Particle System UAV● Single pass GPU-only particle system● In the VS:● Generate sprites for rendering● Do Euler integration and update the particlesystem state to a UAV
Particle System: HLSL Codeuint particleIndex = id / 4;uint vertexInQuad = id % 4;// calculate the new position of the vertexfloat3 oldPosition = g_bufPosColor[particleIndex].pos.xyz;float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz;// Euler integration to find new position and velocityfloat3 acceleration = normalize(oldVelocity) * ACCELLERATION;float3 newVelocity = acceleration * g_deltaT + oldVelocity;float3 newPosition = newVelocity * g_deltaT + oldPosition;g_particleUAV[particleIndex].pos = float4(newPosition, 1.0);g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0);// Generate sprite vertices. . .
Conclusion● Vertex shader “tricks” can be moreefficient than more commonly used methods● Use SV_Vertex ID for smarter instancing●Sprites●Merge Instancing● UAVs add lots of freedom to vertex shaders●Bounding box calculation●Single pass VS particle system
Demos● Particle System● UAV Skinning● Bbox
Acknowledgements● Merge Instancing● Emil Person, “Graphics Gems for Games”SIGGRAPH 2011● Brendan Jackson, Cloud Imperium● Thanks to● Nick Thibieroz, AMD● Raul Aguaviva (particle system UAV), AMD● Alex Kharlamov, AMD
Questions● bill.bilodeau@amd.com

Recommended

PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Bindless Deferred Decals in The Surge 2
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPSX
Dx11 performancereloaded
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
PDF
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
PPTX
Triangle Visibility buffer
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PDF
Taking Killzone Shadow Fall Image Quality Into The Next Generation
PPTX
Stochastic Screen-Space Reflections
PDF
Lighting Shading by John Hable
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
Shiny PC Graphics in Battlefield 3
PDF
Dissecting the Rendering of The Surge
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PPT
A Bit More Deferred Cry Engine3
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
Decima Engine: Visibility in Horizon Zero Dawn
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PDF
Screen Space Reflections in The Surge
PPT
Star Ocean 4 - Flexible Shader Managment and Post-processing
PPTX
Parallel Futures of a Game Engine (v2.0)
 
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPT
Light prepass
PPTX
Approaching zero driver overhead
PPTX
Real-time lightmap baking

More Related Content

PPTX
Optimizing the Graphics Pipeline with Compute, GDC 2016
PPTX
DirectX 11 Rendering in Battlefield 3
PDF
Bindless Deferred Decals in The Surge 2
PPTX
Rendering Technologies from Crysis 3 (GDC 2013)
PPSX
Dx11 performancereloaded
PPTX
Physically Based and Unified Volumetric Rendering in Frostbite
PPT
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
PDF
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Optimizing the Graphics Pipeline with Compute, GDC 2016
DirectX 11 Rendering in Battlefield 3
Bindless Deferred Decals in The Surge 2
Rendering Technologies from Crysis 3 (GDC 2013)
Dx11 performancereloaded
Physically Based and Unified Volumetric Rendering in Frostbite
Frostbite Rendering Architecture and Real-time Procedural Shading & Texturing...
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)

What's hot

PPTX
Triangle Visibility buffer
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPTX
FrameGraph: Extensible Rendering Architecture in Frostbite
PDF
Siggraph2016 - The Devil is in the Details: idTech 666
PDF
Taking Killzone Shadow Fall Image Quality Into The Next Generation
PPTX
Stochastic Screen-Space Reflections
PDF
Lighting Shading by John Hable
PPT
Secrets of CryENGINE 3 Graphics Technology
PPTX
Shiny PC Graphics in Battlefield 3
PDF
Dissecting the Rendering of The Surge
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PPT
A Bit More Deferred Cry Engine3
PPTX
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
PPTX
Decima Engine: Visibility in Horizon Zero Dawn
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PDF
Screen Space Reflections in The Surge
PPT
Star Ocean 4 - Flexible Shader Managment and Post-processing
PPTX
Parallel Futures of a Game Engine (v2.0)
 
PDF
Graphics Gems from CryENGINE 3 (Siggraph 2013)
PPT
Light prepass
Triangle Visibility buffer
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
FrameGraph: Extensible Rendering Architecture in Frostbite
Siggraph2016 - The Devil is in the Details: idTech 666
Taking Killzone Shadow Fall Image Quality Into The Next Generation
Stochastic Screen-Space Reflections
Lighting Shading by John Hable
Secrets of CryENGINE 3 Graphics Technology
Shiny PC Graphics in Battlefield 3
Dissecting the Rendering of The Surge
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
A Bit More Deferred Cry Engine3
Five Rendering Ideas from Battlefield 3 & Need For Speed: The Run
Decima Engine: Visibility in Horizon Zero Dawn
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
Screen Space Reflections in The Surge
Star Ocean 4 - Flexible Shader Managment and Post-processing
Parallel Futures of a Game Engine (v2.0)
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Light prepass

Similar to Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14

PPTX
Approaching zero driver overhead
PPTX
Real-time lightmap baking
PPTX
Shader model 5 0 and compute shader
PPTX
Optimizing unity games (Google IO 2014)
PDF
Best Practices for Shader Graph
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
PPSX
Getting the-best-out-of-d3 d12
PDF
Efficient Usage of Compute Shaders on Xbox One and PS4
PPT
Introduction To Geometry Shaders
PDF
Hpg2011 papers kazakov
PDF
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
PPT
Far cry 3
PPT
Shadow Volumes on Programmable Graphics Hardware
PPT
D3 D10 Unleashed New Features And Effects
PDF
GPU - how can we use it?
PDF
Markus Tessmann, InnoGames
PPTX
Penn graphics
PPTX
Cg shaders with Unity3D
PPT
Geometry Shader-based Bump Mapping Setup
PPTX
4,000 Adams at 90 Frames Per Second | Yi Fei Boon
Approaching zero driver overhead
Real-time lightmap baking
Shader model 5 0 and compute shader
Optimizing unity games (Google IO 2014)
Best Practices for Shader Graph
Introduction to Direct 3D 12 by Ivan Nevraev
Getting the-best-out-of-d3 d12
Efficient Usage of Compute Shaders on Xbox One and PS4
Introduction To Geometry Shaders
Hpg2011 papers kazakov
Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide
Far cry 3
Shadow Volumes on Programmable Graphics Hardware
D3 D10 Unleashed New Features And Effects
GPU - how can we use it?
Markus Tessmann, InnoGames
Penn graphics
Cg shaders with Unity3D
Geometry Shader-based Bump Mapping Setup
4,000 Adams at 90 Frames Per Second | Yi Fei Boon

More from AMD Developer Central

PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PDF
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
PPSX
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
PPSX
TressFX The Fast and The Furry by Nicolas Thibieroz
PPTX
Introduction to Node.js
PDF
DirectGMA on AMD’S FirePro™ GPUS
PPSX
Gcn performance ftw by stephan hodes
PDF
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
PDF
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
PPTX
Media SDK Webinar 2014
PPSX
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
PPSX
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
PDF
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
PDF
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
PPTX
Leverage the Speed of OpenCL™ with AMD Math Libraries
PPSX
Inside XBox- One, by Martin Fuller
PPSX
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
PPT
Webinar: Whats New in Java 8 with Develop Intelligence
PPSX
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
PPSX
Inside XBOX ONE by Martin Fuller
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
DX12 & Vulkan: Dawn of a New Generation of Graphics APIs
Rendering Battlefield 4 with Mantle by Johan Andersson - AMD at GDC14
TressFX The Fast and The Furry by Nicolas Thibieroz
Introduction to Node.js
DirectGMA on AMD’S FirePro™ GPUS
Gcn performance ftw by stephan hodes
An Introduction to OpenCL™ Programming with AMD GPUs - AMD & Acceleware Webinar
Productive OpenCL Programming An Introduction to OpenCL Libraries with Array...
Media SDK Webinar 2014
Rendering Battlefield 4 with Mantle by Yuriy ODonnell
Direct3D12 and the Future of Graphics APIs by Dave Oldcorn
RapidFire - the Easy Route to low Latency Cloud Gaming Solutions - AMD at GDC14
Computer Vision Powered by Heterogeneous System Architecture (HSA) by Dr. Ha...
Leverage the Speed of OpenCL™ with AMD Math Libraries
Inside XBox- One, by Martin Fuller
Mantle and Nitrous - Combining Efficient Engine Design with a modern API - AM...
Webinar: Whats New in Java 8 with Develop Intelligence
The Small Batch (and other) solutions in Mantle API, by Guennadi Riguer, Mant...
Inside XBOX ONE by Martin Fuller

Recently uploaded

PDF
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
PDF
[BDD 2025 - Full-Stack Development] The Modern Stack: Building Web & AI Appli...
PDF
Mulesoft Meetup Online Portuguese: MCP e IA
PDF
Top Crypto Supers 15th Report November 2025
PDF
Open Source Post-Quantum Cryptography - Matt Caswell
PDF
Oracle MySQL HeatWave - Complete - Version 3
PDF
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
PDF
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
PDF
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
PPTX
kernel PPT (Explanation of Windows Kernal).pptx
PDF
ODSC AI West: Agent Optimization: Beyond Context engineering
PDF
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
PPTX
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
PDF
Transcript: The partnership effect: Libraries and publishers on collaborating...
PDF
The Evolving Role of the CEO in the Age of AI
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
10 Best Automation QA Testing Software Tools in 2025.pdf
PDF
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
PDF
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
PDF
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 
Supervised Machine Learning Approaches for Log-Based Anomaly Detection: A Cas...
[BDD 2025 - Full-Stack Development] The Modern Stack: Building Web & AI Appli...
Mulesoft Meetup Online Portuguese: MCP e IA
Top Crypto Supers 15th Report November 2025
Open Source Post-Quantum Cryptography - Matt Caswell
Oracle MySQL HeatWave - Complete - Version 3
Cheryl Hung, Vibe Coding Auth Without Melting Down! isaqb Software Architectu...
[BDD 2025 - Full-Stack Development] PHP in AI Age: The Laravel Way. (Rizqy Hi...
Transforming Supply Chains with Amazon Bedrock AgentCore (AWS Swiss User Grou...
kernel PPT (Explanation of Windows Kernal).pptx
ODSC AI West: Agent Optimization: Beyond Context engineering
Mastering UiPath Maestro – Session 2 – Building a Live Use Case - Session 2
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
Transcript: The partnership effect: Libraries and publishers on collaborating...
The Evolving Role of the CEO in the Age of AI
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
10 Best Automation QA Testing Software Tools in 2025.pdf
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
"DISC as GPS for team leaders: how to lead a team from storming to performing...
 

Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14

  • 1.
    Vertex Shader TricksNewWays to Use the Vertex Shader to ImprovePerformanceBill BilodeauDeveloper Technology Engineer, AMD
  • 2.
    Topics Covered● Overviewof the DX11 front-end pipeline● Common bottlenecks● Advanced Vertex Shader Features● Vertex Shader Techniques● Samples and Results
  • 3.
    Graphics HardwareDX11 Front-EndPipeline● VS –vertex data● HS – control points● Tessellator● DS – generated vertices● GS – primitives● Write to UAV at all stages● Starting with DX11.1Vector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication busVector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication busVector GPR’s(256 2048-bit registers)Vector ALU(1 64-way single precision operation every 4 clocks)Scalar ALU(1 operation every 4 clocks)Scalar GPR’s(256 64-bit registers)Vector/Scalar cross communication bus...Input AssemblerHull ShaderDomainShaderTessellatorGeometryShaderStreamOutCB,SRV,orUAVVertex Shader
  • 4.
    Bottlenecks - VS●VS Attributes● Limit outputs to 4 attributes (AMD)●This applies to all shader stages (except PS)● VS Texture Fetches● Too many texture fetches can add latency●Especially dependent texture fetches●Group fetches together for better performance●Hide latency with ALU instructions
  • 5.
    Bottlenecks - VS●Use the caches wisely● Avoid large vertex formatsthat waste pre-VS cachespace● DrawIndexed() allows forreuse of processed verticessaved in the post-VS cache●Vertices with the same indexonly need to get processed onceVertex ShaderPre-VS Cache(Hides Latency)Input AssemblerPost-VS Cache(Vertex Reuse)
  • 6.
    Bottlenecks - GS●GS● Can add or remove primitives● Adding new primitives requires storing newvertices●Going off chip to store data can be a bandwidth issue● Using the GS means another shader stage●This means more competition for shader resources●Better if you can do everything in the VS
  • 7.
    Advanced Vertex ShaderFeatures● SV_VertexID, SV_InstanceID● UAV output (DX11.1)● NULL vertex buffer● VS can create its own vertex data
  • 8.
    SV_VertexID● Can usethe vertex id to decide whatvertex data to fetch● Fetch from SRV, or procedurally create avertexVSOut VertexShader(SV_VertexID id){float3 vertex = g_VertexBuffer[id];…}
  • 9.
    UAV buffers● Writeto UAVs from a Vertex Shader● New feature in DX11.1 (UAV at any stage)● Can be used instead of stream-out forwriting vertex data● Triangle output not limited to strips●You can use whatever format you want● Can output anything useful to a UAV
  • 10.
    NULL Vertex Buffer●DX11/DX10 allows this● Just set the number of vertices in Draw()● VS will execute without a vertex buffer bound● Can be used for instancing● Call Draw() with the total number of vertices● Bind mesh and instance data as SRVs
  • 11.
    Vertex Shader Techniques●Full Screen Triangle● Vertex Shader Instancing● Merged Instancing● Vertex Shader UAVs
  • 12.
    Full Screen Triangle●For post-processing effects● Triangle has better performancethan quad● Fast and easy with VSgenerated coordinates● No IB or VB is necessary● Something you should beusing for full screen effectsClip Space Coordinates(-1, -1, 0)(-1, 3, 0)(3, -1, 0)
  • 13.
    Full Screen Triangle:C++ code// Null VB, IBpd3dImmediateContext->IASetVertexBuffers( 0, 0, NULL, NULL, NULL );pd3dImmediateContext->IASetIndexBuffer( NULL, (DXGI_FORMAT)0, 0 );pd3dImmediateContext->IASetInputLayout( NULL );// Set Shaderspd3dImmediateContext->VSSetShader( g_pFullScreenVS, NULL, 0 );pd3dImmediateContext->PSSetShader( … );pd3dImmediateContext->PSSetShaderResources( … );pd3dImmediateContext->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );// Render 3 vertices for the trianglepd3dImmediateContext->Draw(3, 0);
  • 14.
    Full Screen Triangle:HLSL CodeVSOutput VSFullScreenTest(uint id:SV_VERTEXID){VSOutput output;// generate clip space positionoutput.pos.x = (float)(id / 2) * 4.0 - 1.0;output.pos.y = (float)(id % 2) * 4.0 - 1.0;output.pos.z = 0.0;output.pos.w = 1.0;// texture coordinatesoutput.tex.x = (float)(id / 2) * 2.0;output.tex.y = 1.0 - (float)(id % 2) * 2.0;// coloroutput.color = float4(1, 1, 1, 1);return output;}Clip Space Coordinates(-1, -1, 0)(-1, 3, 0)(3, -1, 0)
  • 15.
    VS Instancing: PointSprites● Often done on GS, but can be faster on VS● Create an SRV point buffer and bind to VS● Call Draw or DrawIndexed to render the fulltriangle list.● Read the location from the point buffer andexpand to vertex location in quad● Can be used for particles or Bokeh DOF sprites● Don’t use DrawInstanced for a small mesh
  • 16.
    Point Sprites: C++Codepd3d->IASetIndexBuffer( g_pParticleIndexBuffer, DXGI_FORMAT_R32_UINT, 0 );pd3d->IASetPrimitiveTopology( D3D11_PRIMITIVE_TOPOLOGY_TRIANGLELIST );pd3dImmediateContext->DrawIndexed( g_particleCount * 6, 0, 0);
  • 17.
    Point Sprites: HLSLCodeVSInstancedParticleDrawOut VSIndexBuffer(uint id:SV_VERTEXID){VSInstancedParticleDrawOut output;uint particleIndex = id / 4;uint vertexInQuad = id % 4;// calculate the position of the vertexfloat3 position;position.x = (vertexInQuad % 2) ? 1.0 : -1.0;position.y = (vertexInQuad & 2) ? -1.0 : 1.0;position.z = 0.0;position.xy *= PARTICLE_RADIUS;position = mul( position, (float3x3)g_mInvView ) +g_bufPosColor[particleIndex].pos.xyz;output.pos = mul( float4(position,1.0), g_mWorldViewProj );output.color = g_bufPosColor[particleIndex].color;// texture coordinateoutput.tex.x = (vertexInQuad % 2) ? 1.0 : 0.0;output.tex.y = (vertexInQuad & 2) ? 1.0 : 0.0;return output;}
  • 18.
    Point Sprite PerformanceIndexed,500K SpritesNon-Indexed, 500K SpritesGS, 500K SpritesDrawInstanced, 500K SpritesIndexed, 1M SpritesNon-Indexed, 1M SpritesGS, 1M SpritesDrawInstanced, 1M SpritR9 290x (ms) 0.52 0.77 1.38 1.77 1.02 1.53 2.7 3.54Titan (ms) 0.52 0.87 0.83 5.1 1.5 1.92 1.6 10.3024681012AMD Radeon R9 290xNvidia Titan
  • 19.
    Point Sprite Performance●DrawIndexed() is the fastest method● Draw() is slower but doesn’t need an IB● Don’t use DrawInstanced() for creatingsprites on either AMD or NVidia hardware● Not recommended for a small number ofvertices
  • 20.
    Merge Instancing● Combinemultiple meshes that can beinstanced many times● Better than normal instancing which rendersonly one mesh● Instance nearby meshes for smaller bounding box● Each mesh is a page in the vertex data● Fixed vertex count for each mesh●Meshes smaller than page size use degenerate triangles
  • 21.
    Merge InstancingMesh VertexDataMesh Data 0Mesh Data 1Mesh Data 2...Mesh Instance DataInstance 0Mesh Index 2Instance 1Mesh Index 0...DegenerateTriangleVertex 0Vertex 1Vertex 2Vertex 3...000Fixed Length Page
  • 22.
    Merged Instancing usingVS● Use the vertex ID to look up the mesh toinstance● All meshes are the same size, so (id / SIZE)can be used as an offset to the mesh● Faster than using DrawInstanced()
  • 23.
    Merge Instancing Performance051015202530DrawInstancedSoft InstancingR9 290xGTX 780● Instancing performance test byCloud Imperium Games for StarCitizen● Renders 13.5M triangles (~40Mverts)● DrawInstanced version callsDrawInstanced() and uses instancedata in a vertex buffer● Soft Instancing version usesvertex instancing with Draw() callsand fetches instance data fromSRVAMD RadeonR9 290XNvidiaGTX 780ms
  • 24.
    Vertex Shader UAVs●Random access Read/Write in a VS● Can be used to store transformed vertexdata for use in multi-pass algorithms● Can be used for passing constantattributes between any shader stage (notjust from VS)
  • 25.
    Skinning to UAV●Skin vertex data then output to UAV● Instance the skinned UAV data multiple times● Can also be used for non-instanced data● Multiple passes can reuse the transformedvertex data – Shadow map rendering● Performance is about the same asstream-out, but you can do more …
  • 26.
    Bounding Box toUAV● Can calculate and store Bbox in the VS● Use a UAV to store the min/max values (6)● InterlockedMin/InterlockedMax determine minand max of the bbox●Need to use integer values with atomics● Use the stored bbox in later passes● GPU physics (collision)● Tile based processing
  • 27.
    Bounding Box: HLSLCodevoid UAVBBoxSkinVS(VSSkinnedIn input, uint id:SV_VERTEXID ){// skin the vertex. . .// output the max and min for the bounding boxint x = (int) (vSkinned.Pos.x * FLOAT_SCALE); // convert to integerint y = (int) (vSkinned.Pos.y * FLOAT_SCALE);int z = (int) (vSkinned.Pos.z * FLOAT_SCALE);InterlockedMin(g_BBoxUAV[0], x);InterlockedMin(g_BBoxUAV[1], y);InterlockedMin(g_BBoxUAV[2], z);InterlockedMax(g_BBoxUAV[3], x);InterlockedMax(g_BBoxUAV[4], y);InterlockedMax(g_BBoxUAV[5], z);. . .
  • 28.
    Particle System UAV●Single pass GPU-only particle system● In the VS:● Generate sprites for rendering● Do Euler integration and update the particlesystem state to a UAV
  • 29.
    Particle System: HLSLCodeuint particleIndex = id / 4;uint vertexInQuad = id % 4;// calculate the new position of the vertexfloat3 oldPosition = g_bufPosColor[particleIndex].pos.xyz;float3 oldVelocity = g_bufPosColor[particleIndex].velocity.xyz;// Euler integration to find new position and velocityfloat3 acceleration = normalize(oldVelocity) * ACCELLERATION;float3 newVelocity = acceleration * g_deltaT + oldVelocity;float3 newPosition = newVelocity * g_deltaT + oldPosition;g_particleUAV[particleIndex].pos = float4(newPosition, 1.0);g_particleUAV[particleIndex].velocity = float4(newVelocity, 0.0);// Generate sprite vertices. . .
  • 30.
    Conclusion● Vertex shader“tricks” can be moreefficient than more commonly used methods● Use SV_Vertex ID for smarter instancing●Sprites●Merge Instancing● UAVs add lots of freedom to vertex shaders●Bounding box calculation●Single pass VS particle system
  • 31.
    Demos● Particle System●UAV Skinning● Bbox
  • 32.
    Acknowledgements● Merge Instancing●Emil Person, “Graphics Gems for Games”SIGGRAPH 2011● Brendan Jackson, Cloud Imperium● Thanks to● Nick Thibieroz, AMD● Raul Aguaviva (particle system UAV), AMD● Alex Kharlamov, AMD
  • 33.

Editor's Notes

  • #9 The value of SV_VertexID depends on the draw call. For non-indexed Draw, the vertex ID starts with 0 and increments by 1 for every vertex processed by the shader. For DrawIndexed(), the vertexID is the value of the index in the index buffer for that vertex.
  • #17 For indexed Draw calls, create an index buffer which contains (index location + index number). That way you can calculate (vertexID/vertsPerMesh) to get the instance index, and (vertexID % vertsPerMesh) to get the index value which you can use to look up the vertex.
  • #27 - If the mesh is being reused many times, then calculating the bounding box has little overhead.Bounding box can be used for collision detection
  • #30 Could read and write from the UAV instead of binding an input SRV

[8]ページ先頭

©2009-2025 Movatter.jp