Movatterモバイル変換


[0]ホーム

URL:


Mark Kilgard, profile picture
Uploaded byMark Kilgard
PPT, PDF1,262 views

CS 354 Performance Analysis

This document contains notes from a CS 354 Performance Analysis lecture on April 26, 2012. It discusses topics including an in-class quiz, the lecturer's office hours, an upcoming project deadline, and the day's lecture on graphics performance analysis using concepts like Amdahl's law, Gustafson's law, and modeling pipeline efficiency. It also provides examples and diagrams related to graphics hardware architecture.

Embed presentation

Downloaded 86 times
CS 354Performance AnalysisMark KilgardUniversity of TexasApril 26, 2012
CS 354                                            2         Today’s material        In-class quiz            On acceleration structures lecture        Lecture topic            Graphic Performance Analysis
CS 354                            3         My Office Hours        Tuesday, before class            Painter (PAI) 5.35            8:45 a.m. to 9:15        Thursday, after class            ACE 6.302            11:00 a.m. to 12        Randy’s office hours            Monday & Wednesday            11 a.m. to 12:00            Painter (PAI) 5.33
CS 354                                            4         Last time, this time        Last lecture, we discussed            Acceleration structures        This lecture            Graphics Performance Analysis        Projects            Project 4 on ray tracing on Piazza               Due May 2, 2012               Get started!
CS 354                                                                            5                                      On a sheet of paper         Daily Quiz                   • Write your EID, name, and date                                      • Write #1, #2, #3 followed by its answer        Multiple choice: Which is         NOT a bounding volume                 True of False: Volume         representation                         rendering can be accelerated                                                by the GPU by drawing         a) sphere                                                blended slices of the volume.         b) axis-aligned bounding box         c) object aligned bounding box         d) bounding graph point         e) convex polyhedron        True or False: Place objects         within a uniform grid is easier         than placing objects within a         KD tree.
CS 354                                                                6         Graphics Performance Analysis        Generating synthetic images by computer         is computationally—and bandwidth—         intensive            Achieving interactive rates is key                 60 frames/second ≈ real-time interactivity            Worth optimizing                 Entertainment and intuition tied to interactivity        How do we think about graphics         performance analysis?
CS 354                                              7         Framing Amdahl’s Law        Assume a workload with two parts          First part in A%          Second part is B%          Such that A% + B% = 100%        If we have a technique to speedup the         second part by N times          But have no speedup for the first part          What overall speed up can we expect?
CS 354                                                       8         Amdahl’s Equation        Assume A% + B% = 100%        If the un-optimized effort is 100%, the optimized         effort should be smaller                                       B%              OptimizedEffort = A% +                                        N        Speedup is ratio of UnoptimizedEffort to         OptimizedEffort                          100%             1             Speedup =             =                              B%                B                         A% +        ( B − 1) +                               N                N
CS 354                                                       9         Who was Amdahl?        Gene Amdahl            CPU architect for IBM in 1960s                 Helped design IBM’s System/360 mainframe                  architecture            Left IBM to found Amdahl computer                 Building IBM compatible mainframes        Why?            Evaluating whether to invest in parallel             processing or not
CS 354                                                                        10         Parallelization        Broadly speaking, computer tasks can be broken         into two portions            Sequential sub-tasks                 Naturally requires steps to be done in a particular order                 Examples: text layout, entropy decoding            Parallel sub-tasks                 Problem splits into lots of independent chunks of work                 Chunks of work can be done by separate processing units                  simultaneously: parallelization                 Examples: tracing rays, shading pixels, transforming                  vertices
CS 354                             11         Serial Work Sandwiching         Parallel Work
CS 354                                                      12         Example of Amdahl’s Law        Say a task is 50% serial and 50% parallel        Consider using 4 parallel processors on the         parallel portion            Speedup: 1.6x        Consider using 40 parallel processor on parallel         portion            Speedup: 1.951x        Consider limit:              1                             lim              =2                             n →∞        .5                                    .5 +                                          n
CS 354                           13         Graph of Amdahl’s Law
CS 354                                                    14         Pessimism about Parallelism?      Amdahl’s Law can instill pessimism about       parallel processing      If the serial work percent is high, adding       parallel units has low benefit          Assumes fixed “problem” size          So workload stays same size even as parallel           execution resources are added        So why do GPUs offer 100’s of cores         then?
CS 354                                                                       15         Gustafson's Law        Observation            by John Gustafson            With N parallel unit, bigger problems can be attacked        Great example            Increasing GPU resolution            Was 640x480 pixels, now 1920x1200            More parallel units means more pixels can be             processed simultaneously                 Supporting rendering resolutions previously unattainable        Problem size improvement                   problemScale = N − A( N − 1)
CS 354                                                    16         Example        Say a task is 50% serial and 50% parallel        Consider using 4 parallel processors on the         parallel portion            Problem scales up: 2.5x        Consider 100 parallel processors            Problem scales up: 50.5x        Also consider heterogeneous nature of graphics         processing units
CS 354                                                           17         Coherent Work vs.         Incoherent Work        Not all parallel work is created equal        Coherent work = “adjacent” chunks of work         performing similar operations and memory         accesses            Example: camera rays, pixel shading            Allows sharing control of instruction execution            Good for caches        Incoherent work = “adjacent” chunks of work         performing dissimilar operations and memory         accesses            Examples: reflection, shadow, and refraction rays            Bad for caches
CS 354                                                           18         Coherent vs. Incoherent Rays          coherent = camera rays     coherent = light rays                                   incoherent = reflected rays
CS 354                                                             19         Keeping Work Coherent?        How do we keep work concurrent?        Pipelines            Careful because they can introduce latency        Data structures        SPMD (or SIMD) execution            Single Program, Multiple Data            To exploit Single Instruction, Multiple Data (SIMD)             units            Bundling “adjacent” work elements helps cache and             memory access efficiency
CS 354                                     20         Pipeline Processing        Parallel and naturally coherent
A Simplified Graphics PipelineCS 354                                                                   21                       Application                                                       Application-                                                   OpenGL API boundary               Vertex batching & assembly                   Triangle assembly                    Triangle clipping                 NDC to window space                  Triangle rasterization                   Fragment shading                      Depth testing         Depth buffer                      Color update          Framebuffer
CS 354                                                                                                     22         Another View of the Graphics Pipeline   3D Application     or Game   OpenGL API                                                     CPU – GPU                                                      Boundary       GPU           Vertex           Primitive                    Clipping, Setup,             Raster     Front End      Assembly          Assembly                    and Rasterization            Operations                               Vertex                Geometry                    Fragment                               Shader                Program                      Shader           Attribute FetchLegend                             Parameter Buffer Read                  Texture Fetch     Framebuffer Access programmable fixed-function                                                       Memory Interface                                                                                        OpenGL 3.3
CS 354                                                 23         Modeling Pipeline Efficiency        Rate of processing for sequential tasks          Assume three tasks          Run time is sum of each operation’s time                A+B+C        Rate of processing in a pipeline          Assume three tasks, treated as stages          Performance gated by slowest operation              Three operations in pipeline: A, B, C              Run time = max(A,B,C)
CS 354                                                      24         Hardware Clocks        Heart beat of hardware            Measured in frequency              Hertz (Hz) = cycles per second              Megahertz, gigahertz = million, billion Hz      Faster clocks = faster computation and       data transfer      So why not simply raise clocks?          High clocks consume more power          Circuits are only rated to a maximum clock           speed before becoming unreliable
CS 354                                                                           25         Clock Domains        Given chip may have multiple clocks running        Three key domains (GPU-centric)            Graphics clock—for fixed-function units                 Example uses: rasterization, texture filtering, blending                 Optimize for throughput, not latency                      Can often instance more units instead of raising clocks            Processor clock—for programmable shader units                 Example: shader instruction execution                 Generally higher than graphics clock                      Because optimized for latency rather than throughput            Memory clock—for talking to external memory                 Depends on speed rating of external memory            Other domains too                 Display clock, PCI-Express bus clock                 Generally not crucial to rendering performance
CS 354                                                                                                     26          3D Pipeline Programmable          Domains run on Unified Hardware            Unified Streaming Processor Array (SPA) architecture             means same capabilities for all domains                Plus tessellation + compute (not shown below)                                                                     ,       GPU          Vertex        Primitive                   Clipping, Setup,                                                                                                Raster     Front End     Assembly       Assembly                   and Rasterization                 Operations         Can be            Vertex                Primitive                       Fragment         unified          Program                Program                         Program         hardware!     Attribute Fetch     Parameter Buffer Read                  Texture Fetch         Framebuffer Access                                                   Memory Interface
CS 354                                                                                 27         Memory Bandwidth        Raw memory bandwidth            Physical clock rate                 Examples: 3 Ghz            Memory bus width                 64-bit, 128-bit, 192-bit, 256-bit, 384-bit                 Wider buses are faster but more expensive to route all those wires            Signaling rate                 Double data rate (DDR) means signals are sent on the rising and                  falling clock edges                 Often logical memory clock rate includes signaling rate        Computing raw memory bandwidth  bandwidth = physicalClock × signalPerClock × busWidth
CS 354                                                     28         Latency vs. Throughput        Raw bandwidth is reduced by memory         utilization bandwidth          Unrealistic to expect 100% utilization          GPUs are much better than CPUs generally        Trade-off          Maximizing throughput (utilization) increases           latency          Minimizing latency reduces utilization
CS 354                                                                     29         Computing Bandwidth                            [GeForce GTX 680                                                             board]        Example: GeForce GTX 680          Latest NVIDIA generation          3.54 billion transistors in 28 nm process        Memory characteristics          6 GHz memory clock (includes signaling rate)          256-bit memory interface          = 192 gigabytes/second                6 billion × 256 bits/clock × 1byte/8bits                                                            [GK104 die]
CS 354                                                                                                                    30                           GeForce Peak                           Memory Bandwidth Trends                          200                                                          128-bit interface        256-bit interface                          180                                                                                                         Raw                          160                                                                            bandwidth   Gigabytes per second                          140                                                                                                         Effective raw                                                                                                         bandwidth                          120                                                                                                         with                                                                                                         compression                          100                                                                                                         Expon.                                                                                                         (Effective raw                                                                                                         bandwidth                          80                                                                                                         with                                                                                                         compression)                          60                                                                                                         Expon. (Raw                                                                                                         bandwidth)                          40                          20                           0                                GeForce2   GeForce3   GeForce4 T i GeForce FX    GeForce      GeForce                                  GT S                   4600                   6800 Ultra   7800 GT X
CS 354                                                        31         Effective GPU         Memory Bandwidth        Compression schemes          Lossless depth and color (when multisampling)           compression          Lossy texture compression (S3TC / DXTC)          Typically assumes 4:1 compression        Avoidance useless work          Early killing of fragments (Z cull)          Avoiding useless blending and texture fetches        Very clever memory controller designs          Combining memory accesses for improved coherency          Caches for texture fetches
CS 354                                    32         Other Metrics      Host bandwidth      Vertex pulling      Vertex transformation      Triangle rasterization and setup      Fragment shading rate      Shader instruction rate      Raster (blending) operation rate      Early Z reject rate
CS 354                              33         Kepler GeForce GTX 680         High-level Block Diagram        8 Streaming         Multiprocessors         (SMX)        1536 CUDA Cores        8 Geometry Units        4 Raster Units        128 Texture units        32 Raster operations        256-bit GDDR5         memory
CS 354                                                34         Kepler Streaming Multiprocessor                              8 more copies of this
CS 354                                35         Prior Generation Streaming         Multiprocessor (SM)        Multi-processor         execution unit (Fermi)          32 scalar processor           cores          Warp is a unit of           thread execution of up           to 32 threads        Two workloads            Graphics                 Vertex shader                 Tessellation                 Geometry shader                 Fragment shader            Compute
CS 354                                                      36         Power Gating        Computer architecture has hit the “power wall”        Low-power operation is at a premium            Battery-powered devices            Thermal constraints            Economic constraints        Power Management (PM) works to reduce         power by            Lower clocks when performance isn’t required            Disabling hardware units                 Avoids leakage
CS 354                                                                             37         Scene Graph Labor        High-level division of scene graph labor        Four pipeline stages            App (application)                 Code that manipulates/modifies the scene graph in response to                  user input or other events            Isect (intersection)                 Geometric queries such as collision detection or picking            Cull                 Traverse the scene graph to find the nodes to be rendered                       Best example: eliminate objects out of view                 Optimize the ordering of nodes                       Sort objects to minimize graphics hardware state changes            Draw                 Communicating drawing commands to the hardware                 Generally through graphics API (OpenGL or Direct3D)        Can map well to multi-processor CPU systems
CS 354                                               38         App-cull-draw Threading        App-cull-draw processing on one CPU core        App-cull-draw processing on multiple CPUs
CS 354                                                39         Scene Graph Profiling      Scene graph should help provide insight       into performance      Process statistics          What’s going on?          Time stamps        Database statistics            How complex is the scene in any frame?
CS 354                                                           40         Example:         Depth Complexity Visualization        How many pixels are being rendered?            Pixels can be rasterized by multiple objects            Depth complexity is the average number of times a             pixel or color sample is updated per frame          yellow and black indicate higher depth complexity
CS 354                                    41         Example:         Heads-up Display of Statistics        Process statistics            How long is             everything taking?        Database statistic            What is being             rendered?        Overlaying on         active scene often         value            Dynamic update
CS 354                                                         42         Benchmarking        Synthetic benchmarks focus on rendering         particular operations in isolation            What is the blended pixel performance        Application benchmarks            Try to reflect what a real application would do
CS 354                                                                43         Tips for Interactive         Performance Analysis        Vary things you can control            Change window resolution                 Making it smaller and seeing better performance        Null driver analysis            Skip the actual rendering calls            What if the driver was *infinitely” fast        Use occlusion queries to monitor how many         samples (pixels) are actually got to need        Keep data on the GPU            Let GPU do Direct Memory Access (DMA)            Keep from swapping textures and buffers                 Easy when multi-gigabyte graphics cards available
CS 354                                          44         Next Class        Next lecture            Surfaces            Programmable tessellation        Reading            None        Project 4            Project 4 is a simple ray tracer            Due Wednesday, May 2, 2012

Recommended

PPT
CS 354 Ray Casting & Tracing
PPT
CS 354 Shadows (cont'd) and Scene Graphs
PPT
CS 354 Final Exam Review
PPT
CS 354 Programmable Shading
PPT
CS 354 Vector Graphics & Path Rendering
PPT
CS 354 Procedural Methods
PPT
CS 354 Project 2 and Compression
PPT
CS 354 Interaction
PPT
Real-time Shadowing Techniques: Shadow Volumes
PPT
CS 354 Shadows
PPT
CS 354 Introduction
PPT
Shadow Mapping with Today's OpenGL Hardware
PPT
CS 354 GPU Architecture
PPT
CS 354 Acceleration Structures
PPT
CS 354 Transformation, Clipping, and Culling
PPT
CS 354 Understanding Color
PPT
CS 354 Texture Mapping
PDF
Mesh Generation and Topological Data Analysis
PPT
CS 354 Viewing Stuff
PPT
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
PPT
CS 354 Blending, Compositing, Anti-aliasing
PPT
CS 354 More Graphics Pipeline
PPT
An Introduction to NV_path_rendering
PDF
Clustered defered and forward shading
PPT
CS 354 Typography
PPT
CS 354 Pixel Updating
PDF
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
PDF
A Video Watermarking Scheme to Hinder Camcorder Piracy
PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
PPTX
COMPUTER GRAPHICS

More Related Content

PPT
CS 354 Ray Casting & Tracing
PPT
CS 354 Shadows (cont'd) and Scene Graphs
PPT
CS 354 Final Exam Review
PPT
CS 354 Programmable Shading
PPT
CS 354 Vector Graphics & Path Rendering
PPT
CS 354 Procedural Methods
PPT
CS 354 Project 2 and Compression
PPT
CS 354 Interaction
CS 354 Ray Casting & Tracing
CS 354 Shadows (cont'd) and Scene Graphs
CS 354 Final Exam Review
CS 354 Programmable Shading
CS 354 Vector Graphics & Path Rendering
CS 354 Procedural Methods
CS 354 Project 2 and Compression
CS 354 Interaction

What's hot

PPT
Real-time Shadowing Techniques: Shadow Volumes
PPT
CS 354 Shadows
PPT
CS 354 Introduction
PPT
Shadow Mapping with Today's OpenGL Hardware
PPT
CS 354 GPU Architecture
PPT
CS 354 Acceleration Structures
PPT
CS 354 Transformation, Clipping, and Culling
PPT
CS 354 Understanding Color
PPT
CS 354 Texture Mapping
PDF
Mesh Generation and Topological Data Analysis
PPT
CS 354 Viewing Stuff
PPT
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
PPT
CS 354 Blending, Compositing, Anti-aliasing
PPT
CS 354 More Graphics Pipeline
PPT
An Introduction to NV_path_rendering
PDF
Clustered defered and forward shading
PPT
CS 354 Typography
PPT
CS 354 Pixel Updating
PDF
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
PDF
A Video Watermarking Scheme to Hinder Camcorder Piracy
Real-time Shadowing Techniques: Shadow Volumes
CS 354 Shadows
CS 354 Introduction
Shadow Mapping with Today's OpenGL Hardware
CS 354 GPU Architecture
CS 354 Acceleration Structures
CS 354 Transformation, Clipping, and Culling
CS 354 Understanding Color
CS 354 Texture Mapping
Mesh Generation and Topological Data Analysis
CS 354 Viewing Stuff
CS 354 Surfaces, Programmable Tessellation, and NPR Graphics
CS 354 Blending, Compositing, Anti-aliasing
CS 354 More Graphics Pipeline
An Introduction to NV_path_rendering
Clustered defered and forward shading
CS 354 Typography
CS 354 Pixel Updating
Practical and Robust Stenciled Shadow Volumes for Hardware-Accelerated Rendering
A Video Watermarking Scheme to Hinder Camcorder Piracy

Similar to CS 354 Performance Analysis

PDF
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
PPTX
COMPUTER GRAPHICS
PPTX
Performance measures
PPT
CS 354 Bezier Curves
PDF
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
PPT
Nbvtalkatjntuvizianagaram
PPTX
Full introduction to_parallel_computing
PPTX
Amdahl's Law is a formula that predicts the theoretical speedup of a task whe...
PDF
Parallel Algorithms
PPTX
1.1 Introduction.pptx about the design thinking of the engineering students
PDF
Big Data Systems Lecture -2 for Cloud Computing.pdf
PDF
Advanced Computer Architecture - Lec1.pdf
PDF
CSTalks - GPGPU - 19 Jan
PPTX
Super computer
PDF
PPTX
Thinking in parallel ab tuladev
PPT
Lecture1
PPTX
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
PPT
Lecture 1
PDF
Architectures for parallel
[Harvard CS264] 03 - Introduction to GPU Computing, CUDA Basics
 
COMPUTER GRAPHICS
Performance measures
CS 354 Bezier Curves
Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis
Nbvtalkatjntuvizianagaram
Full introduction to_parallel_computing
Amdahl's Law is a formula that predicts the theoretical speedup of a task whe...
Parallel Algorithms
1.1 Introduction.pptx about the design thinking of the engineering students
Big Data Systems Lecture -2 for Cloud Computing.pdf
Advanced Computer Architecture - Lec1.pdf
CSTalks - GPGPU - 19 Jan
Super computer
Thinking in parallel ab tuladev
Lecture1
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
Lecture 1
Architectures for parallel

More from Mark Kilgard

PPT
SIGGRAPH 2012: NVIDIA OpenGL for 2012
PPTX
Migrating from OpenGL to Vulkan
PPT
OpenGL for 2015
PPT
Computers, Graphics, Engineering, Math, and Video Games for High School Students
PPT
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
PPT
NVIDIA OpenGL in 2016
PPT
NVIDIA OpenGL 4.6 in 2017
PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPT
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
PPT
Virtual Reality Features of NVIDIA GPUs
PDF
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
PPT
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
PPT
NVIDIA OpenGL and Vulkan Support for 2017
PPT
NV_path rendering Functional Improvements
PDF
GPU-accelerated Path Rendering
PDF
D11: a high-performance, protocol-optional, transport-optional, window system...
PDF
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
PPT
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
PPT
GPU accelerated path rendering fastforward
PPT
EXT_window_rectangles
SIGGRAPH 2012: NVIDIA OpenGL for 2012
Migrating from OpenGL to Vulkan
OpenGL for 2015
Computers, Graphics, Engineering, Math, and Video Games for High School Students
SIGGRAPH Asia 2012 Exhibitor Talk: OpenGL 4.3 and Beyond
NVIDIA OpenGL in 2016
NVIDIA OpenGL 4.6 in 2017
OpenGL 4.5 Update for NVIDIA GPUs
SIGGRAPH 2012: GPU-Accelerated 2D and Web Rendering
Virtual Reality Features of NVIDIA GPUs
Accelerating Vector Graphics Rendering using the Graphics Hardware Pipeline
SIGGRAPH Asia 2012: GPU-accelerated Path Rendering
NVIDIA OpenGL and Vulkan Support for 2017
NV_path rendering Functional Improvements
GPU-accelerated Path Rendering
D11: a high-performance, protocol-optional, transport-optional, window system...
Programming with NV_path_rendering: An Annex to the SIGGRAPH Asia 2012 paper...
Slides: Accelerating Vector Graphics Rendering using the Graphics Hardware Pi...
GPU accelerated path rendering fastforward
EXT_window_rectangles

Recently uploaded

PDF
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
PDF
ODSC AI West: Agent Optimization: Beyond Context engineering
PDF
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
PDF
Oracle MySQL HeatWave - One Page - Version 3
PDF
Integrating AI with Meaningful Human Collaboration
PDF
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
PDF
Transforming Content Operations in the Age of AI
PDF
Lets Build a Serverless Function with Kiro
PDF
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
PDF
Top 10 AI Development Companies in UK 2025
PDF
Transcript: The partnership effect: Libraries and publishers on collaborating...
PDF
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
PDF
The Evolving Role of the CEO in the Age of AI
PPTX
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
PDF
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
PPTX
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
PDF
MuleSoft Meetup: Dreamforce'25 Tour- Vibing With AI & Agents.pdf
PPTX
kernel PPT (Explanation of Windows Kernal).pptx
PPTX
Guardrails in Action - Ensuring Safe AI with Azure AI Content Safety.pptx
PPTX
Connecting the unconnectable: Exploring LoRaWAN for IoT
[BDD 2025 - Mobile Development] Exploring Apple’s On-Device FoundationModels
ODSC AI West: Agent Optimization: Beyond Context engineering
[BDD 2025 - Full-Stack Development] Agentic AI Architecture: Redefining Syste...
Oracle MySQL HeatWave - One Page - Version 3
Integrating AI with Meaningful Human Collaboration
Rolling out Enterprise AI: Tools, Insights, and Team Empowerment
Transforming Content Operations in the Age of AI
Lets Build a Serverless Function with Kiro
KMWorld - KM & AI Bring Collectivity, Nostalgia, & Selectivity
Top 10 AI Development Companies in UK 2025
Transcript: The partnership effect: Libraries and publishers on collaborating...
Crane Accident Prevention Guide: Key OSHA Regulations for Safer Operations
The Evolving Role of the CEO in the Age of AI
"Feelings versus facts: why metrics are more important than intuition", Igor ...
 
DUBAI IT MODERNIZATION WITH AZURE MANAGED SERVICES.pdf
The power of Slack and MuleSoft | Bangalore MuleSoft Meetup #60
MuleSoft Meetup: Dreamforce'25 Tour- Vibing With AI & Agents.pdf
kernel PPT (Explanation of Windows Kernal).pptx
Guardrails in Action - Ensuring Safe AI with Azure AI Content Safety.pptx
Connecting the unconnectable: Exploring LoRaWAN for IoT

CS 354 Performance Analysis

  • 1.
    CS 354Performance AnalysisMarkKilgardUniversity of TexasApril 26, 2012
  • 2.
    CS 354 2 Today’s material  In-class quiz  On acceleration structures lecture  Lecture topic  Graphic Performance Analysis
  • 3.
    CS 354 3 My Office Hours  Tuesday, before class  Painter (PAI) 5.35  8:45 a.m. to 9:15  Thursday, after class  ACE 6.302  11:00 a.m. to 12  Randy’s office hours  Monday & Wednesday  11 a.m. to 12:00  Painter (PAI) 5.33
  • 4.
    CS 354 4 Last time, this time  Last lecture, we discussed  Acceleration structures  This lecture  Graphics Performance Analysis  Projects  Project 4 on ray tracing on Piazza  Due May 2, 2012  Get started!
  • 5.
    CS 354 5 On a sheet of paper Daily Quiz • Write your EID, name, and date • Write #1, #2, #3 followed by its answer  Multiple choice: Which is NOT a bounding volume  True of False: Volume representation rendering can be accelerated by the GPU by drawing a) sphere blended slices of the volume. b) axis-aligned bounding box c) object aligned bounding box d) bounding graph point e) convex polyhedron  True or False: Place objects within a uniform grid is easier than placing objects within a KD tree.
  • 6.
    CS 354 6 Graphics Performance Analysis  Generating synthetic images by computer is computationally—and bandwidth— intensive  Achieving interactive rates is key  60 frames/second ≈ real-time interactivity  Worth optimizing  Entertainment and intuition tied to interactivity  How do we think about graphics performance analysis?
  • 7.
    CS 354 7 Framing Amdahl’s Law  Assume a workload with two parts  First part in A%  Second part is B%  Such that A% + B% = 100%  If we have a technique to speedup the second part by N times  But have no speedup for the first part  What overall speed up can we expect?
  • 8.
    CS 354 8 Amdahl’s Equation  Assume A% + B% = 100%  If the un-optimized effort is 100%, the optimized effort should be smaller B% OptimizedEffort = A% + N  Speedup is ratio of UnoptimizedEffort to OptimizedEffort 100% 1 Speedup = = B% B A% + ( B − 1) + N N
  • 9.
    CS 354 9 Who was Amdahl?  Gene Amdahl  CPU architect for IBM in 1960s  Helped design IBM’s System/360 mainframe architecture  Left IBM to found Amdahl computer  Building IBM compatible mainframes  Why?  Evaluating whether to invest in parallel processing or not
  • 10.
    CS 354 10 Parallelization  Broadly speaking, computer tasks can be broken into two portions  Sequential sub-tasks  Naturally requires steps to be done in a particular order  Examples: text layout, entropy decoding  Parallel sub-tasks  Problem splits into lots of independent chunks of work  Chunks of work can be done by separate processing units simultaneously: parallelization  Examples: tracing rays, shading pixels, transforming vertices
  • 11.
    CS 354 11 Serial Work Sandwiching Parallel Work
  • 12.
    CS 354 12 Example of Amdahl’s Law  Say a task is 50% serial and 50% parallel  Consider using 4 parallel processors on the parallel portion  Speedup: 1.6x  Consider using 40 parallel processor on parallel portion  Speedup: 1.951x  Consider limit: 1 lim =2 n →∞ .5 .5 + n
  • 13.
    CS 354 13 Graph of Amdahl’s Law
  • 14.
    CS 354 14 Pessimism about Parallelism?  Amdahl’s Law can instill pessimism about parallel processing  If the serial work percent is high, adding parallel units has low benefit  Assumes fixed “problem” size  So workload stays same size even as parallel execution resources are added  So why do GPUs offer 100’s of cores then?
  • 15.
    CS 354 15 Gustafson's Law  Observation  by John Gustafson  With N parallel unit, bigger problems can be attacked  Great example  Increasing GPU resolution  Was 640x480 pixels, now 1920x1200  More parallel units means more pixels can be processed simultaneously  Supporting rendering resolutions previously unattainable  Problem size improvement problemScale = N − A( N − 1)
  • 16.
    CS 354 16 Example  Say a task is 50% serial and 50% parallel  Consider using 4 parallel processors on the parallel portion  Problem scales up: 2.5x  Consider 100 parallel processors  Problem scales up: 50.5x  Also consider heterogeneous nature of graphics processing units
  • 17.
    CS 354 17 Coherent Work vs. Incoherent Work  Not all parallel work is created equal  Coherent work = “adjacent” chunks of work performing similar operations and memory accesses  Example: camera rays, pixel shading  Allows sharing control of instruction execution  Good for caches  Incoherent work = “adjacent” chunks of work performing dissimilar operations and memory accesses  Examples: reflection, shadow, and refraction rays  Bad for caches
  • 18.
    CS 354 18 Coherent vs. Incoherent Rays coherent = camera rays coherent = light rays incoherent = reflected rays
  • 19.
    CS 354 19 Keeping Work Coherent?  How do we keep work concurrent?  Pipelines  Careful because they can introduce latency  Data structures  SPMD (or SIMD) execution  Single Program, Multiple Data  To exploit Single Instruction, Multiple Data (SIMD) units  Bundling “adjacent” work elements helps cache and memory access efficiency
  • 20.
    CS 354 20 Pipeline Processing  Parallel and naturally coherent
  • 21.
    A Simplified GraphicsPipelineCS 354 21 Application Application- OpenGL API boundary Vertex batching & assembly Triangle assembly Triangle clipping NDC to window space Triangle rasterization Fragment shading Depth testing Depth buffer Color update Framebuffer
  • 22.
    CS 354 22 Another View of the Graphics Pipeline 3D Application or Game OpenGL API CPU – GPU Boundary GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Vertex Geometry Fragment Shader Program Shader Attribute FetchLegend Parameter Buffer Read Texture Fetch Framebuffer Access programmable fixed-function Memory Interface OpenGL 3.3
  • 23.
    CS 354 23 Modeling Pipeline Efficiency  Rate of processing for sequential tasks  Assume three tasks  Run time is sum of each operation’s time  A+B+C  Rate of processing in a pipeline  Assume three tasks, treated as stages  Performance gated by slowest operation  Three operations in pipeline: A, B, C  Run time = max(A,B,C)
  • 24.
    CS 354 24 Hardware Clocks  Heart beat of hardware  Measured in frequency  Hertz (Hz) = cycles per second  Megahertz, gigahertz = million, billion Hz  Faster clocks = faster computation and data transfer  So why not simply raise clocks?  High clocks consume more power  Circuits are only rated to a maximum clock speed before becoming unreliable
  • 25.
    CS 354 25 Clock Domains  Given chip may have multiple clocks running  Three key domains (GPU-centric)  Graphics clock—for fixed-function units  Example uses: rasterization, texture filtering, blending  Optimize for throughput, not latency  Can often instance more units instead of raising clocks  Processor clock—for programmable shader units  Example: shader instruction execution  Generally higher than graphics clock  Because optimized for latency rather than throughput  Memory clock—for talking to external memory  Depends on speed rating of external memory  Other domains too  Display clock, PCI-Express bus clock  Generally not crucial to rendering performance
  • 26.
    CS 354 26 3D Pipeline Programmable Domains run on Unified Hardware  Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains  Plus tessellation + compute (not shown below) , GPU Vertex Primitive Clipping, Setup, Raster Front End Assembly Assembly and Rasterization Operations Can be Vertex Primitive Fragment unified Program Program Program hardware! Attribute Fetch Parameter Buffer Read Texture Fetch Framebuffer Access Memory Interface
  • 27.
    CS 354 27 Memory Bandwidth  Raw memory bandwidth  Physical clock rate  Examples: 3 Ghz  Memory bus width  64-bit, 128-bit, 192-bit, 256-bit, 384-bit  Wider buses are faster but more expensive to route all those wires  Signaling rate  Double data rate (DDR) means signals are sent on the rising and falling clock edges  Often logical memory clock rate includes signaling rate  Computing raw memory bandwidth bandwidth = physicalClock × signalPerClock × busWidth
  • 28.
    CS 354 28 Latency vs. Throughput  Raw bandwidth is reduced by memory utilization bandwidth  Unrealistic to expect 100% utilization  GPUs are much better than CPUs generally  Trade-off  Maximizing throughput (utilization) increases latency  Minimizing latency reduces utilization
  • 29.
    CS 354 29 Computing Bandwidth [GeForce GTX 680 board]  Example: GeForce GTX 680  Latest NVIDIA generation  3.54 billion transistors in 28 nm process  Memory characteristics  6 GHz memory clock (includes signaling rate)  256-bit memory interface  = 192 gigabytes/second  6 billion × 256 bits/clock × 1byte/8bits [GK104 die]
  • 30.
    CS 354 30 GeForce Peak Memory Bandwidth Trends 200 128-bit interface 256-bit interface 180 Raw 160 bandwidth Gigabytes per second 140 Effective raw bandwidth 120 with compression 100 Expon. (Effective raw bandwidth 80 with compression) 60 Expon. (Raw bandwidth) 40 20 0 GeForce2 GeForce3 GeForce4 T i GeForce FX GeForce GeForce GT S 4600 6800 Ultra 7800 GT X
  • 31.
    CS 354 31 Effective GPU Memory Bandwidth  Compression schemes  Lossless depth and color (when multisampling) compression  Lossy texture compression (S3TC / DXTC)  Typically assumes 4:1 compression  Avoidance useless work  Early killing of fragments (Z cull)  Avoiding useless blending and texture fetches  Very clever memory controller designs  Combining memory accesses for improved coherency  Caches for texture fetches
  • 32.
    CS 354 32 Other Metrics  Host bandwidth  Vertex pulling  Vertex transformation  Triangle rasterization and setup  Fragment shading rate  Shader instruction rate  Raster (blending) operation rate  Early Z reject rate
  • 33.
    CS 354 33 Kepler GeForce GTX 680 High-level Block Diagram  8 Streaming Multiprocessors (SMX)  1536 CUDA Cores  8 Geometry Units  4 Raster Units  128 Texture units  32 Raster operations  256-bit GDDR5 memory
  • 34.
    CS 354 34 Kepler Streaming Multiprocessor 8 more copies of this
  • 35.
    CS 354 35 Prior Generation Streaming Multiprocessor (SM)  Multi-processor execution unit (Fermi)  32 scalar processor cores  Warp is a unit of thread execution of up to 32 threads  Two workloads  Graphics  Vertex shader  Tessellation  Geometry shader  Fragment shader  Compute
  • 36.
    CS 354 36 Power Gating  Computer architecture has hit the “power wall”  Low-power operation is at a premium  Battery-powered devices  Thermal constraints  Economic constraints  Power Management (PM) works to reduce power by  Lower clocks when performance isn’t required  Disabling hardware units  Avoids leakage
  • 37.
    CS 354 37 Scene Graph Labor  High-level division of scene graph labor  Four pipeline stages  App (application)  Code that manipulates/modifies the scene graph in response to user input or other events  Isect (intersection)  Geometric queries such as collision detection or picking  Cull  Traverse the scene graph to find the nodes to be rendered  Best example: eliminate objects out of view  Optimize the ordering of nodes  Sort objects to minimize graphics hardware state changes  Draw  Communicating drawing commands to the hardware  Generally through graphics API (OpenGL or Direct3D)  Can map well to multi-processor CPU systems
  • 38.
    CS 354 38 App-cull-draw Threading  App-cull-draw processing on one CPU core  App-cull-draw processing on multiple CPUs
  • 39.
    CS 354 39 Scene Graph Profiling  Scene graph should help provide insight into performance  Process statistics  What’s going on?  Time stamps  Database statistics  How complex is the scene in any frame?
  • 40.
    CS 354 40 Example: Depth Complexity Visualization  How many pixels are being rendered?  Pixels can be rasterized by multiple objects  Depth complexity is the average number of times a pixel or color sample is updated per frame yellow and black indicate higher depth complexity
  • 41.
    CS 354 41 Example: Heads-up Display of Statistics  Process statistics  How long is everything taking?  Database statistic  What is being rendered?  Overlaying on active scene often value  Dynamic update
  • 42.
    CS 354 42 Benchmarking  Synthetic benchmarks focus on rendering particular operations in isolation  What is the blended pixel performance  Application benchmarks  Try to reflect what a real application would do
  • 43.
    CS 354 43 Tips for Interactive Performance Analysis  Vary things you can control  Change window resolution  Making it smaller and seeing better performance  Null driver analysis  Skip the actual rendering calls  What if the driver was *infinitely” fast  Use occlusion queries to monitor how many samples (pixels) are actually got to need  Keep data on the GPU  Let GPU do Direct Memory Access (DMA)  Keep from swapping textures and buffers  Easy when multi-gigabyte graphics cards available
  • 44.
    CS 354 44 Next Class  Next lecture  Surfaces  Programmable tessellation  Reading  None  Project 4  Project 4 is a simple ray tracer  Due Wednesday, May 2, 2012

[8]ページ先頭

©2009-2025 Movatter.jp