Movatterモバイル変換


[0]ホーム

URL:


UA  Mobile, profile picture
Uploaded byUA Mobile
647 views

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

This document provides optimization recommendations for iOS game development. It discusses techniques for optimizing vertex processing, texture uploading, and shader code. Some key recommendations include batching draw calls, avoiding synchronizations, culling unused geometry, using compressed textures, and leveraging the NEON SIMD engine for vertex processing on the CPU. Precise data types, avoiding branching in shaders, and minimizing state changes can further improve performance.

Embed presentation

Download to read offline
Learn iOS Game Optimization. Ultimate               Guide                 by Dmitriy Vovk
Want to achieve the same level oftechnology speed? Welcome!                                Image is used without any permissions 
GeneralRecommendations
What you might know• Batch, Batch, Batch!http://ce.u-sys.org/Veranstaltungen/Interaktive%20Computergrappapers/BatchBatchBatch.pdf• Render from one thread only• Avoid synchronizations:  1. glFlush/glFinish;  2. Querying GL states;  3. Accessing render targets;
Vertex DataRecommendations
What you might know• Pixel perfect HSR (Hidden Surface  Removal),• But still need to sort opaque  geometry!• Avoid doing alpha test. Use alpha  blend instead
What you might not know• HSR still requires vertices to be processed!• …thus don’t forget to cull your geometry on  CPU!• Prefer Stencil Test before Scissor.  – Stencil test is performed in hardware on    PowerVR GPUs, thus resulting in dramatically    increased performance.  – Stencil can be of any form in contrast to the    rectangular Scissor
What you might not know• Why no alpha test?!o Alpha testdiscard requires fragment shader to run, before  visibility for current fragment can be determined. This will  remove benefits of HSRo Even more! If shader code contains discard, than any  geometry rendered with this shader will suffer from alpha  test drawbacks. Even if this key-word is under condition,  USSE does assumes, that this condition may be hit.o Move discard into separate shadero Draw opaque geometry, than alpha tested one and alpha  blended in the end
What you might know• Bandwidth matters 1. Use constant color per object, instead of    per vertex 2. Simplify your models. Use smaller data    types. 3. Use indexed triangles or non-indexed    triangle strips 4. Use VBO instead of client arrays 5. Use VAO
What you might not know–   VAO implementation on at least    iOS 4.0 did harmed your    performance–   VBOs are allocated at 4KB page    size multiples. Be aware of that.    Large amount of small VBOs can    defragment and waste you    memory.
What you might not know• Updating your VBO data each frame: 1. glBufferSubData, that updates big part of the    original data do harm performance. Try not to    update buffer, that is used now 2. glBufferData, that will completely overwrite original    data is OK. Old data will be orphaned by driver and    storage for new one will be allocated 3. glMapBuffer with triple buffered VBO is preferred    way to update your data 4. EXT_map_buffer_range (iOS 6 only), when you need to    update only a subset of a buffer object.
What you might not knowint bufferID = 0; //initializationfor (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it{glBindBuffer(vertexBuffer[i]);glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);}//...glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);//update data hereglUnmapBufferOES(GL_ARRAY_BUFFER);++bufferID;if (bufferID == 3) //cycling through 3 buffers{bufferID = 0;}
What you might not know• This scheme will give you the best  performance possible – no blocking CPU by  GPU (or vice versa), no redundant memcpy  operations, lower CPU load, but extra  memory is used (note, that you will need no  extra temporal buffer to store your data  before sending it to VBO).  update(1), draw(1), gpuworking(................)      update(2), draw(2), gpuworking(................)      update(3), draw(3), gpuworking(................)
What you might not know• Float type is native to GPU• …that means any other type will be  converted to float by USSE• …resulting in few additional cycles• Thus it’s your choice in tradeoff  between bandwidthstorage and  additional cycles
What you might know• Use interleaved vertex data  – Align each vertex attribute by 4 bytes    boundaries
What you might not know• Why you have to do this?!  – You don’t. Driver can do this instead of    you  – …resulting in slower performance.
What you might know• Split your vertex data into two parts:  1. Static VBO - the one, that never will be     changed  2. Dynamic VBO – the one, that needs to     be updated frequently• Split your vertex data into few VBOs,  when few meshes share the same set  of attributes
Texture DataRecommendations
What you might know• Bandwidth matters  1. Use lower precision formats i.e.     RGB565  2. Use PVRTC compressed textures  3. Use atlases  4. Use mipmaps. They improve texture     cache efficiency and quality.
What you might not know• iOS OpenGL ES drivers from 4.0 version  prior to 6.0 has a bug, that will ALWAYS  reserve memory for mipmaps, regardless,  whether you requested to create them, or  not. And you don’t need mip maps for 2D  graphics.• …but there are one workaround – make  your textures NPOT (non-power of two).
What you might not know• NPOT textures works only with the  GL_CLAMP_TO_EDGE warp mode• POT are preferable, they gives you the best  performance possible• Use NPOT textures with dimensions multiple to  32 pixels for best performance• Driver will pad data of your NPOT texture to  match the size of the closes POT values.
What you might not know• Why do I have to use PVRTC? It looks  ugly!  1.PVRTC provides great compression,    resulting in smaller texture size,    improved cache, saved bandwidth and    decreased power consumption  2.PVRTC stores pixel data in GPU’s native    order i.e BGRA, instead of RGBA
What you might not know• BGRA vs RGBA1. RGBA: •   Requires pixel data to be shuffled by driver into     BGRA •   Has options for RGB422, RGB565, RGBA4444,     RGBA55512. BGRA: •   Stores data in GPU’s native order •   Has option only for BGRA8888 for upload and     BGRA888, BGRA5551, BGRA4444 for ReadPixels
What you might not know• Prefer OES_texture_half_float instead of  OES_texture_float• Texture reads read only 32 bits per texel, thus  RGBA float texture will result in 4 texture reads
What you might know• Prefer multitexturing instead of  multiple passes• Configure texture parameters before  feeding image data to driver
What you might not know• Texture uploading to the GPU is a  mess!• Usual way to do this:  1. Load texture to temporal buffer in RAM  2. Feed this buffer to glTexImage2D  3. Draw!• Looks simple and fast, right?
What you might not know• …NO!void* buf = malloc(TEXTURE_SIZE);        //4mb for RGBA8 1024x1024 textureLoadTexture(textureName);glBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf);// buf is copied into internal buffer, created by driver (that's obvious)free(buf); // because buffer can be freed immediately after glTexImage2DglDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!• Textures are finally uploaded only when they are used  first time. So draw them off screen immediately after  glTexImage2D• A lot of redundant work!
What you might not know• Jedi way to upload textures:void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mappingglBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);// buf is copied into internal buffer, created by driver (that's obvious)free(buf); // because buffer can be freed immediately after glTexImage2DglDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!munmap(ptr, TEXTURE_SIZE);• File mapping does not copy your file data into RAM! It  does load file data page by page, when it’s accessed.• Thus we eliminated one redundant copy, dramatically  increased texture upload time and decreased memory  fragmentation
What you might not know• Always use glClear at the beginning  of the frame…• … and EXT_discard_framebuffer at  the end.• PVR GPU series have a fast on chip  depth buffer for each tile. If you  forget to cleardiscard depth buffer, it  will be uploaded from HW to SW
Shaders Best Practices
What you might know• Be wise with precision hints• Avoid branching• Eliminate loops• Do not use discard. Place discard  instruction as early, as possible to  avoid useless computations
What you might not know• Code inside of dynamic branch (it’s  condition is evaluated against value  calculated in shader) will be  executed anyway and than it will be  orphaned if condition is false
What you might not know• highp – represents 32 bit floating point value• mediump – represents 16 bit floating point  value in range of [-65520, 65520]• lowp – 10 bit fixed point values in range of [-2,  2] with step of 1/256• Try to give the same precision to all you  operands, because conversion takes some  time
What you might not know• highp values are calculated on a scalar  processor on USSE1 only: highp vec4 v1, v2; highp float s1, s2; // Bad v2 = (v1 * s1) * s2; //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations // Good v2 = v1 * (s1 * s2); //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
Hardware features
What you might know• Typical CPU found in iOS devices: 1. ARMv7 architecture 2. Cortex A8Cortex A9Custom Apple    cores 3. 600 – 1300 MHz 4. 1-2 cores 5. Thumb-2 instructions set
What you might not know• ARMv7 has no hardware support for  integer division• VFPv3 FPUVFPv4 on Apple A6 (rumored)• NEON SIMD engine• Unaligned access is done in software on  Cortex A8. That means a hundred times  slower• Cortex A8 is in-order CPU. Cortex A9+ are  out of order
What you might not know• Cortex A9 core has full VFPv3 FPU,  while Cortex A8 has a VFPLite. That  means, that float operations take 1  cycle on A9 and 10 cycles on A8!
What you might not know• NEON – 16 registers, 128 bit wide each.  Supports operations on 8, 16, 32 and  64 bits integers and 32 bits float values• NEON can be used for:  – Software geometry instancing;  – Skinning on ES 1.1;  – As a general vertex processor;  – Other, typical, applications for SIMD.
What you might not know• USSE1 architecture is scalar, NEON is  vector by nature. Move your vertex  processing to CPU from GPU to  speedup calculations*• ???????• PROFIT!!!111• *NOTE. That doesn’t apply to USSE2 hardware
What you might not know• The weakest side of mobile GPUs is a fill  rate. Fill rate is quickly killed by  blending. 2D games are heavy on this.  PowerVR USSE engine doesn’t care what  to do – vertex or fragments processing.  Moving you vertex processing to CPU  (NEON) will leave some room space for  fragment processing. It will have more  effect on USSE1, scalar hardware.
What you might not know• There are 3 ways to use NEON engine  in your code:  1. Intrinsics       2. 1.1 GLKMath  3. Handwritten NEON assembly  4. Autovectorization. Add –mllvm –vectorize     –mllvm –bb-vectorize-aligned-only to     Other C Flags in project settings and you     are ready to go.
What you might not know• Intrinsics:
What you might not know• Assembly:
What you might not know• Summary:                    Running time, CPU usage, %                    ms       Intrinsics   2764         19       Assembly     3664         20       FPU          6209         25-28        FPU            5028      22-24•   Intrinsics got me 25%        autovectorized    speedup over    assembly. Let’s see the code!• Note that speed of intrinsics code vary from  compiler to compiler.
What you might not know__attribute__((always_inline)) void Matrix4ByVec4(constfloat32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec,float32x4_t* __restrict__ result){    (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);    (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);    (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]);    (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);}
What you might not know__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r){#ifdef INTRINSICS    (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));    (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0));    (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0));    (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0));    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1));    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1));    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1));    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2));    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2));    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2));    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2));    (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3));    (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3));    (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3));    (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));}
What you might not know__asm__ volatile (  "vldmia %6, { q0-q3 } nt"  "vldmia %0, { q8-q11 }nt"  "vmul.f32 q12, q8, d0[0]nt"  "vmul.f32 q13, q8, d2[0]nt"  "vmul.f32 q14, q8, d4[0]nt"  "vmul.f32 q15, q8, d6[0]nt"  "vmla.f32 q12, q9, d0[1]nt"  "vmla.f32 q13, q9, d2[1]nt"  "vmla.f32 q14, q9, d4[1]nt"  "vmla.f32 q15, q9, d6[1]nt"  "vmla.f32 q12, q10, d1[0]nt"  "vmla.f32 q13, q10, d3[0]nt"  "vmla.f32 q14, q10, d5[0]nt"  "vmla.f32 q15, q10, d7[0]nt"  "vmla.f32 q12, q11, d1[1]nt"  "vmla.f32 q13, q11, d3[1]nt"  "vmla.f32 q14, q11, d5[1]nt"  "vmla.f32 q15, q11, d7[1]nt"  "vldmia %1, { q0-q3 } nt"  "vmul.f32 q8, q12, d0[0]nt"  "vmul.f32 q9, q12, d2[0]nt"  "vmul.f32 q10, q12, d4[0]nt"  "vmul.f32 q11, q12, d6[0]nt"  "vmla.f32 q8, q13, d0[1]nt"  "vmla.f32 q8, q14, d1[0]nt"  "vmla.f32 q8, q15, d1[1]nt"  "vmla.f32 q9, q13, d2[1]nt"  "vmla.f32 q9, q14, d3[0]nt"  "vmla.f32 q9, q15, d3[1]nt"  "vmla.f32 q10, q13, d4[1]nt"  "vmla.f32 q10, q14, d5[0]nt"  "vmla.f32 q10, q15, d5[1]nt"  "vmla.f32 q11, q13, d6[1]nt"  "vmla.f32 q11, q14, d7[0]nt"  "vmla.f32 q11, q15, d7[1]nt"  "vstmia %2, { q8 }nt"  "vstmia %3, { q9 }nt"  "vstmia %4, { q10 }nt"  "vstmia %5, { q11 }"  :  : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView)  : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15"  );
What you might not know• For detailed explanation on  intrinsicsassembly see:  http://infocenter.arm.com/help/index.jsp?  com.arm.doc.dui0491e/CIHJBEFE.html
Contact mehttp://www.linkedin.com/in/dvovk/ http://nukecode.blogspot.com/

Recommended

PPTX
Approaching zero driver overhead
PPTX
Beyond porting
PPTX
Future Directions for Compute-for-Graphics
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Windows to reality getting the most out of direct3 d 10 graphics in your games
PDF
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
PPTX
FlameWorks GTC 2014
PPTX
Triangle Visibility buffer
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PDF
Smedberg niklas bringing_aaa_graphics
PPTX
Parallel Futures of a Game Engine
 
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PPTX
Hair in Tomb Raider
PPT
Your Game Needs Direct3D 11, So Get Started Now!
 
PDF
Masked Software Occlusion Culling
PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPTX
Optimizing Games for Mobiles
 
PPTX
Scope Stack Allocation
PDF
Gdc 14 bringing unreal engine 4 to open_gl
PPT
GDC 2012: Advanced Procedural Rendering in DX11
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
PPSX
Oit And Indirect Illumination Using Dx11 Linked Lists
PDF
Checkerboard Rendering in Dark Souls: Remastered by QLOC
 
PDF
Modern OpenGL Usage: Using Vertex Buffer Objects Well
PDF
Fengqi.asia Cloud advantages
PPTX
Sig13 ce future_gfx
PDF
Вадим Розов- Разработка под Blackberry. Подводные грабли
PDF
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт

More Related Content

PPTX
Approaching zero driver overhead
PPTX
Beyond porting
PPTX
Future Directions for Compute-for-Graphics
PPSX
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
PDF
OpenGL 4.4 - Scene Rendering Techniques
PPTX
Windows to reality getting the most out of direct3 d 10 graphics in your games
PDF
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
PPTX
FlameWorks GTC 2014
Approaching zero driver overhead
Beyond porting
Future Directions for Compute-for-Graphics
Vertex Shader Tricks by Bill Bilodeau - AMD at GDC14
OpenGL 4.4 - Scene Rendering Techniques
Windows to reality getting the most out of direct3 d 10 graphics in your games
CUDA Raytracing을 이용한 Voxel오브젝트 가시성 테스트
FlameWorks GTC 2014

What's hot

PPTX
Triangle Visibility buffer
PPTX
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
PPTX
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
PDF
Smedberg niklas bringing_aaa_graphics
PPTX
Parallel Futures of a Game Engine
 
PPSX
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
PPTX
Hair in Tomb Raider
PPT
Your Game Needs Direct3D 11, So Get Started Now!
 
PDF
Masked Software Occlusion Culling
PPTX
OpenGL 4.5 Update for NVIDIA GPUs
PPTX
Optimizing Games for Mobiles
 
PPTX
Scope Stack Allocation
PDF
Gdc 14 bringing unreal engine 4 to open_gl
PPT
GDC 2012: Advanced Procedural Rendering in DX11
PPSX
Introduction to Direct 3D 12 by Ivan Nevraev
PPSX
Oit And Indirect Illumination Using Dx11 Linked Lists
PDF
Checkerboard Rendering in Dark Souls: Remastered by QLOC
 
PDF
Modern OpenGL Usage: Using Vertex Buffer Objects Well
PDF
Fengqi.asia Cloud advantages
PPTX
Sig13 ce future_gfx
Triangle Visibility buffer
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
 
4K Checkerboard in Battlefield 1 and Mass Effect Andromeda
Smedberg niklas bringing_aaa_graphics
Parallel Futures of a Game Engine
 
Holy smoke! Faster Particle Rendering using Direct Compute by Gareth Thomas
Hair in Tomb Raider
Your Game Needs Direct3D 11, So Get Started Now!
 
Masked Software Occlusion Culling
OpenGL 4.5 Update for NVIDIA GPUs
Optimizing Games for Mobiles
 
Scope Stack Allocation
Gdc 14 bringing unreal engine 4 to open_gl
GDC 2012: Advanced Procedural Rendering in DX11
Introduction to Direct 3D 12 by Ivan Nevraev
Oit And Indirect Illumination Using Dx11 Linked Lists
Checkerboard Rendering in Dark Souls: Remastered by QLOC
 
Modern OpenGL Usage: Using Vertex Buffer Objects Well
Fengqi.asia Cloud advantages
Sig13 ce future_gfx

Viewers also liked

PDF
Вадим Розов- Разработка под Blackberry. Подводные грабли
PDF
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт
PDF
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
PPTX
S4 tarea4 armoi
PDF
مشروع الربح من تزين البالونات
DOC
Ranc. tahunan psk tahun 5 2013
Вадим Розов- Разработка под Blackberry. Подводные грабли
Михаил Галушко - Разработка WinRT приложений для Windows 8: реальный опыт
Дмитрий Малеев-Мобильная Геймификация или как вырабатывать-привычки
S4 tarea4 armoi
مشروع الربح من تزين البالونات
Ranc. tahunan psk tahun 5 2013

Similar to Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

PDF
Computer Graphics - Lecture 01 - 3D Programming I
PDF
OpenGL ES and Mobile GPU
PDF
Hpg2011 papers kazakov
PPT
Topic 6 Graphic Transformation and Viewing.ppt
PPTX
Opengl presentation
PPT
Far cry 3
PPTX
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
PDF
The Explanation the Pipeline design strategy.pdf
 
PPTX
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
PPT
Hardware Shaders
PDF
GeForce 8800 OpenGL Extensions
PPT
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
PPT
Advanced Graphics Workshop - GFX2011
PPT
Advanced Mobile Optimizations.ppt
PDF
iOS Visual F/X Using GLSL
KEY
openFrameworks 007 - GL
 
PPTX
Penn graphics
PPTX
Getting started with open gl es 2
PPTX
4,000 Adams at 90 Frames Per Second | Yi Fei Boon
PDF
Playing with camera preview buffers on BlackBerry 10
Computer Graphics - Lecture 01 - 3D Programming I
OpenGL ES and Mobile GPU
Hpg2011 papers kazakov
Topic 6 Graphic Transformation and Viewing.ppt
Opengl presentation
Far cry 3
GFX Part 7 - Introduction to Rendering Targets in OpenGL ES
The Explanation the Pipeline design strategy.pdf
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Hardware Shaders
GeForce 8800 OpenGL Extensions
Advanced Lighting Techniques Dan Baker (Meltdown 2005)
Advanced Graphics Workshop - GFX2011
Advanced Mobile Optimizations.ppt
iOS Visual F/X Using GLSL
openFrameworks 007 - GL
 
Penn graphics
Getting started with open gl es 2
4,000 Adams at 90 Frames Per Second | Yi Fei Boon
Playing with camera preview buffers on BlackBerry 10

More from UA Mobile

PDF
Павел Юрийчук - Разработка приложений под мобильные браузеры
PDF
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
PDF
Денис Лебедев-Управление зависимостями с помощью CocoaPods
PDF
Владимир Кириллов-TCP-Performance for-Mobile-Applications
PDF
Сергей Арнаут - Stream yourself with Android
PDF
Александр Терещук - Memory Analyzer Tool and memory optimization tips in Android
PDF
Mobile automation uamobile
PDF
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
PDF
Tdd objective c
PDF
Александр Додатко - Работа с датами в ObjectiveC и SQLite
PDF
Максим Щеглов - Google Cloud Messaging for Android
PDF
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложений
Павел Юрийчук - Разработка приложений под мобильные браузеры
Олег Апостол - Плюсы и минусы различных тач-платформ глазами веб-разработчика
Денис Лебедев-Управление зависимостями с помощью CocoaPods
Владимир Кириллов-TCP-Performance for-Mobile-Applications
Сергей Арнаут - Stream yourself with Android
Александр Терещук - Memory Analyzer Tool and memory optimization tips in Android
Mobile automation uamobile
Алексей Лельчук - От аутсорсинга к продуктам: трансформация компании и ментал...
Tdd objective c
Александр Додатко - Работа с датами в ObjectiveC и SQLite
Максим Щеглов - Google Cloud Messaging for Android
Евгений Галкин-Рекламные возможности Google для продвижения мобильных приложений

Дмитрий Вовк - Learn iOS Game Optimization. Ultimate Guide

  • 1.
    Learn iOS GameOptimization. Ultimate Guide by Dmitriy Vovk
  • 2.
    Want to achievethe same level oftechnology speed? Welcome! Image is used without any permissions 
  • 3.
  • 4.
    What you mightknow• Batch, Batch, Batch!http://ce.u-sys.org/Veranstaltungen/Interaktive%20Computergrappapers/BatchBatchBatch.pdf• Render from one thread only• Avoid synchronizations: 1. glFlush/glFinish; 2. Querying GL states; 3. Accessing render targets;
  • 5.
  • 6.
    What you mightknow• Pixel perfect HSR (Hidden Surface Removal),• But still need to sort opaque geometry!• Avoid doing alpha test. Use alpha blend instead
  • 7.
    What you mightnot know• HSR still requires vertices to be processed!• …thus don’t forget to cull your geometry on CPU!• Prefer Stencil Test before Scissor. – Stencil test is performed in hardware on PowerVR GPUs, thus resulting in dramatically increased performance. – Stencil can be of any form in contrast to the rectangular Scissor
  • 8.
    What you mightnot know• Why no alpha test?!o Alpha testdiscard requires fragment shader to run, before visibility for current fragment can be determined. This will remove benefits of HSRo Even more! If shader code contains discard, than any geometry rendered with this shader will suffer from alpha test drawbacks. Even if this key-word is under condition, USSE does assumes, that this condition may be hit.o Move discard into separate shadero Draw opaque geometry, than alpha tested one and alpha blended in the end
  • 9.
    What you mightknow• Bandwidth matters 1. Use constant color per object, instead of per vertex 2. Simplify your models. Use smaller data types. 3. Use indexed triangles or non-indexed triangle strips 4. Use VBO instead of client arrays 5. Use VAO
  • 10.
    What you mightnot know– VAO implementation on at least iOS 4.0 did harmed your performance– VBOs are allocated at 4KB page size multiples. Be aware of that. Large amount of small VBOs can defragment and waste you memory.
  • 11.
    What you mightnot know• Updating your VBO data each frame: 1. glBufferSubData, that updates big part of the original data do harm performance. Try not to update buffer, that is used now 2. glBufferData, that will completely overwrite original data is OK. Old data will be orphaned by driver and storage for new one will be allocated 3. glMapBuffer with triple buffered VBO is preferred way to update your data 4. EXT_map_buffer_range (iOS 6 only), when you need to update only a subset of a buffer object.
  • 12.
    What you mightnot knowint bufferID = 0; //initializationfor (int i = 0; i < 3; ++i)// only allocate data for 3 vbo, do not upload it{glBindBuffer(vertexBuffer[i]);glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);}//...glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);//update data hereglUnmapBufferOES(GL_ARRAY_BUFFER);++bufferID;if (bufferID == 3) //cycling through 3 buffers{bufferID = 0;}
  • 13.
    What you mightnot know• This scheme will give you the best performance possible – no blocking CPU by GPU (or vice versa), no redundant memcpy operations, lower CPU load, but extra memory is used (note, that you will need no extra temporal buffer to store your data before sending it to VBO). update(1), draw(1), gpuworking(................) update(2), draw(2), gpuworking(................) update(3), draw(3), gpuworking(................)
  • 14.
    What you mightnot know• Float type is native to GPU• …that means any other type will be converted to float by USSE• …resulting in few additional cycles• Thus it’s your choice in tradeoff between bandwidthstorage and additional cycles
  • 15.
    What you mightknow• Use interleaved vertex data – Align each vertex attribute by 4 bytes boundaries
  • 16.
    What you mightnot know• Why you have to do this?! – You don’t. Driver can do this instead of you – …resulting in slower performance.
  • 17.
    What you mightknow• Split your vertex data into two parts: 1. Static VBO - the one, that never will be changed 2. Dynamic VBO – the one, that needs to be updated frequently• Split your vertex data into few VBOs, when few meshes share the same set of attributes
  • 18.
  • 19.
    What you mightknow• Bandwidth matters 1. Use lower precision formats i.e. RGB565 2. Use PVRTC compressed textures 3. Use atlases 4. Use mipmaps. They improve texture cache efficiency and quality.
  • 20.
    What you mightnot know• iOS OpenGL ES drivers from 4.0 version prior to 6.0 has a bug, that will ALWAYS reserve memory for mipmaps, regardless, whether you requested to create them, or not. And you don’t need mip maps for 2D graphics.• …but there are one workaround – make your textures NPOT (non-power of two).
  • 21.
    What you mightnot know• NPOT textures works only with the GL_CLAMP_TO_EDGE warp mode• POT are preferable, they gives you the best performance possible• Use NPOT textures with dimensions multiple to 32 pixels for best performance• Driver will pad data of your NPOT texture to match the size of the closes POT values.
  • 22.
    What you mightnot know• Why do I have to use PVRTC? It looks ugly! 1.PVRTC provides great compression, resulting in smaller texture size, improved cache, saved bandwidth and decreased power consumption 2.PVRTC stores pixel data in GPU’s native order i.e BGRA, instead of RGBA
  • 23.
    What you mightnot know• BGRA vs RGBA1. RGBA: • Requires pixel data to be shuffled by driver into BGRA • Has options for RGB422, RGB565, RGBA4444, RGBA55512. BGRA: • Stores data in GPU’s native order • Has option only for BGRA8888 for upload and BGRA888, BGRA5551, BGRA4444 for ReadPixels
  • 24.
    What you mightnot know• Prefer OES_texture_half_float instead of OES_texture_float• Texture reads read only 32 bits per texel, thus RGBA float texture will result in 4 texture reads
  • 25.
    What you mightknow• Prefer multitexturing instead of multiple passes• Configure texture parameters before feeding image data to driver
  • 26.
    What you mightnot know• Texture uploading to the GPU is a mess!• Usual way to do this: 1. Load texture to temporal buffer in RAM 2. Feed this buffer to glTexImage2D 3. Draw!• Looks simple and fast, right?
  • 27.
    What you mightnot know• …NO!void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 textureLoadTexture(textureName);glBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, &buf);// buf is copied into internal buffer, created by driver (that's obvious)free(buf); // because buffer can be freed immediately after glTexImage2DglDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!• Textures are finally uploaded only when they are used first time. So draw them off screen immediately after glTexImage2D• A lot of redundant work!
  • 28.
    What you mightnot know• Jedi way to upload textures:void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mappingglBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);// buf is copied into internal buffer, created by driver (that's obvious)free(buf); // because buffer can be freed immediately after glTexImage2DglDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!munmap(ptr, TEXTURE_SIZE);• File mapping does not copy your file data into RAM! It does load file data page by page, when it’s accessed.• Thus we eliminated one redundant copy, dramatically increased texture upload time and decreased memory fragmentation
  • 29.
    What you mightnot know• Always use glClear at the beginning of the frame…• … and EXT_discard_framebuffer at the end.• PVR GPU series have a fast on chip depth buffer for each tile. If you forget to cleardiscard depth buffer, it will be uploaded from HW to SW
  • 30.
  • 31.
    What you mightknow• Be wise with precision hints• Avoid branching• Eliminate loops• Do not use discard. Place discard instruction as early, as possible to avoid useless computations
  • 32.
    What you mightnot know• Code inside of dynamic branch (it’s condition is evaluated against value calculated in shader) will be executed anyway and than it will be orphaned if condition is false
  • 33.
    What you mightnot know• highp – represents 32 bit floating point value• mediump – represents 16 bit floating point value in range of [-65520, 65520]• lowp – 10 bit fixed point values in range of [-2, 2] with step of 1/256• Try to give the same precision to all you operands, because conversion takes some time
  • 34.
    What you mightnot know• highp values are calculated on a scalar processor on USSE1 only: highp vec4 v1, v2; highp float s1, s2; // Bad v2 = (v1 * s1) * s2; //scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on //a scalar processor again – 4 additional operations // Good v2 = v1 * (s1 * s2); //s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor
  • 35.
  • 36.
    What you mightknow• Typical CPU found in iOS devices: 1. ARMv7 architecture 2. Cortex A8Cortex A9Custom Apple cores 3. 600 – 1300 MHz 4. 1-2 cores 5. Thumb-2 instructions set
  • 37.
    What you mightnot know• ARMv7 has no hardware support for integer division• VFPv3 FPUVFPv4 on Apple A6 (rumored)• NEON SIMD engine• Unaligned access is done in software on Cortex A8. That means a hundred times slower• Cortex A8 is in-order CPU. Cortex A9+ are out of order
  • 38.
    What you mightnot know• Cortex A9 core has full VFPv3 FPU, while Cortex A8 has a VFPLite. That means, that float operations take 1 cycle on A9 and 10 cycles on A8!
  • 39.
    What you mightnot know• NEON – 16 registers, 128 bit wide each. Supports operations on 8, 16, 32 and 64 bits integers and 32 bits float values• NEON can be used for: – Software geometry instancing; – Skinning on ES 1.1; – As a general vertex processor; – Other, typical, applications for SIMD.
  • 40.
    What you mightnot know• USSE1 architecture is scalar, NEON is vector by nature. Move your vertex processing to CPU from GPU to speedup calculations*• ???????• PROFIT!!!111• *NOTE. That doesn’t apply to USSE2 hardware
  • 41.
    What you mightnot know• The weakest side of mobile GPUs is a fill rate. Fill rate is quickly killed by blending. 2D games are heavy on this. PowerVR USSE engine doesn’t care what to do – vertex or fragments processing. Moving you vertex processing to CPU (NEON) will leave some room space for fragment processing. It will have more effect on USSE1, scalar hardware.
  • 42.
    What you mightnot know• There are 3 ways to use NEON engine in your code: 1. Intrinsics 2. 1.1 GLKMath 3. Handwritten NEON assembly 4. Autovectorization. Add –mllvm –vectorize –mllvm –bb-vectorize-aligned-only to Other C Flags in project settings and you are ready to go.
  • 44.
    What you mightnot know• Intrinsics:
  • 45.
    What you mightnot know• Assembly:
  • 46.
    What you mightnot know• Summary: Running time, CPU usage, % ms Intrinsics 2764 19 Assembly 3664 20 FPU 6209 25-28 FPU 5028 22-24• Intrinsics got me 25% autovectorized speedup over assembly. Let’s see the code!• Note that speed of intrinsics code vary from compiler to compiler.
  • 47.
    What you mightnot know__attribute__((always_inline)) void Matrix4ByVec4(constfloat32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__ vec,float32x4_t* __restrict__ result){ (*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]); (*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]); (*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]); (*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);}
  • 48.
    What you mightnot know__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2, float32x4x4_t* __restrict__ r){#ifdef INTRINSICS (*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0)); (*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0)); (*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0)); (*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2)); (*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3)); (*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3)); (*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3)); (*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));}
  • 49.
    What you mightnot know__asm__ volatile ( "vldmia %6, { q0-q3 } nt" "vldmia %0, { q8-q11 }nt" "vmul.f32 q12, q8, d0[0]nt" "vmul.f32 q13, q8, d2[0]nt" "vmul.f32 q14, q8, d4[0]nt" "vmul.f32 q15, q8, d6[0]nt" "vmla.f32 q12, q9, d0[1]nt" "vmla.f32 q13, q9, d2[1]nt" "vmla.f32 q14, q9, d4[1]nt" "vmla.f32 q15, q9, d6[1]nt" "vmla.f32 q12, q10, d1[0]nt" "vmla.f32 q13, q10, d3[0]nt" "vmla.f32 q14, q10, d5[0]nt" "vmla.f32 q15, q10, d7[0]nt" "vmla.f32 q12, q11, d1[1]nt" "vmla.f32 q13, q11, d3[1]nt" "vmla.f32 q14, q11, d5[1]nt" "vmla.f32 q15, q11, d7[1]nt" "vldmia %1, { q0-q3 } nt" "vmul.f32 q8, q12, d0[0]nt" "vmul.f32 q9, q12, d2[0]nt" "vmul.f32 q10, q12, d4[0]nt" "vmul.f32 q11, q12, d6[0]nt" "vmla.f32 q8, q13, d0[1]nt" "vmla.f32 q8, q14, d1[0]nt" "vmla.f32 q8, q15, d1[1]nt" "vmla.f32 q9, q13, d2[1]nt" "vmla.f32 q9, q14, d3[0]nt" "vmla.f32 q9, q15, d3[1]nt" "vmla.f32 q10, q13, d4[1]nt" "vmla.f32 q10, q14, d5[0]nt" "vmla.f32 q10, q15, d5[1]nt" "vmla.f32 q11, q13, d6[1]nt" "vmla.f32 q11, q14, d7[0]nt" "vmla.f32 q11, q15, d7[1]nt" "vstmia %2, { q8 }nt" "vstmia %3, { q9 }nt" "vstmia %4, { q10 }nt" "vstmia %5, { q11 }" : : "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView) : "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15" );
  • 50.
    What you mightnot know• For detailed explanation on intrinsicsassembly see: http://infocenter.arm.com/help/index.jsp? com.arm.doc.dui0491e/CIHJBEFE.html
  • 51.

[8]ページ先頭

©2009-2025 Movatter.jp