Movatterモバイル変換

Optimising games for mobilesby Dmytro Vovk

Mobile GPUs architecture• There are 3 major mobile GPU architectureson a market:• IMR (Immediate Mode Renderer)• TBR (Tile Based Renderer)• TBDR (Tile Based Deferred Renderer)2

IMR• Renders anything sent to the GPUimmediately. It makes no assumption aboutwhat is going to be submitted next.• Application has to sort opaque geometry frontto back.• It’s basically a brute force.• Nvidia, AMD.3

TBR• Improves on IMR, but still is an IMR.• Bandwidth is a precious resource on mobilesand TBR tries to reduce data transfers as muchas possible.• Your geometry is split in to tiles and then it isprocessed per tile. Tiles have small amount ofmemory for colour, depthstencil buffers, sothey have no need to do transfers fromtosystem memory.• Qualcomm Adreno, ARM Mali 4

TBDR• It is deferred i.e. all the graphics is drawnsomewhere later.• And this is where all the magic happens!• The GPU is aware of context - it know’s what isgoing to be drawn in future and this allows itto employ some awesome optimisations,reduce power consumption, bandwidth and afillrate.• Imagination PowerVR.5

What you might know• Batch, Batch, Batch!http://ce.u-sys.org/Veranstaltungen/Interaktive%20Computergraphik%20(Stamminger)/papers/BatchBatchBatch.pdf• Render from one thread only• Avoid synchronisations:1. glFlush/glFinish;2. Querying GL states;3. Accessing render targets;

What you might know• Pixel perfect HSR (Hidden Surface Removal),Adreno and ARM does not feature this.• But still needs to sort transparent geometry!• Avoid doing alpha test. Use alpha blendinstead

What you might not know• HSR still requires vertices to be processed!• …thus don’t forget to cull your geometry onCPU!• Prefer Stencil Test before Scissor.– Stencil test is performed in hardware on PowerVRGPUs.– Stencil mask is stored in fast on-chip memory– Stencil can be of any form in contrast to therectangular Scissor

What you might not know• Why no alpha test?!o Alpha testdiscard requires fragment shader to run, before visibility forcurrent fragment can be determined. This will remove benefits of HSRo Even more! If shader code contains discard, than any geometry renderedwith this shader will suffer from alpha test drawbacks. Even if this key-wordis under condition, USSE (PVR’s shader engine) does assumes, that thiscondition may be hit.o Move discard into separate shadero Draw opaque geometry, than alpha tested one and alpha blended in the end

What you might know• Bandwidth matters1. Use constant colour per object, instead of pervertex2. Simplify your models. Use smaller data types.3. Use indexed triangles or non-indexed trianglestrips4. Use VBO instead of client arrays5. Use VAO

What you might not know• VBOs allocations are aligned by 4KB page size.That means, your small buffer for just acouple of triangles will occupy 4KB inmemory, - large amount of small VBOs candefragment and waste you memory.

What you might not know• Updating your VBO data each frame:1. glBufferSubData. If it is used to update big part of theoriginal data it will harm performance. Try to avoidupdates to buffers, that are in use now2. glBufferData. It’s OK to completely overwrite originaldata. Old data will be orphaned by driver and a newdata storage will be allocated3. glMapBuffer with triple buffered VBO is preferred wayto update your data• EXT_map_buffer_range (iOS 6+ only), when you need toupdate only a subset of a buffer object.

What you might not knowint bufferID = 0; //initializationfor (int i = 0; i < 3; ++i) // allocate data for 3 vbo only, do not upload it{glBindBuffer(vertexBuffer[i]);glBufferData(GL_ARRAY_BUFFER, 0, 0, GL_DYNAMIC_DRAW);}//...glBindBuffer(GL_ARRAY_BUFFER, vertexBuffer[bufferID]);void* ptr = glMapBufferOES(GL_ARRAY_BUFFER, GL_WRITE_ONLY_OES);//update data hereglUnmapBufferOES(GL_ARRAY_BUFFER);++bufferID;if (bufferID == 3) //cycling through 3 buffers{bufferID = 0;}

What you might not know• This scheme will give you the best performancepossible – without blocking CPU or GPU, noredundant memcpy operations, lower CPU load, butextra memory is used (note, that you will need noextra temporal buffer to store your data beforesending it to VBO). This is ideal for dynamicbatching of sprites.update(1), draw(1), gpuworking(..............)update(2), draw(2), gpuworking(..............)update(3), draw(3), gpuworking(..............)

What you might not know• Float type is native to GPU• …that means any other type will be convertedto float by USSE• …resulting in few additional cycles• Thus it’s your choice of tradeoff betweenbandwidthstorage and additional cycles

What you might know• Use interleaved vertex data– Align each vertex attribute by 4 bytes boundaries

What you might not know• If you don’t align your data, driver will do thisinstead.• …resulting in slower performance.

What you might not know• PowerVR SGX 5XT GPU series have a vertexcache for last 12 vertex indices. Optimise yourindexed geometry for this cache size.• PowerVR Series 6 (XT) has 16k of vertex cache• Take a look at optimisers, that use TomForsyth’s algorithmhttp://home.comcast.net/~tom_forsyth/papers/fast_vert_cache_opt.html

What you might know• Split your vertex data into two parts:1. Static VBO - the one, that never will be changed2. Dynamic VBO – the one, that needs to beupdated frequently• Split your vertex data into few VBOs, when fewmeshes share the same set of attributes

What you might know• Bandwidth matters1. Use lower precision formats - RGBA4444,RGBA55512. Use PVRTC compressed textures3. Use atlases4. Use mipmaps. They improve texture cacheefficiency and quality.

What you might not know• Avoid RGB8 format - texture data has to bealigned, so driver will pad RGB8 to RGBA8.• Try to replace it with RGB56524

What you might not know• Why PVRTC?1. PVRTC provides great compression, resulting insmaller texture size, improved cache, savedbandwidth and decreased power consumption2. PVRTC stores pixel data in GPU’s native order i.eBGRA, instead of RGBA, in blocks optimised fordata access pattern.

What you might not know• It doesn’t matter whether your textures are inRGBA or BGRA format - the driver will still dointernal processing on a texture data toimprove memory access locality and cacheefficiency.26

What you might not know• On PVR 6 (XT) driver will reserve memory for bothtexture and mip maps chain, but it will commitmemory only for mip level 0.• If you’ll decide to generate mip maps driver willcommit pages reserved for mip chain.• That’s expectable.

What you might not know• On PVR 55MP (tested on iOS 4 – 7.1.1 versions)driver will ALWAYS commit memory for mip maps,regardless, whether you requested to create them, ornot.• That means you’ll waste 33% of memory!• In most cases you don’t need mip maps for 2Dgames, but you are forced to pay this overhead.• That’s too bad for 2D games. However there is oneworkaround – make your textures NPOT (non-powerof two).28

What you might not know• Luckily, there is one solution to this problem.• Core OpenGL ES 2.0 doesn’t support mip mapsfor NPoT (non power of two) textures, so ifyou’ll make your textures to be NPoT, you willnot pay this memory overhead.29

What you might not know• Interesting notes:• glTexImage2D driver implementation has afunction CheckFastPath. When you uploadPoT texture you’ll hit this fast path. NPoTtextures omit it.• When you upload a lot of textures youVRAM gets defragmented, so driver willremap memory - i.e. it will create one bigbuffer for few small textures and will movethem to that buffer 30

What you might not know• Let’s take a look on a texture upload process.• Usual way to do this:1. Load texture to temporal buffer in RAM1. Encode texture if it is stored in compressed file format– JPGPNG2. Feed this buffer to glTexImage2D3. Draw!• Looks simple, but is it the fastest way?

What you might not know• …NO!void* buf = malloc(TEXTURE_SIZE); //4mb for RGBA8 1024x1024 textureLoadTexture(textureName);glBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, buf);// buf is copied into internal buffer, created by driver (that's obvious)free(buf); // because buffer can be freed immediately after glTexImage2DglDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!• A lot of redundant work!

What you might not know• Jedi way to upload textures:int fileHandle = open(filename, O_RDONLY);void* ptr = mmap(NULL, TEXTURE_SIZE, PROT_READ, MAP_PRIVATE, fileHandle, 0); //file mappingglBindTexture(GL_TEXTURE_2D, textureID);glTexImage2D(GL_TEXTURE_2D, 0, 4, 1024, 1024, 0, GL_RGBA, GL_UNSIGNED_BYTE, ptr);glDrawElements(GL_TRIANGLES, 6, GL_UNSIGNED_BYTE, 0);// driver will do some additional work to fully upload texture first time it is actually used!munmap(ptr, TEXTURE_SIZE);• File mapping does not copy your file data into RAM! Itdoes load file data page by page, when it’s accessed.• Thus we eliminated one redundant copy, dramaticallydecreased texture upload time and decreased memoryfragmentation

What you might not know• Keep in my, that textures are finally wired onlywhen they are used first time. So draw themoff screen immediately after glTexImage2D,otherwise it will take too long to render thefirst frame and it will be nearly impossible totrack the cause of this.34

What you might not know• NPOT textures works only with theGL_CLAMP_TO_EDGE wrap mode• POT are preferable, they gives you the bestperformance possible• Use NPOT textures with dimensions multiple to32 pixels for best performance• Driver will pad data of your NPOT texture tomatch the size of the closes POT values.

What you might not know• Prefer OES_texture_half_float instead ofOES_texture_float• Texture reads fetch only 32 bits per texel, thus RGBA floattexture will result in 4 texture reads

What you might not know• Always use glClear at the beginning of theframe…• … and EXT_discard_framebuffer at the end.• PVR GPU series have a fast on chipdepthstencil buffer for each tile. If you forgetto cleardiscard depth buffer, it will beuploaded from HW to SW

What you might know• Prefer multi texturing instead of multiplepasses• Configure texture parameters before feedingimage data to driver

What you might know• Be wise with precision hints• Avoid branching• Eliminate loops• Do not use discard. Place discard instruction asearly, as possible to avoid uselesscomputations

What you might not know• Code inside of dynamic branch (condition isnon constant value) will be executed anywayand than it will be orphaned if condition isfalse

What you might not know• highp – represents 32 bit floating point value• mediump – represents 16 bit floating pointvalue in range of [-65520, 65520]• lowp – 10 bit fixed point values in range of [-2,2] with step of 1/256• Try to give the same precision to all youoperands, because conversion takes some time

What you might not know• highp values are calculated on a scalarprocessor only on USSE1 (thats PVR 5):highp vec4 v1, v2;highp float s1, s2;v2 = (v1 * s1) * s2;//scalar processor executes v1 * s1 – 4 operations, and than this result is multiplied by s2 on//a scalar processor again – 4 additional operationsv2 = v1 * (s1 * s2);//s1 * s2 – 1 operation on a scalar processor; result * v1 – 4 operations on a scalar processor

What you might know• Typical CPU found in mobile devices:1. ARMv7ARMv8 architecture2. Cortex AXKraitSwift or Cyclone3. Up to 2300 MHz4. Up to 8 cores5. Thumb-2 instructions set

What you might not know• ARMv7 has no hardware support for integerdivision• VFPv3, VFPv4 FPU• NEON SIMD engine• Unaligned access is done in software on CortexA8. That means it is hundred times slower• Cortex A8 is in-order CPU. Cortex A9+ are outof order

What you might not know• Cortex A9+ core has full VFPv3 FPU, whileCortex A8 has a VFPLite. That means, that floatoperations take 1 cycle on A9 and 10 cycles onA8!

What you might not know• NEON – 16 registers, 128 bit wide each.Supports operations on 8, 16, 32 and 64 bitsintegers and 32 bits float values• NEON can be used for:– Software geometry instancing;– Skinning;– As a general vertex processor;– Other, typical, applications for SIMD.

What you might not know• There are 3 ways to use NEON engine in yourcode:1. Intrinsics1.1 GLKMath2. Handwritten NEON assembly3. Autovectorization. Add –mllvm –vectorize –mllvm –bb-vectorize-aligned-only to Other CC++Flags in project settings and you are ready to go.

What you might not know• Intrinsics:

What you might not know• Assembly:

What you might not know• Summary:Running time, ms CPU usage, %Intrinsics 2764 19Assembly 3664 20FPU 6209 25-28FPU autovectorized 5028 22-24• Intrinsics got me 25% speedup over assembly.• Note that speed of code generated fromintrinsics will vary from compiler to compiler.Modern compilers are really good in this.

What you might not know• Intrinsics advantages over assembly:– Higher level code;– Much simpler;– No need to manage registers;– You can vectorize basic blocks and buildsolution for every new problem with thisblocks. In contrast to assembly – you have tosolve each new problem from scratch;

What you might not know• Assembly advantages over intrinsics:– Code generated from intrinsics vary fromcompiler to compiler and can give you reallybig difference in speed. Assembly code willalways be the same.

What you might not know__attribute__((always_inline)) void Matrix4ByVec4(constfloat32x4x4_t* __restrict__ mat, const float32x4_t* __restrict__vec, float32x4_t* __restrict__ result){(*result) = vmulq_n_f32((*mat).val[0], (*vec)[0]);(*result) = vmlaq_n_f32((*result), (*mat).val[1], (*vec)[1]);(*result) = vmlaq_n_f32((*result), (*mat).val[2], (*vec)[2]);(*result) = vmlaq_n_f32((*result), (*mat).val[3], (*vec)[3]);}

What you might not know__attribute__((always_inline)) void Matrix4ByMatrix4(const float32x4x4_t* __restrict__ m1, const float32x4x4_t* __restrict__ m2,float32x4x4_t* __restrict__ r){#ifdef INTRINSICS(*r).val[0] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[0], 0));(*r).val[1] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[1], 0));(*r).val[2] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[2], 0));(*r).val[3] = vmulq_n_f32((*m1).val[0], vgetq_lane_f32((*m2).val[3], 0));(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[1], vgetq_lane_f32((*m2).val[0], 1));(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[1], vgetq_lane_f32((*m2).val[1], 1));(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[1], vgetq_lane_f32((*m2).val[2], 1));(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[1], vgetq_lane_f32((*m2).val[3], 1));(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[2], vgetq_lane_f32((*m2).val[0], 2));(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[2], vgetq_lane_f32((*m2).val[1], 2));(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[2], vgetq_lane_f32((*m2).val[2], 2));(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[2], vgetq_lane_f32((*m2).val[3], 2));(*r).val[0] = vmlaq_n_f32((*r).val[0], (*m1).val[3], vgetq_lane_f32((*m2).val[0], 3));(*r).val[1] = vmlaq_n_f32((*r).val[1], (*m1).val[3], vgetq_lane_f32((*m2).val[1], 3));(*r).val[2] = vmlaq_n_f32((*r).val[2], (*m1).val[3], vgetq_lane_f32((*m2).val[2], 3));(*r).val[3] = vmlaq_n_f32((*r).val[3], (*m1).val[3], vgetq_lane_f32((*m2).val[3], 3));}

What you might not know__asm__ volatile("vldmia %6, { q0-q3 } nt""vldmia %0, { q8-q11 }nt""vmul.f32 q12, q8, d0[0]nt""vmul.f32 q13, q8, d2[0]nt""vmul.f32 q14, q8, d4[0]nt""vmul.f32 q15, q8, d6[0]nt""vmla.f32 q12, q9, d0[1]nt""vmla.f32 q13, q9, d2[1]nt""vmla.f32 q14, q9, d4[1]nt""vmla.f32 q15, q9, d6[1]nt""vmla.f32 q12, q10, d1[0]nt""vmla.f32 q13, q10, d3[0]nt""vmla.f32 q14, q10, d5[0]nt""vmla.f32 q15, q10, d7[0]nt""vmla.f32 q12, q11, d1[1]nt""vmla.f32 q13, q11, d3[1]nt""vmla.f32 q14, q11, d5[1]nt""vmla.f32 q15, q11, d7[1]nt""vldmia %1, { q0-q3 } nt""vmul.f32 q8, q12, d0[0]nt""vmul.f32 q9, q12, d2[0]nt""vmul.f32 q10, q12, d4[0]nt""vmul.f32 q11, q12, d6[0]nt""vmla.f32 q8, q13, d0[1]nt""vmla.f32 q8, q14, d1[0]nt""vmla.f32 q8, q15, d1[1]nt""vmla.f32 q9, q13, d2[1]nt""vmla.f32 q9, q14, d3[0]nt""vmla.f32 q9, q15, d3[1]nt""vmla.f32 q10, q13, d4[1]nt""vmla.f32 q10, q14, d5[0]nt""vmla.f32 q10, q15, d5[1]nt""vmla.f32 q11, q13, d6[1]nt""vmla.f32 q11, q14, d7[0]nt""vmla.f32 q11, q15, d7[1]nt""vstmia %2, { q8 }nt""vstmia %3, { q9 }nt""vstmia %4, { q10 }nt""vstmia %5, { q11 }":: "r" (proj), "r" (squareVertices), "r" (v1), "r" (v2), "r" (v3), "r" (v4), "r" (modelView): "memory", "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q11", "q12", "q13", "q14", "q15");

What you might not know• For detailed explanation onintrinsicsassembly see:http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0491e/CIHJBEFE.html

Movatterモバイル変換

Change Language

Optimizing Games for Mobiles

Embed presentation

Recommended

More Related Content

What's hot

Viewers also liked

Similar to Optimizing Games for Mobiles

Recently uploaded

Optimizing Games for Mobiles

Editor's Notes