UPDATE: This article was written for Quest 1 developers. While Quest 2 is built on the same architecture, it is now much easier to budget for shadow maps (and other simple techniques that require reading from a texture generated in a prior render pass).
I’m Trevor Dasch, Developer Relations Engineer at Oculus. I work with many developers who are building games for the Oculus Quest. My job is to help these developers ship high quality games, and to make sure they run at a solid 72 frames per second. We see a lot of developers with a background in PC game development approach their Quest games the same way they do on PC, this can cause a lot of difficulty in optimizing for maximum performance. So, we thought we’d provide a list of rendering techniques that simply don’t help you on your journey to make framerate on a mobile chipset.
While most of the techniques outlined below are technically possible to do on a mobile chipset, we strongly suggest not doing them. However, this isn’t always a hard and fast rule, I’ve seen developers do most of them and still make framerate, but you will save yourself much pain by avoiding everything on this list.
Deferred rendering
Deferred Rendering (or
deferred shading) is a technique where the necessary data to compute scene lighting is rendered into a set of buffers (e.g., diffuse color, normals, roughness, etc.), and a second pass is made using those buffers as the input that actually does the lighting calculations. This technique is effective on PC, because it decouples geometry from lighting, so you will only need to update the pixels touched by each light, which allows a lot more lights to be rendered in fewer GPU cycles.
Why this doesn’t work on mobile:
There are a number of reasons this doesn’t work on mobile, but the main one is resolve cost. What is resolve cost? Before I can tell you what resolve cost is, you first need to understand how tiled GPU’s work.
In order to achieve a higher amount of throughput with much lower power utilization, mobile GPU’s (such as the
Snapdragon 835 used in Oculus Quest) often use a tiled architecture, where every render target is broken up into a grid of chunks, or ‘tiles’ (anywhere from 16x16 pixels to 256x256 pixels depending on hardware and pixel format). Your geometry is then ‘binned’ into these chunks (by running the vertex shader and placing the primitive ID into a list for that bin), and then submitted to asynchronous processors, which perform the rendering work to compute the image result for each ‘tile’ (by pulling only the primitives in list generated by the binning process). Once each tile image is computed, the GPU must then copy this image chunk from its on-chip memory back into general memory (G-MEM). This is actually pretty slow, since it requires transferring data over a bus. We refer to this transfer as the ‘resolve’, and thus the time it takes is referred to as the ‘resolve cost’. Wikipedia has a more detailed description of
tiled rendering for further reading.
Because every texture that you render to requires a resolve, (back of napkin math is about half a millisecond for each eye buffer on Oculus Quest), and deferred rendering requires a number of textures to be rendered before computing lighting, you’ve bumped your resolve cost from ~1ms for forward pass rendering to 3ms+. As you can see, you probably don’t have time for that.
Besides resolve cost, deferred rendering is only an advantage if your geometry is complex and you have multiple lights. Both of which can’t really be achieved on mobile anyway, because the limited power of the GPU to both push a large number of vertices and to compute pixel fill.
The answer for the time being is to stick to forward rendering. There may also be a place for a good forward+ implementation, though I haven’t seen one yet.
Depth pre-pass
A depth/Z pre-pass is a common technique where all of your scene geometry is rendered as a first step without filling in your frame buffer, producing only the depth buffer value. You then render a second pass, which checks if the calculated depth for each pixel is equal to the value for that pixel in the depth buffer. If it isn’t, you can skip shading that fragment. Since processing vertices tends to be much faster than shading pixels on PC, this can save a lot of time.
Here’s why you shouldn’t do this on mobile:
First of all, the amount of time you save on fragment fill by doing a depth pre-pass should be minimal if you sort your geometry before submitting your draw calls. Drawing front to back will cause the regular depth test to reject your pixels, so you’d only avoid the pixel fill work for geometry that wasn’t sorted correctly, or where both objects overlap each other at different points.
Second, it requires doubling your draw calls, since everything has to be submitted first for the depth pre-pass, and then for the forward pass. Since draw calls are quite heavy on the cpu, this is something you will want to avoid.
Third, all the vertices need to be processed twice, which will usually add more GPU time than you save by avoiding filling a few pixels twice. This is because vertex processing is relatively more time intensive on mobile than PC, and processing fragments will be relatively less (since the framebuffer size is usually smaller and the fragment program tends to be less complex).
HDR textures
Resolve cost is directly related to the number of bytes in your image, not the number of pixels, so while we usually think of things in terms of 32 bit RGBA pixels, these days a lot of developers are using HDR textures, which is 64 bits per pixel for half precision values. This will double your resolve cost, and since the display only supports 8 bits per channel, you would be wasting a lot of time on resolving HDR textures. Not to mention the fact that a mobile GPU is optimized for 32 bit framebuffers, rendering to anything else will take more time.
Post-processing
Post-processing is a technique that is often used to apply a number of effects on games, such as color grading, bloom lighting and motion blur. This is achieved by taking the output of your game’s rendering, then running a full screen pass on this image to produce a new image which will then be presented to the player. Some post-process effects are performed as a single extra pass (e.g., color grading), while others require multiple passes (blur often requires producing multiple downresed images).
The main problem with post-processing on mobile is once again, resolve cost. Producing a second image will cause another resolve, which immediately removes about 1ms from your gpu impact. Not to mention the time it takes to compute your post-processing effect, which can be quite resource intensive depending on the effect. It’s better to avoid post-processing all together.
Here are some alternatives for common post-processing effects:
Color grading
Instead of performing color grading as a post-process, add a function call to the end of every fragment shader to perform the same math. This will produce the same visual result without the extra resolve.
Bloom
True bloom is going to be extremely time consuming on your GPU. Your best bet is to fake it. Using billboarding sprites with a blob texture can produce something that looks pretty close to the real thing.
Realtime shadows
I would consider this the most controversial item on this list. There are many apps that have successfully shipped on mobile with full realtime shadows. However, there are significant trade-offs for doing so, that in my opinion are worth avoiding.
A common technique for realtime shadows (and Unity’s default) is cascading shadow maps, which means your scene is rendered multiple times with various viewport sizes. This adds 1-4x the number of times your geometry must be processed by the GPU, which inherently limits the amount of vertices your scene can support. It also adds the resolve cost of the shadow map texture, which will be relative to the size of the texture. At the other end of the GPU pipeline, you have two options when sampling your shadow maps: hard shadows and soft shadows. Hard shadows are quicker to render, but they have an unavoidable aliasing problem. Because of the way shadow maps work (testing the depth of the pixel against the depth of your shadow map), only a binary result can come from this test, in shadow or not in shadow. You can’t bilinearly sample your shadow map, because it represents a depth value, not a color value. Soft shadows should be avoided, because they require multiple samples into the shadow map, which of course is slow. Your best bet is to bake all the shadows that you can, and if you need realtime shadows, figure out a different technique. Blob shadows are generally acceptable if your lighting is mostly diffuse. Geometry shadows can also work quite well if you need hard lighting and your shadow surface is a flat plane.
Depth (and framebuffer) sampling
On PC, it’s possible to sample the current depth texture in your shaders (Unity exposes this as _CameraDepthTexture). This works because the depth texture is just another texture on PC, and since each draw call happens one after the other, the state of the depth texture will be the state after the last draw call. However, with a tiled renderer, the current depth isn’t in a texture, it’s only stored in your tile memory, so it can’t be sampled as a normal texture.
With the above in mind, there are actually GLES extensions that will let you query the current
state of the depth buffer (and the
framebuffer). The problem is that they are very slow, only let you sample the value for the same pixel (so you can’t query nearby pixels), and how they work when MSAA is enabled has its own set of problems (which should always be enabled for VR apps!).
When MSAA is enabled, your tile actually has a buffer that is large enough to hold all of your samples (i.e.,, 2x the pixels for 2xMSAA, 4x for 4xMSAA). This means that by default, if you sample the depth buffer, it will have to execute your fragment shader on a per-sample basis which means it will be 2x or 4x more time intensive than you would expect. There is a way to ‘fix’ this, which is to call glDisable(FETCH_PER_SAMPLE_ARM). However, the problem with this is it will only retrieve the value for the first sample, not the result of blending the samples, which means MSAA is functionally disabled when this is on.
Unless absolutely necessary, avoiding these will be a win on your frame time impact.
Geometry shaders
Geometry shaders allow you to generate extra vertices at runtime, which is useful for things like dynamic tessellation. However, there is a problem with geometry shaders on tiled GPU’s. The step to generate extra vertices prevents the binning process from working, which means the GPU can’t do it, so it switches to ‘immediate’ mode (skips the tiling process completely). As you can guess, this is very slow (tiled rendering was invented for a reason). Therefore, it is best to avoid geometry shaders, and instead generate your vertices on your CPU if necessary.
Mirrors/Portals*
*If you do them the naive way. And by the naive way, I mean allocating two eye buffer-sized textures, calculating the reflection matrix, and rendering the scene into both of them. Your mirror geometry would then do a screen-space texture sample to display the reflection. There are a number of obvious flaws with this approach:
- You’ve just tripled your draw calls.
- You’re filling way more pixels than will be visible on screen.
- You have to resolve an additional two textures.
The minimum improvement I’ve found is to constrain the viewport of your mirror cameras and change the corresponding projection matrix to only render the camera plane bounding box of the mirror quad in your view frustum. This helps a bit for number 2. Ideally, you would also be able to use multiview to render both left and right eye with a single set of draw calls, however this currently isn’t supported in Unity, it doesn’t solve problem 3, and it worsens the situation with number 2, because you can only use a single viewport for both eyes, so you have to use the overlap of both mirror bounding boxes (because of asymmetric FOV this will be most of the width of the eye buffer for a mirror that is roughly centered). Thus, the ideal solution would solve number 3 first, which means drawing both your mirror scene and your non-mirror scene in a single pass.
There is a solution that takes advantage of modified shaders and the stencil buffer. Every material in your scene would have two versions of your shader, one that only draws if a certain bit in your stencil buffer is 0, and one that only draws if it’s 1. Then what you would do is draw the mirror mesh with a material that sets that bit in your stencil buffer, draw your scene using the first set of shaders, set up your camera with the reflection matrices, and finally, draw the scene with your second set of shaders. This will produce the reflection you’re looking for without filling any more pixels than necessary and avoiding an unnecessary resolve. What it won’t do is avoid drawing a bunch of objects twice (which is unavoidable with any solution).
While this sounds easy enough, there are a number of problems if you are using Unity (I haven’t tried this in Unreal, but you’ll probably have similar challenges). First of all, Unity doesn’t let you modify the projection matrix of cameras when Single Pass Stereo (multiview) is enabled, so you can’t use reflection cameras (which you most definitely should if you’re at all concerned with CPU performance). Secondly, this doesn’t take into account late-latching, which is a technique by which the camera matrix is actually updated when the render thread starts (after the main thread completes) in order to reduce latency by as much as possible. Normally this a pure win, but if you have a mirror camera, the transform of your reflection camera no longer matches the transform of your head, so you will get some strange artifacts where things in the mirror don’t line up as you would expect.
The actual simplest solution to mirrors is to fake it. If your mirror is static, just create a reflected copy of all of your world geometry, and put it in your scene. You’ll need to have scripts to move around the ‘reflected’ copy of any dynamic objects to mimic the ‘real’ version’s position, including the player, but it’s going to be the fastest and least complex rendering solution (no tricky matrix math, no futzing with multiple cameras, etc.). You will still have to use two different sets of shaders with different stencil masks if it’s possible to look behind the mirror, but if the player can’t see behind it because of a wall or something, you could just keep one set of shaders for both worlds and literally just punch a hole in the wall. Throw up a transparent ‘dirty glass’ sprite and none shall be the wiser.
Conclusion
Whether you are starting a new project from scratch, or undertaking a port from PC to mobile, knowing what the device can do well and what it can’t will be key to getting the best looking game possible while still maintaining full framerate. However, you don’t have to follow my advice to the letter. Find your vision, and make compromises that work for you.
My final piece of advice, (and the best I can give) is to measure, measure, measure as you go. When you’re adding features that might be too impactful, or removing effects you think will save significant frame time, it’s incredibly useful to determine the impact of your changes. Maybe that new shader isn’t as GPU heavy as you expected, or the new particle effect is causing a big drop in framerate. Having that information will help you make the important decisions. As we all learned from G.I. Joe: “Knowing is half the battle!”
We hope you found Trevor’s overview helpful, if you’re looking for more performance-focused learnings, check out his other articles below: