There are many different measurements that people refer to when talking about latency. The time it takes from a button press until the code detects it, the time from when a frame is rendered until it appears on screen, etc. For the purpose of this document, the measurement we want to focus on is the time from when predicted tracking is sampled by our game logic until the frame rendered using this gameplay state is displayed on the screen.
For a traditional game frame to be processed, it starts by sampling input, performing all logic updates (physics, score calculations, etc.), rendering all objects to the frame, then, swapping the back buffer with the front buffer to display the new frame to the screen. For games designed for conventional displays, they may try to stick to a steady framerate (eg., 30fps, 60fps), but a missed frame here and there can usually go unnoticed, since the camera position and rotation in game is decoupled from the display position and rotation in real life. With VR, missing frames have serious consequences in user comfort, any time the rendered world doesn’t match the real world it ruins the experience. Therefore, we have a system called
Asynchronous TimeWarp that takes the most recently rendered frame, and modifies it just before displaying to the screen so that the eye view is as close as possible to the corresponding, real world orientation. This means our render pipeline is just a little different. The first part of the frame update loop is the same, input is queried, game logic is updated, and the scene is rendered. However, instead of swapping to the screen, the frame is submitted to TimeWarp along with the view pose used when rendering it, so that it can be modified at the very last moment to match an updated view pose. The real magic of TimeWarp is what happens when you miss a frame. Instead of just keeping the display locked to the last thing displayed, TimeWarp will take the previous frame, but do the same logic it used to update the view pose, so even if the state of the world is out of date, your view of it should feel affixed to the real world.
VSync, or Vertical Sync (which is a hold over from the days where the scan direction of the screen was important, and can be more accurately called “Frame Sync” today), is a system that game engines have used for many years to try to match the physical display refresh rate. For VR, since TimeWarp is doing the actual drawing to the display, true VSync is being handled by TimeWarp. This takes a fixed amount of time each frame, so by calculating back from that point, we can define something as Virtual VSync, which all game processing can be based around.
Take a look at the above timeline, which ignores game process for the time being. You can see that every frame, Timewarp takes a small amount of CPU time, then a length of GPU time in the run up to true VSync. Therefore, the game must make its frames available at the time of V-VSync (virtual Vsync) in order for TimeWarp to be able to process them. This of course is a simplified model, but for the sake of this document, is a useful representation of the processing required by TimeWarp.
The simplest game loop has a single CPU thread that performs the game logic, sends the rendering commands to the GPU, then calls SubmitFrame, which waits for the next V-VSync. It looks something like this:
As you can see, the game logic and rendering happen within the length of one frame, and TimeWarp is able to use the rendered frame immediately. This will have the lowest latency possible from a gameplay perspective, however, if your GPU doesn’t finish rendering in time, TimeWarp will have to use the last frame, causing the frame we were rendering to be discarded (assuming the next frame completes on time). Because the next frame's CPU can run while the current frame's GPU work is still happening, you can end up running at full frame rate with a number of stale frames. Therefore, the number of stale frames is a much more important metric to monitor than actual frame rate.
Even worse is what happens when the GPU render time runs past than the next V-VSync. The previous frame will be reused twice, and the next frame's SubmitFrame call will block until the current frame finishes rendering. This gives the GPU time to catch up with the CPU, but means when frame N + 1 is finally displayed, it will have a whole extra frame of latency!
It turns out performing a full frame of rendering within the normal limits of one VSync is a very difficult target to hit, since all CPU + GPU time need to add up to less than a single frame (i.e., 13.89ms at 72hz or 16.67ms at 60hz). Realistically, almost every game will need more time, therefore, the Oculus API supports something called Extra Latency Mode, which makes missing this tiny window expected behavior, and will always use the frame submitted for the previous frame. The diagram for this mode looks like this:
The big advantage of this is it lets you take advantage of the whole frame for CPU and GPU, so you can get closer to 100% utilization. The downside of course is the loss of one frame of latency. This tradeoff is almost always worth it, so much so that using Unity or UE4 by default force Extra Latency Mode on.
While the results are clear if everything is running on time, what happens if your CPU or GPU takes too long to finish rendering the frame? The GPU case is actually very similar to the case with Extra Latency Mode turned off when a frame takes more than two V-VSyncs to finish rendering. The late frame causes the SubmitFrame call for the next frame to block until the following V-VSync (where it will still be expecting the same frame to be complete). As we saw with Extra Latency Mode off, by the time we get back to the expected frame cycle, we will have presented at least 3 high latency frames (the previous repeated frame, the current frame, and the next frame, which was pushed back when SubmitFrame blocked). Therefore, avoiding the GPU running long is crucial to smooth gameplay.
The CPU case is actually less problematic, with Extra Latency On, calling SubmitFrame after V-Vsync simply returns immediately (assuming your previous frame was still ready in time). For example:
As you can see, if the time taken by the CPU continues to run over the frame time, the GPU will eventually take too long and SubmitFrame will block until the next V-VSync. However, if the CPU time goes back down, the game will recover and the app will never miss a frame.
Multithreaded Applications
While the single threaded applications are the simplest, the mobile devices running Oculus software (Gear VR, Oculus Go, and Oculus Quest) all have chipsets that have multiple CPU cores, and therefore require multithreaded applications to take advantage of these cores. Thus, Unity and UE4 both offer a multithreaded rendering mode. In this mode, the game logic is performed on the main thread, and the rendering logic is performed on another thread. These threads are synchronized by the Render Thread calling SubmitFrame, and therefore waiting for the V-VSync. When V-VSync triggers the frame to start, the Render Thread sends a signal to the game thread to begin performing the logic for the next frame while it operates on the current frame (which was processed by the game thread last frame). The end effect is one frame of extra latency happens between the gameplay logic and the frame being rendered to the screen. Here is an illustration of this:
Similarly, if the Render Thread is running late, the Game Thread will not be signaled to begin the next frame:
UE4 has recently introduced something called RHIThread, which separates the actual submission of graphics api calls (for Oculus Mobile this is OpenGLES or Vulkan) from all the other work done by the renderthread (e.g., culling, sorting, etc). For some applications this can improve performance, as it allows splitting the rendering logic from a single frame time to two. However, this comes at the added cost of one additional frame of latency, which is not mitigated by late latching (which stays on the render thread and is also pushed back one frame). Unless necessary, for most applications (especially ones where latency is important) it is probably better to avoid RHIThread, as the total latency can be well over 50ms.
Understanding how the CPU and GPU synchronize to render frames is key to achieving optimal performance. If your game starts to drop frames, figuring out which thread is your bottle neck is the first step to solving the problem, or if you have the opposite problem, your game runs lightening quick, but your motion to photon latency is really high, you can improve latency by reducing your threading complexity.
Best of luck achieving best in class performance for your next game/application. Thanks for reading and feel free to
contact us with any questions.
- Trevor Dasch