Hello, I’m Andrew Melim! I’m an Engineering Manager here at Facebook, leading the efforts around Input Tracking for our AR/VR devices.
Our team of computer vision engineers have been working hard to tackle the problem of providing high fidelity interaction for the Oculus Quest and Rift S. We proved Constellation tracking can deliver a great user experience on Oculus Rift, however, with Quest and Rift S we faced a dramatically different sensing configuration and needed to reimagine the stack from the ground up. With cameras mounted on the HMD instead of externally on one’s desk or table, detection and tracking of the infrared LEDs becomes much more difficult. These problems multiplied due to the ease in which the controller tracking rings could become occluded or leave the view of headset cameras.
With the latest tracking update to the Oculus Quest and Rift S, we tackled the many challenges that are inherent with Inside-Out tracking systems, especially for controllers. These updates were released with
update 1.39 for Rift S, and
build v7 for Quest. Over the next series of blog posts we will provide a look under the hood at how we were able to enhance tracking fidelity and ensure the Insight tracking system meets the needs of your favorite VR games and content.
Challenges
The key underlying principle that drives Constellation tracking is the detection and triangulation of infrared LEDs within camera images which have extremely short exposure times. For each controller, the tracking system attempts to solve for the 3D pose, which is the 3D position and orientation. Within each frame, the system executes the following steps:
- Search camera images for bright volumes of infrared light
- Determine a matching scheme between image projections and the underlying 3D model of the controller
- Compute the 3D pose of the controller with respect to the headset and fuse with inertial data
In order for the system to obtain a sufficient number of constraints to solve for the position and orientation, we need a minimum number of observations. In turn, one of the major issues we faced with tracking the Quest controllers was that the typical number of LEDs visible in any given camera image is quite low. Due to lower camera resolutions and various other constraints, the Quest controllers have fewer LEDs placed on them for tracking (15 vs. 22 on the Rift Touch controllers). This issue was further compounded by the fact that there are very few poses where more than one camera can see a controller at a time, unlike Rift where the controller is typically viewed by two or three cameras at a time, depending on your setup.
In this post we will focus on how we improved the detection and segmentation algorithms, making up the first step in the Constellation pipeline. Gaining more precision in the blobs we detect, reducing false positives, and extending the range our detection algorithms work is a key step in improving how we track the controllers.
New blob segmentation
Blob segmentation is a relatively simple concept. For each image we want to identify contiguous blocks of bright pixels (blobs) to determine potential locations of LEDs. This simplicity hides a very challenging technical problem; how do you identify blobs that correspond to LEDs (true positives) and ignore blobs that correspond to other bright things in the scene (false positives).
If we simply took every point of IR light, it would become quite difficult to locate the controller in environments with many light sources or reflections. Knowing which light sources to reject and ignore is just as important as knowing which ones to use for solving for the controller pose. Therefore, a large number of edge cases, constraints and heuristics are used to help reduce any possible blob that is not likely to have been emitted from the controllers.
The first step to improve tracking started with the quality of the LED blob segmentation algorithm. This covered three main technical problems:
- Enable the detection of LED blobs that merge together into a single, bimodal blob rather than throwing them away
- Extend the blob detect to work on multiple image pyramids to ensure we track blobs at close distances
- Allow the detector to detect faint blobs (as happens are large distances from the cameras)
When a user holds the controller at arm's length, LEDs that are close to the user can appear as one continuous blob, incorrectly detecting the LED location. Incorrectly computed LED locations dramatically increases the error in pose estimation, and leads to poor or lost tracking. By utilizing new heuristics around computing, the ratio of blob pixels to the size of a bounding box computed around the region, we found we could very accurately detect whether there were two distinct blobs at a high incidence angle to the camera. This helped in many common poses, such as aiming a blaster or when you hold the trigger towards your face.
As represented in the image below which is captured from a distance, the bottom controller LEDs were not accurately detected, and two of the LEDs from the top controller were merged into a single blob (single red centroid). With the new methods on the right, we accurately detect three blobs from each controller, providing tracking in these edge cases.

Another issue we tackled was large variations in the detected blob size, a new and unique challenge to the Insight system. When a controller is held at arm's length the projection of the LED may only span a few pixels, whereas it could be as large as several dozen pixels when held right next to the camera. The system initially rejected blobs that were either too large or to small, e.g., as may occur with bright windows or overhead lights. Although this initially functioned as intended, this solution led to compounded challenges with the varying degrees of detected blog sizes.

The above is an example of improved segmentation up close, including a reflection from a table. On the right with the new segmentation code, we accurately detect the large blob without detecting a false blob.
Our solution was to implement image pyramids, a well known method for scaling image information although it can be a computationally expensive operation (see the following
MIT paper on the concept). Downsampling the images allowed us to run the same camera frames through our feature detector, providing scale invariance and allowing us to detect blobs much closer to the HMD. Image pyramids are not computed at every frame or even for every camera - instead we use estimated controller poses and heuristics to decide when to compute pyramids. This gives the system the flexibility to reduce compute when the controller location is known with high confidence, and boost compute to search for the controller when its location is unknown.
Putting these blobs to good work
Detecting these new blobs is a solid first step at addressing the tracking challenges, but it’s only the first in many steps at improving tracking with Quest and Rift S. In the next blog post I’ll address how we can put this new data to work in solving the interesting problems of mapping blobs to LEDs, and using projective geometry to solve for positions and orientations.
- Andrew Melim