Display
When creating content for stereoscopic immersive experience headsets, developers must consider unique challenges that are not present in traditional, non-XR content. To produce the highest-quality content, it is essential to have knowledge and techniques specific to XR display technology. The following sections will cover various aspects of XR display technology and provide tips on how to author content for optimal results.
This section provides tips and explanations on displaying the virtual world to users.
Using monocular depth cues
Failing to properly represent the depth of objects will break a fully immersive experience. Stereopsis, the perception of depth based on disparity between the viewpoints of each eye, is the most salient depth cue, but it is only one of many ways the brain processes depth information.
Many visual depth cues are monocular; that is, they convey depth even when they are viewed by only one eye or appear in a flat image viewed by both eyes.
One such depth cue is motion parallax, or the degree to which objects and different distances appear to move at different rates during head movement.
Other depth cues include: curvilinear perspective (straight lines converge as they extend into the distance), relative scale (objects get smaller when they are farther away), occlusion (closer objects block our view of more distant objects), aerial perspective (distant objects appear fainter than close objects due to the refractive properties of the atmosphere), texture gradients (repeating patterns get more densely packed as they recede into the distance) and lighting (highlights and shadows help us perceive the shape and position of objects).
Current-generation computer-generated content (such as content created in Unreal and Unity) leverage a lot of these depth cues, but we mention them because it can be easy to neglect their importance. If implemented improperly, the experience may become uncomfortable or difficult to view as a result of conflicting depth signals.
Comfortable viewing distances
Two issues are of primary importance to understanding eye comfort when the eyes are focusing on an object in fully immersive experiences: accommodative demand and vergence demand. Accommodative demand refers to how eyes have to adjust the shape of their lenses to bring a depth plane into focus (a process known as accommodation). Vergence demand refers to the degree to which the eyes have to rotate inwards so their lines of sight intersect at a particular depth plane. In the physical world, these two are strongly correlated with one another; so much so that we have what is known as the accommodation-convergence reflex: the degree of convergence that eyes influence the accommodation of lenses, and vice-versa.
Fully immersive experiences create an unusual situation that decouples accommodative and vergence demands, where accommodative demand is fixed but vergence demand can change. This is because the actual images for creating stereoscopic 3D are always presented on a screen that remains at the same distance optically, but the different images presented to each eye still require the eyes to rotate so their lines of sight converge on objects at a variety of different depth planes.
The degree to which the accommodative and vergence demands can differ before the experience becomes uncomfortable to the viewer varies. In order to prevent eye strain, objects that the user will be fixating their eyes on for an extended period of time (e.g., a menu, an object of interest in the environment) should be rendered at least 0.5 meters away. Many have found that 1 meter is a comfortable distance for menus and GUIs that users may focus on for extended periods of time.
A complete virtual environment requires rendering some objects outside the comfortable range. As long as users are not required to focus on those objects for extended periods, they should not cause discomfort for most individuals.
Some developers have found that depth-of-field effects can be both immersive and comfortable for situations in which where the user is looking is known. For example, artificially blurring the background behind a menu the user brings up, or blurring objects that fall outside the depth plane of an object being held up for examination can achieve this. This not only simulates the natural function of vision in the physical world, it can prevent distracting the eyes with salient objects outside the user’s focus.
A user may choose to behave in an unreasonable manner. A user may choose to stand with their eyes inches away from an object and stare at it all day. Avoid requiring scenarios that may cause discomfort.
Viewing objects at a distance
At a certain distance depth perception becomes less sensitive. Up close, stereopsis might allow one to tell which of two objects on a desk is closer on the scale of millimeters. This becomes more difficult further out. If one looks at two trees on the opposite side of a park, they might have to be meters apart before one can confidently tell which is closer or farther away. At even larger scales, one might have trouble telling which of two mountains in a mountain range is closer until the difference reaches kilometers.
Use this relative insensitivity to depth perception in the distance to free up computational power by using imposter or billboard textures in place of fully 3D scenery. For instance, rather than rendering a distant hill in 3D, simply render a flat image of the hill onto a single polygon that appears in the left and right eye images. This image appears to the eyes in fully immersive experiences the same as in traditional 3D games.
The effectiveness of these impostors will vary depending on the size of the objects involved, the depth cues inside of and around those objects, and the context in which they appear. Engage in individual testing with app assets to ensure the impostors look and feel right. Be sure that the impostors are sufficiently distant from the camera to blend in inconspicuously, and that interfaces between physical and impostor scene elements do not break immersion.
Rendering stereoscopic images
We often face situations in the physical world where each eye gets a very different viewpoint, and we generally have little problem with it. Peeking around a corner with one eye works in fully immersive experiences just as well as it does in the physical world. In fact, the eyes’ different viewpoints can be beneficial: say there is a special agent (in the physical world or in a fully immersive experience) trying to stay hidden in some tall grass. Their eyes’ different viewpoints allow them to look “through” the grass to monitor surroundings as if the grass weren’t even there in front of them. Doing the same in a video game on a 2D screen may leave the world completely occluded behind each blade of grass.
Still, fully immersive experiences (like any other stereoscopic imagery) can give rise to some situations that can be uncomfortable for the user. For instance, rendering effects (such as light distortion, particle effects, or light bloom) should always appear in both eyes and with correct disparity. Improperly rendering these effects gives the appearance of flickering/shimmering (when something appears only in one eye) or floating at the wrong depth (if disparity is off, or if the post-processing effect is not rendered to contextual depth of the object it should be; for example, a specular shading pass). It is important to ensure that the images between the two eyes do not differ aside from the slightly different viewing positions inherent to binocular disparity.
It’s typically not a problem in a complex 3D environment, but be sure to give the user’s brain enough information to fuse the stereoscopic images together. The lines and edges that make up a 3D scene are generally sufficient. However, be very cautious of using wide swaths of repeating patterns or textures, which could cause people to fuse the images differently than intended. Be aware also that optical illusions of depth (such as the hollow mask illusion, where concave surfaces appear convex) can sometimes lead to misperceptions, particularly in situations where monocular depth cues are sparse.
We discourage the use of traditional HUDs to display information in fully immersive experiences. Instead, embed the information into the environment or the user’s avatar. Although certain traditional conventions can work with thoughtful re-design, simply porting over the HUD from a non-immersive game into fully immersive content introduces new issues that make them impractical or even discomforting.
When incorporating some HUD elements, be aware of the following issues:
- Occlusion of the scene with the HUD. This isn’t a problem in non-stereoscopic games, because the user can easily assume that the HUD actually is in front of everything else. Adding binocular disparity (the slight differences between the images projected to each eye) as a depth cue can create a contradiction if a scene element comes closer to the user than the depth plane of the HUD. Based on occlusion, the HUD is perceived to be closer than the scene element because it covers everything behind it, yet binocular disparity indicates that the HUD is farther away than the scene element it occludes. This can lead to difficulty and/or discomfort when trying to fuse the images for either the HUD or the environment.
- HUD elements “behind” anything in the scene. This effect is extremely common with reticles, subtitles, and other sorts of floating UI elements. It’s common for an object that should be “behind” a wall (in terms of distance from the camera) to be drawn “in front” of the wall because it’s been implemented as an overlay. This sends conflicting cues about the depth of these objects, which can be uncomfortable.
Traditional Heads Up Displays pose problems in immersive scenes due to depth perception.
Instead, it is recommended to build the information into the environment. Users can move their heads to retrieve information in an intuitive way. For instance, rather than including a mini map and compass in a HUD, the player might get their bearings by glancing down at an actual map and compass in their avatar’s hands or cockpit or a watch that displays the player’s vital information. This is not to say realism is necessary, enemy health gauges might float over their heads. What’s important is presenting information in a clear and comfortable way that does not interfere with the player’s ability to perceive a clear, single image of the environment or the information they are trying to gather.
Targeting reticles are common elements to games, and are a good example of where we can adapt an old information paradigm to fully immersive experiences. While a reticle is critical for accurate aiming, simply pasting it over the scene at a fixed depth plane will not yield the behavior players expect in a game. For example, if the reticle is rendered at a depth different from where the eyes are converged, it is perceived as a double image. In order for the targeting reticle to work the same way it does in traditional video games, it must be drawn directly onto the object it is targeting on screen, presumably where the user’s eyes are converged when aiming. The reticle itself can be a fixed size that appears bigger or smaller with distance, or program it to maintain an absolute size to the user; this is largely an aesthetic decision.
Place critical gameplay elements in the user’s immediate line of sight. UI or elements displayed outside the user’s immersive line of sight are more likely to be missed.
Camera origin and user perspective
Consider the altitude of the user, or height of the user’s point of view (POV), as this can be a factor in causing discomfort. The lower the user’s POV, the more rapidly the ground plane changes, creating a more intense display of optic flow. This can create an uncomfortable sensation for the same reason that moving up staircases is uncomfortable: doing so creates an intense optic flow across the visual field.
When developing a fully immersive app, the camera’s origin can rest on a user’s floor or on their eyes (these are called “floor” and “eye” origins, respectively). Both options have certain advantages and disadvantages.
Using the floor as the origin will cause people’s viewpoint to be at the same height off the ground that they are in the physical world. Aligning their virtual viewpoint height with their physical world height can increase the sense of immersion. However, people’s heights vary. To render a virtual body, build a solution that can scale to different people’s height.
Using the user’s eyes as the camera’s origin affords control over their height within the virtual world. This is useful for rendering virtual bodies that are a specific height and also for offering perspectives that differ from people’s physical world experience (for example, showing people what the world looks like from the eyes of a child). However, by using the eye point as the origin, the physical floor’s location is no longer known. This complicates interactions that involve ducking low or picking things up from the ground. Since the user’s height isn’t known, adding a recentering step at the beginning of an app to accurately record the user’s physical world height can help.
Color is how we perceive our environment, the perception of light interacting with our environment. The set of colors used for a GUI, a photo, or a scene help to convey a window into a reality, to convey a mood, express a message, or even connect to broader themes such as a brand or a concept. Since the color appearance of an immersive experience is critical to the desired artistic intent, it is crucial to ensure that users experience the color as intended by creators. Meta Quest devices support color management in order to preserve faithfully the artistic intent.
This section introduces some basic color science terminology. You may skip to the
Color on Meta Quest devices to access recommendations directly
While it is common to refer as objects having a color (e.g., “this apple is red”), colors are not a property of objects but our perception of them: it is the convolution of the visible light spectrum reaching our human visual system by the response of photoreceptor cells in our retina, reconstructed by our neural system. In an immersive experience the goal often is that the image seen through the lenses of a headset yield a similar perception as a physical location, or for an imaginary world to be perceived the way its creator intended, even when the creator and the end user use different devices.
As we transition from the realm of color perception to the technical aspects of color spaces, it’s essential to understand the fundamental connection between light and our visual experience. The colors we perceive are a result of the interaction between light, our environment, and our eyes. When light is emitted or reflected, it travels through space as electromagnetic radiation, comprising a vast spectrum of wavelengths.
Interestingly, different wavelengths of light can generate the same perceived color response, a phenomenon known as metamers. This occurs because our eyes have limited color receptors, which respond to specific ranges of wavelengths. For instance, long wavelengths (around 620-750 nanometers) are perceived as red, medium wavelengths (around 520-570 nanometers) as green, and short wavelengths (around 450-495 nanometers) as blue. These distinct clusters of wavelengths, when combined in various proportions, create the rich tapestry of colors we experience in the world around us.
In essence, colored light can be thought of as a combination of pure “laser” colors, similar to how sound waves combine to form complex sounds. Just as different sound waves can produce the same perceived pitch or tone, different light waves can generate the same perceived color. This understanding sets the stage for exploring the technical aspects of color spaces, where we’ll delve into the specifics of how these wavelengths are represented and managed in Meta Quest devices.
Since the human visual system is trichromatic, emissive displays need light in three colors to generate many other shades as a mixture of them. Displays use red, green, and blue (RGB) primary lights since they align with the response functions of our photoreceptor cells. The range of colors perceivable by mixing those three lights at different intensities, the actual white shade obtained by mixing all three lights at full power, and the functions to map the mixture values for more efficient storage and processing define a color space. Think of color space as the units for defining a color. Just as a distance is meaningless without a unit (e.g., 1 mile vs 1.609 kilometers), colors expressed as RGB triplets are only meaningful when used as part of a color space.
A color space may be defined by the following characteristics:
Chromaticities
|
For each of the red, green, and blue components, it is their actual hue and saturation (note that luminance is not included). This is typically expressed as x,y coordinates in the CIE 1931 space.
|
White point
|
The chromaticity of the white obtained by mixing the three primaries at 100% each (once again, note that luminance is not included). This is typically expressed as x,y coordinates in the CIE 1931 space or as a well-known illuminant, e.g., D65.
|
Transfer function
|
An invertible, monotonic function that transforms linear scene light values (e.g., values proportional to the photon count) of each primary into a form more efficient for transmission or processing. For digital images, transfer functions are used for efficient quantization. Closely related terms are:
- OETF (opto-electronic transfer function): converts from linear light to encoded values.
- EOTF (electro-optical transfer function): converts from encoded values to linear light.
|
Note: For standard dynamic range (SDR) devices such as Meta Quest headsets, the intensity of a 100% white (e.g., all three primaries at full intensity) is relative to the brightness of the optical system (light seen through the headset lenses). For Meta Quest devices, this is typically 100 nits at the 100% brightness setting.
Color on Meta Quest devices
Meta Quest headsets are factory-calibrated to ensure consistent results across units and to properly depict the desired colors. Calibration for headsets measures the light emitted from the display and transmitted through the lenses. For headsets with two independent displays (Meta Quest Pro, Meta Quest 3) calibration avoids binocular disparity (the adverse effect where if colors are two different, it is not possible to fuse the left and right images) and ensures brightness uniformity (left and right eyes have the same brightness response). The factory calibration abstracts this complexity from developers so the device may be treated as a single, cohesive unit.
As of August 2024, current Meta Quest headsets cover the sRGB color space gamut. Meta Quest Pro offers expanded color capabilities by covering the Display P3 gamut as well as increased contrast via a local dimming backlight.
Mastering recommendations We recommend mastering content and targeting the wide Display P3 color space. Display P3 offers about 50% more colors, more richly saturated, than sRGB.
Illustration showing the more saturated shades of red, blue, and green available with Display P3 compared to sRGB (
source).
Do keep in mind that mainstream devices only support the standard sRGB color gamut; out-of-gamut values are only clipped. Use Display P3-only colors (i.e., those exceeding the sRGB gamut) carefully (e.g., for highlights and purposefully highly saturated content).
Display P3-mastered content (left) will get clipped when shown on an sRGB-only capable device (center). Even though the clipped pixels look different than the reference (right), they represent a graceful degradation: Display P3-capable devices will show the full, saturated colors while sRGB-only devices still preserve enough detail.
Since all current Meta Quest headsets use LCD technology, dark pixel values do not get as dim as expected; the optical stack further reduces the usable contrast. LCD limitations prevent them from meaningfully differentiating brightness levels below 13 out of 255 for 8-bit sRGB or 0.0015 out of 1.0 max for linear-RGB shader output values. If an fully immersive app uses an extensive amount of these dark ranges, it is recommended to author content in a higher brightness range as much as possible.
Images and video captured via the in-headset Camera app are composed in the sRGB color space to facilitate sharing; wide-gamut content will be clipped. Other rectilinear mirrored outputs such as the live view in Meta Quest Developer Hub are composed as sRGB images too.
Captures using the raw display output such as
scrcpy and adb shell screencap use the same color adjustment that was applied for the HMD output. Post-distortion mirror outputs may not have acceptable color-space accuracy when viewing externally since the captured images effectively use the native color space of the display and optics module.
DO use a color-managed workflow when developing assets.
DO use a properly calibrated monitor when working with colors.
DO specify in an immersive app the same color space used to develop the assets. OpenXR rendering pipelines and textures do not have an inherent color space. It is the responsibility of the developer to keep track of the color used throughout that all inputs, processing, and output are in agreement. Use the XR_FB_color_space OpenXR extension to set the color space used by the application output. DON'T re-interpret sRGB content as P3. Do explicitly set the correct color space.
DO sRGB content correctly displayed.
DON'T sRGB content improperly shown (reinterpreted) as Display P3 values. This leads to over-saturation and loss of details du