Sound design for immersive experiences
Sound design is the creative process of designing a soundscape in which to place your end user, and immersive experiences allows for a more captivating audio experience than any other medium. This section will explain the core concepts for immersive sound design and how you can use them to build your next immersive app.
Most spatialization techniques model sound sources as infinitely small point sources; that is, sound is treated as if it were coming from a single point in space as opposed to a large area, or a pair of discrete speakers. As a result, sounds should be authored as monophonic (single channel) sources.
Pure tones such as sine waves lack harmonics or overtones, which present several issues:
- Pure tones do not commonly occur in the real world, so they often sound unnatural. This does not mean you should avoid them entirely, since many immersive experiences are abstract, but it is worth keeping in mind.
- HRTFs work by filtering frequency content, and since pure tones lack that content, they are difficult to spatialize with HRTFs
- Any glitches or discontinuities in the HRTF process will be more audible since there is no additional frequency content to mask the artifacts. A moving sine wave will often bring out the worst in a spatialization implementation.
Use wide spectrum sources
For the same reasons that pure tones are not ideal for spatialization, broad spectrum sounds (such as noise, rushing water, wind sounds) spatialize very effectively providing lots of frequencies for the HRTF to work with. They also help mask audible glitches that result from dynamic changes to HRTFs, pan, and attenuation. In addition to a broad spectrum of frequencies, ensure that there is significant frequency content above 1500 Hz, since this is used heavily by humans for sound localization.
Low frequency sounds are difficult for humans to locate - this is why home theater systems use a monophonic subwoofer channel. If a sound is predominantly low frequency (rumbles, drones, shakes, etc.), then you can avoid the overhead of spatialization and use pan/attenuation instead.
When it comes to sound design for immersive experiences, it may come as a surprise, but realism is not necessarily the end goal. Keep this in mind at all times. As with lighting in computer environments, what is consistent and/or “correct” may not be aesthetically desirable. Audio teams must be careful not to back themselves into a corner by enforcing rigid notions of lifelike accuracy on a immersive experience. This is especially true when considering issues such as dynamic range, attenuation curves, and direct time of arrival.
Accurate 3D positioning of sources
For more traditional mediums, sound is positioned on the horizontal plane with 3D panning. So sound designers working on non-immersive experiences don’t need to concern themselves with the height of sounds, and can simply place sound emitters on the root node of the object. HRTF (Head Related Transfer Function) spatialization provides much more accurate spatial cues, including height, and with this improved accuracy, it is especially noticeable if sound is emanating from the wrong part of a character. It is important to position the sound emitter at the correct location on a character (e.g. footsteps from the feet, voices from the mouth) to avoid weird phenomena like “crotch steps” or “foot voices”.
The Oculus Spatializer does not support sound source directivity patterns (speakers, human voice, car horns, et cetera). However, higher level SDKs often model these using angle-based attenuation that controls the tightness of the direction. This directional attenuation should occur before the spatialization effect.
Not all sounds are point sources, the Oculus Spatializer provides volumetric sound sources to model sounds that need to be more spread out such as waterfalls, rivers, crowds, and so on. This is controlled with the source radius parameter, read more here:
Volumetric Sounds.
The Doppler effect is the apparent change of a sound’s pitch as the source approaches or recedes. Immersive experiences can emulate this by altering the playback based on the relative speed of a sound source and the listener, however, it is very easy to introduce artifacts inadvertently in the process.
The Oculus Spatializer does not have native support for the Doppler effect, but most sound systems/middleware provide the ability to implement the Doppler effect.
In the real world, sound takes time to travel, so there is often a noticeable delay between seeing and hearing something. For example, during a thunderstorm you will see lightning flash before you hear the clap of thunder. Modeling time of arrival delay may paradoxically make things seem less realistic, because it introduces additional latency and can make it feel like the sound is out of sync with the visuals. Also keep in mind that we are conditioned by popular media to believe that loud distance actions are immediately audible.
The Oculus Spatializer does not have native support for time-of-arrival, but if desired for dramatic effect it can be added to specific sounds (like thunder) by adding a short delay in the sound system/middleware.
A great deal of content, such as music, is mixed in stereo. Since immersive experiences are using stereo headphones, it’s tempting to play stereo sounds without spatialization. The drawback is these stereo sounds will not be positioned in the virtual world and will not respond to head tracking. This makes the sounds appear “head locked”, as they follow the users head movements rather than feeling grounded in the virtual world. This can detract from the spatial audio experience and should generally be avoided when possible.
For original compositions it’s best to mix to ambisonics which can be rotated and won’t be headlocked. If that is not an option then try to be mindful of how the music impacts the spatial audio.
Performance is an important consideration for any real time application. The Oculus Spatializer is highly optimized and extremely efficient, but there is some overhead for spatializing sounds compared to traditional 3D panning methods. Even in cases where there is a significant amount of audio processing, it should not impact frame rate because real-time audio systems process audio in a separate thread to the main graphics render thread. In general you shouldn’t be too limited by performance overhead of spatialization but it’s important to know your audio performance budget and measure performance throughout development.
While latency affects all aspects of immersive experiences, it is often viewed as a graphical issue. However, audio latency can be disruptive and immersion-breaking as well. Depending on the speed of the host system and the underlying audio layer, the latency from buffer submission to audible output may be as short as 2 ms in high performance PCs using high end, low-latency audio interfaces, or, in the worst case, as long as hundreds of milliseconds.
High system latency becomes an issue as the relative speed between an audio source and the listener’s head increases. In a relatively static scene with a slow moving viewer, audio latency is harder to detect. As a ballpark, around 100ms is the threshold where the delay for head rotations is noticeable for most users.
Effects such as filtering, equalization, distortion, flanging, and so on can be an important part of the immersive experience. For example, a low pass filter can emulate the sound of swimming underwater, where high frequencies lose energy much more quickly than in air, or distortion may be used to simulate disorientation.
With this knowledge you are one step closer to designing a soundscape for your next immersive app. If you’re looking to learn more about immersive audio, we recommend checking out the next guide which covers
audio localization.
If you’re ready to kick off the technical side of immersive audio design and engineering, be sure to review the following documentation: