3D audio spatialization
An essential element of fully immersive audio is spatialization: the ability to play a sound as if it is positioned at a specific point in three-dimensional space. Spatial audio is essential to deliver an immersive experience because it provides powerful cues that make the user feel they are actually in the 3D environment.
The Audio Localization page discussed how humans localize audio sources in three dimensions, we will now explore methods to synthesize those localization cues, to produce spatial audio.
The two key components to spatialization are direction and distance, in this guide we’ll cover both of these topics, and the technologies which enable them.
The sounds we experience are directly impacted by the shape and geometry of our body (especially our ear), as well as direction of the incoming sound. These two elements: our body + the direction of the audio source, form the basis of HRTFs which are filters we use to localize sound.
The most accurate method of HRTF capture is to take an individual, put a couple microphones in their ears (right outside the ear canal), place the subject in an anechoic chamber (a room with no echo), play sounds in the chamber from every direction we care about, and record these sounds from the ear-mounted microphones. We can then compare the original sound with the captured sound and compute the transfer function, this is the Head-Related Transfer Function.
We have to do this for both ears, and we have to capture sounds from a sufficient number of discrete directions to build a usable sample set.
But wait — we have only captured HRTFs for a specific person. If our brains are conditioned to interpret the HRTFs of our own bodies, why would that work? Don’t we have to go to a lab and capture a personalized HRTF set?
In a perfect world, yes, we’d all have custom HRTFs measured that match our own body and ear geometry precisely, but in reality this isn’t practical. While our HRTFs are personal, they are similar enough to each other that a generic reference set is adequate for most situations, especially when combined with head tracking.
Most HRTF-based spatialization implementations use one of a few publicly available data sets like those outlined below. These are captured either from a range of human test subjects or from a synthetic head model such as the KEMAR.
Most HRTF databases do not have HRTFs in all directions. For example, there is often a large gap representing the area beneath the subject’s head, as it is difficult, if not impossible, to place a speaker one meter directly below an individual’s head. Some HRTF databases are sparsely sampled, including HRTFs only every 5 or 15 degrees.
Most implementations either snap to the nearest acquired HRTF (which exhibits audible discontinuities) or use some method of HRTF interpolation. This is an ongoing area of research, but for immersive applications on desktops, it is often adequate to find and use a sufficiently-dense data set.
Given an HRTF set, if we know the direction we want a sound to appear to come from, we can select an appropriate HRTF and apply it to the sound. This is usually done either in the form of a time-domain convolution or a frequency domain convolution using FFT. If you don’t know what these are, don’t worry - those details are only relevant if you are implementing the HRTF system yourself. Our discussion glosses over a lot of the implementation details (e.g., how we store an HRTF, how we use it when processing a sound). For our purposes, what matters is the high-level concept: we are simply filtering an audio signal to make it sound like it’s coming from a specific direction.
Since HRTFs take the listener’s head geometry into account, it is important to use headphones when performing spatialization. Without headphones, you are effectively applying two HRTFs: the simulated one, and the actual HRTF caused by the geometry of your body.
Listeners instinctively use head motion to disambiguate and fix sound in space. If we take this ability away, our capacity to locate sounds in space is diminished, particularly with respect to elevation and front/back. Even ignoring localization, if we are unable to compensate for head motion, then sound reproduction is mediocre at best. When a listener turns their head 45 degrees to the side, we must provide an accurate audio response, or immersion will be lost.
Immersive headsets such as the Meta Quest provide the ability to track a listener’s head orientation and position. By providing this information to a sound package, we can update the spatial audio to make it feel like the sound is grounded in the virtual world as the user moves through the space. (This assumes that the listener is wearing headphones.) It is possible to mimic this with a speaker array, but it is significantly less reliable, more cumbersome, and more difficult to implement, and thus impractical for most immersive applications.
HRTFs help us identify a sound’s direction, but they do not model our localization of distance. Humans use several factors to infer the distance to a sound source. These can be simulated with varying degrees of complexity:
- Loudness: Our most reliable cue, is generally easy to model with simple attenuation based on distance between the source and the listener.
- Initial time delay: Significantly harder to model than loudness, as it requires computing the early reflections for a given set of geometry, along with that geometry’s characteristics. This is both computationally expensive and awkward to implement architecturally (sending world geometry to a lower level API is often complex). Even so, several packages have made attempts at this, ranging from simple “shoebox models” to elaborate, full scene geometric models.
- Direct to reverberant sound (wet and dry mix): A natural byproduct of any system that attempts to accurately model reflections and late reverberations. Unfortunately, such systems tend to be very expensive computationally. With ad hoc models based on artificial reverberators, the mix setting can be adjusted in software, but these are strictly empirical models.
- Motion parallax: The magnitude of directional changes for small head movements helps inform us of the distance of sounds. We get this “for free” due to positional head tracking.
- Air absorption: High frequency attenuation due to air absorption is a minor effect, but it is also reasonably easy to model by applying a simple low-pass filter. In practice, HF attenuation is not very significant in comparison to the other distance cues.
If you’re ready to kick off the technical side of fully immersive audio design and engineering, be sure to review the following documentation: