Discover how voice can be used to enhance experiences. This page focuses on voice as an input method. It covers the benefits and challenges of this input method, how it works, design principles, and dos and don’ts.
Usage
Voice input enables users to interact with devices and applications using spoken language. By leveraging advancements in speech recognition and natural language processing, voice input allows for hands-free, efficient, and accessible user experiences. This modality is especially important for users with disabilities, those in situations where manual input is impractical, or when multitasking.
Voice input enhances inclusivity, reduces cognitive load, and eases friction in user interfaces.
Terminology
These are the frequently used terms to be familiar with:
Term
Definition
Anaphora Resolution
In NLP and linguistics, anaphora refers to the use of a word or phrase that refers back to a previous word or phrase, often using pronouns like “he,” “she,” or “it.” The goal of anaphora resolution is to determine what the anaphoric expression is referring to, which can be crucial for understanding the meaning of a sentence or text. Example: “John gave his book to Mary. She was very happy with it.” In this example, the anaphora “She” refers back to “Mary,” and the anaphora “it” refers back to “his book.”
Automatic Speech Recognition (ASR)
ASR is a technology that enables computers to transcribe spoken words into text. It uses machine learning algorithms and statistical models to recognize patterns in speech and convert them into written language. ASR systems can be trained on large datasets of speech recordings and their corresponding transcripts to improve their accuracy.
Barge In
In conversational systems, barge-in refers to the ability of a user to interrupt a system’s response or prompt with their own input, usually by speaking over the system. This feature allows users to quickly correct errors, provide additional information, or change the direction of the conversation.
Beam Forming
In audio signal processing, beam forming is a technique used to enhance the quality of audio signals captured by multiple microphones. By combining the signals from multiple microphones, beam forming creates a virtual microphone that focuses on a specific area or direction, reducing background noise and improving the signal-to-noise ratio.
Confirmation
Feedback provided to the user after a voice command has been executed, confirming the action taken.
Contextual Understanding
The ability of a voice input system to understand the context in which a command is given, including previous interactions and environmental factors.
Disambiguation
The process of clarifying ambiguous or unclear voice commands, often through follow-up questions or prompts.
Error Handling
Strategies for dealing with errors or misinterpretations in voice input, such as providing feedback or offering alternatives.
Large Language Models (LLMs)
Large Language Models (LLMs) are a type of artificial intelligence (AI) model that is trained on vast amounts of text data to learn patterns, relationships, and structures within language. These models are designed to process and generate human-like language, enabling them to perform a wide range of natural language processing (NLP) tasks.
Latency
The delay between the user speaking and the system responding to their voice input.
Natural Language Processing (NLP)
Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that deals with the interaction between computers and humans in natural language. It involves the development of algorithms, statistical models, and machine learning techniques to process, understand, and generate natural language data.
Speech Recognition
Speech recognition, also known as speech-to-text or voice recognition, is a technology that enables computers to transcribe spoken words into text. It uses machine learning algorithms and statistical models to recognize patterns in speech and convert them into written language.
Speech to Text
The ability of a computer system to synthesize spoken words from text, using techniques such as concatenative synthesis, parametric synthesis, and generative AI.
Generative speech - Text to Speech (TTS)
The ability of a computer system to synthesize spoken words from text, using techniques such as concatenative synthesis, parametric synthesis, and generative AI.
Turn-based Interaction
A type of interaction where the user speaks and then waits for the system to respond before speaking again.
Voice User Interface (VUI)
A Voice User Interface (VUI) is a type of user interface that allows users to interact with a system, device, or application using voice commands. VUIs use speech recognition technology to interpret and process spoken language, enabling users to perform tasks, access information, and control devices without the need for physical input such as typing or clicking.
Wake Word
A specific word or phrase used to activate a voice input system, such as “Hey Meta”.
Technology and how it works
This section provides an overview of voice input technology, covering its functionality, accuracy, limitations, and strategies to overcome them. At a high level, voice input is facilitated by the Voice SDK, which utilizes on-device microphones to capture user speech. The process involves:
Audio Capture: Player voice input is captured through the headset’s microphone.
Speech Recognition: The audio is then processed using machine learning algorithms to transcribe spoken words into text.
Intent recognition: The transcribed text is analyzed to identify the player’s intent, such as executing a voice command or interacting with an NPC.
Activation model
Users need a simple and discoverable way to activate or invoke voice interactions on your app. This should align with the aesthetic and theme of the app or game, as well as the interaction model. Below are several examples of common custom activation methods.
UI affordance
A UI affordance is an option within the user interface that can be used for activation such as a mic button, exclamation point above a character’s head, and so on. Common use cases include keyboard dictation or search activities.
Pros
High discoverability and reliability mean users can easily find and activate features.
A low learning curve ensures new users can learn quickly.
High input device adaptation allows users to transfer affordance easily to different devices.
Cons
Easy to overload the paradigm: Too many icons or affordances can cause confusion.
Increasing the visual density of the interface adds another element.
Not always accessible: Visual elements may not be accessible to those with visual impairments or when using hand-tracking.
Examples
Controller click on microphone button
Hand pinch click on UI button
Hand direct touch on UI button
Immersion
This option is often very effective and natural when used within the context of a game. Some common examples are rubbing a magic lamp, standing in front of a magic mirror, or talking to a non-player character (NPC). The primary use case is voice-driven gameplay.
Pros
Preserves immersion by integrating seamlessly into normal user activity, ensuring the game flow remains uninterrupted.
Enhances voice-driven gameplay and novel interactions, maintaining the user's focus on the game.
Contextual availability ensures it is only accessible when needed, minimizing accidental activation.
Cons
Discoverability can be low. Users might miss an option if they don't know it exists. For instance, they could pass by an NPC who would otherwise engage them.
Conveying state is challenging.
Transparency is crucial for maintaining confidence in an app or game, but virtual activation makes this difficult.
Gaze
This option uses the first person perspective and eye tracking as the activation method. It can be combined with other gestures or game elements to provide an immersive experience. For example, gazing at a non-player character and waving to speak or locking onto a target during game play to use a voice command.
Pros
Accessible when hands are not available, this feature enhances accessibility and offers more flexibility in interacting with non-player characters.
It is especially useful when hand-tracking is in use.
By requiring multiple steps for activation, it reduces accidental activations when combined with other inputs, increasing certainty. This natural interaction form has a lower learning curve, providing a faster and more intuitive activation method.
This natural interaction form has a lower learning curve, providing a faster and more intuitive activation method.
Cons
Accidental activation is common when using head or eye gaze control alone, as precise control is notoriously difficult.
This can reduce accuracy and increase the chance of unintentional activation.
Clear listening affordance is essential. Providing a clear indication of successful activation is crucial, as false activation without system state indication poses a privacy risk.
Examples
Look at a world locked target
Look at a body locked target
Gesture
Activation by hand gesture is useful within the context of voice-driven gaming. It can be combined with a head/eye gaze or used separately, as when making a gesture with a wand. Use cases include voice-driven gameplay and multi-sequence activation.
Pros
Accessible without controllers enhances accessibility and allows more flexible interactions with non-player characters.
It preserves immersion by integrating seamlessly into normal user activity, maintaining the game's flow.
Cons
Could be activated by mistake—hand gestures are easier to control with precision than head or eye gaze. However, many users find it difficult to keep their hands still, often moving them randomly.
This randomness is less likely to affect accuracy when triggering activation but increases the chance of unintentional activation. It's crucial to provide a clear indication when this action has successfully activated, as false activation without system state indication poses a real privacy risk. Might be constrained by light conditions.
Repeated use can cause fatigue and take time to repeat.
Example
Based on hand distance and gesture to activate voice feature
Interaction model
The interaction model is the core of the experience and occurs after activation. It can be a single turn or multi turn interaction depending on the specific use case for your app. Despite the form of interaction, three principles should always be considered during the design:
Transparency: Your app should transparently communicate what the system is doing to the user. Surprises can be a good thing in a game, but not when interacting with the system.
Preserve immersion: In immersive games, immersion is the goal and a significant part of the user experience. Don’t take the user out of that experience when interacting with them. This might mean using a more immersive voice, dialog, or situation in which the player communicates with the system.
Minimize Disruption: Ensure that you move smoothly from the game into an interaction model and then smoothly move out again. An interaction with the player may not be expected by them, but it shouldn’t feel out of character for the game. By minimizing the disruption that interactions can cause, the player will remain immersed, enjoying the game more, and be more likely to return to the game at a later time.
Single turn
This form of interaction consists of a single request. Typically there would also be a single response to conclude the interaction. It is generally used to complete a single task e.g. turning off the lights, getting stock updates, setting timers, etc. Synonymously, this interaction can be referred to as one-shot utterance or one-shot request.
Multi-turn
This form of interaction is usually conversation driven and consists of multiple interactions. It is generally used for scenarios that require interaction e.g. speaking to a non-player character, ordering something, guided activity, etc. During multi-turn interactions, it’s important that there is an option for users to dismiss the interaction or the system time out when there hasn’t been a reply for a certain amount of time.
Attention model
Attention systems provide your users with an easy, integrated way to understand what’s happening with their voice interactions within an app. This can indicate what they can interact with using voice within a game or how to get feedback on their voice input. Creating an effective attention system for your app is essential for providing a good experience for your users. It’s the best way to address the idea that the game is always listening to them.
What is an attention system?
Creating an attention system is the first step in designing an efficient voice system for your app. It makes things easier for your users by providing audio and visual cues that let them know the microphone is active. It also creates a way to provide feedback about how your user’s voice commands are received, including responding to errors, reducing the perception of latency, and even just indicating that a voice command has been received. It’s a way for you, as the developer, to deepen the immersive experience that lets your users enjoy themselves more.
Additionally, attention systems are critical to helping your users know when the microphone is “active” and “listening.”
Core components
Some basic attention system components are available for you to use in the Voice SDK Toolkit available on GitHub. These can be valuable to use while you customize the look and feel of your attention system to fit the app style and experience you’re creating.
Mic status cues (required)
Status cues can be used when mapped with the attention system states to visually guide your user and let them know when their mic is on and when the system is ready to receive their voice command input. This helps prevent frustration and misunderstanding about when to speak.
Some basic techniques can be used to show when the mic is on, so your user knows when their audio input is being collected.
Important: The Mic on status should be accurately represented on screen as soon as your app calls for audio input and maintained for the full duration that audio collection is enabled, until the mic is turned off.
A basic mic status attention system can be added to the user’s headset view or to individual objects or characters. More elaborate systems can be used, but the following cover the basic states that should be communicated:
Mic on: This status covers the Listening (On), Inactive, Processing, and Response mic states.
Mic off: This status covers the Not Listening mic state.
Audio feedback cues
Using a visual cue or icons as a way to provide audio feedback is an easy way to help your user to gauge and calibrate the volume from their mic so voice commands can be heard and processed properly.
Before using a visual cue like this, you should test it to ensure the animation syncs correctly with audio input devices. You should also test it in the type of environment where you expect your user to use your app.
Earcons
You can also use audio feedback to reinforce a visual attention system that shows mic states or as an alternate method that can be used when the mic is open for increased accessibility and immersion.
A selection of custom crafted earcon sounds is available to start. When using them, remember that activation earcons will be played frequently in an app, so use one that is short and easy to listen to repeatedly for a better user experience.
Combining these elements
The following status cues and icons, combined with animation and earcons, provide an example of how you can use these tools in combination to effectively show voice interaction states.
These elements can be seen working in combination in the following video:
Advanced examples
As you continue to explore ways to bring voice interaction into your app experiences, you may consider integrating attention systems directly into character and environment design. For example:
Character animations: A character has specific expressions and gestures to show active listening or comprehension.
Environment design: An object glows and moves, indicating interaction or response.
Dialogue action prompts: The user is prompted and issues voice commands by way of conversational dialogue with an NPC.
Transcription: Displaying the text transcription feedback of audio input can provide an additional indication as to when the mic is open and receiving input. This can also serve as light user guidance about how voice commands are heard by the system so that your user can then adjust input.
For additional information on using Attention Systems with Voice SDK, see the Voice SDK Toolkit.
Error responses
Errors are inevitable in any app and voice systems are prone to errors that don’t exist in other apps. Users may speak too quickly or mumble, causing the system not to understand. Alternatively, the system may ask a question but not receive a response because the user doesn’t hear, notice, or understand.
Common error types
No Match Error. These errors occur when the user says something that the system doesn’t understand. This is the most common error when there’s an ASR (Automatic Speech Recognition) error. However, it can also occur if the user is in a noisy environment, their utterance isn’t included in the app’s recognition grammar, or when the user doesn’t respond fully or clearly enough for the system to understand. For example:
System (hears): “Refrat the camon!”System (responds): “Sorry captain, what was that?”System (hears): “I said reload the cannon!”System (responds): “Aye captain! Cannoneers, reload!”Error response: A quick response asking the user to repeat themselves is usually the best way to handle a No Match error. However, if the user’s utterance is already in response to a question, don’t repeat the question. This only serves to sound robotic or break the user’s immersion. Rephrase the response, instead.
No Input Error. These errors occur when the system expects a response and doesn’t receive one. This can happen for a variety of reasons, such as if the user doesn’t speak loudly enough for the microphone or when they don’t say anything while the microphone is active. The user might be thinking about their answer, unsure about how to respond, distracted, or paused for some other reason. For example:
System (statement): “You got it, Player 1! Where should we go?”System (hears): “...”System (responds): “Player 1, which direction shall we go?”System (hears): ”Oh, let’s take troops south to D1”System (responds): “Got it, south to D1. On our way!”Error response: Rephrasing the question with additional detail is usually the best option with this error. As with the No Match error, it’s important not to repeat the question verbatim, so as not to sound robotic.
Out of Domain Error. This error occurs when the system understands what the user said, but can’t act upon it for some (game-oriented) reason. This is sometimes known as an Intent error. This can happen because a game rule prevents the requested action from occurring, or because the functionality has not been built into the app. For example:
System (hears): “Bring me that sword.”System (responds): “I wish I could, but I’m actually not able to pick up any weapons.”Error response: Wherever possible, you want to guide the user toward a solution, or provide them with information so they can contextualize the response. The most effective response is to explain what the app can’t do, for example, “I’m actually not able to pick up any weapons” so the user doesn’t try it again. However, you should take care that the system’s response doesn’t break the user’s immersion.
System Error. This is a catch-all category for errors that are out of the user’s or system’s control. These can include client errors, server errors, or virtually any other error that might happen in an unexpected way or in the system backend. For example:
User (speaks): : “Take me to the starting zone.”\<System error>System (responds): “Sorry, something weird happened. Try saying that again.”Error response: To the extent possible, you should have one or more error responses available that can be used for this type of error. An appropriate response should be as broad as possible to cover many situations, but not so generic that it sounds like a mechanical response. If you have specific information on the error that can be acted upon by the users, such as “Sorry, the internet is not connected,” it can be added to the response, but you should make an effort to prevent it from breaking the user’s immersive experience as much as possible.
Best practices
Errors are a lot more common than many developers realize or expect, and so designing your voice experience with a strong and broad set of error responses will greatly increase your user’s enjoyment.
Several best practices to keep in mind when designing error responses include the following:
The best error responses are context-specific. Evaluate the context of the scene when creating the error response. Look at the environment, previous questions asked, the user’s position, and other factors. Ask yourself “Would it be weird if a character said this to me?” before applying a given error response.
Make the error sound like part of the game. If the user is playing an action game, consider creating a visual cue that communicates that the user should repeat themselves if there is a No Match error. When designing NPC dialog, expect the user to pause before responding, so provide them enough time to answer, and have follow-up questions ready for No Input errors.
Don’t repeat the same response over and over. Users will often repeat what they said, but “carefully.” If it’s a No Match or No Input error, this may work, but for other errors, it might not. There may truly be an ASR or NLU error that prevents the system from understanding. Repeating the same message such as “Sorry, I didn’t get that” will quickly become tiresome to users. Make your app more dynamic for your users by creating variant responses or additional error messages for when the same error is repeated.
Review your metrics. You can use errors as a signal for ways to improve the user voice experience. For example:
Are users asking for things you didn’t anticipate?
Where are these errors happening most often? Try and figure out why these errors are happening where they do.
Are users asking for things they can do, but in ways you didn’t anticipate? Look at failed utterances to help address future user scenarios.
Are people asking for supported features, but the app keeps misunderstanding the requests?
User education responses
User education within an app is a constant challenge when developing voice experiences. The set of possible things a user can say in the app is often not the same as the set of the things they might want to say. This is especially true for fully immersive experiences.
User education buckets
First-time users of an app have slightly different needs from returning users. The mechanisms for both user education and discovery differ, depending on how much prior knowledge users have about the experience.
First-time users (Discovery)
These users are sometimes provided a new user experience at the beginning of the game. User education given here should be relatively high-level. Providing a couple of voice input examples is helpful, but you should focus on categories and interaction models such as Move Objects, Order Attacks, Cast Spells, Navigation, or Actions. These are categories of things the user can do, rather than things spoken to the game itself. You should consider the following things when designing a first-time user experience:
How does the user invoke the voice experience?
How does the user know when to talk?
How do they know when the microphone is on?
What can they say?
How does the user learn more of what they can say?
Returning users (Retention)
These users are already familiar with some mechanisms of the app and you want them to become familiar with in-app user education and discovery mechanisms that they can use themselves, like contextual tips. Such tips on what they can say should be as contextual as possible without breaking their immersion or disrupting app mechanics.
Types of user education in fully immersive experiences
There are a number of different potential resources that can to help users learn how to use their voices for in-app experiences.
When a user must navigate a complex sequence of voice commands, using a guided walkthrough can make this easier by allowing you to string contextual tips together into a walkthrough. This flow can help show the user key dynamics or elements for their experience in the app. It can be enhanced with a progress indicator for the one-off tips, such as a progress bar, or text indicating “2 of 4 complete.”
App landing
When a user starts an app, they are taken to a specific location within it where they can begin. Here, they can pick a level to play, customize their loadout, or select other options. When they’re here, you can also show users a list of things they can say in the app, such as by using a menu tab, or perhaps as an entity floating in space. You can think of this showing them the “controller settings” or “button layout,” as many apps do. These instructions should focus on broader categories, such as “Move Objects” or “Cast Spells.” It can also be helpful to show some examples of the specific commands users can utter.
Embedded tips
Showing these to the user as a seamless aspect or extension of the natural environment, these tips can be the most immersive way to teach users what they can do or say, but care must be taken to prevent the tip from breaking the immersive experience. For example, an NPC could be teaching the user a new spell and ask them to say the conjuring words. Other options could be an item indicating somehow that it can be triggered by voice or a person in the game indicating that they can interact with you by voice by beckoning you over to talk with them.
Non-embedded tips
Unlike embedded tips, non-embedded tips won’t show as part of the environment, even though they still exist within the context of the game. Rather, they appear to the user as a layer on top of the game, much like an exit sign above a door. Users don’t need to interact with these tips and disregarding them shouldn’t impact the player’s progress in the game. However, they can use them to get more information that’s pertinent to the game. For example, an unobtrusive icon that can give the user a hint if they select it, or perhaps an arrow pointing in the direction to some goal of the user.
Voice search
In a user’s journey within a game, they may ask simple questions, such as “How do I pause” or “How do I save?” or other game-level and how-to queries. The easiest and most natural way for users to do this is by talking normally to the system. The Voice Search bar enables users to do this. It also enables you, as the developer, to create some generic help intents that can be useful.
Voice interaction guidelines
This section offers guidance on voice-based interaction techniques used in voice experience design. Discover input primitives, understand design principles, consider ergonomic factors, and learn the essential dos and don’ts.
Interaction Primitives
Discover various input capabilities and interaction methods utilizing voice as input modality:
Speech Recognition Behavior
The basic unit of speech recognition interaction is an utterance, a single spoken phrase or sentence.
Identifying the intent behind a user's utterance is crucial in determining the appropriate response. Intents can be categorized into different types, such as informational, transactional, or navigational.
Entities are the specific details that provide context to an utterance. Examples include names, dates, locations, and quantities. Accurate entity recognition is vital for effective speech recognition interactions.
Context plays a significant role in shaping the user's expectations and behavior. Designing interactions that take into account the user's context can lead to more accurate and relevant responses.
Confirmations are used to ensure that the system has correctly understood the user's intent or to request additional information. This primitive helps prevent errors and improves overall accuracy.
Disambiguation techniques, such as asking follow-up questions or providing options, help clarify user intent when it's unclear or ambiguous.
Effective error handling strategies, such as explaining and offering alternatives, can mitigate the impact of errors and maintain user satisfaction.
Natural Language Processing (NLP) Design Application
Tokenization is the process of splitting text into smaller units, such as words or subwords, to facilitate analysis and understanding.
Part of Speech (POS) tagging helps determine the part of speech (e.g., noun, verb, adjective) for each word in a sentence, enabling more accurate interpretation.
Named Entity Recognition (NER) involves identifying named entities, such as people, places, organizations, and dates, to provide context and meaning.
Sentiment analysis evaluates the emotional tone or attitude conveyed in text, such as positive, negative, or neutral.
Intent identification involves determining the purpose or objective behind a piece of text, such as making a request or providing information.
Voice-Based Navigation
The system should clearly announce the available menu options.
Users should be able to select a specific option using voice commands.
The system should allow users to navigate through submenus using voice commands, such as "What's next?" or "Go back."
The system should confirm user input and verify its understanding of the selected option to ensure accuracy.
The system should be able to handle errors, such as incorrect user input or unavailable options, and provide alternative solutions or suggestions.
The system should group related options together.
The system should provide shortcut options for frequently used actions.
Emotion Detection
The system should be able to identify and categorize emotions into different types, such as happiness, sadness, anger, or fear.
The system should be able to detect the intensity of an emotion, such as mild, moderate, or extreme.
The system should be able to recognize how emotions are expressed through voice, such as tone, pitch, volume, and rate.
Continuously test and refine the system to ensure that it is effective and respectful.
Biometric Authentication
The system should capture a user's unique voiceprint, which is a digital representation of their voice.
The system should verify the speaker's identity by comparing their voiceprint to a stored template or profile.
The system should request authentication from the user, such as asking them to say a specific phrase or provide a voice sample.
Use machine learning algorithms to improve the accuracy of speaker verification over time.
Provide users with control over their biometric data and how it is used.
Ensure that the system is accessible and usable for all users, including those with disabilities.
Continuously test and refine the system to ensure that it is effective and secure.
Consider integrating with other systems, such as identity management or access control systems, to provide additional functionality.
Multimodal Interaction
The system should define the input modalities used in the interaction, such as voice, text, gesture, or gaze.
The system should define the output modalities used in the interaction, such as speech, text, images, or video.
Different systems can allow users to switch between different input modalities during an interaction, such as switching from voice to text.
The system should handle errors and exceptions that occur during multimodal interactions, such as providing alternative input methods.
The system should establish metrics to evaluate the effectiveness of multimodal interactions, such as measuring user satisfaction and task completion rates.
Accessibility Features
The four guiding principles of accessible design are known by the acronym POUR: perceivable, operable, understandable, and robust.
Perceivable: Provide content in ways that users can perceive, regardless of their abilities. Examples:
Provide alternative text for images (e.g., alt tags).
Use high contrast colors to improve readability.
Offer closed captions or transcripts for audio and video content.
Operable: Ensure that users can interact with your product using a variety of methods. Examples:
Make navigation and controls keyboard-accessible.
Provide clear and consistent navigation menus.
Allow users to adjust font sizes and line heights.
Understandable: Present content in a way that is easy to understand. Examples:
Use clear and concise language.
Avoid jargon and technical terms unless necessary.
Provide definitions for complex concepts.
Robust: Build products that work across different devices, browsers, and assistive technologies. Examples:
Test your product on various devices and browsers.
Use semantic HTML to ensure proper structure and meaning.
Follow web standards and best practices for coding and development.
These input capabilities and interaction methods can be used in various applications, including virtual assistants, smart home devices, and accessibility tools. By utilizing voice as the primary input modality, these systems can provide a more natural and intuitive way for users to interact with technology.
For a comprehensive overview of all input modalities and their corresponding Input Primitives, please see Input Primitives.
Design Principles
Here are the fundamental concepts that shape user-friendly interactions for our input modality.
Term
Definition
Ensure Transparent Activation
Provide clear mic activation awareness through earcons or conversational cues.
Provide Clear Privacy Controls
Offer users the option to opt-in to voice experiences where their voice will be captured, respecting their privacy.
Manage Cognitive Load
Limit the amount of information users need to carry in their heads. Use breadcrumbing techniques to guide users through complex interactions. Breadcrumbing refers to the process of providing users with a clear and consistent navigation path through a voice-based conversation. This provides acknowledgement of their current position and prevents confusion and anxiety.
Example:
User: “Hey Alexa, what’s my schedule like today?”
Alexa: “You have a meeting at 2 PM. You’re currently in the ‘Calendar’ menu. To go back to the main menu, say ‘Go back’.” |
Empower Users with Data Ownership
Provide users with a clear understanding of how their audio data is being used and protected, including the ability to revoke consent or delete their data.
Design Contextually Aware Interactions
Design voice interactions that adapt to the user’s context, considering both visual and audio-only experiences. Use passthrough or immersive experiences and embodied interactions to create a seamless experience.
Implement Effective Error Handling and Repair
Implement a system for handling and repairing errors that acknowledges the disconnect and provides options for solving the error. Structure prompts to guide users towards resolving errors and provide clear instructions for repair. Prioritize user understanding and resolution in the design of the error handling system.
Educate Users and Facilitate Onboarding
Educate users about new interaction modes and mental models, especially for NPC (Non-Player Character) interactions, through a robust NUX (New User Experience). Teach users how to prompt effectively and empower creators to take responsibility for prompting in Horizon. Frame education in the context of evolving mental models to help users adapt to new technologies.
Limitations and mitigations
When integrating voice as an input modality, it’s essential to consider the limitations and mitigations of this modality. Please see Voice Best Practices for more information.