Design

Voice

Updated: Mar 13, 2026

Discover how voice can be used to enhance experiences. This page focuses on voice as an input method. It covers the benefits and challenges of this input method, how it works, and design principles. For dos and don’ts, see Voice Best Practices.

Usage

Voice input enables users to interact with devices and applications using spoken language. By leveraging advancements in speech recognition and natural language processing, voice input allows for hands-free, efficient, and accessible user experiences. This modality is especially important for users with disabilities, those in situations where manual input is impractical, or when multitasking.

Voice input enhances inclusivity, reduces cognitive load, and eases friction in user interfaces.

Terminology

These are the frequently used terms to be familiar with:

Term	Definition
Anaphora Resolution	In NLP and linguistics, anaphora refers to the use of a word or phrase that refers back to a previous word or phrase, often using pronouns like “he,” “she,” or “it.” The goal of anaphora resolution is to determine what the anaphoric expression is referring to, which can be crucial for understanding the meaning of a sentence or text. Example: “John gave his book to Mary. She was very happy with it.” In this example, the anaphora “She” refers back to “Mary,” and the anaphora “it” refers back to “his book.”
Automatic Speech Recognition (ASR)	ASR is the technology underlying speech-to-text conversion. It uses machine learning algorithms and statistical models to recognize patterns in speech and convert them into written language. ASR systems can be trained on large datasets of speech recordings and their corresponding transcripts to improve their accuracy.
Barge In	In conversational systems, barge-in refers to the ability of a user to interrupt a system’s response or prompt with their own input, usually by speaking over the system. This feature allows users to quickly correct errors, provide additional information, or change the direction of the conversation.
Beam Forming	In audio signal processing, beam forming is a technique used to enhance the quality of audio signals captured by multiple microphones. By combining the signals from multiple microphones, beam forming creates a virtual microphone that focuses on a specific area or direction, reducing background noise and improving the signal-to-noise ratio.
Confirmation	Feedback provided to the user after a voice command has been executed, confirming the action taken.
Contextual Understanding	The ability of a voice input system to understand the context in which a command is given, including previous interactions and environmental factors.
Disambiguation	The process of clarifying ambiguous or unclear voice commands, often through follow-up questions or prompts.
Error Handling	Strategies for dealing with errors or misinterpretations in voice input, such as providing feedback or offering alternatives.
Large Language Models (LLMs)	Large Language Models (LLMs) are a type of artificial intelligence (AI) model that is trained on vast amounts of text data to learn patterns, relationships, and structures within language. These models are designed to process and generate human-like language, enabling them to perform a wide range of natural language processing (NLP) tasks.
Latency	The delay between the user speaking and the system responding to their voice input.
Natural Language Processing (NLP)	Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that deals with the interaction between computers and humans in natural language. It involves the development of algorithms, statistical models, and machine learning techniques to process, understand, and generate natural language data.
Speech Recognition	Speech recognition, also known as voice recognition, refers broadly to the ability of a system to identify and process spoken language. It encompasses ASR (transcription) as well as related capabilities like speaker identification and intent recognition.
Speech to Text	The process of converting spoken language into written text, also known as speech recognition or automatic speech recognition (ASR).
Generative speech - Text to Speech (TTS)	The ability of a computer system to synthesize spoken words from text, using techniques such as concatenative synthesis, parametric synthesis, and generative AI.
Turn-based Interaction	A type of interaction where the user speaks and then waits for the system to respond before speaking again.
Voice User Interface (VUI)	A Voice User Interface (VUI) is a type of user interface that allows users to interact with a system, device, or application using voice commands. VUIs use speech recognition technology to interpret and process spoken language, enabling users to perform tasks, access information, and control devices without the need for physical input such as typing or clicking.
Wake Word	A specific word or phrase used to activate a voice input system, such as “Hey Meta”.

Voice interaction guidelines

This section offers guidance on voice-based interaction techniques used in voice experience design. Discover input mappings, understand design principles, consider ergonomic factors, and learn the essential dos and don’ts.

Input mappings

Discover various input capabilities and interaction methods utilizing voice as input modality:

Speech Recognition Behavior	The basic unit of speech recognition interaction is an utterance, a single spoken phrase or sentence. Identifying the intent behind a user's utterance is crucial in determining the appropriate response. Intents can be categorized into different types, such as informational, transactional, or navigational. Entities are the specific details that provide context to an utterance. Examples include names, dates, locations, and quantities. Accurate entity recognition is vital for effective speech recognition interactions. Context plays a significant role in shaping the user's expectations and behavior. Designing interactions that take into account the user's context can lead to more accurate and relevant responses. Confirmations are used to ensure that the system has correctly understood the user's intent or to request additional information. This primitive helps prevent errors and improves overall accuracy. Disambiguation techniques, such as asking follow-up questions or providing options, help clarify user intent when it's unclear or ambiguous. Effective error handling strategies, such as explaining and offering alternatives, can mitigate the impact of errors and maintain user satisfaction.
Natural Language Processing (NLP) Design Application	Tokenization is the process of splitting text into smaller units, such as words or subwords, to facilitate analysis and understanding. Part of Speech (POS) tagging helps determine the part of speech (e.g., noun, verb, adjective) for each word in a sentence, enabling more accurate interpretation. Named Entity Recognition (NER) involves identifying named entities, such as people, places, organizations, and dates, to provide context and meaning. Sentiment analysis evaluates the emotional tone or attitude conveyed in text, such as positive, negative, or neutral. Intent identification involves determining the purpose or objective behind a piece of text, such as making a request or providing information.
Voice-Based Navigation	The system should clearly announce the available menu options. Users should be able to select a specific option using voice commands. The system should allow users to navigate through submenus using voice commands, such as "What's next?" or "Go back." The system should confirm user input and verify its understanding of the selected option to ensure accuracy. The system should be able to handle errors, such as incorrect user input or unavailable options, and provide alternative solutions or suggestions. The system should group related options together. The system should provide shortcut options for frequently used actions.
Emotion Detection	The system should be able to identify and categorize emotions into different types, such as happiness, sadness, anger, or fear. The system should be able to detect the intensity of an emotion, such as mild, moderate, or extreme. The system should be able to recognize how emotions are expressed through voice, such as tone, pitch, volume, and rate. Continuously test and refine the system to ensure that it is effective and respectful.
Biometric Authentication	The system should capture a user's unique voiceprint, which is a digital representation of their voice. The system should verify the speaker's identity by comparing their voiceprint to a stored template or profile. The system should request authentication from the user, such as asking them to say a specific phrase or provide a voice sample. Use machine learning algorithms to improve the accuracy of speaker verification over time. Provide users with control over their biometric data and how it is used. Ensure that the system is accessible and usable for all users, including those with disabilities. Continuously test and refine the system to ensure that it is effective and secure. Consider integrating with other systems, such as identity management or access control systems, to provide additional functionality.
Multimodal Interaction	The system should define the input modalities used in the interaction, such as voice, text, gesture, or gaze. The system should define the output modalities used in the interaction, such as speech, text, images, or video. Different systems can allow users to switch between different input modalities during an interaction, such as switching from voice to text. The system should handle errors and exceptions that occur during multimodal interactions, such as providing alternative input methods. The system should establish metrics to evaluate the effectiveness of multimodal interactions, such as measuring user satisfaction and task completion rates.
Accessibility Features	The four guiding principles of accessible design are known by the acronym POUR: perceivable, operable, understandable, and robust. Perceivable: Provide content in ways that users can perceive, regardless of their abilities. Examples: Provide alternative text for images (e.g., alt tags). Use high contrast colors to improve readability. Offer closed captions or transcripts for audio and video content. Operable: Ensure that users can interact with your product using a variety of methods. Examples: Make navigation and controls keyboard-accessible. Provide clear and consistent navigation menus. Allow users to adjust font sizes and line heights. Understandable: Present content in a way that is easy to understand. Examples: Use clear and concise language. Avoid jargon and technical terms unless necessary. Provide definitions for complex concepts. Robust: Build products that work across different devices, browsers, and assistive technologies. Examples: Test your product on various devices and browsers. Use semantic HTML to ensure proper structure and meaning. Follow web standards and best practices for coding and development.

These input capabilities and interaction methods can be used in various applications, including virtual assistants, smart home devices, and accessibility tools. By utilizing voice as the primary input modality, these systems can provide a more natural and intuitive way for users to interact with technology.

For a comprehensive overview of all input modalities and their corresponding input mappings, please see Input mappings.

Design principles

Here are the fundamental concepts that shape user-friendly interactions for our input modality.

Term	Definition
Ensure Transparent Activation	Provide clear mic activation awareness through earcons or conversational cues.
Provide Clear Privacy Controls	Offer users the option to opt-in to voice experiences where their voice will be captured, respecting their privacy.
Manage Cognitive Load	Limit the amount of information users need to carry in their heads. Use breadcrumbing techniques to guide users through complex interactions. Breadcrumbing refers to the process of providing users with a clear and consistent navigation path through a voice-based conversation. This provides acknowledgement of their current position and prevents confusion and anxiety.

Example: User: “Hey Alexa, what’s my schedule like today?” Alexa: “You have a meeting at 2 PM. You’re currently in the ‘Calendar’ menu. To go back to the main menu, say ‘Go back’.” |

Empower Users with Data Ownership	Provide users with a clear understanding of how their audio data is being used and protected, including the ability to revoke consent or delete their data.
Design Contextually Aware Interactions	Design voice interactions that adapt to the user’s context, considering both visual and audio-only experiences. Use passthrough or immersive experiences and embodied interactions to create a seamless experience.
Implement Effective Error Handling and Repair	Implement a system for handling and repairing errors that acknowledges the disconnect and provides options for solving the error. Structure prompts to guide users towards resolving errors and provide clear instructions for repair. Prioritize user understanding and resolution in the design of the error handling system.
Educate Users and Facilitate Onboarding	Educate users about new interaction modes and mental models, especially for NPC (Non-Player Character) interactions, through a robust NUX (New User Experience). Teach users how to prompt effectively and empower creators to take responsibility for prompting in Horizon. Frame education in the context of evolving mental models to help users adapt to new technologies.

Limitations and mitigations

When integrating voice as an input modality, it’s essential to consider the limitations and mitigations of this modality. Please see Voice Best Practices for more information.