Develop

Spatial Lingo

Updated: May 11, 2026

Overview

This sample demonstrates how to build an AI-powered mixed reality language-learning experience for Meta Quest using Unity. It combines on-device object detection, Llama-generated vocabulary lessons, and voice-driven interaction to create an immersive learning app where users practice vocabulary by speaking about real objects in their environment.

What you will learn

  • Integrate the Passthrough Camera API, YOLO object detection via Unity Sentis, and the Environment Depth API to identify and track real-world objects in 3D space
  • Generate dynamic lesson content using Llama chat completions and image understanding, with robust retry logic and JSON response parsing
  • Implement multilingual text-to-speech and speech-to-text using the Voice SDK (wit.ai)
  • Build app flow state machines using Unity Visual Scripting with custom units
  • Combine hand tracking and controller input for squeeze-based interaction using the Interaction SDK and OVRPlugin hand bone tracking

Requirements

You need a Meta Quest device and Unity 6000.0.51f1 or later. For detailed build instructions, SDK versions, and setup steps, see the sample’s README.

Get started

Clone the repository from GitHub, open the project in Unity 6, and configure your Llama API key in the SpatialLingoSettings ScriptableObject. Build for Android with OpenXR backend and deploy to your Quest device. For complete build configuration, see the build instructions.

Explore the sample

The project contains 5 main scenes for the full app experience and 17 standalone sample scenes that demonstrate individual features. Each sample scene isolates a specific API or pattern, making it easy to study the code without navigating the full app.

Main scenes

SceneWhat it demonstratesKey concepts
MainScene
Complete app entry point with all core systems (tracking, voice, lessons, UI flow)
Visual Scripting state machine, system initialization order, Application Variables, world anchor management
GymScene
Development/debugging scene for testing MRUK room scanning and camera object tracking
MRUK room data display, object tracking debug visuals, camera permission flow, Inference Engine preloading
SelectScene
Language selection UI where users choose their target language
Language selection flow, UI localization
LoadingScene
Initial loading screen and system initialization
Loading state, async system setup
ArtScene
Environment art assets for the immersive experience
Visual environment, 3D art integration

Sample scenes

SceneWhat it demonstratesKey concepts
LlamaAPISample
Basic Llama API calls: chat conversation, word cloud generation, translation, example sentences, image understanding
LlamaRestApi.StartNewChat(), ContinueChat(), ImageUnderstanding(), base64 image encoding
AssistantAISample
Higher-level AssistantAI wrapper for word cloud generation, sentence complexity, transcription evaluation
Prompt engineering, JSON response parsing, phonetic similarity evaluation
ObjectRecognitionSample
YOLO object detection on a static image using Unity Sentis, with bounding box overlay
Sentis model loading, YOLO output parsing (8400 detections x 84 values), Non-Maximum Suppression via FunctionalGraph
CameraImageSample
Passthrough Camera API image capture and display on mesh
WebCamTextureManager, camera resolution configuration, camera pose/orientation
WordCloudSample
3D word cloud lesson interaction: berry spawning, activation/deactivation, lesson completion flow, voice transcription trigger
Lesson3DInteractor usage, proximity-based activation, CameraTrackedTaxon with sample data
VoiceSynthesizeSample
TTS synthesis in 12 languages via wit.ai, with button-per-language UI
VoiceSynthesizer async API, AudioClip caching, multilingual TTS with romanized fallbacks
TextToSpeech
Text-to-speech functionality with UI controls
TTSSpeaker integration, SSML prosody markup
SpeechToText
Speech-to-text microphone input and transcription display
STT language toggling, microphone input handling
TranscriptionSample
Voice transcription pipeline with language switching
VoiceTranscriber lifecycle, partial vs full transcription events
CharacterSample
Golly Gosh character animation, emotion, movement, and gaze behavior
Bezier curve movement, MaterialPropertyBlock sprite animation, gaze tracking with lazy follow
ActivitySample
Lesson activity data model and interaction patterns
Activity lifecycle, lesson data structures
AudioSample
Audio system and sound effect playback
Meta XR Audio integration, spatial audio
FindSpawnPositionsSample
MRUK floor placement for spawning objects on user’s floor
MRUK room readiness polling, FindSpawnPositions building block
LanguageSelectSample
Language selection UI flow
Language selection state machine, UI event handling
LessonFlowSample
End-to-end lesson flow with state transitions
FlowController state machine, FlowState lifecycle hooks
PassthroughHighlighting
Passthrough environment highlighting technique
Passthrough API material manipulation
VATSample
Vertex Animation Texture (VAT) shader technique for animations
Shader-based animation, VAT playback

Runtime behavior

When you run the main scene, you see the loading screen while the app initializes systems (Depth API, Passthrough Camera, YOLO model). The app prompts you to grant camera permissions, then scans your room using MRUK to find a floor position. Golly Gosh, the character guide, appears and prompts you to plant a seed that grows into a language tree. After selecting your target language, you look around your room. The app identifies objects in real time (chair, laptop, bottle, etc.) and spawns 3D word clouds near them. When you approach a word cloud, it activates and Golly Gosh speaks the vocabulary in your target language. You repeat the words, the app transcribes your speech and evaluates it using Llama, and on success, a berry flies to your tree, which grows through three tiers as you complete lessons.

Key concepts

Object detection pipeline

The sample combines three APIs to detect and track objects in 3D space. The Passthrough Camera API captures frames at configurable resolution (1280x960 in the main app), YOLO object detection runs on-device via Unity Sentis to classify objects into 80 COCO classes, and the Environment Depth API raycasts from 2D bounding boxes to map detections into 3D world positions. The CameraTaxonTracker orchestrates this pipeline:
m_taxonTracker = new CameraTaxonTracker(
    m_environmentRaycastManager,
    m_cameraTextureManager,
    m_imageObjectClassifier);
The tracker produces CameraTrackedTaxon objects with 3D position, extent, and classification confidence. For the complete pipeline setup, see SpatialLingoApp.cs.

Llama-powered lesson generation

When the sample detects an object, it sends the classification name and cropped camera image to Llama to generate a bilingual word cloud. The prompt asks for nouns, adjectives, and verbs in both the user’s native language and the target language, formatted as JSON. The sample uses JsonUtility to parse the response after stripping markdown code fences:
var response = await m_llamaAPI.ContinueChat(chat, request.ToString());
responseText = PrepareJsonStringForParsing(response.Message.Text);
cloud = JsonUtility.FromJson<WordCloudData>(responseText);
The sample includes retry logic (up to 5 attempts per subtask) and differentiates between WiFi and server errors. For the full prompt engineering and parsing logic, see AssistantAI.cs.

Voice interaction

The sample uses the Voice SDK for both text-to-speech and speech-to-text. Golly Gosh speaks in the target language using wit.ai TTS, wrapped by VoiceSpeaker with SSML prosody markup for pitch and rate control. For speech recognition, VoiceTranscriber wraps AppDictationExperience with auto-relisten on microphone timeout and separate events for partial and full transcription. The sample supports 12 languages for both TTS and STT. For TTS implementation, see VoiceSpeaker.cs.

Visual Scripting app flow

The sample uses Unity Visual Scripting to orchestrate the app flow state machine. Custom Visual Scripting units extend SkippableUnit, which provides debug-skip capability and lifecycle hooks. The sample demonstrates how to pass data between nodes using Application Variables and how to wait for external events:
protected override IEnumerator Await(Flow flow) {
    m_isDone = false;
    OnEnter(flow);
    yield return new WaitUntil(() => m_isDone);
    OnExit();
    yield return m_targetControlOutput;
}
Each custom unit reports its state name via static events, enabling debug UI overlays. For the base class pattern, see SkippableUnit.cs.

Squeeze interaction with hand tracking

The sample implements squeeze interaction by combining Interaction SDK’s TouchHandGrabInteractable with OVRPlugin hand bone tracking. When the user grabs an object, the sample reads bone positions for all five fingertips, computes the average distance from the palm center, and uses the ratio to the starting distance as a squeeze factor:
var thumb = positions[(int)OVRPlugin.BoneId.XRHand_ThumbTip];
var index = positions[(int)OVRPlugin.BoneId.XRHand_IndexTip];
var currentDistance = AverageDistanceFingersCenter(m_selectingHand);
var ratio = currentDistance / m_startSelectDistance;
The sample applies this ratio to visual scale, creating deformation feedback. For the complete implementation, see SqueezableHandInteraction.cs.

Extend the sample

  • Add support for additional object classes by replacing the YOLO model with a custom Sentis model trained on domain-specific objects (food items, office supplies, etc.)
  • Implement lesson evaluation using on-device speech recognition instead of Llama API calls to reduce latency and network dependency
  • Add vocabulary quiz modes by extending the Visual Scripting state machine with new states in the LessonFlowSample pattern, such as timed challenges or multi-object sentence construction