Spatial Lingo

Updated: May 11, 2026

Overview

This sample demonstrates how to build an AI-powered mixed reality language-learning experience for Meta Quest using Unity. It combines on-device object detection, Llama-generated vocabulary lessons, and voice-driven interaction to create an immersive learning app where users practice vocabulary by speaking about real objects in their environment.

View the sample source code on GitHub⁠.

What you will learn

Integrate the Passthrough Camera API, YOLO object detection via Unity Sentis, and the Environment Depth API to identify and track real-world objects in 3D space

Generate dynamic lesson content using Llama chat completions and image understanding, with robust retry logic and JSON response parsing

Implement multilingual text-to-speech and speech-to-text using the Voice SDK (wit.ai)

Build app flow state machines using Unity Visual Scripting with custom units

Combine hand tracking and controller input for squeeze-based interaction using the Interaction SDK and OVRPlugin hand bone tracking

Requirements

You need a Meta Quest device and Unity 6000.0.51f1 or later. For detailed build instructions, SDK versions, and setup steps, see the sample’s README.

Get started

Clone the repository from GitHub, open the project in Unity 6, and configure your Llama API key in the SpatialLingoSettings ScriptableObject. Build for Android with OpenXR backend and deploy to your Quest device. For complete build configuration, see the build instructions.

Explore the sample

The project contains 5 main scenes for the full app experience and 17 standalone sample scenes that demonstrate individual features. Each sample scene isolates a specific API or pattern, making it easy to study the code without navigating the full app.

Main scenes

Scene	What it demonstrates	Key concepts
MainScene	Complete app entry point with all core systems (tracking, voice, lessons, UI flow)	Visual Scripting state machine, system initialization order, Application Variables, world anchor management
GymScene	Development/debugging scene for testing `MRUK` room scanning and camera object tracking	`MRUK` room data display, object tracking debug visuals, camera permission flow, Inference Engine preloading
SelectScene	Language selection UI where users choose their target language	Language selection flow, UI localization
LoadingScene	Initial loading screen and system initialization	Loading state, async system setup
ArtScene	Environment art assets for the immersive experience	Visual environment, 3D art integration

Sample scenes

Scene	What it demonstrates	Key concepts
LlamaAPISample	Basic Llama API calls: chat conversation, word cloud generation, translation, example sentences, image understanding	`LlamaRestApi.StartNewChat()`, `ContinueChat()`, `ImageUnderstanding()`, base64 image encoding
AssistantAISample	Higher-level AssistantAI wrapper for word cloud generation, sentence complexity, transcription evaluation	Prompt engineering, JSON response parsing, phonetic similarity evaluation
ObjectRecognitionSample	YOLO object detection on a static image using Unity Sentis, with bounding box overlay	Sentis model loading, YOLO output parsing (8400 detections x 84 values), Non-Maximum Suppression via FunctionalGraph
CameraImageSample	Passthrough Camera API image capture and display on mesh	`WebCamTextureManager`, camera resolution configuration, camera pose/orientation
WordCloudSample	3D word cloud lesson interaction: berry spawning, activation/deactivation, lesson completion flow, voice transcription trigger	Lesson3DInteractor usage, proximity-based activation, CameraTrackedTaxon with sample data
VoiceSynthesizeSample	TTS synthesis in 12 languages via wit.ai, with button-per-language UI	VoiceSynthesizer async API, AudioClip caching, multilingual TTS with romanized fallbacks
TextToSpeech	Text-to-speech functionality with UI controls	TTSSpeaker integration, SSML prosody markup
SpeechToText	Speech-to-text microphone input and transcription display	STT language toggling, microphone input handling
TranscriptionSample	Voice transcription pipeline with language switching	VoiceTranscriber lifecycle, partial vs full transcription events
CharacterSample	Golly Gosh character animation, emotion, movement, and gaze behavior	Bezier curve movement, MaterialPropertyBlock sprite animation, gaze tracking with lazy follow
ActivitySample	Lesson activity data model and interaction patterns	Activity lifecycle, lesson data structures
AudioSample	Audio system and sound effect playback	Meta XR Audio integration, spatial audio
FindSpawnPositionsSample	`MRUK` floor placement for spawning objects on user’s floor	`MRUK` room readiness polling, FindSpawnPositions building block
LanguageSelectSample	Language selection UI flow	Language selection state machine, UI event handling
LessonFlowSample	End-to-end lesson flow with state transitions	`FlowController` state machine, FlowState lifecycle hooks
PassthroughHighlighting	Passthrough environment highlighting technique	Passthrough API material manipulation
VATSample	Vertex Animation Texture (VAT) shader technique for animations	Shader-based animation, VAT playback

Runtime behavior

When you run the main scene, you see the loading screen while the app initializes systems (Depth API, Passthrough Camera, YOLO model). The app prompts you to grant camera permissions, then scans your room using MRUK to find a floor position. Golly Gosh, the character guide, appears and prompts you to plant a seed that grows into a language tree. After selecting your target language, you look around your room. The app identifies objects in real time (chair, laptop, bottle, etc.) and spawns 3D word clouds near them. When you approach a word cloud, it activates and Golly Gosh speaks the vocabulary in your target language. You repeat the words, the app transcribes your speech and evaluates it using Llama, and on success, a berry flies to your tree, which grows through three tiers as you complete lessons.

Key concepts

Object detection pipeline

The sample combines three APIs to detect and track objects in 3D space. The Passthrough Camera API captures frames at configurable resolution (1280x960 in the main app), YOLO object detection runs on-device via Unity Sentis to classify objects into 80 COCO classes, and the Environment Depth API raycasts from 2D bounding boxes to map detections into 3D world positions. The CameraTaxonTracker orchestrates this pipeline:

m_taxonTracker = new CameraTaxonTracker(
    m_environmentRaycastManager,
    m_cameraTextureManager,
    m_imageObjectClassifier);

The tracker produces CameraTrackedTaxon objects with 3D position, extent, and classification confidence. For the complete pipeline setup, see SpatialLingoApp.cs.

Llama-powered lesson generation

When the sample detects an object, it sends the classification name and cropped camera image to Llama to generate a bilingual word cloud. The prompt asks for nouns, adjectives, and verbs in both the user’s native language and the target language, formatted as JSON. The sample uses JsonUtility to parse the response after stripping markdown code fences:

var response = await m_llamaAPI.ContinueChat(chat, request.ToString());
responseText = PrepareJsonStringForParsing(response.Message.Text);
cloud = JsonUtility.FromJson<WordCloudData>(responseText);

The sample includes retry logic (up to 5 attempts per subtask) and differentiates between WiFi and server errors. For the full prompt engineering and parsing logic, see AssistantAI.cs.

Voice interaction

The sample uses the Voice SDK for both text-to-speech and speech-to-text. Golly Gosh speaks in the target language using wit.ai TTS, wrapped by VoiceSpeaker with SSML prosody markup for pitch and rate control. For speech recognition, VoiceTranscriber wraps AppDictationExperience with auto-relisten on microphone timeout and separate events for partial and full transcription. The sample supports 12 languages for both TTS and STT. For TTS implementation, see VoiceSpeaker.cs.

Visual Scripting app flow

The sample uses Unity Visual Scripting to orchestrate the app flow state machine. Custom Visual Scripting units extend SkippableUnit, which provides debug-skip capability and lifecycle hooks. The sample demonstrates how to pass data between nodes using Application Variables and how to wait for external events:

protected override IEnumerator Await(Flow flow) {
    m_isDone = false;
    OnEnter(flow);
    yield return new WaitUntil(() => m_isDone);
    OnExit();
    yield return m_targetControlOutput;
}

Each custom unit reports its state name via static events, enabling debug UI overlays. For the base class pattern, see SkippableUnit.cs.

Squeeze interaction with hand tracking

The sample implements squeeze interaction by combining Interaction SDK’s TouchHandGrabInteractable with OVRPlugin hand bone tracking. When the user grabs an object, the sample reads bone positions for all five fingertips, computes the average distance from the palm center, and uses the ratio to the starting distance as a squeeze factor:

var thumb = positions[(int)OVRPlugin.BoneId.XRHand_ThumbTip];
var index = positions[(int)OVRPlugin.BoneId.XRHand_IndexTip];
var currentDistance = AverageDistanceFingersCenter(m_selectingHand);
var ratio = currentDistance / m_startSelectDistance;

The sample applies this ratio to visual scale, creating deformation feedback. For the complete implementation, see SqueezableHandInteraction.cs.

Extend the sample

Add support for additional object classes by replacing the YOLO model with a custom Sentis model trained on domain-specific objects (food items, office supplies, etc.)

Implement lesson evaluation using on-device speech recognition instead of Llama API calls to reduce latency and network dependency

Add vocabulary quiz modes by extending the Visual Scripting state machine with new states in the LessonFlowSample pattern, such as timed challenges or multi-object sentence construction

Unity Spatial Lingo sample repository⁠

Llama API documentation⁠

Voice SDK documentation⁠

Unity Sentis documentation⁠

Mixed Reality Utility Kit (MRUK) documentation⁠

Environment Depth API documentation⁠