This sample demonstrates how to build a complete computer vision pipeline on Meta Quest using the Passthrough Camera API and Unity Inference Engine. It captures the passthrough camera feed, runs YOLOv9 machine learning inference to detect real-world objects, and places persistent 3D markers at detected object locations using depth raycasting and spatial anchors.
Integrating the Passthrough Camera API with Unity Inference Engine for real-time object detection on Quest
Converting 2D ML bounding boxes into 3D world-space positions using depth raycasting
Implementing Non-Maximum Suppression (NMS) in C# to filter overlapping detections
Using OVRSpatialAnchor to world-lock detected objects and recover from tracking loss
Building a coroutine-based async inference pipeline that avoids blocking the main thread
Requirements
You need a Meta Quest 3 or Quest 3S running Horizon OS v74 or later. For development environment setup and specific SDK versions, see the Passthrough Camera API getting started guide and sample README.
Get started
Clone or download the Unity-PassthroughCameraApiSamples repository from GitHub and open the project in Unity. Navigate to the MultiObjectDetection scene and build to your Quest device. The sample requires Scene permission (com.oculus.permission.USE_SCENE) for depth raycasting — see the repository README for detailed build instructions.
Explore the sample
The sample is organized into three major subsystems that work together to deliver the detection experience.
File / Scene
What it demonstrates
Topics covered
MultiObjectDetection.unity
Main scene with all components integrated
Scene composition, prefab orchestration
DetectionManager.cs
Orchestrates the overall detection flow, handles user input, manages marker spawning
When you run the scene, the sample first checks for required permissions. If Scene permission is missing, you see a NoPermission UI state. After granting permissions, the Initial state displays a welcome message with typewriter animation. Press the A button on your controller or perform an index finger pinch gesture to enter the Running state.
The inference loop begins, displaying real-time bounding boxes around detected objects with class labels overlaid in world space. Press A again or pinch with your index finger to spawn persistent 3D markers at the detected object locations — these appear with animated scaling and billboard labels. Press B or perform a middle finger pinch to clear all spawned markers. The UI continuously displays the model name, detection count, and total identified objects.
Key concepts
Camera-to-ML-to-3D pipeline
The sample implements a complete data flow from camera capture through ML inference to 3D world placement. PassthroughCameraAccess provides the camera frame and pose, which are converted to a tensor and fed to the YOLOv9 inference worker. The worker returns three output tensors — bounding boxes, class IDs, and confidence scores — which are then projected into world space. See SentisInferenceRunManager.cs for the complete inference loop.
Depth-augmented 2D-to-3D projection
Rather than placing bounding boxes at a fixed distance, the sample uses the Depth API to determine precise 3D positions. For each detection, the sample converts the bounding box center to a world-space ray via PassthroughCameraAccess.ViewportPointToRay() and raycasts against the environment depth map. This produces accurate world-space positions that align with the physical objects the user sees. See SentisInferenceUiManager.cs for the projection math.
Coroutine-based async inference
The sample uses Unity coroutines to avoid blocking the main thread during inference result retrieval. After scheduling inference with Worker.Schedule(), it calls ReadbackAndCloneAsync() and yields until completion, keeping the frame rate smooth. A PreloadModel() method runs a dummy inference pass at startup to avoid first-frame latency. See the RunInference() method in SentisInferenceRunManager.cs for the async pattern.
Spatial anchors for world persistence
The sample parents spawned markers to a GameObject with an OVRSpatialAnchor component, preventing content drift as the user moves. It saves the anchor after localization, continuously monitors tracking state, and automatically attempts recovery if tracking is lost. If recovery fails, the anchor is erased and all markers are cleared. See DetectionManager.cs for the anchor lifecycle logic.
Two-stage non-maximum suppression
The sample implements NMS twice: once baked into the model during editor conversion via SentisModelEditorConverter.cs, and again at runtime in C#. The runtime NonMaxSuppression() method filters overlapping bounding boxes by comparing Intersection-over-Union (IoU) scores against a configurable threshold. This two-stage approach balances model efficiency with runtime flexibility. See SentisInferenceRunManager.cs for the C# implementation.
Duplicate marker prevention
When spawning persistent markers, the sample prevents duplicates by checking both class name and spatial overlap. If an existing marker of the same class falls within the new bounding box, the duplicate is skipped. See the SpawnCurrentDetectedObjects() method in DetectionManager.cs for the deduplication logic.
Extend the sample
Custom object detection: Train YOLOv9 on your own dataset and convert the model using the editor converter tool in SentisModelEditorConverter.cs
Tune detection sensitivity: Adjust the NMS IoU threshold or confidence score filter in the SentisInferenceRunManager Inspector to balance precision and recall
Add haptic feedback: Trigger controller vibration when markers are spawned to provide tactile confirmation that an object was recognized
For a simpler introduction to the Passthrough Camera API, explore the CameraViewer or CameraToWorld samples first. For more on combining passthrough camera data with machine learning, see the Passthrough Camera API + Unity Inference Engine guide.