This section describes using the Unity Inference Engine framework with the Passthrough Camera API. The Inference Engine provides a framework for loading models from common open source platforms and compiling them on-device. This tutorial explains how to set up the Inference Engine and the Yolov9t model to identify real objects at runtime on Quest 3 devices.
After completing this section, you should be able to:
Recompile and load the YOLO sample with Unity.
Build a new project using Unity and Inference Engine.
Compile and build Inference Engine models to support models other than YOLO.
Integrate this API with the Unity Inference Engine architecture to access ML/CV models. Note that the Unity Inference Engine integration is not a required dependency and many developers will choose to access different or proprietary frameworks. However, it is provided as a convenience for developers to experiment quickly.
The framework described in this section can be followed to load the ML/CV model of your choice. Note that the complexity of the models available online differ widely and you will need to find a model that meets the performance profile required by your application. Unity also provides samples that can be modified for use to do other things like digit recognition.
Prerequisites
Hardware requirements
Development machine running one of the following:
Windows 10+ (64-bit)
macOS 10.10+ (x86 or ARM)
Headset requirements
Supported Meta Quest headsets:
Quest 3
Quest 3S and 3S Xbox Edition
Horizon OS v74 or later installed on your headset
Software requirements
Unity Editor 2022.3.15f1 or later (6.1 or later recommended)
Inference Engine package 2.2.1 (com.unity.ai.inference) installed in your project. See the Upgrade to Inference Engine if you used Unity Sentis before.
Grant Camera and Spatial data permissions to your app. The Spatial data permission is used by the EnvironmentRaycastManager.
Known issues
The model accuracy is not 100% because the model has been optimized to work on devices like Quest 3.
This model has been trained with 80 classes (not objects), it means that some object will be included inside a class. For example: Monitor and TV are in the same class TV_Monitor. The table in the next section identifies the classes the model can identify.
Some classes are hard to identify, for example, cell phones are difficult to identify, in most cases are identified as the TV_Monitor class.
Quest 3 controllers might be unrecognized or be mislabeled as remote controllers.
The bounding boxes visualization in the MultiObjectDetection sample doesn’t perfectly align with the detected objects. For a better example of the camera to world projection, refer to the CameraToWorld sample.
YOLO sample
This sample draws 2D boxes around the detected objects and spawns a marker in the approximate 3D position of each object when the user presses the A button.
Open the Unity-PassthroughCameraApiSamples project in Unity Editor.
In Build Profiles switch the Active Platform to Android.
Open the Assets/PassthroughCameraApiSamples/MultiObjectDetection/MultiObjectDetection.unity sample scene.
Navigate to Meta > Tools > Project Setup Tool. If the rule “MR Utility Kit recommends Scene Support to be set to Required” exists, select ... > Ignore.
Click Fix all and Apply all to resolve the other issues and recommendations.
Build the app and test it on your headset.
Preview the scene
Description: This sample shows how to identify multiple objects using the Inference Engine and a pretrained YOLO model.
Controls: This sample uses the Quest 3 controllers:
Menus (Start and Pause):
Button A: start playing
In Game:
Button A: place a marker in the world position for each detected object
At any moment:
Button MENU: back to Samples selection.
How to play:
Start the application and use the device to look around you.
When an object is detected, you will see 2D floating boxes around the detected objects.
If you press the A button, a 3D marker will be placed in the real world position of the detected objects with the class name.
This model can identify the following objects (80):
People and animals:person, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe
Vehicles and transport:bicycle, car, motorbike, aeroplane, bus, train, truck, boat
Outdoor objects:traffic light, fire hydrant, stop sign, parking meter, bench
Sports and accessories:frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket
Personal items:backpack, umbrella, handbag, tie, suitcase
Kitchen and dining:bottle, wine glass, cup, fork, knife, spoon, bowl
This section describes the components used in the MultiObjectDetection.unity scene.
The sample scene contains the following prefabs to run the gameplay:
[BuildingBlock] Camera Rig: added using Meta XR Building Blocks. This Game Object contains additional prefabs:
As child of CenterEyeAnchor:
DetectionUiMenuPrefab: manages and shows the UI of the sample.
[BuildingBlock] Passthrough: Meta XR Building Blocks entity to configure and enable the Passthrough feature.
DetectionManagerPrefab: contains the scanner logic to get the camera data and run the Inference Engine inference to update the UI elements.
SentisInferenceManagerPrefab: contains the multi-object detection inference and UI logic.
EnvironmentRaycastPrefab: contains the logic to use the MRUK Raycast to get the real world 3D point using the Quest Depth Data.
PassthroughCameraAccessPrefab: creates a PassthroughCameraAccess component and manages its settings.
ReturnToStartScene: common prefab used to go back to Samples selection.
Sample detection manager prefab
The main prefab that manages the logic of this sample is DetectionManagerPrefab.
This prefab contains the following components:
DetectionManager.cs: This script contains the sample logic. It gets the camera image from the PassthroughCameraAccessPrefab to send it to Inference Engine inference. Also, it manages the placement action when the user presses the A button.
AI models used in this project
This project uses Unity Inference Engine (2.2.1) to run AI models locally.
Below you can find the details about each AI model used with the Inference Engine and which type of data used and can be used from each model.
Multi-object detection model
To identify real-world objects, the project uses the Yolov9t model prepared for the Inference Engine with an extra layer (non-max suppression) and quantized to Uint8. This model requires just an image as input and returns the 2D coordinates and the class type of the detected objects.
Technical information
Model name: Yolov9t
Model format: Sentis (.sentis), quantized to Uint8
Model size: ~2.3 MB
Classes trained: 80
YOLO control scripts
SentisInferenceRunManager.cs: contains the logic to run inference using the Inference Engine 2.2.1.
This component contains the following parameters:
Backend: the local backend used by the Inference Engine to run the inferences
Sentis Model: the model asset (.sentis) used for these inferences
K_Layers Per Frame: the number of layers executed per frame.
Label Asset: a file containing a list of objects detected by the pretrained Yolov9t model.
UIInference: references to the SentisInferenceUiManager scripts.
SentisInferenceUiManager.cs: contains the logic to draw the UI box around the tracked object.
This component contains the following parameters:
Display Image: the reference to the UI element that will contain the 2D bounding boxes.
Box Texture: the texture used by the 2d Box.
Box Color: the color of the bounding box.
Font: the font type used to write the detected object information.
Font Color: the color for the object information
Font size: the size of the font used to show the object information.
On Object Detected event: event launched when YOLO detects an object.
Input values: a Tensor<float> with the current camera RGB raw data image.
The Inference Engine framework provides a function to convert a texture to a Tensor<float>.
Output values: the model returns two tensors:
A Tensor<float> of screen coordinates (X,Y) for all detected objects.
A Tensor<int> with the class ID for each object detected.
Once you have the object and its coordinates (X and Y percentage of the input image), use the Depth placement with MRUK Raycast to get the real world position.
ONNX to Inference Engine converter
This sample includes editor code to convert the Yolov9 ONNX model to the Inference Engine format. The script class names retain the Sentis* prefix from the prior package name (Sentis was renamed to Inference Engine in Unity 6).
The Inference Engine Inference Run Manager component lets you adjust the Intersection over Union (IoU) and score threshold for the YOLO model and generate a .sentis file with the Non-max-Suppression layer and quantized to UInt8.
IoU Threshold: YOLO IoU threshold used to identify the objects.
Score Threshold: YOLO score threshold value used to identify the objects.
The Inference Engine functions to quantize a model and save it are located inside the SentisModelEditorConverter.cs script.
To place a marker in the real-world position using the 2D coordinates for each object detected, the sample uses the MRUK Environment Raycasting feature. This feature performs a raycast from the 2D position to the real world using the device’s depth map data.
The EnvironmentRaycast prefab contains the two components used to perform the depth raycast.
EnvironmentRayCastSampleManager script: contains the logic to perform the depth raycast using the screen position of each detected object.
EnvironmentRaycastManager: provided by MRUK. Contains the logic to perform a raycast against the depth map generated by the Quest 3 depth sensors.
Environment ray cast sample manager class
The EnvironmentRayCastSampleManager.cs script contains a Raycast(Ray ray) method that returns the 3D world position where the ray hits a real-world surface, using the device’s depth sensors via MRUK.
public Vector3? Raycast(Ray ray)
{
if (EnvironmentRaycastManager.IsSupported)
{
if (m_raycastManager.Raycast(ray, out var hitInfo))
{
return hitInfo.point;
}
else
{
return null;
}
}
else
{
Debug.LogError("EnvironmentRaycastManager is not supported");
return null;
}
}
The Raycast() method returns null when no surface is hit or when environment raycasting is not supported on the device. Callers construct the Ray from the user’s view direction toward the 2D screen position of each object detected by the Yolov9t model.
This class uses EnvironmentRaycastManager from the com.meta.xr.mrutilitykit package (namespace Meta.XR).
# Install the ultralytics package from PyPI
pip3 install ultralytics
Create a new python script (yolo.py) and add the following code inside
from ultralytics import YOLO
# Load a pretrained YOLO model (recommended for training)
model = YOLO("yolov9t.pt")
# Train the model using the 'coco8.yaml' dataset for 3 epochs
results = model.train(data="coco8.yaml", epochs=3)
# Evaluate the model's performance on the validation set
results = model.val()
# Export the model to ONNX format
success = model.export(format="onnx")
Run the python script to generate the yolov9t.onnx file.
python3 yolo.py
This script will show you the path of your onnx file when it finishes:
Use the Unity Package Manager UI to install Inference Engine 2.2.1 into your project (com.unity.ai.inference).
Navigate to Meta > Tools > Project Setup Tool and fix any issues that it finds in the configuration of your project.
Create a new empty scene and ensure it doesn’t contain a Camera component.
Add an Event System to your scene.
Navigate to Meta > Tools > Building Blocks and add a Camera Rig.
To run a different AI model, follow a similar procedure to the Inference Engine sample:
Check the content from the Assets/PassthroughCameraApiSamples/MultiObjectDetection folder.
Drag the following prefabs into the project hierarchy:
DetectionManagerPrefab
SentisInferenceManagerPrefab
EnvironmentRaycastPrefab
PassthroughCameraAccessPrefab
Drag the DetectionUiMenuPrefab prefab to the CenterEyeAnchor. Then, drag the CenterEyeAnchor camera to the event camera for the canvas.
Iterate through the prefabs and ensure the fields are populated as in the sample package.
Build and run to verify it works.
The management and processing of the Inference Engine framework are handled by the following scripts:
DetectionManagerPrefab: see Sample detection manager prefab in the previous section. Within DetectionManager.cs, the Update() function gets the CPU texture data from the WebCamTexture object (via PassthroughCameraAccessPrefab) and kicks off the inference model by calling RunInference(captureBuffer) once the WebCamTexture object is ready. This process checks for a camera texture and waits for the current inference to complete before starting a new one. This is a standard approach that you can reuse with ML models.
SentisInferenceManagerPrefab: contains the critical features for running Inference Engine and can be used as a design pattern for your Inference Engine models. See YOLO control scripts in the previous section.
The public variables here allow you to configure the backend for the Inference Engine to run on either the GPU or the CPU. Select the Sentis model that you plan to use (Yolov9t in this case). See How to generate your own YOLO ONNX model to learn how to create these models. ONNX is the standard format accepted by Inference Engine. Finally, it lets you set the layers per frame.
SentisInferenceRunManager.cs controls the Inference Engine model. This includes LoadModel(), which takes the Inference Engine model, compiles the graph for the model, and spawns the worker thread that will execute the model. It contains a public function RunInference(Texture2D targetTexture), which converts the input camera image into the tensor that can be consumed by the model and kicks off the inference.
Modify the Inference Engine inference based on your model requirements:
Input data will be the same (texture to tensor). The RunInference(Texture2D targetTexture) function converts the texture to a tensor.
Output data will vary, so you must write your own post-processing scripts. Refer to the GetInferencesResults() function to get the output of the model using the pull request technique.
Use the layer by layer inference technique and pull request download function to read the output to get better performance results on Quest. See the InferenceUpdate() function to see how layer by layer inference works.
The UI varies depending on the model you chose. The following explains the selections used in the sample:
SentisInferenceUiManager.cs is the script that updates the UI based on the output of the model.
Note: You must create your specific script to interpret the output of the model that you choose for your own application. In the case of YOLO, use DrawUIBoxes(output labelDs, imageWidth, imageHeight).
DetectionUiMenuPrefab prefab is under the CenterEyeAnchor and manages the display of the initial and start panels that you see as the app launches along with the countdown.
Recommendations for using Inference Engine on Meta Quest devices
Model architecture:
Use models that do not require complex architectures.
Note: Large generative models and LLMs do not perform well on devices like Meta Quest 3.
Inference Engine runs on the main thread of your Unity app, so it will impact other main processes like render pipeline.
Use the layer by layer inference technique (split the inference in layer per frame) to prevent blocking the main thread.
Model size:
The model is loaded to the CPU or GPU of your device. This can cause multiple issues when you try to use large models:
Long loading times
Lag on the main thread during the first inference
Reduction of the memory budget for other resources
Recommendations:
Use the smallest version of the model that you need. For example, in the MultiObject detection sample, the smallest version of the model is YOLO (8 MB), because the medium size of YOLO (146 MB) performs worse.
Convert the model to Inference Engine format and quantize it to Uint8 to reduce the loading times.
GPU versus CPU:
Choose a suitable strategy to minimize transferring data between GPU and CPU contexts:
If you are using an AI model to perform graphics-related work, keep all the processing on the GPU. Select the GPUCompute backend for Inference Engine and use the procedures that operate on the GPU to send camera data directly to Inference Engine. Finally, if you don’t need to access the output of the model on the CPU, you can send it directly to your shader or material.
If you need to send the output to the CPU, you can:
Run the model on the GPU, then send results asynchronously to the CPU.
Run the model on the CPU.
Recommendations:
Get the output data asynchronously in any backend to prevent blocking the main thread. Inference Engine provides different techniques to accomplish this. To learn more, see Read output from a model asynchronously.
NPU backend on Inference Engine:
Currently, Inference Engine doesn’t use NPU or hardware acceleration to run the inferences. Inference Engine is part of Unity Engine and is designed to run on multiple platforms. For Quest devices, Inference Engine is running as a regular Android platform with no specific accelerations. Keep this in mind when selecting the model.