This section describes using the Unity Sentis framework with the Passthrough Camera API. Sentis provides a framework for loading models from common open source platforms and compiling them on-device. For more information about Sentis, please visit https://unity.com/products/sentis. In this section, we will describe using Sentis and the Yolov8n model to identify real objects at runtime on Quest 3 devices.
After completing this section, the developer should be able to:
Recompile and load the Yolo Sample with Unity.
Understand how to build a new project using Unity and Sentis,
Understand how to compile and build Sentis models to support models other than Yolo
Understand how to integrate this API with the Unity Sentis architecture to access ML/CV models. Note that the Unity Sentis integration is not a required dependency and many developers will choose to access different or proprietary frameworks. However, it is provided as a convenience for developers to experiment quickly.
Use Cases
The framework described in this section can be followed to load the ML/CV model of your choice. Note that the complexity of the models available online differ widely and you will need to be careful to find a model that meets the performance profile required by your application. Unity also provides samples that can be modified for use to do other things like digit recognition.
Prerequisites
Horizon OS: v74
Devices: Quest 3 or Quest 3S
Grant Camera and Spatial data permissions to your app. The Spatial data permission is used by the EnvironmentRaycastManager.
Install Unity 2022.3.58f1 with Sentis package 2.1.1 (com.unity.sentis)
Known Issues
The model accuracy is not 100% because the model has been optimized to work on devices like Quest 3.
This model has been trained with 80 classes (not objects), it means that some object will be included inside a class, for example: Monitor and TV are in the same class TV_Monitor. The table in the next section identifies the classes the model can identify.
Some classes are hard to identify, for example, cell phones are difficult to identify, in most cases are identified as the TV_Monitor class.
New devices like Quest 3 controllers, maybe not recognized or recognized as remote controllers.
The bounding boxes visualization in the MultiObjectDetection sample don’t perfectly align with the detected objects. For a better example of the camera to world projection, refer to the CameraToWorld sample. We will update this sample in a future version.
YOLO Sample
This sample draws 2D boxes around the detected objects and spawns a marker in the approximate 3D position of each object when the user presses the A button.
Open the Unity-PassthroughCameraApiSamples in Unity Editor
In Build Profiles switch the Active Platform to Android
Open the sample scene: Assets\PassthroughCameraApiSamples\MultiObjectDectection\MultiObjectDectection.unity
In Meta->Tools->Project Setup Tool if the rule “MR Utility Kit recommends Scene Support to be set to Required”, choose “...”-> Ignore. Fix and apply all the other issues and recommendations
Build the app and test it on your headset.
Using the Sample
Description: This sample shows how to identify multiple objects using Sentis and a pretrained Yolo model.
Controls: this sample uses the Quest 3 controllers:
Menus (Start and Pause):
Button A: start playing
In Game:
Button A: place a marker in the world position for each detected object
At any moment:
Button MENU: back to Samples selection.
How to play:
Start the application and look around you.
When an object is detected, you will see 2D floating boxes around the detected objects
If you press the A button, a 3D marker will be placed in the real world position of the detected objects with the class name
This model can identify the following objects (80):
person
fire hydrant
elephant
skis
wine glass
broccoli
dining table
toaster
bicycle
stop sign
bear
snowboard
cup
carrot
toilet
sink
car
parking meter
zebra
sports ball
fork
hot dog
tv monitor
refrigerator
motorbike
bench
giraffe
kite
knife
pizza
laptop
book
aeroplane
bird
backpack
baseball bat
spoon
donut
mouse
clock
bus
cat
umbrella
baseball glove
bowl
cake
remote
vase
train
dog
handbag
skateboard
banana
chair
keyboard
scissors
truck
horse
tie
surfboard
apple
sofa
cell phone
teddy bear
boat
sheep
suitcase
tennis racket
sandwich
potted plant
microwave
hair drier
traffic light
cow
frisbee
bottle
orange
bed
oven
toothbrush
Unity Project Structure
This section describes the components used in the MultiObjectDectection.unity scene.
The sample scene contains all the prefabs to run the gameplay. Below we are going to see it one by one:
[BuildingBlock] Camera Rig: we created this object using Meta XR Building Blocks tool. This gameObject contains additional prefabs:
As child of CenterEyeAnchor:
DetectionUiMenuPrefab: manages and shows the ui of the sample.
[BuildingBlock] Passthrough: Meta XR Building Blocks entity to configure and enable the Passthrough feature.
DetectionManagerPrefab: contains the scanner logic to get the camera data and run the Sentis inference to update the UI elements.
SentisObjectsDetectedUiPrefab: contains the UI canvas to draw the boxes around the detected objects.
SentisInferenceManagerPrefab: contains the multi-object detection inference and UI logic.
EnvironmentRaycastPrefab: contains the logic to use the MRUK Raycast to get the real world 3D point using the Quest Depth Data.
WebCamTextureManagerPrefab: manages the lifecycle of the WebCamTexture.
ReturnToStartScene: common prefab used to go back to Samples selection.
Sample Detection Manager Prefab
The main prefab that manages the logic of this sample is DetectionManagerPrefab.
This prefab contains the following components:
DetectionManager.cs: This script contains the sample logic. It gets the camera image from the WebCamTextureManagerPrefab to send it to Sentis inference. Also, it manages the placement action when the user presses the A button.
AI models used on this project
This project uses Unity Sentis (2.1.1) to run AI models locally on Quest 3 devices.
Below you can find the details about each AI model used with Sentis and which type of data we use and we get from each model.
Multi-object detection model
To identify real world objects, we are using the Yolov9 model prepared for Sentis with an extra layer (Non-Max suppression) and quantized to Uint8. This model requires just an image as input and returns the 2D coordinates and the class type of the detected objects.
Technical Information
Model name: Yolov9t
Model format: SENTIS Quantized Uint
Model size: 6,284 KB
Classes trained: 80 different classes
YOLO Control Scripts
SentisInferenceRunManager.cs: contains the logic to run Sentis inference using Sentis 2.1.1.
This component has the following params:
Backend: the local backed used by Sentis to run the inferences
Sentis Model: the model asset (.sentis) used for this inferences
K_Layers Per Frame: the amount of layers executed per frame.
Label Asset: file a list of objects detected by pretrained Yolov8n model.
UIInference: references to the SentisInferenceUiManager scripts.
SentisInferenceUiManager.cs: contains the logic to draw the UI box around the tracked object.
This component has the following params:
Display Image: the reference to the UI element that will contain the 2D bounding boxes.
Box Texture: the texture used by the 2d Box.
Box Color: the color of the bounding box.
Font: the font type used to write the detected object information.
Font Color: the color for the the object information
Font size: the size of the font used to show the object information.
On Object Detected event: event launched when Yolo detect an object,
Input values:TensorFloat with the current Camera RGB raw data image.
Sentis framework provides a function to convert a texture to tensor float.
Output values: the model returns 2 tensors:
TensorFloat of screen coordinates (X,Y) for all detected objects.
TensorInt with the class ID for each object detected
Once we have the object and its coordinates (X and Y percentage of the input image), we use the Depth placement with MRUK Raycast to get the “real world” position.
ONNX to SENTIS converter
This sample includes an editor code to convert the Yolov9 ONNX model to Sentis format.
The Sentis Inference Run Manager component offers you the ability to adjust the Iou andScore thresholds for Yolo model and generate a .sentis file with the Non-max-Supression layer and quantized to UInt8.
Iou Threshold: Yolo Iou threshold used to identify the objects.
Score Threshold: Yolo score threshold value used to identify the objects.
The Sentis functions to quantize a model and save it are located inside the SentisModelEditorConverter.cs script.
To place a marker in the real world position using the 2D coordinates for each object detected we are using the MRUK Environment Raycasting (Beta) feature.This feature allows us to perform a raycast from the 2D position to the real world using the Quest 3 depth map data.
The EnvironmentRaycast prefab contains the two components used to perform the depth raycast.
MRUKCastManager class: this component contains the logic to perform the MRUK DepthRaycast using the ui position of each object detected.
Environment Raycast Manager: this script is provided by the MRUK and contains the logic to perform the RayCast to the internal depth map generated by the Quest 3 sensors.
MRUK Ray Cast Manager class
The MRUKRayCastManager.cs is the class that contains the function to get the Transform data of the real world point using the 2D coordinates from the detected objects in the camera image.
public Transform PlaceGameObject(Vector3 cameraPosition)
{
this.transform.position = Camera.position;
this.transform.LookAt(cameraPosition);
var ray = new Ray(Camera.position, this.transform.forward);
if (RaycastManager.Raycast(ray, out EnvironmentRaycastHit hitInfo))
{
this.transform.SetPositionAndRotation(
hitInfo.point,
Quaternion.LookRotation(hitInfo.normal, Vector3.up));
}
return this.transform;
}
The PlaceGameObject function performs a Raycast from the camera position (center eye anchor) to the world position of each ui box drawn around the object detected by the Yolov8n model.
This class uses EnvironmentRaycastManager from Meta.XR.MRUtilityKit package to use the Ray function.
How to generate your own Yolo ONNX model
The Yolo model used in this sample is the yolov9-t-converted.pt model from here converted to ONNX.
If you want to export any Yolo model, follow the steps below to generate the ONNX file:
# Install the ultralytics package from PyPI
pip install ultralytics
Create a new python script (yolo.py) and add the following code inside
from ultralytics import YOLO
# Load a pretrained YOLO model (recommended for training)
model = YOLO("yolov8n.pt")
# Train the model using the 'coco8.yaml' dataset for 3 epochs
results = model.train(data="coco8.yaml", epochs=3)
# Evaluate the model's performance on the validation set
results = model.val()
# Export the model to ONNX format
success = model.export(format="onnx")
Run the python script to generate the Yolov8n.onnx file.
python yolo.py
This script will show you the path of your onnx file when it finishes:
Within DetectionManager.cs, the Update function is used to get the CPU texture data from the WebCamTexture object (PassthroughCameraApiPrefab reference) and kick off the inference model by calling RunInference(captureBuffer) once the WebCamTexture object is ready. This process checks for a camera texture and waits for the current inference to complete before starting a new one. This is a standard approach that you can re-use with ML models.
SentisInferenceManagerPrefab contains the critical features for running Sentis and can be used as a design pattern for your Sentis models. See YOLO Control Scripts in the previous section.
The public variables here allow you to configure the backend for Sentis to run on either the GPU or the CPU. Select the Sentis model that you plan to use (Yolov8n in our case). See How to generate your own Yolo ONNX model for background on how to create these models. ONNX is the standard format accepted by Sentis. Finally, it allows you to set the layers per frame.
SentisInferenceRunManager.cs controls the Sentis model. This includes LoadModel() which takes the Sentis model including compiling the graph for the model and spawning the Worker thread that will execute the mode. Probably most importantly, it contains a public function RunInference(Texture2D targetTexture) function which converts the input camera image into the Tensor that can be consumed by the model and kicks off the inference.
Modify the Sentis inference based on your model requirements:
Input data will be the same (texture to tensor). RunInference(Texture2D targetTexture) function converts the texture to a tensor.
Output data will be different, so you will need to write your own post process scripts. GetInferencesResults() function shows you how to get the output of the model using the pull request technique.
We recommend the layer by layer inference technique and pull request download function to read the output to get better performance results on Quest. InferenceUpdate() function shows how to do the layer by layer inference.
The UI will likely be very different for the model that you might choose, but the following is an explanation of what we used in our sample.
SentisInferenceUiManager.cs is the script that updates the UI based on the output of the model. You will need to create your specific script to interpret the output of the model that you choose for your own application. In the case of Yolo, we use the DrawUIBoxes(output, labelDs, imageWidth, imageHeight).
DetectionUIMenuManager prefab is under the CenterEyeAnchor and manages the display of the initial and start panels that you see as the app launches along with the countdown.
Sentis on Meta Quest devices recommendation
This section contains our recommendations for using Sentis with the Meta Quest 3/3s. For more information on Sentis refer to the Sentis official site: https://unity.com/products/sentis
Model architecture:
Use models that do not require complex architectures. Large generative models and LLMs will not perform well on devices like Meta Quest.
Sentis runs on the main thread of your Unity app, so it will impact other main processes like render pipeline.
Use layer by layer inference technique (split the inference in layer per frame) to not block the main thread.
Model size:
The model is loaded to the CPU or the GPU of your device. This can cause multiple issues when you try to use big models:
Long loading times.
Lag on the main thread during the first inference.
Reduction of the memory budget for other resources.
We recommend:
The smallest version of the model that you want to use. For example, in our MultiObject detection sample we use the smallest version of Yolo (8mb), because the medium size of Yolo (146 MB) has worse performance.
Convert the model to Sentis format and quantize it to Uint8 to reduce the loading times.
GPU versus CPU:
It’s important to choose the right strategy to minimize transferring data between GPU and CPU contexts:
If you are using an AI model to perform graphics-related work, keep all the processing on the GPU. Select the GPUCompute backend for Sentis and use the procedures that operate on the GPU to send camera data directly to Sentis. Finally, if you don’t need to access the output of the model on the CPU, you can send it directly to your shader or material.
If you need to send the output to the CPU, then you can::
Run the model on the GPU, then send results asynchronously to the CPU.
Right now, Sentis is not using any NPU or hardware acceleration to run the inferences. Sentis is part of Unity Engine and is designed to run on multiple platforms. For Quest devices Sentis is running as a regular Android platform with no specific accelerations. It is important to have this in mind when you are selecting a model.