Develop

Spatial Scanner - ML Kit object tracking

Updated: May 29, 2025

Meta Spatial Scanner⁠ uses CV and object detection AI models to help users learn about their surroundings through the device’s camera API. This capability has many potential use cases: body pose detection for medical applications, real-time text recognition and translation, face mesh detection for AR filters or video chat, brand product recognition to track consumer metrics, and more.

Showing the CV model inference and on-device object detection.

This page explains how CV model inference was implemented in this app for real-time, on-device object detection and tracking. It serves as a starting point for integrating CV inference into other applications. All of the Kotlin code for object tracking is isolated into the objectdetection package for easier adoption into other projects.

objectdetection/**/*.kt

Object detection options

Many options exist for image object detection inference, including various datasets, pre-trained models, and model inference software. Most options provide a rectangular bounding box that defines the object’s location and extents in the image, along with a confidence threshold. Some options also support object classification, segmentation, and tracking across sequential inferences. Few options provide all of these features while also running in real-time on a device with less computational power than desktop GPUs and processors.

Choosing the right options should come down to the application’s specific use case. For this app, the following criteria were determined to select the best options:

Must be high-performing, and capable of completing 15-20 inferences per second on-device.

Must include out-of-the-box support for real-time object tracking across sequential inferences, with persistent object IDs.

Must be capable of identifying and labeling everyday household and office objects with relative accuracy, especially the three curated objects: the refrigerator, television, and smartphone.

Must be relatively low level of effort to integrate into an Android project, ideally via a Kotlin SDK.

Ideally capable of tracking multiple objects at once.

Model inference options

Three options for model inference software were explored during the development of this app. All three satisfied most of the criteria defined above, but only one satisfied criteria 1 and 2 in our testing: ML Kit.

OpenCV⁠

Mediapipe⁠

ML Kit⁠

A number of other options exist that were not explored, but most others did not satisfy criteria 4, as they required using Android NDK⁠. For this application’s use case, and to facilitate exploring different options, some abstractions for these various object detector technologies were implemented.

data class DetectedObject(
    val point: PointF,
    val bounds: Rect,
    val label: String,
    val confidence: Float,
    val id: Int? = null
)

data class DetectedObjectsResult(
    val objects: List<DetectedObject>,
    val inferenceTime: Long,
    val inputImageWidth: Int,
    val inputImageHeight: Int
)

interface IObjectsDetectedListener {
    fun onObjectsDetected(result: DetectedObjectsResult, image: Image)
}

interface IObjectDetectorHelper {
    fun setObjectDetectedListener(listener: IObjectsDetectedListener)
    fun detect(image: Image, width: Int, height: Int, finally: () -> Unit)
}

All of these structures are well documented in-code, and are located in the objectdetection/detector package. The different inference options that weren’t ultimately used in this app -- Mediapipe and OpenCV -- were left in the project to demonstrate their usage. They can serve as a starting point for other applications whose criteria for choosing a model inference technology are satisfied by one of these implementations. You can switch between them in ObjectDetectionFeature by toggling the commented lines, though it is recommended that you only test these with the spawnCameraViewPanel flag set to true.

// different options for object detection; though only MLKit currently supports persistent ids
//objectDetector = MediaPipeObjectDetector(activity)
objectDetector = MLKitObjectDetector(activity)
//objectDetector = OpenCVObjectDetector(activity)

Dataset and model options

After settling on ML Kit, various options for object detection models were explored. Kaggle⁠ is a large repository of datasets and models, and conveniently hosts a number of lightweight models that are built for on-device inference, and are compatible with ML Kit. A dozen different models were tested, and four are all still included in the project at app/src/main/assets/models/mlkit/*.tflite. Each one offers its own advantages and disadvantages in terms of what types of objects it is capable of detecting, how long an inference takes in milliseconds, and how large the model file is. Ultimately, the efficientnet TFLite 4 uint8⁠ model was chosen, as it strikes the best balance of criteria satisfaction for this app’s use case.

There are, however, many more options that exist on Kaggle or other repositories, and a thorough documentation and capability for how to train many of those models with a dataset that is tailored to the needs of other applications.

Detected object cache

While developing this app, the following needs arose, which resulted in the creation of the objectdetection/DetectedObjectCache.kt.

A central location where the current pool of detected objects is stored and searchable.

A means of filtering out detected objects that overlap too much with the bounds of other objects.

A way to reconcile the diff between the set of detected objects from the previous video frame with the new set from the latest video frame.

A way to emit events notifying others when new objects are detected, existing objects are updated, and existing objects are lost.

The following diagram illustrates the code execution path which includes performing the object detection inference and displaying the results to the user. Note how the DetectedObjectCache is integral in determining which detected objects should be displayed to the user. Also note in the Llama Vision Integration document the role DetectedObjectCache plays in fetching a cropped image from the latest video frame of a detected object so that it can be sent to Llama for vision analysis via AWS Bedrock.

Showing the flow of ML Kit Object Tracking.

Filtering objects

Filtering out overlapping objects occurs in DetectedObjectCache, and is performed to avoid the situation where multiple objects are detected on top of each other, making it difficult for the user to select any one of them in particular. An example of this occurring would be if the user is looking at a multi-shelf bookcase, and the object detection inference returns results for multiple adjacent shelves – each as “bookshelf” – plus the entirety of the structure as “bookcase”. Two different methods of filtering were applied, listed below.

If 2 objects intersect, where 1 is completely within the other, remove the interior one.

If 2 objects are overlapping, and the intersection area makes up a majority of the smaller of the two, remove it.

These filtering methods eliminated the majority of overlapping issues, while still including partially overlapping bounds, which users could still select one or the other. Additional filtering methods could be conceived and implemented in apps which have different use cases – for example, an app that only provides information for furniture could implement a whitelist or blacklist filter for furniture related objects only, given that the chosen model or dataset didn’t completely rule out non-furniture object labeling.

Tracked object system

The TrackedObjectSystem, as illustrated on the diagram above, receives object detection results from the DetectedObjectCache, and is responsible for displaying those results to the user. This custom ECS System is also responsible for listening for user input events for selecting a detected object.

Object pooling

Because object detection results could change many times per second, a simple object pooling pattern was implemented to prevent potentially numerous Spatial Entity and Component instances from being created and destroyed each frame, which may affect app performance. These files are located in objectdetection/utils, and are well documented in-code.

interface IPoolable {
    fun reset()
}

class ObjectPool<T : IPoolable>(
    private val factory: () -> T,
    initialSize: Int = 0
) {
    private val pool = ArrayDeque<T>()

    val availableObjects: Int get() = pool.size

    fun take(): T {
        // ...
    }

    fun put(obj: T) {
        // ...
    }
}

Note that the private data class TrackedObjectInfo which implements the IPoolable pattern for the TrackedObjectSystem object pool maintains references to a number of Spatial SDK object instances, and updates them as new objects are detected and lost.

private data class TrackedObjectInfo(
    val entity: Entity,
    val labelPanelEntity: Entity,
    val outlineMaterial: SceneMaterial,
    val uiVM: ObjectLabelViewModel = ObjectLabelViewModel(),
    var cameraRayToObject: Vector3 = Vector3.Forward,
    var cameraFrameBounds: Rect = Rect(),
    var targetPose: Pose = Pose(),
    var targetScale: Vector3 = Vector3(0f)
) : IPoolable

Outline shader

A few options for displaying the detected objects were explored during the development of this app. Ultimately, a method was chosen which displays the bounds of detected objects in the user’s view, and uses a label formatted and styled with a Jetpack Compose panel. A simple quad geometry is used for the bounds, with a custom 9-slice shader to render a visually appealing, rounded rectangle. This implementation is well documented in-code, and more can be read about it in the Overview page here.

Debug view and overlay

An alternative to the TrackedObjectSystem – primarily used for debugging in this app’s development – is a means of displaying detected objects to the user via a view-locked Spatial Panel with a graphic overlay which draws rectangles on an Android View canvas. This view-locked overlay also renders the camera feed used for video frame processing into a SurfaceView, as documented in Passthrough Camera API. However, this method does not let users select an object or use a UI library like Jetpack Compose for labeling objects (as the TrackedObjectSystem does).

Showing the camera preview that can be used for debugging.

This feature can be enabled by passing true for the spawnCameraViewPanel parameter of the ObjectDetectionFeature constructor. Enabling this also disables the TrackedObjectSystem to prevent many overlapping outlines of objects.

Note that the camera feed displayed on the panel has a slight delay (40-60 ms, as mentioned in the documentation), and can be disorienting or cause motion dizziness after prolonged use. For debugging and testing purposes, and a way to circumvent this issue, it is recommended that the visibility property value be set to gone on the CameraView instance in the res/layout/ui_camera_view.xml layout file (which is used on the panel), unless the application use case requires a view into the processed video frames.

<!-- View group which displays the camera preview view via the CameraController onto a Surface. -->
<com.meta.pixelandtexel.scanner.objectdetection.views.android.CameraPreview
    android:id="@+id/preview_view"
    android:layout_width="match_parent"
    android:layout_height="match_parent"
    android:layout_gravity="center"
    android:background="@android:color/transparent"
    android:visibility="gone" /> <!-- ADD THIS -->

Adoption

To implement this capability into other Spatial apps, start by adding the custom Spatial FeatureObjectDetectionFeature and all of its dependencies into your app, and registering it in your main activity in the registerFeatures() function. For an example of this, see the function in activities/MainActivity.kt.

Note that you must pass references to ObjectDetectionFeature constructor callback parameters to receive object detection results, and maintain a reference to the Spatial Feature instance to invoke its control functions during the app’s session – for example scan(), pause().