Develop

Orchestrator sample overview

Updated: May 12, 2026

Overview

This sample demonstrates a declarative resource management framework for Android systems, designed for system engineers working on performance optimization and resource control. The sample shows how to build an event-driven daemon using BPF programs for kernel-level monitoring, trait-based policy composition, and cgroupv2 enforcement mechanisms.

What you will learn

  • Use BPF programs to monitor cgroup lifecycle events, CPU execution time, and GPU utilization at the kernel level
  • Implement a trait-based policy system that composes per-process and aggregate resource management strategies
  • Build a multi-runtime architecture with tokio for supervised, long-running flows
  • Integrate with Android’s cgroupv2 infrastructure for memory pressure management and budget enforcement
  • Coordinate graceful shutdown across multiple concurrent async tasks using cancellation tokens

Requirements

  • Target device: Meta Quest device with developer mode enabled for ADB access and deployment
  • Development environment: Linux development environment with Android build tools
For complete build prerequisites and toolchain setup, see the sample README.

Get started

The sample runs as an Android init service and requires Meta’s internal buck2 build system. Clone the orchestrator repository and build the orchestrator binary using buck2. Deploy to your device via the Android build system integration. For step-by-step build and deployment instructions, consult the sample README.

Explore the sample

The sample is organized into three resource domain flows (CPU, GPU, memory) with shared infrastructure in a common crate:
File / CrateWhat it demonstratesKey concepts
service/src/bin.rs
Main service entry point with multi-runtime supervision
Runtime creation, flow dispatch, signal handling, coordinated shutdown
service/init/orchestrator.rc
Android init service integration
System property triggers, capability management, lifecycle hooks
bpf/cgroup.bpf.c
BPF cgroup lifecycle monitoring
Ring buffer events, tracepoint hooks, cgroupv2 path filtering
crates/cpuflow/bpf/cpu.bpf.c
Per-CPU, per-PID execution tracking
Nested BPF maps, scheduler hooks, frequency-weighted credits
crates/gpuflow/bpf/credits.bpf.c
GPU power level and per-PID utilization tracking
Vendor tracepoint hooks, pinned maps, time-in-state accounting
crates/common/src/lib.rs
Core trait definitions
Flow trait, async reactor abstraction
crates/common/src/process_monitor.rs
Policy orchestration and lifecycle
Policy, PolicyFactory, AggregatePolicy traits, task supervision
crates/common/src/cgroup_monitor.rs
BPF event processing with state machine
Sequential event processing, async ring buffer polling, initial filesystem scan
crates/common/src/cgroup.rs
cgroupv2 filesystem abstraction
File descriptor caching, reclaim type control, async file I/O
crates/common/src/stats.rs
PSI and memory statistics parsing
Pressure triggers, cgroup stat formats, async pressure monitoring
crates/memflow/src/lib.rs
Fully implemented memory management flow
Policy composition, experiment configuration, PSI integration
crates/memflow/src/policy/telemetry.rs
Observability for memory usage
Per-process telemetry collection, statsd integration
crates/memflow/src/policy/mem_throttle.rs
Working set size measurement via throttle
memory.high manipulation, WSS estimation, cleanup guarantees
crates/memflow/src/policy/system_psi.rs
PI controller for pressure-based memory limits
Proportional-integral control, anti-windup, deadband filtering
crates/memflow/src/policy/senpai_psi.rs
Exponential backoff/probe pressure management
Pressure integral tracking, grace periods, power-state awareness
crates/memflow/src/policy/static_min.rs
Guaranteed memory protection for system services
Aggregate policy, memory.min enforcement, per-UID accumulation
crates/memflow/src/policy/mem_budget.rs
Declarative budget enforcement with remote config
JSON budget schema, enforcement levels (Trace/Throttle/Kill), dry-run mode
crates/cpuflow/src/lib.rs
Stub CPU flow (infrastructure only)
BPF data collection without userspace policy
crates/gpuflow/src/lib.rs
Stub GPU flow (infrastructure only)
BPF data collection without userspace policy

Runtime behavior

When the service starts, it creates four separate tokio runtimes: one single-threaded orchestrator runtime and three multi-threaded flow runtimes (CPU, GPU, memory). The orchestrator runtime spawns each flow on its dedicated runtime and enters a supervision loop that monitors Unix signals and flow health.
The memflow flow performs most of the work. On startup, it loads the cgroup monitoring BPF program, reads system pressure from /proc/pressure/memory, builds policy instances based on Android system property configuration, and starts the process monitor. The process monitor listens to a stream of cgroup lifecycle events from the BPF program, reads process information from /proc/{pid}/, and spawns individual policy tick tasks for each enrolled process.
The BPF programs run continuously in the kernel. The cgroup monitor emits events when processes attach to cgroups, execute, and exit. The CPU BPF program tracks execution time at scheduler context switches and frequency changes. The GPU BPF program tracks time-in-state at each power level when the GPU hardware changes frequency. All three BPF programs use ring buffers or pinned maps to communicate data to userspace.
When the service receives SIGTERM or SIGINT, the orchestrator cancels all flows and waits up to 190 milliseconds for them to exit gracefully. Each policy’s cleanup method restores cgroup files to default values before the task terminates.

Key concepts

Flow trait and multi-runtime architecture

The sample defines a Flow trait in crates/common/src/lib.rs that abstracts a long-running resource management domain. The main binary creates a separate tokio runtime for each flow, isolating scheduler contention and allowing per-flow thread naming for debugging:
#[async_trait]
pub trait Flow: Send {
    async fn run(&mut self, cancel_token: CancellationToken) -> anyhow::Result<()>;
}
The main supervision loop uses tokio::select! with biased to prioritize shutdown signals over flow completion. See service/src/bin.rs for the complete supervision pattern.

Policy system with trait-based composition

The policy system uses three traits to enable runtime composition. PolicyFactory conditionally creates per-process policies based on process metadata. Policy defines a periodic tick interface with cleanup guarantees. AggregatePolicy manages multiple processes collectively for cross-process resource decisions:
async fn create(&self, process: &ProcessInfo) -> anyhow::Result<Option<Box<dyn Policy>>>;
Factories return Option<Box<dyn Policy>>, allowing enrollment decisions at runtime. The ProcessMonitor in crates/common/src/process_monitor.rs iterates all factories for each new process and spawns tick tasks only for enrolled policies.

BPF ring buffer integration with state machine

The cgroup monitor uses BPF ring buffers for low-overhead event delivery from kernel to userspace. The sample wraps the ring buffer file descriptor with tokio::io::unix::AsyncFd for async polling:
let ring_buffer = AsyncFd::new(ring_buffer)?;
ring_buffer.readable().await?;
The state machine in crates/common/src/cgroup_monitor.rs tracks which processes are attached to cgroups, handling the Android process lifecycle where app processes get their identity at attach time but system services require exec events.

cgroupv2 abstraction with file descriptor caching

The CgroupEntry trait provides async methods for reading and writing cgroupv2 pseudo-files. The FileSystemCgroup implementation caches file descriptors and seeks to the start for repeated reads, avoiding open/close overhead on hot paths:
let stats = cgroup.memory_stats().await?;
cgroup.write_memory_high(limit_bytes).await?;
The ReclaimType parameter controls whether writes use O_NONBLOCK, determining whether memory reclaim is charged to the orchestrator or deferred to the target cgroup. See crates/common/src/cgroup.rs for the complete abstraction.

Pressure-based memory management

The sample demonstrates two memory pressure control algorithms. The System PSI policy implements a PI controller that adjusts memory.high to maintain a target pressure percentage. The Senpai PSI policy uses exponential backoff when pressure exceeds the target and exponential probing when pressure is below target.
Both policies read cumulative pressure from memory.pressure and compute the delta since the last tick. See crates/memflow/src/policy/system_psi.rs and crates/memflow/src/policy/senpai_psi.rs for the complete control logic.

Coordinated shutdown with cancellation tokens

The sample uses tokio_util::CancellationToken for hierarchical shutdown coordination. A global token created in main() is cloned and passed to each flow. The ProcessMonitor creates its own global token for all policy tasks:
select! {
    _ = cancel_token.cancelled() => break,
    _ = interval.tick() => { /* policy work */ }
}
Policies check for cancellation in their tick loops and call cleanup() before exiting. The orchestrator enforces a 190ms shutdown timeout, aborting tasks that fail to respond to cancellation.

Extend the sample

  • Add a new policy: Implement the PolicyFactory and Policy traits (see crates/memflow/src/policy/telemetry.rs for a minimal example), register the factory in crates/memflow/src/lib.rs, and add configuration system properties.
  • Implement userspace logic for CPU or GPU flows: Add policies that consume the BPF-collected data from the CPU and GPU programs.
  • Add support for new BPF tracepoints: Modify the cgroup monitor or create a new BPF program with skeleton generation.