Scheduler Stress Testing
Running and interpreting the booking + accounting stress suite
Overview
The stress suite (rust/crates/scheduler/tests/stress_tests.rs) exercises the
Rust scheduler’s full production dispatch
path at scale — pipeline::run end to end: Redis accounting bootstrap →
cluster feed → pending-job query → host matching → dispatch (proc insert, host
ledger decrement, frame start) — against a deterministic, bulk-seeded farm.
It is both a correctness gate and a benchmark harness:
- Correctness: after each phase an audit cross-checks every Redis
acct:*hash the run touched againstSUM(proc)in Postgres (the canonical record — see the Redis-Backed Accounting Reference), and verifies cap enforcement and ledger invariants. - Benchmark: it reports booking throughput (frames/s over the active booking window), host-matching efficiency (wasted attempt %), host-cache hit ratio, and Redis Lua op counts.
The suite runs two phases in one process:
| Phase | Shape | What it proves |
|---|---|---|
| drain | Farm capacity comfortably exceeds demand (default: 1,200 hosts, 6,000 frames) | ≥90% of frames book; throughput measured; accounting stays exact under concurrency, including the force-rollback compensation path |
| saturation | Demand vastly exceeds tight subscription bursts and per-job core caps (default: 400 hosts, 3,000 frames, 150-core bursts) | The Redis Lua cap check is the binding constraint: bookings stop exactly at burst, caps are never breached, rejections flow through the hot path |
Invariants the audit asserts
- Every
acct:{sub,folder,job,layer,point}hash holds exactlySUM(proc.int_cores_reserved)/100cores andSUM(proc.int_gpus_reserved)GPUs for its grouping — the same 5-dimension grouping and centicore→core conversion the recompute loop uses. The suite pushes the recompute and limit-reseed loops out to a 1-hour interval, so agreement here proves the dispatch hot path alone (Lua book + force-rollback) kept Redis exact — reconciliation never got a chance to paper over drift. - Jobs with no bookings have no leaked Redis counters.
- Per-(show, alloc) booked cores never exceed the subscription burst.
- Per-job booked cores never exceed
job_resource.int_max_cores. - Host ledger:
int_cores - int_cores_idle == SUM(proc)per host, never negative. - One
RUNNINGframe per proc row. - Trigger-maintained
job_stat.int_waiting_countmatches the frame table. - After teardown, zero
stress_%rows remain in any table the suite touches.
Running locally
Prerequisites
-
A migrated Postgres on
localhost:5432(cuebot/cuebot_password). From the repo root:docker compose up -d flyway(brings updband applies migrations). If the Flyway image won’t build in your environment (e.g. SSL-inspecting proxies break its package mirrors), apply the migrations directly — they are plain SQL:cd cuebot/src/main/resources/conf/ddl/postgres/migrations for f in $(ls V*.sql | sort -t V -k2 -n); do docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f" done -
A running Docker daemon. The suite starts its own throwaway Redis container via testcontainers; all accounting state dies with it.
Run
cd rust
cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture
For meaningful benchmark numbers, use a release build:
cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture
Tuning
| Env var | Default | Meaning |
|---|---|---|
STRESS_JOBS |
300 | drain-phase job count |
STRESS_LAYERS |
4 | drain-phase layers per job |
STRESS_FRAMES_PER_LAYER |
5 | drain-phase frames per layer |
STRESS_HOSTS |
1200 | drain-phase host count |
STRESS_TAGS |
8 | drain-phase manual tag count |
STRESS_SAT_JOBS |
150 | saturation-phase job count |
STRESS_SAT_HOSTS |
400 | saturation-phase host count |
STRESS_DRAIN_TARGET |
0.9 | fraction of drain frames that must book |
STRESS_STALL_SECS |
30 | watchdog: pause jobs after this long without a new booking |
STRESS_TIMEOUT_SECS |
600 | watchdog: per-phase hard timeout |
Seeding is deterministic for a given scale (fixed RNG seed), so consecutive runs at the same scale book the same workload — diffs in throughput between runs reflect the code, not the data.
Reading the report
================ phase: drain ================
frames : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit : OK
- throughput is measured from the first to the last
proc.ts_booked, so it excludes the post-drain shutdown tail of the feed (thewallfigure includes it). - redis lua ops above the dispatch count means the compensation path ran: each failed dispatch costs one book plus one force-rollback. The audit passing alongside a surplus is a positive signal — rollbacks netted out.
- In the saturation phase, expect large
subscription=rejection counts and every subscription pinned at exactlyburst/burstcores.
Cleanup guarantees
All database rows the suite creates are prefixed stress_. The suite sweeps
that prefix before seeding (so leftovers from a crashed earlier run never
skew results) and after the run, then asserts zero residue. Redis state
needs no cleanup — the container is destroyed with the test. If a run is
killed hard (e.g. SIGKILL mid-phase), the next run’s pre-sweep removes the
leftovers.
CI integration
The suite runs in the
scheduler-stress-pipeline.yml
workflow.
When it runs
| Trigger | Scale | Purpose |
|---|---|---|
Pull request touching rust/crates/scheduler/**, rust/crates/opencue-proto/**, rust/Cargo.toml, the Postgres migrations, or the workflow itself |
defaults | Gate scheduler changes on booking/accounting correctness |
| Nightly (cron, master) | defaults | Catch drift from changes outside the paths filter; daily throughput data point |
Manual (workflow_dispatch) |
custom via inputs | Benchmark a branch at chosen scale |
When it deliberately does not run
- PRs that don’t touch the scheduler or schema (Python, CueGUI, CueWeb, docs, …). The suite needs a migrated Postgres, a Docker daemon, and several minutes of runner time; for those changes it produces zero signal.
- As a performance gate. Shared CI runners have noisy CPU/IO, so the workflow never asserts on frames/s — throughput is published in the job’s step summary (and the full log as an artifact) for humans to eyeball trends. Benchmark conclusions should come from local release-mode runs on quiet hardware.
What fails the job
Only correctness regressions: accounting drift between Redis and Postgres, a cap breach, booking liveness failures (drain below target, or a saturated farm producing no Redis rejections), a phase that never converges (hard-timeout), or test data left behind after cleanup.
Launching a manual benchmark run
GitHub → Actions → OpenCue Scheduler Stress Pipeline → Run workflow, then
optionally override the job/host/frame counts and timeout. Results appear in
the run’s step summary; the complete log is attached as the
scheduler-stress-output artifact (kept 30 days).
Scope and limitations
- RQD is not exercised. The suite runs in
dry_run_mode: the full booking path executes (Redis Lua, proc insert, host ledger, frame start) but no gRPC launch is sent. Frame completion and the Cuebot release path are out of scope — see the Redis-Backed Accounting Reference for how releases are reconciled. - Only scheduler-managed shows (
show.b_scheduler_managed = true) are covered; Cuebot-managed accounting is Cuebot’s test territory. - The recompute / limit-reseed loops are intentionally dormant during the run
(see invariant 1); their CAS semantics are covered separately by
tests/redis_integration.rs(--features redis-tests).
Schema gotchas the suite encodes
These bit during development and are asserted/documented in the test code — keep them in mind when extending the seeding:
alloc.str_tagisVARCHAR(24)andhost.str_nameisVARCHAR(30): generated names must stay short.- The pending-job query
INNER JOINsfolder_resource: a folder without that row makes every job in it silently unbookable. - The
vs_waitingview requiresjob_resource.int_max_cores - int_cores >= 100(centicores):int_max_cores = 0does not mean “unlimited” on the query path — use a large value instead.