Scheduler Stress Testing

Running and interpreting the booking + accounting stress suite


Overview

The stress suite (rust/crates/scheduler/tests/stress_tests.rs) exercises the Rust scheduler’s full production dispatch path at scale — pipeline::run end to end: Redis accounting bootstrap → cluster feed → pending-job query → host matching → dispatch (proc insert, host ledger decrement, frame start) — against a deterministic, bulk-seeded farm.

It is both a correctness gate and a benchmark harness:

  • Correctness: after each phase an audit cross-checks every Redis acct:* hash the run touched against SUM(proc) in Postgres (the canonical record — see the Redis-Backed Accounting Reference), and verifies cap enforcement and ledger invariants.
  • Benchmark: it reports booking throughput (frames/s over the active booking window), host-matching efficiency (wasted attempt %), host-cache hit ratio, and Redis Lua op counts.

The suite runs two phases in one process:

Phase Shape What it proves
drain Farm capacity comfortably exceeds demand (default: 1,200 hosts, 6,000 frames) ≥90% of frames book; throughput measured; accounting stays exact under concurrency, including the force-rollback compensation path
saturation Demand vastly exceeds tight subscription bursts and per-job core caps (default: 400 hosts, 3,000 frames, 150-core bursts) The Redis Lua cap check is the binding constraint: bookings stop exactly at burst, caps are never breached, rejections flow through the hot path

Invariants the audit asserts

  1. Every acct:{sub,folder,job,layer,point} hash holds exactly SUM(proc.int_cores_reserved)/100 cores and SUM(proc.int_gpus_reserved) GPUs for its grouping — the same 5-dimension grouping and centicore→core conversion the recompute loop uses. The suite pushes the recompute and limit-reseed loops out to a 1-hour interval, so agreement here proves the dispatch hot path alone (Lua book + force-rollback) kept Redis exact — reconciliation never got a chance to paper over drift.
  2. Jobs with no bookings have no leaked Redis counters.
  3. Per-(show, alloc) booked cores never exceed the subscription burst.
  4. Per-job booked cores never exceed job_resource.int_max_cores.
  5. Host ledger: int_cores - int_cores_idle == SUM(proc) per host, never negative.
  6. One RUNNING frame per proc row.
  7. Trigger-maintained job_stat.int_waiting_count matches the frame table.
  8. After teardown, zero stress_% rows remain in any table the suite touches.

Running locally

Prerequisites

  • A migrated Postgres on localhost:5432 (cuebot / cuebot_password). From the repo root: docker compose up -d flyway (brings up db and applies migrations). If the Flyway image won’t build in your environment (e.g. SSL-inspecting proxies break its package mirrors), apply the migrations directly — they are plain SQL:

    cd cuebot/src/main/resources/conf/ddl/postgres/migrations
    for f in $(ls V*.sql | sort -t V -k2 -n); do
      docker exec -i opencue-db-1 psql -q -v ON_ERROR_STOP=1 -U cuebot -d cuebot < "$f"
    done
    
  • A running Docker daemon. The suite starts its own throwaway Redis container via testcontainers; all accounting state dies with it.

Run

cd rust
cargo test -p scheduler --features stress-tests --test stress_tests -- --nocapture

For meaningful benchmark numbers, use a release build:

cargo test -p scheduler --release --features stress-tests --test stress_tests -- --nocapture

Tuning

Env var Default Meaning
STRESS_JOBS 300 drain-phase job count
STRESS_LAYERS 4 drain-phase layers per job
STRESS_FRAMES_PER_LAYER 5 drain-phase frames per layer
STRESS_HOSTS 1200 drain-phase host count
STRESS_TAGS 8 drain-phase manual tag count
STRESS_SAT_JOBS 150 saturation-phase job count
STRESS_SAT_HOSTS 400 saturation-phase host count
STRESS_DRAIN_TARGET 0.9 fraction of drain frames that must book
STRESS_STALL_SECS 30 watchdog: pause jobs after this long without a new booking
STRESS_TIMEOUT_SECS 600 watchdog: per-phase hard timeout

Seeding is deterministic for a given scale (fixed RNG seed), so consecutive runs at the same scale book the same workload — diffs in throughput between runs reflect the code, not the data.

Reading the report

================ phase: drain ================
frames     : 6000 seeded, 5988 dispatched (99.8%), waiting 6000 -> 12
throughput : 975.1 frames/s over a 6.1s booking window (wall 43.3s)
matching   : 3175 host attempts (41.9% wasted), 39 cluster rounds, host-cache hit 98%
accounting : 7452 redis lua ops, 5988 dispatches (metrics), 24040 booked cores, rejections [...]
audit      : OK
  • throughput is measured from the first to the last proc.ts_booked, so it excludes the post-drain shutdown tail of the feed (the wall figure includes it).
  • redis lua ops above the dispatch count means the compensation path ran: each failed dispatch costs one book plus one force-rollback. The audit passing alongside a surplus is a positive signal — rollbacks netted out.
  • In the saturation phase, expect large subscription= rejection counts and every subscription pinned at exactly burst/burst cores.

Cleanup guarantees

All database rows the suite creates are prefixed stress_. The suite sweeps that prefix before seeding (so leftovers from a crashed earlier run never skew results) and after the run, then asserts zero residue. Redis state needs no cleanup — the container is destroyed with the test. If a run is killed hard (e.g. SIGKILL mid-phase), the next run’s pre-sweep removes the leftovers.

CI integration

The suite runs in the scheduler-stress-pipeline.yml workflow.

When it runs

Trigger Scale Purpose
Pull request touching rust/crates/scheduler/**, rust/crates/opencue-proto/**, rust/Cargo.toml, the Postgres migrations, or the workflow itself defaults Gate scheduler changes on booking/accounting correctness
Nightly (cron, master) defaults Catch drift from changes outside the paths filter; daily throughput data point
Manual (workflow_dispatch) custom via inputs Benchmark a branch at chosen scale

When it deliberately does not run

  • PRs that don’t touch the scheduler or schema (Python, CueGUI, CueWeb, docs, …). The suite needs a migrated Postgres, a Docker daemon, and several minutes of runner time; for those changes it produces zero signal.
  • As a performance gate. Shared CI runners have noisy CPU/IO, so the workflow never asserts on frames/s — throughput is published in the job’s step summary (and the full log as an artifact) for humans to eyeball trends. Benchmark conclusions should come from local release-mode runs on quiet hardware.

What fails the job

Only correctness regressions: accounting drift between Redis and Postgres, a cap breach, booking liveness failures (drain below target, or a saturated farm producing no Redis rejections), a phase that never converges (hard-timeout), or test data left behind after cleanup.

Launching a manual benchmark run

GitHub → Actions → OpenCue Scheduler Stress PipelineRun workflow, then optionally override the job/host/frame counts and timeout. Results appear in the run’s step summary; the complete log is attached as the scheduler-stress-output artifact (kept 30 days).

Scope and limitations

  • RQD is not exercised. The suite runs in dry_run_mode: the full booking path executes (Redis Lua, proc insert, host ledger, frame start) but no gRPC launch is sent. Frame completion and the Cuebot release path are out of scope — see the Redis-Backed Accounting Reference for how releases are reconciled.
  • Only scheduler-managed shows (show.b_scheduler_managed = true) are covered; Cuebot-managed accounting is Cuebot’s test territory.
  • The recompute / limit-reseed loops are intentionally dormant during the run (see invariant 1); their CAS semantics are covered separately by tests/redis_integration.rs (--features redis-tests).

Schema gotchas the suite encodes

These bit during development and are asserted/documented in the test code — keep them in mind when extending the seeding:

  • alloc.str_tag is VARCHAR(24) and host.str_name is VARCHAR(30): generated names must stay short.
  • The pending-job query INNER JOINs folder_resource: a folder without that row makes every job in it silently unbookable.
  • The vs_waiting view requires job_resource.int_max_cores - int_cores >= 100 (centicores): int_max_cores = 0 does not mean “unlimited” on the query path — use a large value instead.

Back to top