Deploying the Distributed Scheduler

Deploy the standalone Rust scheduler for improved scalability

This guide shows you how to deploy the Distributed Scheduler, a standalone Rust module that offloads frame dispatching workload from Cuebot, enabling OpenCue to scale to larger render farms.

What is the Distributed Scheduler?

The Distributed Scheduler is a high-performance Rust service that handles frame-to-host matching and dispatching. Unlike Cuebot’s traditional reactive booking system (which responds to each host report), the scheduler operates proactively with an internal loop that continuously searches for pending jobs and intelligently matches them with cached host availability.

Key Benefits

Reduced Database Load: Host information is cached in memory, dramatically reducing complex booking queries
Improved Dispatch Latency: Proactive matching reduces time-to-first-frame for new jobs
Horizontal Scalability: Multiple scheduler instances can share the load by processing different clusters
Better Resource Utilization: Sophisticated in-memory matching algorithms optimize host selection

For more technical details, see the Scheduler Technical Reference.

System Requirements

To plan your installation of the Distributed Scheduler, consider the following:

Memory: Minimum 2GB RAM per scheduler instance (scales with number of hosts cached)
CPU: 2-4 cores recommended per instance
Network: Low-latency connection to the OpenCue database (same requirements as Cuebot)
Database: PostgreSQL with the same schema as Cuebot (no additional tables required)

Architecture Overview

The scheduler is organized around clusters, which represent unique combinations of:

Allocation Clusters: One per facility + show + allocation tag
Manual Tag Clusters: Groups of manual tags (chunk size configurable)
Hostname Tag Clusters: Groups of hostname tags (chunk size configurable)

Each scheduler instance processes one or more clusters in a round-robin fashion. The cluster set is built automatically from every show with b_scheduler_managed = true, optionally scoped to one facility via --facility. In distributed deployments, different instances are scoped to different facilities to share the workload.

Before You Begin

Before deploying the scheduler, ensure you have:

Running OpenCue infrastructure:
- PostgreSQL database (same as used by Cuebot)
- Cuebot instance (version 0.23.0 or later for exclusion list support)
- RQD agents on render hosts
Network access:
- Scheduler needs database access (same credentials as Cuebot)
- Scheduler needs gRPC access to RQD hosts (default port 8444)
Installation method:
- Docker (recommended for production) - install Docker
- Pre-built binary (for testing/development)
- Build from source (for customization)

Installation Options

Option 1: Run with Docker (Recommended)

The easiest way to deploy the scheduler in production is using the pre-built Docker image.

1. Download the Docker Image

docker pull opencue/scheduler

2. Create a Configuration File

Create /etc/cue-scheduler/scheduler.yaml with your environment-specific settings:

logging:
  level: info,sqlx=warn

database:
  pool_size: 20
  db_host: your-postgres-host
  db_name: cuebot
  db_user: cuebot
  db_pass: your_password
  db_port: 5432

rqd:
  grpc_port: 8444
  dry_run_mode: false  # Set to true for testing without actual dispatch

queue:
  monitor_interval: 5s
  worker_threads: 4
  dispatch_frames_per_layer_limit: 20
  manual_tags_chunk_size: 100
  hostname_tags_chunk_size: 300
  cluster_reload_interval: 120s   # How often to reload the managed-show cluster set from the DB

scheduler:
  # Optional: Filter to a specific facility
  # facility: spi
  
  # Optional: Tags to exclude from all loaded clusters
  # ignore_tags:
  #   - deprecated_tag

The scheduler automatically loads every cluster (allocation, manual, hostname, and hardware host-tags) for all shows where b_scheduler_managed = true. There is no per-show or per-tag selection in the config; ownership is driven solely by the show.b_scheduler_managed DB column (see Configuring Cuebot Exclusion List).

3. Run the Scheduler Container

docker run -d \
  --name opencue-scheduler \
  --restart unless-stopped \
  -v /etc/cue-scheduler/scheduler.yaml:/etc/cue-scheduler/scheduler.yaml:ro \
  -p 9090:9090 \
  opencue/scheduler

The scheduler will:

Read configuration from the mounted YAML file
Expose Prometheus metrics on port 9090
Automatically restart if it crashes

Option 2: Build and Run with Docker from Source

If you need to customize the scheduler or are developing locally:

1. Check Out the Source Code

Make sure you’ve checked out the source code and your current directory is the root of the checked out source.

2. Build the Docker Image

docker build -t opencue/scheduler -f rust/Dockerfile.scheduler .

This multi-stage build:

Compiles the Rust scheduler in release mode
Creates a minimal runtime image with just the binary
Includes necessary runtime dependencies

3. Run the Container

Follow the same steps as Option 1, step 3 above.

Option 3: Run Pre-built Binary

For testing or development environments:

1. Download the Binary

Download the appropriate pre-built binary from the OpenCue releases page:

Linux (GNU): cue-scheduler-VERSION-x86_64-unknown-linux-gnu
Linux (MUSL): cue-scheduler-VERSION-x86_64-unknown-linux-musl (static, no dependencies)
macOS (Intel): cue-scheduler-VERSION-x86_64-apple-darwin
macOS (Apple Silicon): cue-scheduler-VERSION-aarch64-apple-darwin

2. Make Executable and Install

chmod +x cue-scheduler-VERSION-PLATFORM
sudo mv cue-scheduler-VERSION-PLATFORM /usr/local/bin/cue-scheduler

3. Create Configuration

Create ~/.config/cue-scheduler/scheduler.yaml (or any path you prefer).

See the sample configuration in Option 1, step 2 above.

4. Run the Scheduler

# With default config location
cue-scheduler

# Or specify config path
OPENCUE_SCHEDULER_CONFIG=/path/to/scheduler.yaml cue-scheduler

# Or use command-line overrides
cue-scheduler \
  --facility spi \
  --ignore_tags=deprecated,old

Option 4: Build from Source

For development or customization:

1. Install Prerequisites

Rust toolchain:

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Protobuf compiler:

# macOS
brew install protobuf

# Ubuntu/Debian
sudo apt-get install protobuf-compiler

# RHEL/CentOS/Rocky
sudo yum install protobuf-compiler

2. Build the Scheduler

cd OpenCue/rust
cargo build --release -p scheduler

The binary will be at target/release/cue-scheduler.

3. Run

target/release/cue-scheduler --facility spi

Configuring Cuebot Exclusion List

To prevent Cuebot and the Scheduler from competing for the same work, you must configure Cuebot to exclude the clusters handled by the scheduler.

Understanding Exclusion Configuration

Cuebot supports two exclusion mechanisms:

Global Booking Disable (in opencue.properties): Turn off all booking in Cuebot
```
dispatcher.turn_off_booking=true
```
Per-show Scheduler Ownership (show.b_scheduler_managed column, toggled by operators): Mark a show as managed by the standalone scheduler. Cuebot then skips that show in dispatch and the standalone scheduler dispatches its frames. Toggle with cueadmin:
```
cueadmin -scheduler-managed myshow on
cueadmin -scheduler-managed myshow off
```
This is the sole onboarding mechanism. Granularity is whole-show: when a show is scheduler-managed, the scheduler automatically loads all of its clusters (every allocation, manual, hostname, and hardware host-tag), so there is nothing else to configure per show or per allocation. The set of managed shows is reloaded from the DB periodically (queue.cluster_reload_interval, default 120s), so flipping the flag takes effect without restarting the scheduler.

Migration Strategy

We recommend a gradual migration approach:

Phase 1: Test with One Show

Deploy scheduler (optionally scoped to a facility):
```
cue-scheduler --facility spi
```
You do not list the show’s allocations or tags anywhere — the scheduler covers all of a managed show’s clusters automatically.
Hand the show to the scheduler:
```
cueadmin -scheduler-managed testshow on
```
The scheduler picks up all of testshow’s clusters (allocation, manual, hostname, and hardware host-tags) within queue.cluster_reload_interval (default 120s), with no restart required.
Monitor both systems:
- Watch scheduler metrics at http://scheduler-host:9090/metrics
- Verify Cuebot no longer dispatches new frames for testshow
- Confirm frames dispatch successfully from the scheduler

Phase 2: Expand Coverage

Mark each additional show as scheduler-managed:

cueadmin -scheduler-managed mainshow on

No scheduler-side configuration change is needed: the scheduler reloads the managed-show set from the DB every cluster_reload_interval and automatically picks up all of the newly managed show’s clusters.

Phase 3: Full Migration (Optional)

Once confident, disable Cuebot booking entirely:

dispatcher.turn_off_booking=true

At this point, all dispatching is handled by the scheduler.

Notes on Granularity

Marking a show as scheduler-managed applies to the whole show across all of its clusters (allocation, manual, hostname, and hardware host-tags). There is no per-show, per-allocation, or per-tag selection — a show is wholly scheduler-managed or wholly Cuebot-managed.
The scheduler reloads the managed-show set from the DB every queue.cluster_reload_interval (default 120s), so toggling a show — or changing its host-tags / subscriptions — is picked up without a restart.
To return a show to Cuebot dispatch, run cueadmin -scheduler-managed myshow off.

Configuration Reference

Database Settings

database:
  pool_size: 20          # Connection pool size (default: 20)
  db_host: localhost     # PostgreSQL host
  db_name: cuebot        # Database name
  db_user: cuebot        # Database user
  db_pass: password      # Database password
  db_port: 5432          # PostgreSQL port

Scheduler Cluster Selection

The scheduler automatically loads all clusters for every show where b_scheduler_managed = true (toggled with cueadmin -scheduler-managed <show> on|off). The only scheduler-side knobs are an optional facility scope and a tag-exclusion list:

scheduler:
  # Optional: Filter clusters to a specific facility
  facility: spi

  # Optional: Tags to ignore (exclude from all loaded clusters)
  ignore_tags:
    - deprecated_tag
    - old_allocation

Queue and Performance Tuning

queue:
  monitor_interval: 5s                      # How often to check for work
  worker_threads: 4                         # Concurrent workers
  dispatch_frames_per_layer_limit: 20       # Max frames per layer per cycle
  manual_tags_chunk_size: 100               # Manual tags per cluster
  hostname_tags_chunk_size: 300             # Hostname tags per cluster
  cluster_reload_interval: 120s             # How often to reload the managed-show cluster set from the DB
  host_candidate_attemps_per_layer: 10      # Host matching retries
  job_back_off_duration: 300s               # Backoff after failures
  
  # Optional: Exit after N idle cycles (useful for testing)
  # empty_job_cycles_before_quiting: 10

Host Cache Configuration

host_cache:
  concurrent_groups: 3            # Parallel cache groups
  memory_key_divisor: 2GiB        # Memory bucketing granularity
  checkout_timeout: 12s           # Host checkout timeout
  monitoring_interval: 1s         # Cache monitoring frequency
  group_idle_timeout: 10800s      # Evict idle cache after 3 hours

Command-Line Overrides

CLI arguments override YAML configuration:

cue-scheduler \
  --facility spi \
  --ignore_tags=deprecated,old

Monitoring the Scheduler

Prometheus Metrics

The scheduler exposes metrics on port 9090 at /metrics:

curl http://localhost:9090/metrics

Key Metrics:

scheduler_jobs_queried_total - Total jobs fetched from database
scheduler_jobs_processed_total - Total jobs successfully processed
scheduler_frames_dispatched_total - Total frames dispatched to hosts
scheduler_candidates_per_layer - Distribution of hosts needed per layer
scheduler_time_to_book_seconds - Latency from frame creation to dispatch
scheduler_no_candidate_iterations_total - Failed matching attempts

Log Output

The scheduler uses structured logging. Configure verbosity:

logging:
  level: info              # Options: trace, debug, info, warn, error
  # Or filter specific modules:
  level: info,sqlx=warn    # Reduce sqlx noise

Example output:

2025-12-12T10:00:00.123Z INFO scheduler: Starting scheduler feed
2025-12-12T10:00:01.456Z DEBUG scheduler: Found job: Job(id=abc123, name=render_shot_010)
2025-12-12T10:00:02.789Z DEBUG scheduler: Layer layer_id fully consumed.
2025-12-12T10:00:03.012Z INFO scheduler: Processed 5 layers, dispatched 120 frames

Verifying Your Installation

1. Check Scheduler Startup

If running in Docker:

docker logs opencue-scheduler

You should see:

Starting scheduler feed

2. Verify Database Connectivity

The scheduler will log errors if it can’t connect to PostgreSQL:

ERROR Failed to connect to database: connection refused

3. Monitor Frame Dispatch

Watch the metrics endpoint:

watch -n 1 'curl -s http://localhost:9090/metrics | grep scheduler_frames_dispatched_total'

4. Check Cuebot Exclusion

In Cuebot logs, verify exclusions are working:

docker logs cuebot | grep exclusion

Production Deployment Recommendations

Running as a System Service

For non-Docker deployments, create a systemd service:

/etc/systemd/system/opencue-scheduler.service:

[Unit]
Description=OpenCue Distributed Scheduler
After=network.target postgresql.service

[Service]
Type=simple
User=opencue
Group=opencue
Environment=OPENCUE_SCHEDULER_CONFIG=/etc/cue-scheduler/scheduler.yaml
ExecStart=/usr/local/bin/cue-scheduler
Restart=on-failure
RestartSec=5s

[Install]
WantedBy=multi-user.target

Enable and start:

sudo systemctl enable opencue-scheduler
sudo systemctl start opencue-scheduler
sudo systemctl status opencue-scheduler

Resource Allocation

For production deployments:

Single instance: 2-4 CPU cores, 4-8GB RAM (handles thousands of hosts)
Multi-instance: Divide clusters across instances based on workload
Database: Ensure connection pool size × instances < PostgreSQL max_connections

High Availability

Currently (v1.0), the scheduler doesn’t have built-in HA. For resilience:

Run with systemd/Docker restart policies
Monitor with health checks on the metrics endpoint
Use process supervisors (systemd, supervisord, Kubernetes)

Future releases will include automatic cluster distribution for true multi-instance HA.

Troubleshooting

Scheduler Not Dispatching Frames

Check:

Database connectivity: psql -h dbhost -U cuebot -d cuebot -c "SELECT 1"
Show ownership: Verify the show has b_scheduler_managed = true (cueadmin -scheduler-managed <show> on) and that its hosts/allocations carry the expected tags in the database
Cuebot exclusion: Verify Cuebot isn’t still booking the same clusters
RQD connectivity: Ensure scheduler can reach RQD gRPC port (8444)

High Database Load

Solutions:

Reduce database.pool_size
Increase queue.monitor_interval (check less frequently)
Reduce queue.worker_threads

Memory Growth

Solutions:

Lower host_cache.group_idle_timeout to evict cache sooner
Reduce queue.concurrent_groups in host cache
Monitor with docker stats or system tools

Frames Failing to Dispatch

Check logs for:

AllocationOverBurst: Allocation has exceeded its burst limit
HostLock: Failed to acquire lock (another scheduler instance has a lock on the host)
GrpcFailure: RQD communication failure

Current Limitations

Version 1.0 Constraints

Facility-scoped distribution: Multi-instance deployments are split by facility (--facility); there is no finer-grained per-cluster assignment
No automatic distribution: Cluster workload isn’t automatically balanced across instances within a facility
Single instance recommended: While multi-instance is supported, it requires careful facility scoping to avoid overlap

Future Enhancements

Planned for future releases:

Automatic cluster distribution: Central control module to coordinate multiple schedulers
Dynamic scaling: Automatically add/remove instances based on workload
Self-healing: Redistribute clusters when instances fail
Load balancing: Evenly distribute work across available schedulers

What’s Next?

Scheduler Technical Reference - Deep dive into architecture
Deploying RQD - Set up render hosts
Monitoring Jobs - Track job progress

Getting Help

Slack: Join #opencue on ASWF Slack
GitHub Issues: Report bugs or request features