Orchestrator
The OrchestratorService manages vLLM endpoint lifecycle and GPU allocation. It decides which models are loaded, on which GPUs, and handles startup, shutdown, and recovery.
Endpoint Lifecycle
Endpoints transition through these states:
PENDING → STARTING → HEALTHY
↘ UNHEALTHY → FAILED
HEALTHY → STOPPING → STOPPED
EndpointStatus
@dataclass
class EndpointStatus:
name: str # "reasoning", "coding", etc.
state: str # "healthy", "starting", "unhealthy", "stopped"
gpus: list[int] # Allocated GPU indices
pid: int | None # vLLM process ID
port: int # Serving port
model: str # HuggingFace model ID
uptime_seconds: int
Workload Management
The orchestrator follows Yunikorn-style capability-based scheduling:
- Requests declare capabilities, not endpoints: A workload asks for “reasoning” capability, not a specific model
- Priority-based preemption: Idle endpoints can be evicted for higher-priority work
- Makespan fulfillment: The engine ensures work completes, then restores baseline set points
Example: Render Pipeline
When the viz pipeline needs a GPU for LuxCore rendering:
- Workload requests GPU with
allow_baseline_eviction=True - Orchestrator evicts lowest-priority endpoint from target GPU
- Rendering completes
- Orchestrator restores the evicted endpoint
Clean Start
The clean_start() operation handles recovery from corrupted state:
result = await orch.clean_start(endpoints=["reasoning"])
# Kills stale vLLM processes
# Cleans up CUDA memory
# Restarts endpoints fresh
Health Integration
The orchestrator works with the AgendaTracker to distinguish intentional state changes from failures. When an endpoint is part of a scheduled makespan operation, the Health Observer skips incident creation:
if tracker.is_endpoint_in_scheduled_transition("reasoning"):
# Don't create incident — this is planned
expected = tracker.get_scheduled_endpoint_state("reasoning")
Checking Status
uv run gaius-cli --cmd "/gpu status" --format json | jq '.data.endpoints[]'