GPU Management

Gaius manages 6 NVIDIA RTX 4090 GPUs (24GB VRAM each, 144GB total) across vLLM inference, LuxCore rendering, and embedding workloads.

GPU Allocation

GPU	Typical Use	VRAM	Notes
0-1	Reasoning endpoint (24B model)	2 × 24GB	tensor_parallel=2, CoT reflection
2-3	Coding endpoint (24B model)	2 × 24GB	tensor_parallel=2
4	Embedding endpoint (Nomic 768-dim)	24GB	ColNomic multi-vector
5	Rendering / Evolution	24GB	Dynamically assigned

The Orchestrator manages allocation via capability-based scheduling (OR-Tools CP-SAT). GPUs can be temporarily reassigned for LuxCore rendering or evolution training via makespan scheduling — the Orchestrator evicts a low-priority endpoint, runs the workload, then restores the endpoint.

Status Monitoring

# Endpoint status
uv run gaius-cli --cmd "/gpu status" --format json

# GPU health (memory, temperature, utilization)
uv run gaius-cli --cmd "/gpu health" --format json

Cleanup

When GPU processes get stuck or memory leaks:

# Standard cleanup (kill orphan vLLM processes)
just gpu-cleanup

# Deep cleanup (aggressive memory recovery)
just gpu-deep-cleanup

The gpu-helpers.sh shared library provides the gpu_cleanup function used by both the engine startup script and the justfile recipes.

Common Issues

Issue	Symptom	Fix
Orphan vLLM process	GPU memory used but no endpoint	`just gpu-cleanup`
OOM during model load	Endpoint stuck in STARTING	Free GPU, then `/health fix endpoints`
CUDA memory fragmentation	Degraded inference speed	`just gpu-deep-cleanup` then restart
OpenCV conflict	vLLM WorkerProc fails (cv2 error)	Already fixed via pyproject.toml override

Rendering GPU Eviction

The viz pipeline temporarily evicts a low-priority endpoint to use a GPU for LuxCore rendering:

Orchestrator evicts endpoint from target GPU
LuxCore renders using PATHOCL engine with CUDA
clear_embeddings() releases Nomic model (~3GB)
Orchestrator restores evicted endpoint