GPU Acceleration

Atelier uses GPU acceleration for sentence-transformer embedding computation and CatBoost training/inference. GPU support is auto-detected at startup with graceful fallback to CPU.

Detection

gpu.preflight_gpu() runs once at config load time and caches the result for the process lifetime. Three-step detection:

nvidia-smi probe: subprocess call to detect device count, names, VRAM, and driver CUDA version
CUDA version extraction: parse nvidia-smi header for driver compatibility
PyTorch check: torch.cuda.is_available() confirms runtime support

The result is a GpuInfo dataclass with:

available — whether CUDA is usable
device_count — number of GPUs
devices — device names with VRAM (e.g., “NVIDIA RTX 4090 24GB”)
resolved_device — "cuda" or "cpu" for model initialization
warnings — non-blocking issues (version mismatches, library path hints)

NVIDIA Driver Symlink (nix + CUDA)

In devenv (nix-managed), CUDA libraries are isolated from the host system. The GPU module handles the nix+CUDA compatibility pattern by detecting the driver library path and ensuring PyTorch can find it. This avoids the common nix pitfall where torch.cuda.is_available() returns False despite GPUs being present.

Integration Points

Sentence-Transformer Embedding

embedding.py calls preflight_gpu() before initializing the SentenceTransformer model, passing device=gpu_info.resolved_device:

gpu_info = preflight_gpu()
model = SentenceTransformer("all-MiniLM-L6-v2", device=gpu_info.resolved_device)

GPU batch encoding achieves ~2,768 texts/second on RTX 4090 (vs ~400/s on CPU). This matters at scale: 15M columns takes ~90 minutes on GPU vs ~10 hours on CPU.

GPU available: uses CUDA for all embedding and training operations
GPU detected but CUDA unavailable: warns about library path issues, falls back to CPU
No GPU: runs entirely on CPU with no warnings

CAI Considerations

CAI ML workloads can request GPU runtimes. When running on a GPU-enabled CAI session:

The NVIDIA drivers are provided by the container runtime
PyTorch CUDA support depends on the Python runtime image
GPU memory is shared with other processes in the session
Background SHAP computation can be memory-intensive; monitor with nvidia-smi if running alongside large models

Atelier

GPU Acceleration

Detection

NVIDIA Driver Symlink (nix + CUDA)

Integration Points

Sentence-Transformer Embedding

CatBoost Training

Preflight Reporting

Configuration

CAI Considerations

Keyboard shortcuts

Atelier