Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

gRPC & Gateway

Atelier follows the Fine Tuning Studio proto-first pattern: the gRPC service contract defines the API, and a FastAPI gateway bridges REST to gRPC while serving the React frontend.

Proto Definition

The service contract lives in src/atelier/proto/atelier.proto.

RPCs

RPCRequest → ResponsePurpose
HealthCheckHealthCheckRequestHealthCheckResponseProve gRPC is alive (status + version)
ListAgentsListAgentsRequestListAgentsResponseList agent metadata (id, name, role, tools)
GetAgentGetAgentRequestGetAgentResponseSingle agent by ID
ListDataSourcesListDataSourcesRequestListDataSourcesResponseList OOTB + Hive sources
ListDatasetsListDatasetsRequestListDatasetsResponseClassification datasets (filterable by source_id)
GetFSMStatusFSMStatusRequestFSMStatusResponsePipeline state + progress JSON
StartClassificationStartClassificationRequestStartClassificationResponseTrigger a classification run

Key Messages

  • DataSource — id, source_type (sample/hive), source_uri, display_name, vocabulary_mode
  • ClassificationDataset — id, name, parquet_path, source_id, version_number, is_active, summary
  • FSMStatusResponse — run_id, state, started_at, progress_json, error
  • AgentMetadata — id, name, description, role, tool_ids

Generating Stubs

just proto    # runs bin/generate-proto.sh

This invokes grpc_tools.protoc to produce _pb2.py, _pb2_grpc.py, and .pyi type stubs.

Architecture Layers

Proto (atelier.proto)     ← Service contract and message definitions
    ↓
Servicer (service.py)     ← Thin router dispatching to business logic
    ↓
Client (client.py)        ← Wrapper around generated stub with error handling
    ↓
Gateway (gateway.py)      ← FastAPI bridge from REST to gRPC + React SPA

Gateway REST Endpoints

Infrastructure

EndpointMethodDescription
/api/healthGETgRPC health check
/api/statusGETAggregated health: gRPC + PostgreSQL + Qdrant + config state
/api/agents/validate-credentialsPOSTTest all configured LLM providers
/api/agents/model-discoveryGETCheck for model upgrades via Anthropic Models API

Data Sources & Datasets

EndpointMethodDescription
/api/data-sourcesGETList registered data sources
/api/datasetsGETList datasets (optional source_id filter)
/api/datasets/{id}/activatePOSTSet dataset version as active
/api/datasets/{id}/dataGETServe parquet file
/api/data-connectionsGETList CAI data connections
/api/data-connections/{name}/testPOSTTest a CAI connection
/api/vocabulary/statsGETTerm count (source-aware routing)

Classification Pipeline

EndpointMethodDescription
/api/fsm/statusGETCurrent pipeline state + progress
/api/fsm/startPOSTStart classification (optional source_id)
/api/fsm/runsGETList past classification runs

Agents & Skills

EndpointMethodDescription
/api/agentsGETList agent metadata
/api/skillsGETSkill definitions from .claude/commands/
/api/skills/{skill_id}GETSingle skill markdown content
/api/agents/smoke-testPOSTMinimal Claude Agent SDK verification

WebSocket

EndpointPurpose
/ws/terminal/{session_id}Persistent terminal backed by Claude Agent SDK
/ws/orchestrationLive agent events (spawned, reasoning, tool_call, completed)

Persistent Terminal Sessions

Terminal sessions survive page navigation and browser reload. The WebSocket endpoint accepts a client-provided session_id (persisted in localStorage). On disconnect, the session stays alive server-side — SDK queries continue running and output accumulates in a ring buffer (64KB collections.deque). On reconnect, the buffer is replayed so the user sees everything that happened while they were away.

  • Session registry: Module-level _sessions dict in terminal.py
  • Idle cleanup: Background asyncio task sweeps sessions with no client for 30 minutes (/api/terminal/sessions lists active sessions)
  • Dedicated page: /terminal route renders a full-screen Ghostty WASM terminal; the Landing page embeds the same component at preview size

SPA Fallback

/{path} serves ui/dist/index.html for client-side routing.

Aggregated Status Endpoint

GET /api/status returns a comprehensive health report:

{
  "grpc": {"status": "ok", "latency_ms": 12},
  "postgres": {"status": "ok"},
  "qdrant": {"status": "ok"},
  "config": {
    "has_anthropic": true,
    "has_bedrock": false,
    "agent_model": "claude-sonnet-4-5-20250929",
    "db_url": "postgresql://...(masked)"
  },
  "overall_status": "connected"
}

PostgreSQL probes retry 3x with 1s backoff (PGlite can have transient stalls). Overall status is connected when gRPC responds, degraded when gRPC is up but other services are flaky.

Gateway Lifespan

The FastAPI lifespan hook runs three startup tasks:

  1. OOTB seed: Check if ootb-sample source has any dataset versions; if none, create version 1 with metadata.
  2. Hive auto-discovery: discover_hive_sources() probes all configured data connections (ATELIER_DATA_CONNECTIONS), iterates databases, finds annotations tables matching the known schema (legacy or universal format), and auto-registers them via get_or_create_data_source().
  3. Terminal cleanup: Background asyncio task sweeps idle terminal sessions every 60 seconds.

All three tasks are wrapped in try/except — failures are logged as warnings but don’t prevent gateway startup.

Config Lifecycle

HOCON (config/base.conf) is the single source of truth. No module reads os.environ directly for configuration values.

.env → devenv shell → HOCON ${?VAR} substitution → AtelierConfig dataclass

load_config() reads the HOCON file with live environment variable substitution. External tools that need a flat key=value file use just resolve-config to materialize build/config/atelier.env.

Preflight Validation

just preflight runs structured deny/warn checks via atelier.preflight.run_preflight():

  • Deny = blocking (service cannot start). Examples: missing API keys when both Anthropic and Bedrock are unconfigured.
  • Warn = advisory (degraded functionality). Examples: GPU detected but CUDA unavailable, Qdrant not reachable.

Preflight is called during gateway startup to surface configuration problems early rather than during the first pipeline run.