Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Atelier is an agentic classification workbench for Cloudera AI. It classifies column metadata using six independent evidence sources fused via Dempster-Shafer Theory (DST), producing belief intervals instead of point estimates. An LLM-in-the-loop convergence agent identifies disagreements between sources and orchestrates targeted reclassification until the corpus stabilizes.

Why Belief Intervals?

Traditional classifiers output a single probability \( P(A) = 0.85 \) — “85% email address.” This conflates two fundamentally different situations: high confidence with abundant evidence vs. moderate confidence with sparse evidence. A Bayesian posterior and a coin flip can both yield 0.5, but they represent very different epistemic states.

Dempster-Shafer theory separates these via the belief function \( \text{Bel}(A) \) and plausibility function \( \text{Pl}(A) \), where:

$$ \text{Bel}(A) = \sum_{B \subseteq A} m(B), \qquad \text{Pl}(A) = 1 - \text{Bel}(\bar{A}) $$

The interval \( [\text{Bel}(A),; \text{Pl}(A)] \) bounds the true probability. Its width \( \text{Pl}(A) - \text{Bel}(A) \) quantifies epistemic uncertainty — how much we don’t know:

IntervalInterpretation
\( [0.82,; 0.87] \)Strong evidence, low ambiguity — classify with confidence
\( [0.30,; 0.90] \)Some support for \(A\), but high ignorance — gather more evidence
\( [0.45,; 0.55] \)Two sources disagree — wide gap, needs revisit

This distinction drives the entire pipeline: columns with wide belief gaps (where \( \text{Pl}(A) - \text{Bel}(A) \) is large) are automatically escalated for LLM re-examination with enriched context. Conflict \( K \) is tracked as a diagnostic but the gap width determines which columns need attention.

Architecture

React FrontendFastAPI GatewayREST → gRPC bridge
Serves React buildgRPC CoreProto-first API
7 RPCsClaude Agent SDK6-tool convergence loop
Conflict-driven revisitClassification Pipeline6 evidence sources
DST fusion
MC sampling at scalePostgreSQLdevenv: PG 16 + pgvector
CAI: PGliteQdrantVector store
Embedding searchXYFlow Agent CanvasEmbeddings Visualization  REST /api/*gRPC :50051REST → gRPC bridge
Serves React build












Proto-first API
7 RPCs












6-tool convergence loop
Conflict-driven revisit












6 evidence sources
DST fusion
MC sampling at scale












devenv: PG 16 + pgvector
CAI: PGlite












Vector store
Embedding search

















Six Evidence Sources

Each source independently produces a mass function \( m_i : 2^\Theta \to [0, 1] \) over the frame of discernment \( \Theta \) (the set of all category codes). Sources are grouped by computational cost:

SourceFeature SpaceCost Tier
Cosine similarityDense 384-dim sentence-transformer embedding (all-MiniLM-L6-v2)M0 (local)
Pattern detection16 regex detectors + post-regex validators (email, phone, SSN, IP, UUID, date, datetime, URL, credit card + Luhn, MAC, IBAN, postal code, monetary, hash, semver, currency + ISO 4217); graduated mass scaling by match fractionM0
Name matchingColumn name vs vocabulary labels, codes, and aliases (4-tier: exact > code > alias > overlap)M0
LLM classificationFrontier model reasoning (Anthropic / Bedrock / Cerebras / OpenAI-compatible)M1 (API)
CatBoost12 discrete features + 384-dim embedding; virtual ensemble uncertainty via posterior_samplingM2 (trained)
SVMSparse TF-IDF: character n-grams (3–6) ∪ word bigrams; Platt-scaled LinearSVCM2 (trained)

The SVM and CatBoost classifiers occupy deliberately orthogonal feature spaces: the SVM operates on sparse lexical features (TF-IDF) while CatBoost uses dense semantic embeddings. This architectural separation ensures genuine evidence independence for Dempster’s rule.

Fusion

Sources are combined via the conjunctive rule of combination:

$$ m_{1 \oplus 2}(C) = \frac{1}{1-K} \sum_{\substack{A \cap B = C \ A,B \subseteq \Theta}} m_1(A) \cdot m_2(B) $$

where the conflict \( K = \sum_{A \cap B = \varnothing} m_1(A) \cdot m_2(B) \) measures the degree to which sources contradict each other. High \( K \) is the diagnostic signal that drives the convergence loop: columns where independent evidence sources disagree are escalated for targeted LLM revisit with enriched context (ML prediction, belief interval, confusable pair).

Hierarchical Classification

The vocabulary forms a rooted code tree (e.g., ICE.SENSITIVE.PID.CONTACT.EMAIL). Belief and plausibility are queryable at any depth — \( \text{Bel}(\texttt{ICE.SENSITIVE}) \) aggregates all descendants. The cautious_code(τ) operator returns the deepest code where \( \text{Bel} > \tau \), enabling principled depth-accuracy tradeoffs: high \( \tau \) yields coarse but reliable labels; low \( \tau \) yields specific but less certain ones.

Convergence

The bootstrap pipeline iterates three phases until the belief gap (\( \text{Pl}(A) - \text{Bel}(A) \)) stabilizes:

  1. LLM sweep — classify all frontier columns via batch LLM calls
  2. ML validation — run the full 6-source DST pipeline; compute per-column belief, plausibility, and gap
  3. Targeted revisit — re-classify only uncertain columns (high gap or low belief) with enriched context (ML prediction + belief interval + detected patterns + confusable pairs)

The primary convergence measure is mean belief gap — the average width of the \( [\text{Bel}, \text{Pl}] \) interval across all columns. A narrow gap means the evidence sources agree on a confident prediction. Conflict \( K \) is tracked as a diagnostic signal (it indicates source disagreement) but does not gate convergence — a column can have \( K = 0.9 \) but \( \text{Bel} = 0.95 \): the sources fought, but the winner is clear.

An agent-driven variant (via Claude Agent SDK with 6 tools) delegates the revisit strategy to an LLM that reasons about uncertainty patterns, calls retrain_svm to progressively improve the SVM on accumulated frontier labels, and declares convergence when diminishing returns are reached. The programmatic variant uses gap + coverage thresholds for environments where tool-use isn’t available.

Frontier-Label SVM Training

After the first LLM sweep, the SVM is retrained on blended synthetic + frontier labels — high-quality classifications from the frontier model on the stratified importance sample. Synthetic data provides vocabulary breadth (all categories); frontier labels provide corpus-specific depth. The SVM is hot-swapped progressively across convergence iterations, carrying corpus-specific signal into each validation pass. DST independence is preserved: the SVM trains on frontier-model (Opus) labels while the LLM mass function in fusion uses the subagent model (Sonnet/Haiku).

Scale

The pipeline handles corpora from 50 columns (OOTB sample) to 120M+ columns (full GitTables at 10M+ tables). Monte Carlo stratified sampling selects a representative frontier subset for LLM classification and propagates labels to the remaining corpus via embedding similarity.

With max_frontier_columns = 500, classifying a 120M-column corpus requires LLM inference on only 0.0004% of columns — a >99.99% cost reduction while preserving classification quality through DST conflict-driven escalation of uncertain propagations.

Out-of-the-Box Experience

A fresh deployment auto-seeds on first boot:

  1. 316-leaf BFO-grounded vocabulary (351 categories total) covering the CCO Information Content Entity trichotomy: Designative (names, IDs, codes), Descriptive (measurements, dates, amounts), Prescriptive (software, specs)
  2. 25 sample tables with 316 columns and a committed curated reference
  3. One-click classification via the Status page
  4. Interactive Embeddings visualization (UMAP/t-SNE via embedding-atlas)

Quick Start

Local development (devenv):

devenv shell          # Enter dev environment
just install          # Install Python + Node dependencies
just up               # Start gRPC + gateway + Vite dev server

CAI deployment: Deploy as an AMP from https://github.com/zndx/atelier.

Documentation Map

System Overview

Atelier is a multi-service application with a gRPC core, FastAPI HTTP gateway, and React frontend.

React FrontendFastAPI HTTP GatewayServes React build
Bridges REST → gRPCgRPC Core ServiceProto-first API
Agent orchestration
Data managementClaude Agent SDKKeystone agents
Adaptive workflowsPostgreSQLdevenv: PG 16 + pgvector
CAI: PGlite (Node.js)XYFlow Agent CanvasEmbeddings  REST /api/*gRPC :50051Serves React build
Bridges REST → gRPC












Proto-first API
Agent orchestration
Data management












Keystone agents
Adaptive workflows












devenv: PG 16 + pgvector
CAI: PGlite (Node.js)

















Deployment

Cloudera AI (CML)

Atelier deploys as a CAI Application from the Git URL https://github.com/zndx/atelier.

The .project-metadata.yaml defines two tasks:

  1. Install Dependencies — Installs Python (via uv) and Node.js dependencies, builds the React frontend
  2. Start Atelier — Launches the gRPC server and HTTP gateway on CDSW_APP_PORT

Local Development

devenv shell          # Enter dev environment (loads .env automatically)
just install          # Install Python + Node dependencies
just proto            # Generate proto stubs
just resolve-config   # Materialize HOCON → build/config/atelier.env
just up               # Start gRPC + Vite dev server via devenv processes

gRPC & Gateway

Atelier follows the Fine Tuning Studio proto-first pattern: the gRPC service contract defines the API, and a FastAPI gateway bridges REST to gRPC while serving the React frontend.

Proto Definition

The service contract lives in src/atelier/proto/atelier.proto.

RPCs

RPCRequest → ResponsePurpose
HealthCheckHealthCheckRequestHealthCheckResponseProve gRPC is alive (status + version)
ListAgentsListAgentsRequestListAgentsResponseList agent metadata (id, name, role, tools)
GetAgentGetAgentRequestGetAgentResponseSingle agent by ID
ListDataSourcesListDataSourcesRequestListDataSourcesResponseList OOTB + Hive sources
ListDatasetsListDatasetsRequestListDatasetsResponseClassification datasets (filterable by source_id)
GetFSMStatusFSMStatusRequestFSMStatusResponsePipeline state + progress JSON
StartClassificationStartClassificationRequestStartClassificationResponseTrigger a classification run

Key Messages

  • DataSource — id, source_type (sample/hive), source_uri, display_name, vocabulary_mode
  • ClassificationDataset — id, name, parquet_path, source_id, version_number, is_active, summary
  • FSMStatusResponse — run_id, state, started_at, progress_json, error
  • AgentMetadata — id, name, description, role, tool_ids

Generating Stubs

just proto    # runs bin/generate-proto.sh

This invokes grpc_tools.protoc to produce _pb2.py, _pb2_grpc.py, and .pyi type stubs.

Architecture Layers

Proto (atelier.proto)     ← Service contract and message definitions
    ↓
Servicer (service.py)     ← Thin router dispatching to business logic
    ↓
Client (client.py)        ← Wrapper around generated stub with error handling
    ↓
Gateway (gateway.py)      ← FastAPI bridge from REST to gRPC + React SPA

Gateway REST Endpoints

Infrastructure

EndpointMethodDescription
/api/healthGETgRPC health check
/api/statusGETAggregated health: gRPC + PostgreSQL + Qdrant + config state
/api/agents/validate-credentialsPOSTTest all configured LLM providers
/api/agents/model-discoveryGETCheck for model upgrades via Anthropic Models API

Data Sources & Datasets

EndpointMethodDescription
/api/data-sourcesGETList registered data sources
/api/datasetsGETList datasets (optional source_id filter)
/api/datasets/{id}/activatePOSTSet dataset version as active
/api/datasets/{id}/dataGETServe parquet file
/api/data-connectionsGETList CAI data connections
/api/data-connections/{name}/testPOSTTest a CAI connection
/api/vocabulary/statsGETTerm count (source-aware routing)

Classification Pipeline

EndpointMethodDescription
/api/fsm/statusGETCurrent pipeline state + progress
/api/fsm/startPOSTStart classification (optional source_id)
/api/fsm/runsGETList past classification runs

Agents & Skills

EndpointMethodDescription
/api/agentsGETList agent metadata
/api/skillsGETSkill definitions from .claude/commands/
/api/skills/{skill_id}GETSingle skill markdown content
/api/agents/smoke-testPOSTMinimal Claude Agent SDK verification

WebSocket

EndpointPurpose
/ws/terminal/{session_id}Persistent terminal backed by Claude Agent SDK
/ws/orchestrationLive agent events (spawned, reasoning, tool_call, completed)

Persistent Terminal Sessions

Terminal sessions survive page navigation and browser reload. The WebSocket endpoint accepts a client-provided session_id (persisted in localStorage). On disconnect, the session stays alive server-side — SDK queries continue running and output accumulates in a ring buffer (64KB collections.deque). On reconnect, the buffer is replayed so the user sees everything that happened while they were away.

  • Session registry: Module-level _sessions dict in terminal.py
  • Idle cleanup: Background asyncio task sweeps sessions with no client for 30 minutes (/api/terminal/sessions lists active sessions)
  • Dedicated page: /terminal route renders a full-screen Ghostty WASM terminal; the Landing page embeds the same component at preview size

SPA Fallback

/{path} serves ui/dist/index.html for client-side routing.

Aggregated Status Endpoint

GET /api/status returns a comprehensive health report:

{
  "grpc": {"status": "ok", "latency_ms": 12},
  "postgres": {"status": "ok"},
  "qdrant": {"status": "ok"},
  "config": {
    "has_anthropic": true,
    "has_bedrock": false,
    "agent_model": "claude-sonnet-4-5-20250929",
    "db_url": "postgresql://...(masked)"
  },
  "overall_status": "connected"
}

PostgreSQL probes retry 3x with 1s backoff (PGlite can have transient stalls). Overall status is connected when gRPC responds, degraded when gRPC is up but other services are flaky.

Gateway Lifespan

The FastAPI lifespan hook runs three startup tasks:

  1. OOTB seed: Check if ootb-sample source has any dataset versions; if none, create version 1 with metadata.
  2. Hive auto-discovery: discover_hive_sources() probes all configured data connections (ATELIER_DATA_CONNECTIONS), iterates databases, finds annotations tables matching the known schema (legacy or universal format), and auto-registers them via get_or_create_data_source().
  3. Terminal cleanup: Background asyncio task sweeps idle terminal sessions every 60 seconds.

All three tasks are wrapped in try/except — failures are logged as warnings but don’t prevent gateway startup.

Config Lifecycle

HOCON (config/base.conf) is the single source of truth. No module reads os.environ directly for configuration values.

.env → devenv shell → HOCON ${?VAR} substitution → AtelierConfig dataclass

load_config() reads the HOCON file with live environment variable substitution. External tools that need a flat key=value file use just resolve-config to materialize build/config/atelier.env.

Preflight Validation

just preflight runs structured deny/warn checks via atelier.preflight.run_preflight():

  • Deny = blocking (service cannot start). Examples: missing API keys when both Anthropic and Bedrock are unconfigured.
  • Warn = advisory (degraded functionality). Examples: GPU detected but CUDA unavailable, Qdrant not reachable.

Preflight is called during gateway startup to surface configuration problems early rather than during the first pipeline run.

Keystone Agents

Atelier uses the Claude Agent SDK to drive classification convergence. Rather than a fixed programmatic loop, an LLM agent reasons about which columns to revisit based on DST conflict metrics, evidence breakdowns, and convergence trends.

Agent Convergence Loop

The agent loop (src/atelier/classify/agent_loop.py) wraps the bootstrap pipeline functions as six Claude tools. Claude receives an initial state summary and iteratively calls tools until it determines the classification has converged.

Flow

1. Initial state → agent sees mean gap, mean belief, coverage, K (diagnostic)
2. Agent calls get_conflict_report → identifies uncertain columns (high gap or low belief)
3. Agent calls get_column_detail → inspects per-source evidence breakdown
4. Agent calls revisit_columns → re-classifies with enriched context
5. Agent calls retrain_svm → SVM learns from accumulated frontier labels
6. Agent calls check_convergence → verifies gap trend + belief floor
7. Repeat 2-6 until satisfied
8. Agent calls declare_converged with reason

The conversation loop runs up to classify_agent_max_turns (default 10) Messages API round-trips. Each tool call returns structured JSON that the agent uses to plan its next action.

Six Tools

ToolInputReturnsPurpose
get_conflict_reportk_threshold (float)Flagged columns with K, belief, plausibility, gap, settled flagIdentify uncertain or conflicting columns
revisit_columnscolumn_names (list)Updated labels + new belief intervalsRe-classify with enriched LLM context (ML prediction + belief interval)
check_convergencemean_gap, mean_bel, frac_unclear, coverage, K (diagnostic), iteration historyAssess convergence via belief-gap criteria
get_column_detailcolumn_name (string)Per-source evidence breakdown, sample values, belief intervalDeep-dive into a specific column
declare_convergedreason (string)ConfirmationExit loop with stated rationale
retrain_svmfrontier_samples, classes, model_pathRetrain SVM on blended synth + frontier labels

The retrain_svm tool (M9) lets the agent decide when to retrain the SVM classifier on accumulated frontier LLM labels. The retrained SVM is hot-swapped via ml_inference.reset() + configure_paths() and used in subsequent ML validation passes. The agent calls this when it judges enough new frontier labels have accumulated to improve classification accuracy.

Agent System Prompt

The system prompt guides the agent’s strategy:

  1. Examine the conflict report to understand where sources disagree
  2. Inspect individual columns for uncertain cases (high gap or low belief)
  3. Revisit uncertain columns to resolve ambiguity
  4. Check convergence metrics (mean gap, mean belief, coverage) to decide whether to continue — K is available as a diagnostic but does not gate
  5. Declare convergence when satisfied (or when diminishing returns)

State Tracking

The agent loop tracks:

  • state.agent_reasoning — text blocks from each agent turn
  • state.agent_converged_reason — the reason given at convergence
  • state.agent_turns — number of conversation turns
  • state.tokens_input / state.tokens_output — token consumption

Each revisit_columns call increments state.iteration and triggers full ML revalidation on all columns, not just the revisited ones. This ensures that improved LLM labels propagate through the DST fusion.

LLM Backend Matrix

The agent loop and LLM sweep share the same backend infrastructure. No global provider switch — credentials determine what’s available.

BackendClassConfigUse Case
AnthropicAnthropicBackendANTHROPIC_API_KEYAgent loop + LLM sweep
BedrockBedrockBackendAWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGIONProduction default on CAI
CerebrasCerebrasBackendCEREBRAS_API_KEYFast inference via GLM-4.7
OpenAI-compatibleOpenAICompatibleBackendATELIER_LLM_BASE_URL + ATELIER_LLM_MODELvLLM, any compatible endpoint

The agent client is built via _build_client(cfg) which prefers Anthropic when ANTHROPIC_API_KEY is set, falling back to Bedrock when AWS credentials are available. The agent model resolves as: classify_agent_modelagent_model"claude-sonnet-4-5-20250929".

Configuration

All agent and bootstrap settings live in HOCON (config/base.conf):

classify {
    llm {
        backend = "openai_compatible"
        model = "glm-4.7"
        base_url = null
        columns_per_call = 50
        discount = 0.10
    }
    bootstrap {
        max_iterations = 5
        k_threshold = 0.2
        coverage_target = 0.95
        max_total_llm_calls = 5000
        frontier_svm_retrain = true
        frontier_svm_min_labels = 20
    }
}

agent {
    model = "claude-sonnet-4-5-20250929"
    model = ${?ATELIER_AGENT_MODEL}
}

classify {
    agent_model = null
    agent_model = ${?ATELIER_CLASSIFY_AGENT_MODEL}
    agent_max_turns = 10
}

When classify.agent_model is set, it overrides agent.model for the classification convergence loop specifically.

Agent vs Programmatic Loop

The bootstrap pipeline (bootstrap.py) contains the programmatic convergence loop as well: sweep → validate → revisit uncertain → repeat. The agent loop is an alternative that delegates the revisit strategy to Claude. Both paths share the same underlying functions (_llm_sweep, _run_ml_validation, etc.) and produce identical DST evidence.

The agent approach is preferred when:

  • The corpus has complex ambiguity patterns (confusable categories)
  • You want reasoning traces explaining why convergence was declared
  • The LLM backend supports tool_use (Anthropic, Bedrock with Claude)

The programmatic approach is used when:

  • The LLM backend doesn’t support tool_use (vLLM, Cerebras)
  • Deterministic behavior is required
  • Cost must be minimized (fewer API calls)

WebSocket Orchestration

The gateway exposes /ws/orchestration for live agent event streaming. Events include agent_spawned, agent_reasoning, agent_tool_call, and agent_completed. The React frontend’s Agent Canvas page consumes these events to render the agent’s decision process in real time.

Classification Pipeline

Atelier’s core objective: agent-mediated metadata classification using Dempster-Shafer Theory (DST) to produce belief intervals instead of flat confidence scores, exposing epistemic uncertainty and source disagreement.

Terminology — reference-label provenance

Four distinct sources of per-column labels show up in our writeups. Conflating them is load-bearing error, so we name each explicitly:

TermSourceAuthority levelWhere it appears
Published benchmarkExternal, human-curated labels (SOTAB, GitTables)Gold standard — memorization-safe checkSOTAB pilot artifacts; docs/notes/2026-04-19/…phase_gate_2.md
Curated referenceGenerator-derived (synth pairs an answer-key “reference column” per target) + spot-checked by handDefinitive for the synthetic corpus; not equivalent to a published benchmarkbuild/meta-tagging-clean/curated_reference.csv
LLM commitmentA single LLM’s pass-1 or pass-2 outputClassifier opinion; not a truthparquet llm_code, predicted_code
CatBoost priorCatBoost fit to LLM labels, used for revisit enrichmentNot independent evidence — it is a compressed self-consensus of the LLM; valuable specifically for rescuing abstentionsparquet predicted_code via DST fusion

An ablation (as used in our writeups) is a controlled experiment that holds most of the pipeline fixed and varies exactly one component at a time, so changes in accuracy can be attributed to that component rather than to the combination.

Methodology

Why Dempster-Shafer?

Traditional classifiers output a single confidence score (e.g., “85% email address”). This hides two distinct types of uncertainty:

  • Aleatoric uncertainty: inherent randomness in the data
  • Epistemic uncertainty: ignorance due to insufficient evidence

DST separates these via belief intervals [Bel(A), Pl(A)]:

  • Bel(A) = committed evidence supporting A (lower bound)
  • Pl(A) = evidence that cannot rule out A (upper bound)
  • Pl(A) - Bel(A) = unresolved ambiguity

When Bel(A) = 0.8 and Pl(A) = 0.85, we have high confidence with low ambiguity. When Bel(A) = 0.3 and Pl(A) = 0.9, we know something supports A but much remains uncertain — a signal to gather more evidence.

Evidence Sources

Each source independently produces a mass function (Basic Probability Assignment) that distributes belief across the frame of discernment:

SourceTypeDiscountConfigurableStatus
Cosine similaritySentence-transformer (all-MiniLM-L6-v2)0.30classify.discounts.cosineM0
Pattern detection16 regex detectors + post-regex validators0.25classify.discounts.pattern_thetaM0
Name matchingColumn name ↔ label/abbrev/common_namesvariesclassify.discounts.name_match_*M0
LLMOpenAI-compatible / Anthropic / Bedrock / Cerebras0.10classify.llm.discountM1
CatBoostGradient boosted trees (virtual ensembles)adaptiveclassify.discounts.catboost_*M2
SVMDual TF-IDF (char+word n-grams) + LinearSVC (Platt scaling)0.20classify.discounts.svmM2

The discount controls how much mass goes to Θ (total ignorance). Higher discount = more conservative = wider belief intervals.

Pattern mass is graduated: detect_patterns() returns a match fraction (0.0-1.0) per pattern, and pattern_to_mass() scales evidence mass by the average match fraction. A 95% match produces ~3x more mass than a 35% match, eliminating the binary cliff at the 1/3 detection threshold.

Pattern theta (0.25) is deliberately higher than LLM theta (0.10), so the LLM cleanly dominates when pattern and LLM evidence conflict — the LLM considers full context (name, type, values, siblings), while patterns operate on value structure alone.

Evidence Independence

Dempster’s rule of combination requires cognitively independent evidence sources (Shafer 1976) — each mass function must reflect information not derived from the other sources being combined. Atelier achieves this through architectural separation of feature spaces and training signals:

SourceFeature SpaceTraining SignalIndependence Basis
Name matchString/lexicalNone (deterministic)Symbolic matching only
PatternRegexNone (deterministic)Hand-crafted rules only
CosineDense embedding (384-dim)Pre-trained sentence-transformerLearned semantic similarity
LLMSemantic (frontier or subagent model)Pre-trained weightsIn-context classification
CatBoostDense embedding + 12 featuresSynthetic data generatorsGradient-boosted ensemble
SVMSparse TF-IDF (char 3-6 + word 1-2 n-grams)Synthetic data generatorsLexical surface patterns

The SVM is architecturally the most important independence guarantee. While cosine similarity and CatBoost both operate on the same dense sentence-transformer embedding (384 dimensions from all-MiniLM-L6-v2), the SVM operates on an entirely orthogonal feature representation: sparse TF-IDF character and word n-grams extracted by sklearn.pipeline.Pipeline + FeatureUnion. This means the SVM captures lexical surface patterns (abbreviations, digit sequences, camelCase fragments) that the dense embedding may collapse — providing genuine corrective signal in DST fusion.

SVM Architecture (adopted from Signals)

The SVM classifier follows the Pipeline + FeatureUnion composition pattern from the Signals project — the version of record presented as an independent fifth DST evidence source:

Column metadata text ("email_addr | user@example.com")
        │
        ▼
    FeatureUnion
    ├── TfidfVectorizer(analyzer="char_wb", ngram_range=(3,6))
    │   → captures subword patterns, abbreviations, digit sequences
    └── TfidfVectorizer(analyzer="word", ngram_range=(1,2))
        → captures multi-word patterns ("email address", "zip code")
        │
        ▼
    Sparse feature matrix (up to 100K dimensions)
        │
        ▼
    CalibratedClassifierCV(LinearSVC, method="sigmoid")
        │
        ▼
    Calibrated probability distribution {code: probability}

Key implementation details:

  • Singleton class filteringfit() drops categories with < 2 training examples before CalibratedClassifierCV, since StratifiedKFold requires every class to have >= 2 samples. With 316 categories and few tables, some categories inevitably have only one example. Dropped categories are logged and still receive predictions from the other 5 DST evidence sources.
  • _min_class_count() — returns the actual minimum (no longer clamped to 2)
  • feature_importances(top_n) — navigates CalibratedClassifierCVLinearSVC to extract coef_, averages absolute coefficients across classes, cross-references with FeatureUnion.get_feature_names_out() for named feature importance
  • is_fitted property for safe state checking before prediction

Frontier-Label SVM Training (M9)

The Monte Carlo sampling architecture enables a stronger training signal for the SVM without breaking independence. After the bootstrap LLM sweep, the SVM is retrained on blended synth + frontier labels — high-quality classifications from the Opus-tier model on the stratified importance sample.

_llm_sweep() → frontier columns get Opus labels
     ↓
  RETRAIN #1: Blend synth data + frontier labels
  SVM hot-swapped before first ML validation
     ↓
_run_ml_validation() — uses frontier-trained SVM
     ↓
  Convergence loop:
    Agent path: agent calls retrain_svm tool when it judges
                enough new labels have accumulated
    Programmatic path: retrain after each revisit iteration
                       that adds ≥10 new frontier labels
     ↓
  RETRAIN #3 (final): Only if NOT converged
     ↓
  CLASSIFYING — final pass uses best available SVM

Blending ensures categories not in the frontier sample still have coverage from synth data (broad vocabulary), while corpus-specific patterns dominate via frontier signal (depth).

Independence is preserved because:

  • Training signal: Opus (frontier model, used in LLM sweep)
  • Bulk LLM source in DST fusion: Sonnet/Haiku (subagent model)
  • SVM feature space: sparse TF-IDF (orthogonal to all other sources)

The three independence axes:

  1. Different models at training time (Opus) vs. fusion time (Sonnet/Haiku)
  2. Different feature spaces (sparse TF-IDF vs. semantic LLM reasoning)
  3. Different inductive biases (maximum-margin classifier vs. autoregressive LM)

The SVM becomes the transmission mechanism for frontier-quality signal — MC sampling bounds the Opus cost; the SVM amortizes Opus’s accuracy across the entire table-space.

Configuration
classify.bootstrap {
  frontier_svm_retrain = true    # Enable/disable frontier retraining
  frontier_svm_min_labels = 20   # Minimum frontier labels to trigger retrain
}
Implementation
  • train_svm_on_frontier_labels() in ml_train.py — collects frontier labels (label_source in ("llm", "llm_revisit")), blends with synth data, trains SVMClassifier, saves to results_dir/svm_frontier.pkl
  • _maybe_retrain_svm() in pipeline.py — encapsulates retrain + hot-swap via ml_inference.reset() + configure_paths()
  • Three call sites in pipeline: post-sweep, iterative, final (if not converged)
  • Agent tool retrain_svm for agent-driven convergence path

Dempster’s Rule of Combination

Sources are fused via the conjunctive combination rule:

m₁₂(C) = Σ{m₁(A)·m₂(B) : A∩B=C} / (1 - K)

where K = Σ{m₁(A)·m₂(B) : A∩B=∅} is the conflict between sources.

High K means the sources disagree — a valuable diagnostic signal. Note that K is not the convergence criterion — see Belief-Gap Convergence below.

Confusable Pairs

When DST evidence splits between two known-confusing categories, mass is redistributed from the runner-up singleton to a compound focal element representing the pair. This captures honest ambiguity instead of forcing a singleton prediction that may be wrong.

Four confusable pairs are active (filtered to vocabulary at runtime):

PairRationale
Record Identifier ↔ Device IdentifierBoth are opaque identifiers; context determines which
Timestamp ↔ Date of BirthBoth are temporal; DOB is a specific semantic subtype
Transaction Amount ↔ Bank Account NumberBoth are financial numbers
IP Address ↔ Device IdentifierIP addresses can identify devices

Mechanics: When the top-2 singleton masses form a known pair and their ratio is below confusable_ratio_threshold (default 3.0), half of the runner-up’s mass transfers to the pair focal element. The pair’s mass propagates up the hierarchy via belief_at() — Bel at the common ancestor reflects the combined evidence.

Pattern Validation

Pattern detection uses a two-stage architecture: 16 regex patterns for recall, plus a _VALIDATORS registry for precision. A value must pass both the regex AND the validator (if one exists) to count.

ValidatorPatternChecks
_luhn_checkcredit_card_patternLuhn checksum (ISO/IEC 7812)
_is_valid_ipv4ipv4_patternAll 4 octets in 0-255 range
_is_plausible_datedate_iso_pattern, datetime_iso_patternMonth 01-12, day 01-31
_is_iso_currencyiso_currency_patternISO 4217 whitelist (~40 codes)

The phone_pattern uses a suppression mechanism: when a more specific digit-heavy pattern also fires (SSN, date, credit card, IP, postal code, monetary, IBAN), the phone match is suppressed. This prevents the phone regex from injecting false evidence on columns whose values happen to contain formatted digits.

12 Discrete Features

Each column produces 12 SAGE-ablatable features:

  1. column_name — humanized column name
  2. column_type — SQL type (suppresses uninformative STRING/VARCHAR)
  3. sample_values — first 5 non-null values as text
  4. cardinality — distinct value count
  5. null_ratio — fraction of NULL values
  6. value_entropy — Shannon entropy of value lengths
  7. pattern_signals — matched regex patterns
  8. avg_value_length — mean string length
  9. numeric_ratio — fraction parseable as numbers
  10. sibling_context — other column names in the same table
  11. source_table — table name
  12. value_description — auto-generated natural language description

Architecture

AgentFSM

The classification pipeline runs as a background Finite State Machine:

ML-only path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Bootstrap path (programmatic):
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
                                                    ▲                     │
                                                    └─── (disagreements) ─┘
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Agent-driven path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING
                                                    ▲           │
                                                    └── Agent convergence loop (5 tools)
                                                          Claude reasons about which columns to revisit
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

MC sampling (when corpus > 200 columns):
SAMPLING includes pre-classify → stratify → select MC sample
LLM_SWEEP classifies frontier columns only → propagate labels to remainder

State transitions are persisted to PostgreSQL. The Status page polls /api/fsm/status for live progress updates.

Module Structure

src/atelier/classify/
├── __init__.py          # Public API: run_pipeline(), run_bootstrap(), get_fsm_status()
├── belief.py            # DST core: BeliefAssignment, FocalElement, dempster_combine()
├── mass_functions.py    # Evidence→mass converters (6 active)
├── features.py          # 12 features + 16 pattern detectors + 5 post-regex validators
├── taxonomy.py          # ReferenceCategory, HierarchicalCategorySet
├── embedding.py         # Sentence-transformer cosine classifier
├── llm_backend.py       # LLM backend factory (Anthropic, OpenAI-compat, Bedrock tool-use, Cerebras)
├── bootstrap.py         # Bootstrap convergence loop (LLM sweep + ML validation)
├── agent_loop.py        # Agent-driven convergence (6 Claude tools)
├── monte_carlo.py       # MC stratified sampling for scale (pre-classify, stratify, select, propagate)
├── gpu.py               # GPU detection + NVIDIA driver symlink (nix+CUDA)
├── sampler.py           # Hive metadata sampling + fixture data loading
├── synth.py             # Synthetic data generation
├── synth_generators.py  # 316+ hand-coded value generators (shared module)
├── synth_registry.py    # Three-layer generator registry (hand-coded > template > inferred)
├── meta_tagging_overlay.py # 130+ META_TO_ICE mappings for meta-tagging alignment
├── svm_classifier.py    # Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
├── catboost_classifier.py # CatBoost with virtual ensemble uncertainty
├── ml_train.py          # Training orchestrator (synth → models)
├── ml_inference.py      # Lazy-loading inference wrappers
├── evaluation.py        # Structured evaluation (per-category P/R/F1, confusion matrix)
├── train_eval_cycle.py  # Synth → train → classify → evaluate orchestrator
├── mock_llm.py          # Realistic mock LLM (confusable pairs, seeded mistakes)
├── sage.py              # SAGE feature importance (permutation-based, GPU-aware)
├── shap_explanations.py # Per-item SHAP feature attribution (TreeSHAP + PermutationSHAP)
├── pipeline.py          # Full pipeline orchestration (6 sources + MC + background SHAP)
├── fsm.py               # AgentFSM state machine
├── fixtures/
│   ├── universal_vocabulary.json  # BFO-grounded universal vocabulary (16 leaves)
│   └── fixture_tables.json        # 8 tables, 50 cols — fixture reference for unit tests
│                                    (NOT the UAT-corpus curated reference; see
│                                    build/meta-tagging-clean/curated_reference.csv)
data/sample/
└── ontology.json                  # Expanded vocabulary (300 leaves, 25 internal)
└── ontology/
    ├── atelier-vocab.ttl          # CCO-mediated BFO alignment (59 mapped terms)
    ├── sparql/unmapped-terms.rq   # Totality validation query
    └── README.md                  # Mapping methodology and usage

Build Directory

Artifacts are written to build/ (gitignored) to separate reproducible code from potentially sensitive intermediate data:

build/
├── data/annotations/    # Cached vocabulary from hive
├── data/samples/        # Sampled metadata
├── data/synth/          # Synthetic training data
├── models/              # Trained CatBoost + SVM models, embedding caches
└── results/{run_id}/
    ├── classifications.json           # Per-column DST results (+ SHAP columns when enabled)
    ├── evaluation_report.json         # Per-category P/R/F1, confusion matrix
    └── atelier_embeddings.parquet     # For embedding-atlas (+ shap_top{1,2,3}_{name,value})

Controlled Vocabulary

Loaded from hive default.annotations (11 columns):

ColumnMaps toPurpose
idcodeHierarchical dot-notation identifier
ontologylabelHuman-readable category name
annotationabbrevFormal code / mnemonic
definitiondescriptionHuman-readable definition text
common_namescommon_namesPipe/comma-separated aliases
specifics(embedding text)Examples and context
non_corp, emp_contractor, individual, corpsensitivityPer-role ratings (0-4)
deprecated(filter)“yes” = exclude

API

REST Endpoints

  • GET /api/fsm/status — Current pipeline state + progress
  • POST /api/fsm/start — Start a single-pass ML classification run
  • POST /api/fsm/start-bootstrap — Start bootstrap convergence loop (LLM + ML)
  • GET /api/fsm/runs — List past runs

gRPC RPCs

  • GetFSMStatus() → FSMStatusResponse
  • StartClassification() → StartClassificationResponse

HierarchicalClassification

The pipeline wraps each column result in a HierarchicalClassification object (ported from signals) that enables post-hoc hierarchy navigation:

  • belief_at(code) — query Bel at any hierarchy level (leaf or internal)
  • plausibility_at(code) — query Pl at any level
  • interval_at(code)(Bel, Pl) tuple
  • uncertainty_gapPl - Bel for the predicted category
  • needs_clarification — True when uncertainty_gap > 0.3 or conflict > 0.2
  • from_combined_evidence() — factory method: filters vacuous sources, combines via the configured fusion strategy, ranks by pignistic probability

Confidence is pignistic probability BetP(singleton), the decision-theoretic transform that distributes multi-element focal set mass equally among members.

Fusion Strategies

Two DST combination rules are implemented, selectable via classify.fusion_strategy:

  • dempster (default) — Classical Dempster’s rule with (1-K) normalization. Under high conflict, surviving singletons are amplified.
  • yager — Yager’s modified rule. Conflict mass is redirected to Θ (ignorance) instead of being normalized away. Preserves epistemic honesty at the cost of higher ignorance mass and typically lower peak belief values. When K=0, produces identical results to Dempster.

Yager is available as an opt-in alternative for empirical validation. The default (Dempster) remains in place pending A/B comparison on real pipeline runs — Yager’s increased conservatism may or may not improve overall classification quality, and compensatory adjustments to per-source discounting or decision thresholds may be needed.

Bootstrap Convergence Loop

The bootstrap pipeline wraps the single-pass ML pipeline in an iterative LLM↔ML convergence loop. It adds LLM evidence and repeats until predictions are settled — measured by belief-gap convergence, not raw conflict K.

Three Phases

  1. LLM Sweep (LLM_SWEEP): Batch-classify all columns via the configured LLM backend (Claude via Bedrock/Anthropic, or any OpenAI-compatible endpoint). Columns are sent in table-aware batches with sibling context. If every batch fails, the sweep raises RuntimeError (fail-fast) instead of silently proceeding with zero labels.

  2. ML Validation (VALIDATING): Run the full 6-source DST pipeline for each column. Compute per-column belief interval [Bel, Pl], conflict K, and uncertainty gap Pl - Bel. Identify uncertain columns where predictions need revisiting.

  3. Targeted Revisit (back to LLM_SWEEP): Re-classify uncertain columns with enriched context — the ML prediction, belief interval, pattern signals, and value descriptions are included in the prompt. This gives the LLM evidence it didn’t have in the first pass.

Belief-Gap Convergence

The primary convergence measure is the uncertainty gap Pl - Bel for each column’s predicted category. This directly answers “how settled is this prediction?” — unlike K, which only measures source disagreement.

A column can have K=0.9 but Bel=0.95 — the sources fought hard during combination, but the normalizing denominator (1-K) concentrated surviving mass on the agreed-upon singleton. That column’s prediction is settled despite high conflict; it doesn’t need revisiting.

Convergence criteria (all must hold):

CriterionMetricDefaultMeaning
Primarymean_gap < gap_threshold0.15Predictions are tight
Secondaryfrac_unclear < clarity_target0.10At most 10% of columns need clarification
Coveragecoverage >= coverage_target0.9595% of columns have labels

Revisit targeting: _identify_uncertain_columns() selects columns where gap > 0.3 OR Bel < bel_floor (default 0.50), sorted by gap descending (most uncertain first).

Early stopping: The proof-of-progress paradigm monitors the gap trend. When mean gap plateaus for 2 consecutive iterations (no verifiable progress), the loop stops even if the threshold hasn’t been reached.

K as Diagnostic

Conflict K remains in logs, iteration metrics, and agent tools as a diagnostic for source disagreement. It is useful for identifying calibration issues (e.g., a pattern detector producing false positives) but does not gate convergence. The cumulative K formula K = 1 - Π(1 - Kᵢ) tends to be high (~0.5-0.8) with 6 partially correlated sources; this is expected and does not indicate poor quality.

Agent-Driven Convergence

As an alternative to the programmatic loop, the agent convergence loop (agent_loop.py) delegates revisit strategy to Claude. The agent uses 6 tools — get_conflict_report, revisit_columns, check_convergence, get_column_detail, retrain_svm, declare_converged — to reason about which columns need re-examination. The agent sees both gap-based and K-based metrics and can make nuanced decisions. See Keystone Agents.

LLM Backend

llm_backend.py provides a factory-pattern abstraction:

  • OpenAICompatibleBackend: For vLLM, GLM-4.7, and any endpoint implementing the OpenAI chat completions API. Default backend.
  • AnthropicBackend: For Claude via the Anthropic SDK.
  • BedrockBackend: For AWS Bedrock via the Converse API.
  • BedrockStructuredBackend: Production default on CAI. Uses invoke_model with tool-use for structured output (output_config is not supported on Bedrock). When extended thinking is enabled, tool_choice must be "auto" (Anthropic constraint); a text-block fallback parser handles this case. Both backends use region_from_arn() to extract the target region from cross-region inference profile ARNs.
  • CerebrasBackend: OpenAI-compatible with Cerebras-specific defaults (base_url=https://api.cerebras.ai/v1, model=zai-glm-4.7).
  • create_backend_from_cfg(cfg): Factory that reads HOCON config to select and configure the appropriate backend.

Backends fail fast when not configured — no mock fallback in production code.

Configuration

All bootstrap/LLM settings live in HOCON (config/base.conf):

classify {
    llm {
        backend = "openai_compatible"  # or "anthropic", "bedrock_structured"
        model = "glm-4.7"
        base_url = null                # vLLM endpoint URL
        columns_per_call = 50
        discount = 0.10                # DST discount for LLM mass
    }
    bootstrap {
        max_iterations = 5
        k_threshold = 0.2              # diagnostic (not convergence-gating)
        coverage_target = 0.95
        max_total_llm_calls = 5000
        # Belief-gap convergence (primary criteria)
        gap_threshold = 0.15           # mean(Pl - Bel) target
        clarity_target = 0.10          # max fraction of unclear columns
        bel_floor = 0.50               # min belief for "settled"
    }
}

Environment variable overrides follow the standard pattern: ATELIER_LLM_MODEL, ATELIER_LLM_BASE_URL, ATELIER_BOOTSTRAP_K_THRESHOLD, etc.

SHAP Explanations

Per-item feature attribution explaining why each column was classified as it was. Complements the global SAGE importance (which ranks features across the entire dataset) with item-level explanations.

Two Methods

MethodAlgorithmSpeedFeaturesWhen Used
CatBoost TreeSHAPExact O(TLD) built-in~0.1s for 50 itemsGrouped: embedding, discreteAuto when CatBoost model loaded
Embedding PermutationSHAPshap.PermutationExplainer~50s/item on CPU12 named featuresTier-1, explicit request only

Auto mode (method="auto") only uses TreeSHAP — PermutationSHAP is too slow for default pipeline runs and must be explicitly requested.

Output

Each classification gains 6 extra columns:

  • shap_top1_name, shap_top1_value
  • shap_top2_name, shap_top2_value
  • shap_top3_name, shap_top3_value

These flow through to JSON, parquet, and evaluation output.

Configuration

classify.shap {
    enabled = true        # Enable SHAP in pipeline (auto-selects method)
    top_k = 3             # Number of top features to report per item
}

Configurable Discounts

All DST discount factors are configurable via HOCON. The DiscountConfig dataclass bundles all parameters with DiscountConfig.from_cfg(cfg) factory:

classify.discounts {
    cosine = 0.30                    # Cosine similarity → Theta mass
    svm = 0.20                       # SVM → Theta mass
    pattern_theta = 0.25             # Pattern detection → Theta mass (graduated by match fraction)
    name_match_exact = 0.70          # Exact label match singleton mass
    name_match_code = 0.50           # Formal code/abbrev match mass
    name_match_alias = 0.50          # Common name alias match mass
    name_match_overlap = 0.30        # Word overlap match mass
    catboost_base = 0.10             # Adaptive discount base
    catboost_variance_scale = 1.6    # Variance-to-discount scaling
    catboost_max = 0.50              # Cap on adaptive discount
    catboost_fallback = 0.15         # When no variance available
    confusable_ratio_threshold = 3.0 # CatBoost confusable pair threshold
}

Environment variable overrides: ATELIER_DISCOUNT_COSINE, ATELIER_DISCOUNT_SVM, etc.

Milestones

MilestoneScopeStatus
M0Cosine + pattern + name match, FSM, pipeline E2EDone
M0.5Schema fix, pignistic probability, HierarchicalClassificationDone
M1LLM evidence source, bootstrap convergence loop, LLM↔ML validationDone
M2CatBoost + SVM + synthetic data, 6 evidence sources, Bedrock/Cerebras backendsDone
M3Evaluation framework, E2E synth-train-eval, realistic mock LLM, SAGE importanceDone
M4SHAP explanations, configurable discounts, thread-safe model loadingDone
M5Data sources + versioning, OOTB onboarding (316-leaf ontology, 25 sample tables)Done
M6Agent-driven convergence loop (6 Claude tools), synth framework (316+ generators)Done
M7Monte Carlo stratified sampling, label propagation, background SHAPDone
M8GPU acceleration (NVIDIA driver symlink, batch encoding), meta-tagging overlayDone
M8.5SVM signals alignment (Pipeline+FeatureUnion adoption, evidence independence documentation)Done
M9Frontier-label SVM training (cross-model distillation via MC sampling)Done
M10MLflow experiment tracking, Hive data source integrationProposed

Monte Carlo Sampling

At small corpus sizes (< 200 columns), every column receives direct frontier-LLM classification. As the corpus scales to thousands or millions of columns, this becomes prohibitively expensive. Monte Carlo stratified sampling selects a representative subset for LLM inference and propagates labels cheaply via embedding similarity.

This is a zero-cost optimization: below the threshold, the pipeline behaves identically to before. The MC layer activates transparently at scale.

Three-Phase MC Layer

The MC layer operates between SAMPLING and LLM_SWEEP in the existing pipeline. No new FSM states — it runs as sub-phases.

SAMPLING
  ├─ [existing] Extract features for all columns
  ├─ Pre-classify: cheap M0 evidence (name, pattern, cosine) — no LLM
  ├─ Stratify: group by preliminary category + uncertainty
  └─ Select MC sample: importance-weighted within strata

LLM_SWEEP
  ├─ [existing] Frontier LLM classifies MC sample (not all columns)
  └─ Propagate: extend labels to remaining corpus via embedding similarity

VALIDATING
  └─ [existing] Full 6-source DST on ALL columns
      (propagated labels enter as discounted LLM evidence)
      → High-gap / low-belief propagated columns escalate to revisit

Phase 1: Pre-Classification

Run M0 evidence sources only (no LLM, no ML models). For each column:

  • Name matching → best category + mass
  • Pattern detection → matched categories
  • Cosine similarity → top-K categories + scores

Returns a preliminary category code + confidence for every column. Uses the existing name_match_to_mass(), pattern_to_mass(), classify_cosine() functions from the pipeline.

Phase 2: Stratification

Partition columns by their preliminary category code:

  • Rare strata (< 2 x min_per_stratum members): fully sampled
  • UNRESOLVED stratum (M0 sources disagree or low confidence): fully sampled
  • Normal strata: proportional allocation with importance weighting

Phase 3: Sample Selection

Within each normal stratum, select columns via importance-weighted random sampling without replacement. Importance weight per column:

w = (1 - confidence) × (1 + uncertainty)

where confidence = max cosine similarity, uncertainty = ratio of 2nd-best to 1st-best similarity (ambiguity measure).

Total budget: min(max_frontier_columns, total × sample_fraction)

Label Propagation

After the LLM sweep on frontier columns:

  1. For each propagation column, find nearest frontier column by cosine similarity (stratum-local to limit search space)
  2. If similarity >= propagation_threshold: assign same label with discounted confidence
  3. If similarity < threshold: column gets no LLM evidence in DST

Propagated labels enter DST fusion with a higher discount factor (0.30 vs 0.10 for direct LLM) — they carry less evidential mass. If M0 sources disagree with the propagated label, conflict K rises and the existing targeted-revisit loop automatically escalates the column to the frontier model.

Why This Works with DST

The evidence fusion framework makes MC sampling robust:

  • Propagated evidence carries less mass (more goes to Theta/ignorance)
  • M0 agreement with propagated label → high belief, narrow gap (good)
  • M0 disagreement with propagated label → wide gap → frontier revisit
  • Escalation is automatic — no special MC-aware revisit logic needed

Scaling Projections

GitTables corpus: 1.7M tables today, 10M+ near-term. Average 8-12 columns per table = 15M-120M columns at full scale.

CorpusMC ModeFrontier CallsPropagatedCost Reduction
50Passthrough50 (all)00%
500Active~75 (15%)~42585%
5,000Active~500 (cap)~4,50090%
50KActive~500 (cap)~49.5K99%
500KActive~500 (cap)~499.5K99.9%
15MActive~500 (cap)~15M>99.99%
120MActive~500 (cap)~120M>99.99%

At the max_frontier_columns=500 cap, stratified importance sampling ensures every category stratum gets at least min_per_stratum=3 exemplars. Uniform random sampling at 500/15M would miss rare categories entirely.

Scale-Critical Design Decisions

  • Embedding computation: batch GPU encoding at ~2,768 texts/s (RTX 4090); 15M columns takes ~90 minutes. One-time cost, GPU-parallelizable.
  • Stratum-local propagation: similarity search within each stratum (not across the full corpus) to limit memory and compute.
  • Memory: 15M columns × 200B = ~3GB for metadata; 15M × 1.5KB = ~22GB for embeddings. Requires streaming/chunked processing.
  • Escalation budget: ~50-100 additional frontier calls from revisit. Total frontier budget: ~600 LLM API calls for a 15M-column corpus.

Configuration

classify {
  monte_carlo {
    min_corpus_size = 200              # Below this, classify everything
    min_corpus_size = ${?ATELIER_MC_MIN_CORPUS_SIZE}
    sample_fraction = 0.15             # Fraction for frontier model
    sample_fraction = ${?ATELIER_MC_SAMPLE_FRACTION}
    min_per_stratum = 3                # Minimum samples per category stratum
    max_frontier_columns = 500         # Hard cap on frontier columns
    max_frontier_columns = ${?ATELIER_MC_MAX_FRONTIER}
    propagation_threshold = 0.85       # Cosine sim for propagation
    propagation_threshold = ${?ATELIER_MC_PROPAGATION_THRESHOLD}
    propagation_discount = 0.30        # LLM mass discount for propagated labels
  }
}

Module Structure

src/atelier/classify/monte_carlo.py
├── MCConfig          — Frozen dataclass with from_cfg() factory
├── PreClassification — Per-column M0 result (code + confidence + uncertainty)
├── Stratum           — Column group by preliminary category
├── MCPlan            — Sampling plan (frontier + propagation sets)
├── pre_classify()    — Run M0 evidence for all columns
├── stratify()        — Group by preliminary category + uncertainty
├── select_sample()   — Importance-weighted selection within strata
└── propagate_labels() — Embedding-similarity label extension

GPU Acceleration

Atelier uses GPU acceleration for sentence-transformer embedding computation and CatBoost training/inference. GPU support is auto-detected at startup with graceful fallback to CPU.

Detection

gpu.preflight_gpu() runs once at config load time and caches the result for the process lifetime. Three-step detection:

  1. nvidia-smi probe: subprocess call to detect device count, names, VRAM, and driver CUDA version
  2. CUDA version extraction: parse nvidia-smi header for driver compatibility
  3. PyTorch check: torch.cuda.is_available() confirms runtime support

The result is a GpuInfo dataclass with:

  • available — whether CUDA is usable
  • device_count — number of GPUs
  • devices — device names with VRAM (e.g., “NVIDIA RTX 4090 24GB”)
  • resolved_device"cuda" or "cpu" for model initialization
  • warnings — non-blocking issues (version mismatches, library path hints)

In devenv (nix-managed), CUDA libraries are isolated from the host system. The GPU module handles the nix+CUDA compatibility pattern by detecting the driver library path and ensuring PyTorch can find it. This avoids the common nix pitfall where torch.cuda.is_available() returns False despite GPUs being present.

Integration Points

Sentence-Transformer Embedding

embedding.py calls preflight_gpu() before initializing the SentenceTransformer model, passing device=gpu_info.resolved_device:

gpu_info = preflight_gpu()
model = SentenceTransformer("all-MiniLM-L6-v2", device=gpu_info.resolved_device)

GPU batch encoding achieves ~2,768 texts/second on RTX 4090 (vs ~400/s on CPU). This matters at scale: 15M columns takes ~90 minutes on GPU vs ~10 hours on CPU.

CatBoost Training

CatBoost automatically uses GPU when available via its task_type parameter. The virtual ensemble posterior sampling that drives uncertainty quantification benefits from GPU parallelism.

Preflight Reporting

GPU status appears in just preflight output and in the /api/status gateway endpoint, giving operators immediate visibility into whether GPU acceleration is active.

Configuration

GPU detection is automatic — no configuration needed. The system probes hardware and falls back gracefully:

  • GPU available: uses CUDA for all embedding and training operations
  • GPU detected but CUDA unavailable: warns about library path issues, falls back to CPU
  • No GPU: runs entirely on CPU with no warnings

CAI Considerations

CAI ML workloads can request GPU runtimes. When running on a GPU-enabled CAI session:

  • The NVIDIA drivers are provided by the container runtime
  • PyTorch CUDA support depends on the Python runtime image
  • GPU memory is shared with other processes in the session
  • Background SHAP computation can be memory-intensive; monitor with nvidia-smi if running alongside large models

Synthetic Data & Training

The classification pipeline includes two ML evidence sources — CatBoost and SVM — that require training data. Atelier generates synthetic training data from the controlled vocabulary, trains both classifiers, and uses them as independent evidence sources in DST fusion.

Synth Generators

synth_generators.py is the single source of truth for 316+ hand-coded value generators shared across the synth framework, sample source generation, and the registry.

Each generator is a callable (rng: random.Random) -> str that produces realistic values for a category. Examples:

  • EMAIL"j.smith@example.com", "alice.chen@corp.net"
  • SSN"123-45-6789" (formatted US Social Security Number)
  • LATITUDE"41.8781" (valid geographic coordinate)
  • CURRENCY_CODE"USD", "EUR", "JPY"

Three-Layer Generator Registry

synth_registry.py builds a complete generator set for any vocabulary through a priority-based registry:

PrioritySourceDescription
1 (highest)Hand-codedFrom GENERATORS dict in synth_generators.py
2TemplateReal sample values with mild perturbation (±10% numeric jitter, character substitution)
3 (lowest)InferredRegex pattern matching on category metadata (description, common_names)
registry = GeneratorRegistry.from_vocabulary(category_set)
# registry.coverage_summary() → {"hand-coded": 250, "template": 40, "inferred": 26}

The registry provides coverage_report() and coverage_summary() to identify categories without generators — important for vocabulary expansion.

Column Name Generation

Synthetic training data deliberately uses diverse column names to prevent classifiers from relying on name heuristics:

  • Semantic names: email_address, emailAddress, EMAIL_ADDR (snake_case, camelCase, uppercase variants, synonym-based)
  • Opaque names: field_42, col_abc, v_123 (~25% of columns)

This forces the ML models to learn from value patterns and context, not just column naming conventions.

ML Training Pipeline

ml_train.py orchestrates training for both classifiers:

synth_*.csv + reference_labels.json
        ↓
   _load_synth_data()
        ↓
   ┌────┴────┐
   ↓         ↓
  SVM     CatBoost
   ↓         ↓
 svm.pkl  catboost.cbm

SVM Path (Signals Architecture)

The SVM classifier uses the Pipeline + FeatureUnion composition adopted wholesale from the Signals project:

  1. Build short text from column name + type + sample values via build_svm_text()
  2. FeatureUnion extracts dual TF-IDF features:
    • Character n-grams (3-6, char_wb analyzer) — captures subword patterns
    • Word n-grams (1-2) — captures multi-word patterns
  3. CalibratedClassifierCV(LinearSVC, method="sigmoid") — Platt scaling for calibrated probability estimates
  4. _min_class_count() guard prevents calibration CV crash on small classes
  5. Save to .pkl + .classes.json via joblib

The SVM operates on sparse lexical features — architecturally independent from the dense sentence-transformer embedding used by cosine and CatBoost. See Classification Pipeline for the full independence analysis.

CatBoost Path (GPU-accelerated)

  1. Extract 12 features per column via features.extract_features()
  2. Compute sentence-transformer embeddings (384-dim, GPU batch encoding)
  3. Fit CatBoostColumnClassifier with:
    • loss_function="MultiClass"
    • posterior_sampling=True (virtual ensemble uncertainty)
    • auto_class_weights="Balanced" (handle imbalanced categories)
  4. Save to .cbm + .classes.json

Virtual Ensemble Uncertainty

CatBoost’s posterior_sampling=True enables Bayesian uncertainty quantification via virtual ensembles. The classifier produces not just class probabilities but per-class variance estimates. High variance translates to a higher DST discount factor — uncertain ML predictions carry less evidential weight in the fusion.

Frontier-Label SVM Training (M9)

After the bootstrap LLM sweep, the pipeline has high-quality frontier labels from the Opus-tier model. train_svm_on_frontier_labels() blends these with synthetic data and retrains the SVM progressively:

synth_*.csv + frontier LLM labels
        ↓
  train_svm_on_frontier_labels()
        ↓
  ┌─────────────────────────────────────┐
  │  Synth texts  +  Frontier texts     │
  │  (vocabulary   (corpus-specific     │
  │   coverage)     signal)             │
  └──────────────┬──────────────────────┘
                 ↓
         SVMClassifier.fit()
                 ↓
         svm_frontier.pkl

Three-Phase Progressive Retraining

  1. Post-sweep (always): After the first LLM sweep labels frontier columns, retrain immediately so the SVM carries corpus-specific signal into the first ML validation pass.

  2. Iterative (during convergence): In the programmatic loop, retrain after each revisit iteration that adds ≥10 new frontier labels. In the agent-driven loop, the agent calls retrain_svm when it judges enough new labels have accumulated.

  3. Final (only if not converged): Last-resort retrain with all accumulated labels before the final classification pass. Skipped when already converged (the last iteration’s model is already in use).

Why Blend Synth + Frontier

  • Synth data: Covers all vocabulary categories — ensures the SVM can classify categories not present in the frontier sample
  • Frontier labels: Corpus-specific patterns — column names, value formats, and type distributions that synth generators can’t capture
  • Together: Breadth from synth, depth from frontier

Hot-Swap Mechanism

After retraining, the SVM is hot-swapped via:

  1. ml_inference.reset() — clears cached models and paths
  2. ml_inference.configure_paths(svm_path=..., catboost_path=...) — points the lazy-loader at the frontier-trained model

The model file lives at results_dir/svm_frontier.pkl (run-specific), preserving build/models/svm.pkl as the synth-trained fallback.

Train-Eval Cycle

train_eval_cycle.py orchestrates the full loop:

  1. Generate synthetic data from vocabulary
  2. Train CatBoost + SVM models
  3. Classify using the trained models
  4. Evaluate against the curated reference

This runs as part of the classification pipeline when models don’t exist yet, or can be triggered explicitly for experimentation.

SAGE Feature Importance

sage.py computes global feature importance via permutation-based SAGE values. Each of the 12 discrete features is ablated and the classification accuracy impact measured:

  • High SAGE value = feature is critical for classification
  • Low SAGE value = feature adds little discriminative power

SAGE runs on the frontier sample when MC sampling is active (representative subset), reducing computation at scale.

SHAP Per-Item Attribution

shap_explanations.py provides per-column explanations for why each column was classified as it was:

MethodAlgorithmSpeedWhen Used
CatBoost TreeSHAPExact O(TLD) built-in~0.1s for 50 itemsAuto when CatBoost loaded
PermutationSHAPshap.PermutationExplainer~50s/itemExplicit request only

Each classification gains 6 SHAP columns: shap_top1_name, shap_top1_value, shap_top2_name, shap_top2_value, shap_top3_name, shap_top3_value.

Background SHAP

For large corpora, SHAP can run in a background thread while the pipeline proceeds to EVALUATING. Controlled by the HOCON flag:

classify {
  background_analysis = true
  background_analysis = ${?ATELIER_BACKGROUND_ANALYSIS}
}

Set to false on CAI if background threads cause runtime issues.

Key Files

FileRole
synth_generators.py316+ hand-coded value generators
synth_registry.pyThree-layer registry: hand-coded > template > inferred
synth.pySynthetic data generation with diverse column names
ml_train.pyTraining orchestrator: synth-only + frontier-label blended training
catboost_classifier.pyCatBoost with virtual ensemble uncertainty
svm_classifier.pyPipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
train_eval_cycle.pyGenerate → train → classify → evaluate loop
sage.pyGlobal SAGE feature importance
shap_explanations.pyPer-item SHAP attribution

Embeddings

The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.

Architecture

BrowserFastAPI GatewayreactduckdbviewerReact AppDuckDB WASMIn-browser SQL engine
Loads parquet directlyEmbeddingAtlas ComponentWebGPU scatter plot
Density contours
Search + filters/api/datasets/{id}/data  fetch parquetSQL queriesGET parquetIn-browser SQL engine
Loads parquet directly












WebGPU scatter plot
Density contours
Search + filters


















The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.

Data Flow

  1. Backend serves the parquet file via /api/datasets/{id}/data
  2. React fetches the parquet and loads it into DuckDB WASM via a Mosaic coordinator
  3. EmbeddingAtlas queries the DuckDB table for rendering: x/y coordinates, categories, text for tooltips
  4. All filtering, search, and aggregation happens client-side — no round-trips to the server

Parquet Schema

The Embeddings page expects parquet files with these columns:

ColumnTypeRequiredDescription
idstringyesUnique row identifier
xfloat32yes2D projection x-coordinate (UMAP)
yfloat32yes2D projection y-coordinate (UMAP)
textstringrecommendedTooltip and search text
categorystringrecommendedColor-coding category

Additional columns (e.g., source_table, belief, plausibility) are automatically available as cross-filter charts.

GitTables Dataset

The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.

To prepare the visualization parquet:

# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet

# Then seed the database
just seed

The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.

Naming: Embeddings vs Apache Atlas

The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.

  • Embeddings (Atelier) — Interactive scatter plot of classification embeddings
  • Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000

To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.

Data Sources & Versioning

Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.

Data Model

DataSource (1)                      Dataset versions (N)
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "ootb-sample"       │───1:N──│ v3 (active) — 2 min ago  │
│ type: "sample"          │        │ v2 — yesterday           │
│ display: "Sample"       │        │ v1 — built-in            │
│ vocab_mode: "universal" │        └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive"            │        └──────────────────────────┘
│ display: "hive:prod/…"  │
│ vocab_mode: "hive"      │
└─────────────────────────┘

Source Types

TypeTables loaded fromVocabularyCreated by
sampledata/sample/tables/*.csvExpanded ICE ontology (316 leaves)Auto-seeded on first boot
hiveCAI data connectionDomain annotations from vocab_uriUser creates via Status page
synthdata/synth/tables/*.csvDomain annotations from vocab_uriGenerated by scripts/generate_synth_source.py

Vocabulary routing: For in-situ classification, the customer’s domain vocabulary IS the classification target — the LLM reads labels and descriptions and classifies into the domain’s hierarchical dot-codes. The annotations table location is configured per source via vocab_uri (e.g. meta.vocab, meta.annotations), decoupling data tables from the vocabulary. Multiple sources can share the same annotations table.

Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via VocabMapping. This requires dedicated training hardware and is not yet implemented.

Database Schema

CREATE TABLE data_sources (
    id TEXT PRIMARY KEY,
    source_type TEXT NOT NULL,          -- 'sample' | 'hive'
    source_uri TEXT NOT NULL DEFAULT '',
    display_name TEXT NOT NULL,
    vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
    vocab_uri TEXT NOT NULL DEFAULT '',  -- e.g. 'meta.vocab', 'meta.annotations'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT                       -- JSON: table_count, column_count
);

-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;

Vocabulary Routing

When a pipeline run starts, the source_id determines which vocabulary loads:

  • ootb-sample: load_sample_vocabulary()data/sample/ontology.json (316 BFO-grounded leaves across the CCO ICE trichotomy)
  • hive/synth: Domain annotations loaded directly from the table specified by vocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.
  • No source: Falls back to universal vocabulary (16 PII leaves)

LLM Robustness

The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.

  • Adaptive batch sizing: _estimate_safe_batch_size() reduces columns_per_call for large vocabularies (e.g. 290 categories → 41)
  • Truncation retry: When LLMResponse.truncated is detected, the batch is halved and retried recursively until all columns are classified
  • Metrics: truncation_count and effective_batch_size tracked in BootstrapState and exposed via the agent’s check_convergence tool

Sample Source

The built-in “Sample” source (source_id ootb-sample) ships with Atelier so new deployments show meaningful data immediately. When the landing page loads and “Connected” turns green, the stats cards show 316 Terms and 316 Entities. The ootb- prefix in the id is an internal marker distinguishing shipped sources from user-registered connections — it is not shown in the UI.

Expanded Vocabulary (ICE.* Ontology)

The vocabulary follows the CCO ICE (Information Content Entity) trichotomy, grounded in BFO via atelier-vocab.ttl:

ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│   ├── ICE.NONSENSITIVE.DESIGNATIVE   ⊑ cco:DesignativeICE
│   │   ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│   │   ├── .CODE (.ID, .ABBREV, .POSTAL)
│   │   ├── .GEO  (.COUNTRY, .REGION, .CITY, .LOCATION)
│   │   ├── .REF  (.CITATION, .VERSION, .SOURCE)
│   │   └── .TITLE
│   ├── ICE.NONSENSITIVE.DESCRIPTIVE   ⊑ cco:DescriptiveICE
│   │   ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│   │   ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│   │   ├── .MEASUREMENT (~20 subtypes)
│   │   └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│   └── ICE.NONSENSITIVE.PRESCRIPTIVE  ⊑ cco:PrescriptiveICE
│       └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│   ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│   ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│   └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
    └── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …

351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.

Design principle: every category is our own BFO-grounded term. External sources (GitTables, meta-tagging) inform which conceptual space to cover; we never import their raw tags. The mapping goes outward from our vocabulary via atelier-vocab.ttl, not inward.

Sample Tables

25 mixed-domain tables with 316 columns (100 rows each). Tables are deliberately cross-domain — a customers table contains identity, contact, metadata, and categorical columns — so the classification pipeline cannot rely on table name alone.

~25% of columns use opaque names (field_42, var_abc, col_xyz) to exercise the pipeline’s ability to classify from values and context rather than column name heuristics.

Generated by scripts/generate_sample_source.py. The curated reference for the Sample source fixture is committed in data/sample/reference_labels.json (scope: fixture-only, for OOTB demo and unit tests).

For UAT / production evaluation, the curated reference lives at build/meta-tagging-clean/curated_reference.csv (gitignored) — built by scripts/parity/build_curated_reference.py from direct reference-column evidence plus name-index lookup with Ontology > Annotation > Common Names priority. UAT’s own classification outputs are provisional predictions and are scored against this curated reference at build/results/parity/delta_report.md.

Auto-Import on First Boot

The gateway seeds the Sample source (id ootb-sample) via a FastAPI lifespan context manager:

  1. Check if ootb-sample source has any dataset versions
  2. If none, read sample_source_stats() (table count, column count)
  3. Create dataset version 1 with the stats as metadata
  4. Update source metadata JSON

This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.

API

REST Endpoints

EndpointMethodDescription
/api/data-sourcesGETList all data sources
/api/data-sourcesPOSTCreate a new data source
/api/datasets?source_id=XGETList versions for a source
/api/datasets/{id}/activatePOSTSet a version as active
/api/vocabulary/stats?source_id=XGETTerm count (source-aware)
/api/fsm/start?source_id=XPOSTStart pipeline for a source

gRPC RPCs

RPCDescription
ListDataSources()List all sources
StartClassification(source_id=…)Start pipeline for a source

UI Integration

The Status page has two new cards:

  • Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
  • Classification Pipeline card: “Start Classification” passes activeSourceId to /api/fsm/start?source_id=…

The Landing page stats cards reflect the active source:

  • Terms: vocabulary size for the active source (316 for the Sample source)
  • Entities: column count from the active dataset version
  • Sources badge: shows count when multiple sources exist

DatasetContext

interface DatasetContextValue {
  sources: DataSourceInfo[];
  activeSourceId: string | null;
  setActiveSourceId: (id: string) => void;
  datasets: DatasetInfo[];           // for activeSourceId
  activeDatasetId: string | null;
  setActiveDatasetId: (id: string) => void;
  refreshSources: () => Promise<void>;
  refreshDatasets: () => Promise<void>;
}

Key Files

FileRole
db/migrations/20260414…_data_sources_and_versions.sqlSchema migration
src/atelier/db/model.pyDataSource ORM model
src/atelier/db/dao.pySource + version DAO methods
src/atelier/classify/sampler.pyload_sample_source(), sample_source_stats()
src/atelier/classify/taxonomy.pyload_sample_vocabulary()
src/atelier/classify/pipeline.pySource-aware routing
src/atelier/gateway.pyREST endpoints + auto-import lifespan
data/sample/ontology.jsonExpanded vocabulary (316 leaves)
data/sample/tables/*.csv25 sample tables
data/sample/reference_labels.json316-entry Sample-source fixture reference labels
build/meta-tagging-clean/curated_reference.csv (gitignored)UAT-corpus curated reference
scripts/expand_vocabulary.pyVocabulary expansion script
scripts/generate_sample_source.pySample table generation script
ui/src/contexts/DatasetContext.tsxSource-aware React context
ui/src/pages/Status.tsxData source + version UI

Proposed Integrations

This page documents two planned integration points that extend the data source model: MLflow experiment tracking (Phase 5) and Hive data connections (Phase 6). Both are designed but not yet implemented.


MLflow Integration (Phase 5)

Motivation

On CAI deployments, MLflow is available as a managed service. Logging pipeline runs to MLflow provides:

  • Experiment history: compare accuracy, conflict, and coverage across pipeline versions without the Atelier UI
  • Model registry: when CatBoost/SVM models are trained, register them as versioned artifacts
  • Artifact persistence: classifications.json, evaluation reports, and parquet files survive pod restarts
  • Cross-project visibility: other CAI workloads can discover Atelier’s registered models

Architecture: Write-Then-Reconcile

The MLflow bridge follows the RAG Studio reconciler pattern — the pipeline never blocks on MLflow I/O.

Pipeline thread                    Reconciler (background)
──────────────                     ───────────────────────
write JSON to queue dir ──────►   poll queue dir
  (non-blocking)                   parse JSON envelope
                                   log to MLflow (retries)
                                   move to archive/

This design is resilient to:

  • MLflow downtime (queue accumulates, reconciler catches up)
  • Pipeline latency (no synchronous API calls in the hot path)
  • Pod restarts (queue dir is on persistent storage)

Queue Format

Each pipeline state transition writes a JSON envelope to build/mlflow_queue/:

{
  "event": "run_complete",
  "run_id": "abc123",
  "source_id": "ootb-sample",
  "timestamp": "2026-04-14T12:00:00Z",
  "payload": {
    "params": {
      "source_id": "ootb-sample",
      "vocabulary_mode": "universal",
      "sample_size": 50,
      "llm_model": "glm-4.7",
      "discount_cosine": 0.30
    },
    "metrics": {
      "accuracy": 0.847,
      "micro_f1": 0.832,
      "macro_f1": 0.791,
      "mean_belief": 0.724,
      "mean_conflict": 0.089,
      "coverage": 0.973,
      "llm_calls": 42,
      "bootstrap_iterations": 3
    },
    "artifacts": [
      "build/results/abc123/classifications.json",
      "build/results/abc123/evaluation_report.json",
      "build/results/abc123/atelier_embeddings.parquet"
    ]
  }
}

MLflow Experiment Structure

Each data source maps to an MLflow experiment:

Experiment: atelier/ootb-sample
├── Run: v1 (params, metrics, artifacts)
├── Run: v2 (params, metrics, artifacts)
└── Run: v3 (params, metrics, artifacts)

Experiment: atelier/hive-prod-default
└── Run: v1 (params, metrics, artifacts)

What Gets Logged

CategoryItemsNotes
Paramssource_id, vocabulary_mode, sample_size, llm_model, discount factorsStatic per run
Metricsaccuracy, micro_f1, macro_f1, mean_belief, mean_conflict, coverageNumeric scalars
Artifactsclassifications.json, evaluation_report.json, parquetFull result set
ModelsCatBoost (.cbm), SVM (.pkl)Registered when newly trained

Module Design

# src/atelier/classify/mlflow_bridge.py

class MLflowBridge:
    """Async write-then-reconcile bridge to MLflow."""

    def __init__(self, queue_dir: Path, experiment_prefix: str = "atelier"):
        self.queue_dir = queue_dir
        self.experiment_prefix = experiment_prefix

    def enqueue(self, event: str, run_id: str, source_id: str, payload: dict):
        """Write an event envelope to the queue (non-blocking)."""
        ...

    def reconcile(self):
        """Process all pending queue items. Called by background thread."""
        ...

Pipeline integration points:

# In pipeline.py — at key state transitions:
bridge.enqueue("run_started", run_id, source_id, {"params": {...}})
# ... pipeline work ...
bridge.enqueue("run_complete", run_id, source_id, {"metrics": {...}, "artifacts": [...]})

Gating

MLflow is only active on CAI (cfg.is_cml). In devenv, the bridge is a no-op. The mlflow package is an optional dependency — import failure is handled gracefully.

Configuration

# config/base.conf (proposed)
mlflow {
    enabled = false
    enabled = ${?ATELIER_MLFLOW_ENABLED}
    tracking_uri = null
    tracking_uri = ${?MLFLOW_TRACKING_URI}
    queue_dir = "build/mlflow_queue"
}

Implementation Notes

  • The reconciler runs as a daemon thread started in the gateway lifespan, similar to the sample source seeding
  • Queue items are atomic files (write to .tmp, rename to .json) to prevent partial reads
  • Failed reconciliation retries with exponential backoff (max 5 min)
  • Archive dir (build/mlflow_queue/archive/) retains processed items for debugging

Files (Proposed)

FileAction
src/atelier/classify/mlflow_bridge.pyNew: bridge + reconciler
src/atelier/classify/pipeline.pyExtend: bridge calls at transitions
config/base.confExtend: mlflow config block
src/atelier/config.pyExtend: mlflow fields
src/atelier/gateway.pyExtend: reconciler daemon thread

Hive Data Source (Phase 6)

Motivation

The OOTB sample source demonstrates the pipeline with synthetic data. In production on CAI, the real value comes from classifying columns in the customer’s actual Hive tables via CAI data connections.

How It Works

Hive sources are auto-discovered at gateway startup. The gateway lifespan hook calls discover_hive_sources(cfg) which:

  1. Iterates all connections listed in ATELIER_DATA_CONNECTIONS
  2. For each connection, runs SHOW DATABASES and checks each database for an annotations table
  3. Validates the schema: fetches 1 row and checks for legacy (id, ontology, annotation) or universal (code, label) format
  4. Auto-registers valid sources via get_or_create_data_source() (idempotent — safe to re-run on restart)

Once registered, the pipeline route works automatically:

  1. Pipeline resolves data from the connection: when source_id refers to a hive source, the pipeline calls discover_tables() and sample_table_metadata() using that connection
  2. Vocabulary routing: hive sources use load_annotations_from_hive() which reads default.annotations (domain categories) and composes them on top of the universal base
  3. Results register as versions: each pipeline run creates a new version under the hive source, with the same activation/versioning semantics as the sample source

Data Flow

CAI Data Connection (Hive/Impala)
        │
        ▼
discover_tables(cfg, connection_name, database)
        │                    ┌─────────────────────────┐
        ▼                    │ load_annotations_from_   │
sample_table_metadata()      │ hive(cfg, connection)    │
        │                    │ → default.annotations    │
        ▼                    └──────────┬──────────────┘
                                        │
    ┌───────────────────────────────────┘
    ▼
compose_vocabularies(universal, hive_domain)
    │
    ▼
run_classification_pipeline(cfg, fsm, source_id="hive-prod-default")
    │
    ▼
Dataset version N+1 registered under hive source

Vocabulary Composition

Hive sources use two-layer vocabulary composition:

Layer 1 (always):   Universal vocabulary (16 BFO-grounded PII categories)
                              ╱╲
Layer 2 (hive only): Domain annotations from default.annotations table
                     (290+ customer-specific categories with hierarchical codes)
                              ╱╲
                     Composed CategorySet (300+ terms)

Domain categories attach to the universal tree via parent_code references. Categories without a valid parent are logged as warnings and placed under a catch-all internal node.

Source Creation

When a user selects a data connection from the Status page dropdown and clicks “Create Source”, the gateway:

  1. Validates the connection by running SHOW DATABASES

  2. Creates a data_sources record:

    {
      "id": "hive-{connection}-{database}",
      "source_type": "hive",
      "source_uri": "{connection}/{database}",
      "display_name": "hive:{connection}/{database}",
      "vocabulary_mode": "hive"
    }
    
  3. The source appears in the dropdown immediately

Pipeline Routing

# In pipeline.py — source-based auto-resolution
if source.source_type == "hive":
    connection_name = source.source_uri.split("/")[0]
    database = source.source_uri.split("/")[1]
    # discover_tables() and sample_table_metadata() use the connection
    # load_annotations_from_hive() uses the connection for vocabulary

Configuration

No new configuration needed. Existing settings control Hive behavior:

classify {
    connection_name = ""                # Default CAI data connection
    connection_name = ${?ATELIER_CLASSIFY_CONNECTION}
    database = "default"
    database = ${?ATELIER_CLASSIFY_DATABASE}
}

cml {
    data_connections = ""               # Comma-separated connection names
    data_connections = ${?ATELIER_DATA_CONNECTIONS}
}

Files (Proposed Changes)

FileChange
src/atelier/gateway.pyAdd POST /api/data-sources endpoint with connection validation
src/atelier/classify/pipeline.pyExtend source routing to resolve hive connections
ui/src/pages/Status.tsxAdd “Create Source” button in data connection card

Existing Modules Used (No Changes)

ModuleFunctionRole
sampler.pydiscover_tables()List tables via cml.data_v1
sampler.pysample_table_metadata()Sample column values
taxonomy.pyload_annotations_from_hive()Load domain vocabulary
taxonomy.pycompose_vocabularies()Merge universal + domain

Implementation Priority

PhaseIntegrationDepends OnTestable Without Services
5MLflow bridgePhase 2 (data model)Partially — queue/reconcile logic is pure Python
6Hive sourcePhase 2 (data model)No — requires CAI data connection

Phase 5 can be developed and unit-tested independently (the queue and reconcile logic is pure Python). The MLflow API calls can be mocked in tier-0 BDD scenarios.

Phase 6 is primarily wiring — the heavy lifting (table discovery, vocabulary loading, pipeline execution) already exists. The main new code is the gateway endpoint for source creation and the UI for triggering it.

Encrypted Deployment Defaults (SOPS + age)

Atelier ships with encrypted deployment defaults so a CAI operator can stand up a working instance by entering only four environment variables — their two AWS Bedrock credentials, a direct Anthropic API key (for overwatch), plus a single age private key that unlocks everything else.

Why

Every CAI deployment needs a dozen-ish environment variables: Bedrock model ARNs, Atlas / Ranger URLs, feature toggles, governance flags, subagent model IDs, and — for UAT runs — a curated-reference CSV for accuracy measurement. Most of those values are identical across every deployment of the same Atelier release; only the AWS credentials and the Anthropic key are operator-specific. Rather than documenting a long checklist for every customer, we encrypt the defaults and the curated-reference fixture into the repository with SOPS and ship one key alongside the deployment.

The operator paste-sets the key; everything else is already wired up.

Operator workflow (what to tell your CAI users)

Set four environment variables on the CAI Application, then start it:

NameValueSource
AWS_ACCESS_KEY_IDBedrock access keyyour AWS / IAM team
AWS_SECRET_ACCESS_KEYBedrock secretyour AWS / IAM team
ANTHROPIC_API_KEYdirect Anthropic API keyAnthropic Console
SOPS_AGE_KEYfull AGE-SECRET-KEY-1… stringprovided out-of-band by the Atelier maintainer

On startup, bin/start-app.sh runs the shared bin/bootstrap-secrets.sh utility, which decrypts both .env.cai.enc (dotenv defaults) and features/fixtures/curated_reference.csv.enc (meta-tagging answer key) with the age key you provided. The dotenv values source into the environment where HOCON’s ${?VAR} substitution picks them up; the decrypted CSV materializes at build/data/curated_reference.csv and ATELIER_CLASSIFY_REFERENCE_URI points at it so evaluation_report.json carries real accuracy numbers. No per-customer checklist to maintain.

Overrides still work. Any explicit ATELIER_* env var on the CAI Application wins over the encrypted default — so an operator who wants a different Bedrock ARN just sets ATELIER_AGENT_MODEL directly and that value takes precedence.

Alternative: pointing at a key file

If the operator already has the age key on disk (e.g. mounted from a secret store), they can set SOPS_AGE_KEY_FILE=/path/to/key.txt instead of pasting the key content. bin/start-app.sh supports both.

Maintainer workflow

The age public key is committed in .sops.yaml; the private key is held by the Atelier maintainer and distributed out-of-band to each CAI operator.

First-time setup

Place your age private key at ~/.config/sops/age/keys.txt — the public key must match the age: age1… line in .sops.yaml. The devenv shell provides both sops and age binaries.

Editing defaults

just decrypt-secrets          # .env.cai.enc → .env.cai (plaintext, gitignored)
$EDITOR .env.cai              # add / change values
just encrypt-secrets          # .env.cai → .env.cai.enc
git add .env.cai.enc
git commit -m "chore: update CAI deployment defaults"

The plaintext .env.cai is excluded by .gitignore; only the encrypted .env.cai.enc is tracked. SOPS encrypts each value independently, so diffs show which keys changed even though their values are opaque.

Editing the curated-reference fixture

The meta-tagging answer key (what evaluation_report.json compares predictions against) ships encrypted under the BDD fixtures tree so committed secrets live with the corpus they validate.

# From the maintainer's reviewer xlsx
uv run python -m atelier.overwatch.ingest_reference \
    ~/path/to/Atelier_Results_Default_DB_4-16.xlsx \
    --out build/data/curated_reference.csv

# Encrypt into features/fixtures/ and commit the ciphertext only
just encrypt-reference
git add features/fixtures/curated_reference.csv.enc
git commit -m "chore: update curated-reference answer key"

To inspect the current key without re-running the xlsx ingest:

just decrypt-reference        # decrypts into build/data/curated_reference.csv
$PAGER build/data/curated_reference.csv

Both the plaintext CSV (in build/) and .env.cai are ignored by git; only the .enc ciphertexts are tracked.

Rotating the key

age-keygen -o new-key.txt                                    # generate replacement pair
# update .sops.yaml: replace the age: age1... line with the new public key
sops updatekeys .env.cai.enc                                 # re-encrypt deployment defaults
sops updatekeys features/fixtures/curated_reference.csv.enc  # AND the curated-reference fixture
git commit -am "chore: rotate CAI deployment key"
# distribute the new private key to operators via the same out-of-band channel

sops updatekeys rewrites the encrypted file’s recipient list in place — nothing about the plaintext values changes, so this is a zero-content-drift rotation. Run it against every encrypted artifact so the new key unlocks the whole set.

Adding a second recipient (e.g. ops team shared key)

Add a second age: entry under the matching creation_rules block in .sops.yaml, then run sops updatekeys .env.cai.enc. Either private key will decrypt.

How this fits with HOCON

SOPS only populates environment variables. HOCON (config/base.conf) already treats all configuration as environment-overridable via the ${?VAR} pattern:

agents {
  model = "claude-opus-4-7"
  model = ${?ATELIER_AGENT_MODEL}     # env wins when set
}

SOPS decryption runs before the gRPC server loads HOCON, so from HOCON’s perspective the encrypted values are just ordinary environment variables.

What belongs in .env.cai.enc vs config/base.conf

  • .env.cai.enc — deployment-specific defaults that differ between environments but aren’t operator secrets per se (model ARNs, Knox endpoints, feature toggles, subagent IDs). Values that are derivable from context and you don’t want every operator to rediscover.
  • config/base.conf — true defaults that hold for every deployment; structural knobs that belong in source control in plaintext (pipeline thresholds, port numbers, fusion strategy).
  • Operator-entered env vars — genuine per-deployment secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, the SOPS_AGE_KEY itself). These never live in the repository.

Security notes

  • SOPS_AGE_KEY decrypts only this project’s .env.cai.enc. Losing it costs you these defaults; gaining it grants no AWS, Cloudera, or third-party privilege on its own.
  • Each customer should get the same age private key (defaults are identical across deployments) — per-customer secrets, if any, stay in the CAI Application’s own environment variables.
  • Rotate the key whenever a recipient leaves the operator pool.
  • The age public key in .sops.yaml is intentionally committed; public keys are meant to be public.

Reference

  • .sops.yaml — recipient rules (covers .env.cai.enc + features/fixtures/*.csv.enc)
  • .env.cai.enc — encrypted deployment defaults (committed)
  • features/fixtures/curated_reference.csv.enc — encrypted curated-reference CSV (committed)
  • bin/bootstrap-secrets.sh — shared decrypt utility; runs from bin/start-app.sh, devenv enterShell, and just bootstrap-secrets
  • bin/start-app.sh — CAI startup; invokes bootstrap-secrets then sources .env.cai
  • justfile helpers:
    • bootstrap-secrets — run the shared decrypt utility
    • decrypt-secrets / encrypt-secrets — dotenv editing workflow
    • decrypt-reference / encrypt-reference — curated-reference CSV editing workflow
  • devenv.nix — provides sops + age in the dev shell; runs bootstrap in enterShell
  • SOPS docs · age docs

Scenario Overview

Atelier uses behave (BDD) to capture platform decisions as executable specifications. Every scenario answers a concrete question: Does the config load? Can the runtime start? Does the classification pipeline converge?

These aren’t just tests. They’re the design context that connects architectural choices to the deployment realities of Cloudera AI.

Active Domains

155 scenarios across 35 features, 4 domains.

Infrastructure (infra)

Health checks and configuration lifecycle for the services Atelier depends on.

FeatureTagTierScenariosWhat it validates
Config lifecycle@config03HOCON load, CLI override precedence, materialize + validate
PostgreSQL health@postgres12Connection with pgvector extension, migration state
Qdrant health@qdrant11Vector store HTTP health endpoint
PGlite process@pglite02Node.js script existence, npm dependency declarations
Preflight@preflight03Structured deny/warn checks, GPU detection

Deployment

CAI deployment modalities and the runtime profile that catches failures before pushing.

FeatureTagTierScenariosWhat it validates
Runtime profile@runtime-profile06Import chain, script executability, config resolution, migration parsing
AMP lifecycle@amp0 + cai5.project-metadata.yaml structure, task patterns, install + start
Application modality@application0 + 13HOST binding logic, full local stack startup
Studio modality@studio02IS_COMPOSABLE root directory routing
Embeddings integration@embeddings04npm dependency, page component, React Router, preparation script
Naming conventions@naming02User-facing surfaces say “Embeddings”, no Apache Atlas confusion

Gateway

HTTP gateway endpoints, gRPC bridge, and live service integration.

FeatureTagTierScenariosWhat it validates
API endpoints@api0 + 18REST endpoint contracts, response shapes
API testclient@testclient07FastAPI TestClient integration (no running server)
Status endpoint@status0 + 14Aggregated health report, config state
Pipeline integration@pipeline12Classification pipeline via gateway
SPA routes@spa01Client-side routing fallback

Agent

Classification pipeline, DST evidence fusion, ML classifiers, and agent orchestration.

FeatureTagTierScenariosWhat it validates
Classification pipeline@gpu028DST belief, Dempster combination, features, patterns (+ Luhn/IPv4/date/currency validation), name matching, pipeline E2E, Monte Carlo sampling
Bootstrap convergence@bootstrap011LLM sweep, ML validation, targeted revisit, convergence criteria, frontier SVM
Agent convergence loop@gpu066-tool agent loop, conflict reports, convergence, mock client
Agent smoke test@agent06Agent metadata, tool definitions, state formatting
LLM backends@backend08Backend factory, Anthropic/Bedrock/Cerebras/OpenAI clients
ML classifiers@ml04CatBoost + SVM training, inference, virtual ensemble UQ
ML E2E@ml-e2e02Full synth → train → classify → evaluate cycle
Belief path@belief-path03Hierarchical navigation, cautious classification
SAGE importance@sage01Permutation-based feature importance
SHAP explanations@shap02TreeSHAP + PermutationSHAP attribution
Synth generation@synth02Synthetic data + reference-label generation
Synth framework@synth-framework02Generator registry, coverage reporting
Meta-tagging@meta-tagging02META_TO_ICE mappings, coverage
Experimentation@experimentation03Discount tuning, comparative evaluation
Real data@real-data03Production annotation validation (requires build/data/)

By Tier

TierRequiresScenariosPass locally
0Python only~120Yes
1devenv stack~15Yes (with devenv up)
caiLive CAI session~5Skipped (documentation-only)

Additional tags: @slow (~17 scenarios requiring extended runtime), @gpu (GPU detection/acceleration scenarios — run on CPU too, just slower).

Why BDD for a Deployment Platform?

CAI deployment has four modalities — Project, Application, AMP, and Studio — each with different constraints on networking, filesystem layout, and process lifecycle. Traditional unit tests verify module behavior in isolation. BDD scenarios verify that the system hangs together across these modalities.

Consider the Application modality: when CDSW_APP_PORT is set, the startup script must bind to 127.0.0.1 because CAI’s reverse proxy handles external traffic. Bind to 0.0.0.0 instead and you bypass the proxy’s auth layer. This isn’t a bug in any single module — it’s a deployment contract that only a scenario can express clearly:

Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
  Given CDSW_APP_PORT is set to "8090"
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "127.0.0.1"

The scenario is the spec. A colleague reading this knows exactly what the constraint is, why it matters, and can verify it passes with just behave.

Test Infrastructure

Framework

Atelier uses behave for BDD and pytest for unit tests. The BDD scenarios live in features/ and are organized by domain.

Tier System

Scenarios are tagged by the infrastructure they require. The ATELIER_BDD_TIER environment variable controls which tiers run.

TierTagRequiresPurpose
0@tier-0Python onlyConfig, imports, classification pipeline, agent loop, ML classifiers
1@tier-1devenv stackPostgreSQL, Qdrant, gRPC, full gateway startup
cai@tier-caiCAI sessionLive deployment validation — always skipped locally

Additional tags:

  • @slow — scenarios requiring extended runtime (pipeline E2E, ML training)
  • @gpu — GPU acceleration scenarios (run on CPU too, just slower)

Tier 0 runs everywhere: laptops, CI, CAI sessions. No services, no network calls. This is where the runtime profile lives — the scenarios that catch deployment failures before you push.

Tier 1 requires devenv up to be running (PostgreSQL on :5533, Qdrant on :6334). These verify that services are healthy and that the application can actually connect to its data stores.

Tier CAI exists as executable documentation. The step definitions are stubs — they express what should happen in a live CAI session without automating it. When debugging a deployment failure, these scenarios are a checklist.

Running Tests

# Full BDD suite including gateway checks (preferred)
just behave

# Tier-0 only (no services needed)
just bdd

# Tier-0 + tier-1 (requires devenv up)
just bdd-full

# Runtime profile specifically
just bdd-runtime

# Single domain
ATELIER_BDD_TIER=0 uv run behave features/agent/

# Single feature file
uv run behave features/agent/classification.feature

# By tag
ATELIER_BDD_TIER=0 uv run behave features/ -t @bootstrap

# Verbose (show all steps, not just failures)
just behave --no-capture

Feature Organization

features/
├── environment.py                          # Tier filtering, stack health, cleanup hooks
├── steps/__init__.py                       # Central re-exports (behave's discovery point)
├── infra/                                  # Domain: infrastructure & services
│   ├── step_defs/
│   │   ├── helpers.py
│   │   ├── config_steps.py
│   │   ├── health_steps.py
│   │   └── preflight_steps.py
│   ├── config_lifecycle.feature            # 3 scenarios
│   ├── health_postgres.feature             # 2 scenarios
│   ├── health_qdrant.feature               # 1 scenario
│   ├── health_pglite.feature               # 2 scenarios
│   └── preflight.feature                   # 3 scenarios
├── deployment/                             # Domain: CAI deployment workflows
│   ├── step_defs/
│   │   ├── helpers.py
│   │   ├── runtime_steps.py
│   │   ├── amp_steps.py
│   │   └── naming_steps.py
│   ├── runtime_profile.feature             # 6 scenarios
│   ├── amp_lifecycle.feature               # 5 scenarios
│   ├── application.feature                 # 3 scenarios
│   ├── studio.feature                      # 2 scenarios
│   ├── embeddings.feature                  # 4 scenarios
│   └── naming_audit.feature                # 2 scenarios
├── gateway/                                # Domain: HTTP/gRPC gateway
│   ├── step_defs/
│   │   ├── status_steps.py
│   │   ├── http_steps.py
│   │   ├── endpoint_steps.py
│   │   ├── pipeline_steps.py
│   │   └── testclient_steps.py
│   ├── api_endpoints.feature               # 8 scenarios
│   ├── api_testclient.feature              # 7 scenarios
│   ├── status_endpoint.feature             # 4 scenarios
│   ├── pipeline_integration.feature        # 2 scenarios
│   └── spa_routes.feature                  # placeholder
└── agent/                                  # Domain: classification & agents
    ├── step_defs/
    │   ├── agent_steps.py
    │   ├── classification_steps.py
    │   ├── bootstrap_steps.py
    │   ├── backend_steps.py
    │   ├── synth_steps.py
    │   ├── ml_steps.py
    │   ├── ml_e2e_steps.py
    │   ├── sage_steps.py
    │   ├── shap_steps.py
    │   ├── real_data_steps.py
    │   ├── belief_path_steps.py
    │   ├── synth_framework_steps.py
    │   ├── meta_tagging_steps.py
    │   ├── experimentation_steps.py
    │   ├── agent_loop_steps.py
    │   └── monte_carlo_steps.py
    ├── classification.feature              # 19 scenarios (DST, pipeline, MC sampling)
    ├── bootstrap.feature                   # 10 scenarios
    ├── agent_loop.feature                  # 6 scenarios
    ├── agent_smoke.feature                 # 6 scenarios
    ├── backend.feature                     # 8 scenarios
    ├── ml_classifiers.feature              # 4 scenarios
    ├── ml_e2e.feature                      # 2 scenarios
    ├── synth.feature                       # 2 scenarios
    ├── synth_framework.feature             # 2 scenarios
    ├── sage.feature                        # 1 scenario
    ├── shap.feature                        # 2 scenarios
    ├── belief_path.feature                 # 3 scenarios
    ├── meta_tagging.feature                # 2 scenarios
    ├── experimentation.feature             # 3 scenarios
    └── real_data.feature                   # 3 scenarios

Step Discovery

Behave only discovers step definitions from features/steps/. Domain step definitions live in <domain>/step_defs/ directories and are re-exported through features/steps/__init__.py:

from features.infra.step_defs.config_steps import *
from features.infra.step_defs.health_steps import *
from features.infra.step_defs.preflight_steps import *
from features.deployment.step_defs.runtime_steps import *
from features.deployment.step_defs.amp_steps import *
from features.deployment.step_defs.naming_steps import *
from features.agent.step_defs.agent_steps import *
from features.agent.step_defs.classification_steps import *
from features.agent.step_defs.bootstrap_steps import *
from features.agent.step_defs.backend_steps import *
from features.agent.step_defs.synth_steps import *
from features.agent.step_defs.ml_steps import *
from features.agent.step_defs.ml_e2e_steps import *
from features.agent.step_defs.sage_steps import *
from features.agent.step_defs.shap_steps import *
from features.agent.step_defs.real_data_steps import *
from features.agent.step_defs.belief_path_steps import *
from features.agent.step_defs.synth_framework_steps import *
from features.agent.step_defs.meta_tagging_steps import *
from features.agent.step_defs.experimentation_steps import *
from features.gateway.step_defs.status_steps import *
from features.gateway.step_defs.http_steps import *
from features.gateway.step_defs.endpoint_steps import *
from features.gateway.step_defs.pipeline_steps import *
from features.agent.step_defs.agent_loop_steps import *
from features.agent.step_defs.monte_carlo_steps import *
from features.gateway.step_defs.testclient_steps import *

Two conventions protect against behave’s automatic discovery behavior:

  1. Use step_defs/, not steps/ — Behave walks the feature tree and exec’s any .py file it finds in a directory named steps/. This bypasses Python’s import system, breaking relative imports and module context. Using step_defs/ avoids this entirely.

  2. Never name a features/ subdirectory after a stdlib module — When behave imports features.platform, Python also registers it as platform in sys.modules, shadowing the stdlib. This breaks anything that lazily imports platform (including pydantic). The infra/ domain was originally named platform/ until this caused a cascade of subtle failures.

Config-Driven BDD

Infrastructure steps load configuration from HOCON via atelier.config.load_config() rather than hardcoding values. This means BDD scenarios validate the same config path used in production:

from atelier.config import load_config
cfg = load_config()
_wait_for("PostgreSQL", lambda: _check_pg(cfg.db_url))

Stack Health Gate

Tier-1 scenarios share a one-time stack health check in environment.py. Before the first tier-1 scenario runs, the framework verifies PostgreSQL and Qdrant are reachable (with a 60-second retry window). If either service is down, all tier-1 scenarios fail fast with a clear message rather than producing confusing connection errors.

Cleanup

after_scenario in environment.py removes temporary files registered via context._temp_files. This handles config materialization artifacts and other test-created files.

Unit Tests

Alongside BDD, tests/ contains pytest unit tests for isolated module behavior:

just test                    # Run all pytest tests
uv run pytest tests/ -x     # Stop on first failure

BDD and pytest serve complementary roles: pytest validates that individual functions behave correctly; BDD validates that the system’s deployment contracts hold.

Deployment Modalities

Cloudera AI offers four ways to run code. Each has different constraints on networking, filesystem layout, process lifecycle, and dependency management. Atelier’s BDD scenarios encode these constraints as executable specifications.

ProjectGit-backed workspace
Base for all modalitiesAMPOne-click provisioning
.project-metadata.yamlApplicationLong-running service
CDSW_APP_PORT bindingStudioPre-built Docker image
IS_COMPOSABLE=true  install + start tasksbind to reverse proxy  embedded serviceGit-backed workspace
Base for all modalities












One-click provisioning
.project-metadata.yaml












Long-running service
CDSW_APP_PORT binding












Pre-built Docker image
IS_COMPOSABLE=true


















Project

Every CAI deployment starts as a Project — a Git-backed workspace cloned into /home/cdsw. The Project modality is implicit: it provides the filesystem layout, environment variables, and Python runtime that all other modalities build on.

No dedicated feature file. Project constraints are tested indirectly through every other deployment scenario.

AMP (Automated Machine Learning Prototype)

An AMP is a one-click provisioning workflow defined in .project-metadata.yaml. It runs a sequence of tasks — typically create_job to install dependencies, then start_application to launch the service.

Why BDD captures this well: AMP metadata is YAML that CAI interprets at deploy time. A malformed task definition doesn’t fail until someone clicks “Deploy” in the CAI UI. Our tier-0 scenarios catch structural problems immediately.

What the scenarios validate

AMP metadata structure (amp_lifecycle.feature):

Scenario: AMP metadata file is valid
  Given the file ".project-metadata.yaml" exists
  When I parse the AMP metadata
  Then it has a "name" field
  And it has a "runtimes" section
  And it has a "tasks" section

Task ordering pattern — CAI requires create_job before run_job for the same entity label. Getting this wrong means the install job never runs:

Scenario: AMP tasks follow create_job/run_job pattern
  Given the AMP metadata is loaded
  Then a "create_job" task with entity_label "install_deps" exists
  And a "run_job" task with entity_label "install_deps" exists
  And a "start_application" task exists

Install script validityscripts/install_deps.py runs in a bare Python environment without uv or devenv. A syntax error here means the entire deployment fails:

Scenario: Install script is valid Python
  When I compile "scripts/install_deps.py" with py_compile
  Then no SyntaxError is raised

Tier-CAI scenarios document what a successful AMP deploy looks like. These are skipped locally but serve as a regression checklist when debugging deployment failures:

@tier-cai
Scenario: AMP install job completes successfully
  Given I am in a CAI project session
  When I run the install dependencies job
  Then the job exits with code 0
  And "atelier" is importable in system Python
  And "node --version" succeeds
  And the directory "ui/dist" exists

Application

An Application is a long-running web service. CAI assigns a port via CDSW_APP_PORT and routes subdomain traffic through a reverse proxy that handles authentication.

The key constraint: When CDSW_APP_PORT is set, the service must bind to 127.0.0.1, not 0.0.0.0. The reverse proxy connects over localhost; binding to all interfaces bypasses CAI’s auth layer.

For local development (no CDSW_APP_PORT), binding to 0.0.0.0 is correct — it lets you access the service from a browser.

Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
  Given CDSW_APP_PORT is set to "8090"
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "127.0.0.1"

Scenario: start-app.sh binds to 0.0.0.0 for local dev
  Given CDSW_APP_PORT is not set
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "0.0.0.0"

The tier-1 scenario verifies the full stack actually starts and serves traffic:

@tier-1
Scenario: Full application stack starts locally
  When I run bin/start-app.sh in the background
  Then the HTTP gateway responds on port 8090 within 30 seconds
  And the gRPC server responds on port 50051

Studio (future)

A Studio is a pre-built Docker image where IS_COMPOSABLE=true. Instead of being the root application, Atelier runs as an embedded service within a larger container.

The key constraint: When IS_COMPOSABLE is set, the install script must use /home/cdsw/atelier as the root directory (the project subdirectory) instead of /home/cdsw (the container root). Getting this wrong means dependencies install into the wrong location and imports fail at startup.

Scenario: install_deps.py handles IS_COMPOSABLE root path
  When I set IS_COMPOSABLE to "true"
  And I parse scripts/install_deps.py for root_dir
  Then root_dir is "/home/cdsw/atelier"

Scenario: install_deps.py uses default root without IS_COMPOSABLE
  When IS_COMPOSABLE is not set
  And I parse scripts/install_deps.py for root_dir
  Then root_dir is "/home/cdsw"

Studio support is currently speculative — these scenarios document the expected behavior so the contract is established before implementation begins.

Runtime Profile

The CAI Runtime Profile is a set of tier-0 scenarios that validate deployment readiness without requiring a live CAI session. Run it before every push to catch the class of errors that only manifest when CAI tries to start the application.

just bdd-runtime

Why This Exists

CAI deployment failures are expensive to debug. The install job runs in a container with a 30-minute timeout. If it fails, the only feedback is a log dump. If it succeeds but the application crashes at startup, the only feedback is a “Application failed to start” banner with a link to logs that may or may not contain the root cause.

The runtime profile catches failures that would otherwise require a deploy-debug-redeploy cycle:

CheckFailure mode it prevents
Core package importableMissing __init__.py, circular imports, broken package structure
Entry points importableNew dependency not declared in pyproject.toml
Proto stubs importableForgot to run just proto after editing .proto
Scripts exist and are executableMissing chmod +x, file not committed
HOCON config resolvesUndefined substitution variable, syntax error in .conf
Migrations parseableMalformed -- migrate:up block, missing SQL terminator

The Scenarios

Import chain validation

The most common CAI deployment failure is an import error. A module works in devenv because all dev dependencies are installed, but fails in CAI because the install script only installs production dependencies.

Scenario: Core package is importable
  When I import "atelier"
  Then no ImportError is raised
  And atelier.__version__ is defined

Scenario: All entry points are importable
  When I import "atelier.server"
  And I import "atelier.gateway"
  And I import "atelier.config"
  And I import "atelier.db.bootstrap"
  Then no ImportError is raised

Scenario: Proto stubs are generated and importable
  When I import "atelier.proto.atelier_pb2"
  And I import "atelier.proto.atelier_pb2_grpc"
  Then no ImportError is raised

These scenarios exercise the full import graph. If atelier.gateway imports fastapi which imports pydantic which imports annotated_types, and annotated_types isn’t in the dependency chain — this catches it.

Script executability

CAI runs scripts via #!/usr/bin/env python3 or #!/usr/bin/env bash. If the shebang is wrong or the execute bit isn’t set, the deploy fails with a cryptic “Permission denied” error.

Scenario: Required scripts exist and are executable
  Then the file "scripts/install_deps.py" exists
  And the file "scripts/startup_app.py" exists
  And the file "scripts/install_node.sh" is executable
  And the file "scripts/install_qdrant.sh" is executable
  And the file "bin/start-app.sh" is executable

Configuration resolution

HOCON configs use ${?VAR} substitution for environment variables. A typo in a variable name or an unresolvable reference won’t fail until load_config() is called at startup. The runtime profile forces resolution at test time:

Scenario: HOCON config resolves without errors
  When I load the config with no overrides
  Then no exception is raised
  And the config has grpc_port > 0
  And the config has gateway_port > 0

Migration parsing

Atelier uses a dbmate-compatible migration runner (atelier.db.bootstrap) that parses -- migrate:up / -- migrate:down blocks from SQL files. If a migration is missing its UP block, the bootstrap silently skips it — which means the schema diverges from what the code expects.

Scenario: Database migrations are parseable
  Given migration files exist in "db/migrations/"
  When I parse each migration for UP/DOWN blocks
  Then every migration has a valid UP block

When to Extend the Profile

Add a new runtime profile scenario whenever you:

  • Add a new Python entry point or importable module
  • Add a new script that CAI executes directly
  • Add a new HOCON config key that downstream code depends on
  • Add a new migration file

The rule of thumb: if it can break a CAI deploy and you can verify it without services running, it belongs in the runtime profile.