Atelier is an agentic classification workbench for Cloudera AI. It classifies
column metadata using six independent evidence sources fused via
Dempster-Shafer Theory (DST), producing belief intervals instead of point
estimates. An LLM-in-the-loop convergence agent identifies disagreements
between sources and orchestrates targeted reclassification until the corpus
stabilizes.
Traditional classifiers output a single probability \( P(A) = 0.85 \) — “85%
email address.” This conflates two fundamentally different situations: high
confidence with abundant evidence vs. moderate confidence with sparse evidence.
A Bayesian posterior and a coin flip can both yield 0.5, but they represent
very different epistemic states.
Dempster-Shafer theory separates these via the belief function
\( \text{Bel}(A) \) and plausibility function \( \text{Pl}(A) \), where:
The interval \( [\text{Bel}(A),; \text{Pl}(A)] \) bounds the true
probability. Its width \( \text{Pl}(A) - \text{Bel}(A) \) quantifies
epistemic uncertainty — how much we don’t know:
Interval
Interpretation
\( [0.82,; 0.87] \)
Strong evidence, low ambiguity — classify with confidence
\( [0.30,; 0.90] \)
Some support for \(A\), but high ignorance — gather more evidence
\( [0.45,; 0.55] \)
Two sources disagree — wide gap, needs revisit
This distinction drives the entire pipeline: columns with wide
belief gaps (where \( \text{Pl}(A) - \text{Bel}(A) \) is large)
are automatically escalated for LLM re-examination with enriched
context. Conflict \( K \) is tracked as a diagnostic but the
gap width determines which columns need attention.
Each source independently produces a mass function
\( m_i : 2^\Theta \to [0, 1] \) over the frame of discernment \( \Theta \)
(the set of all category codes). Sources are grouped by computational cost:
16 regex detectors + post-regex validators (email, phone, SSN, IP, UUID, date, datetime, URL, credit card + Luhn, MAC, IBAN, postal code, monetary, hash, semver, currency + ISO 4217); graduated mass scaling by match fraction
M0
Name matching
Column name vs vocabulary labels, codes, and aliases (4-tier: exact > code > alias > overlap)
M0
LLM classification
Frontier model reasoning (Anthropic / Bedrock / Cerebras / OpenAI-compatible)
M1 (API)
CatBoost
12 discrete features + 384-dim embedding; virtual ensemble uncertainty via posterior_sampling
M2 (trained)
SVM
Sparse TF-IDF: character n-grams (3–6) ∪ word bigrams; Platt-scaled LinearSVC
M2 (trained)
The SVM and CatBoost classifiers occupy deliberately orthogonal feature
spaces: the SVM operates on sparse lexical features (TF-IDF) while CatBoost
uses dense semantic embeddings. This architectural separation ensures genuine
evidence independence for Dempster’s rule.
Sources are combined via the conjunctive rule of combination:
$$
m_{1 \oplus 2}(C) = \frac{1}{1-K} \sum_{\substack{A \cap B = C \ A,B \subseteq \Theta}} m_1(A) \cdot m_2(B)
$$
where the conflict \( K = \sum_{A \cap B = \varnothing} m_1(A) \cdot m_2(B) \)
measures the degree to which sources contradict each other. High \( K \) is
the diagnostic signal that drives the convergence loop: columns where
independent evidence sources disagree are escalated for targeted LLM revisit
with enriched context (ML prediction, belief interval, confusable pair).
The vocabulary forms a rooted code tree (e.g.,
ICE.SENSITIVE.PID.CONTACT.EMAIL). Belief and plausibility are queryable at
any depth — \( \text{Bel}(\texttt{ICE.SENSITIVE}) \) aggregates all
descendants. The cautious_code(τ) operator returns the deepest code where
\( \text{Bel} > \tau \), enabling principled depth-accuracy tradeoffs:
high \( \tau \) yields coarse but reliable labels; low \( \tau \) yields
specific but less certain ones.
The bootstrap pipeline iterates three phases until the belief gap
(\( \text{Pl}(A) - \text{Bel}(A) \)) stabilizes:
LLM sweep — classify all frontier columns via batch LLM calls
ML validation — run the full 6-source DST pipeline; compute
per-column belief, plausibility, and gap
Targeted revisit — re-classify only uncertain columns
(high gap or low belief) with enriched context (ML prediction +
belief interval + detected patterns + confusable pairs)
The primary convergence measure is mean belief gap — the average
width of the \( [\text{Bel}, \text{Pl}] \) interval across all
columns. A narrow gap means the evidence sources agree on a confident
prediction. Conflict \( K \) is tracked as a diagnostic signal
(it indicates source disagreement) but does not gate convergence — a
column can have \( K = 0.9 \) but \( \text{Bel} = 0.95 \): the
sources fought, but the winner is clear.
An agent-driven variant (via Claude Agent SDK with 6 tools) delegates
the revisit strategy to an LLM that reasons about uncertainty patterns,
calls retrain_svm to progressively improve the SVM on accumulated
frontier labels, and declares convergence when diminishing returns are
reached. The programmatic variant uses gap + coverage thresholds
for environments where tool-use isn’t available.
After the first LLM sweep, the SVM is retrained on blended synthetic +
frontier labels — high-quality classifications from the frontier model on
the stratified importance sample. Synthetic data provides vocabulary breadth
(all categories); frontier labels provide corpus-specific depth. The SVM is
hot-swapped progressively across convergence iterations, carrying
corpus-specific signal into each validation pass. DST independence is
preserved: the SVM trains on frontier-model (Opus) labels while the LLM
mass function in fusion uses the subagent model (Sonnet/Haiku).
The pipeline handles corpora from 50 columns (OOTB sample) to 120M+ columns
(full GitTables at 10M+ tables). Monte Carlo stratified sampling selects
a representative frontier subset for LLM classification and propagates labels
to the remaining corpus via embedding similarity.
With max_frontier_columns = 500, classifying a 120M-column corpus requires
LLM inference on only 0.0004% of columns — a >99.99% cost reduction while
preserving classification quality through DST conflict-driven escalation of
uncertain propagations.
devenv shell # Enter dev environment (loads .env automatically)
just install # Install Python + Node dependencies
just proto # Generate proto stubs
just resolve-config # Materialize HOCON → build/config/atelier.env
just up # Start gRPC + Vite dev server via devenv processes
Atelier follows the Fine Tuning Studio proto-first pattern: the gRPC
service contract defines the API, and a FastAPI gateway bridges REST
to gRPC while serving the React frontend.
Proto (atelier.proto) ← Service contract and message definitions
↓
Servicer (service.py) ← Thin router dispatching to business logic
↓
Client (client.py) ← Wrapper around generated stub with error handling
↓
Gateway (gateway.py) ← FastAPI bridge from REST to gRPC + React SPA
Terminal sessions survive page navigation and browser reload. The WebSocket
endpoint accepts a client-provided session_id (persisted in localStorage).
On disconnect, the session stays alive server-side — SDK queries continue
running and output accumulates in a ring buffer (64KB collections.deque).
On reconnect, the buffer is replayed so the user sees everything that happened
while they were away.
Session registry: Module-level _sessions dict in terminal.py
Idle cleanup: Background asyncio task sweeps sessions with no client
for 30 minutes (/api/terminal/sessions lists active sessions)
Dedicated page: /terminal route renders a full-screen Ghostty WASM
terminal; the Landing page embeds the same component at preview size
PostgreSQL probes retry 3x with 1s backoff (PGlite can have transient stalls).
Overall status is connected when gRPC responds, degraded when gRPC is up
but other services are flaky.
The FastAPI lifespan hook runs three startup tasks:
OOTB seed: Check if ootb-sample source has any dataset versions;
if none, create version 1 with metadata.
Hive auto-discovery: discover_hive_sources() probes all configured
data connections (ATELIER_DATA_CONNECTIONS), iterates databases, finds
annotations tables matching the known schema (legacy or universal format),
and auto-registers them via get_or_create_data_source().
load_config() reads the HOCON file with live environment variable
substitution. External tools that need a flat key=value file use
just resolve-config to materialize build/config/atelier.env.
Atelier uses the Claude Agent SDK to drive classification convergence.
Rather than a fixed programmatic loop, an LLM agent reasons about which
columns to revisit based on DST conflict metrics, evidence breakdowns,
and convergence trends.
The agent loop (src/atelier/classify/agent_loop.py) wraps the bootstrap
pipeline functions as six Claude tools. Claude receives an initial state
summary and iteratively calls tools until it determines the classification
has converged.
1. Initial state → agent sees mean gap, mean belief, coverage, K (diagnostic)
2. Agent calls get_conflict_report → identifies uncertain columns (high gap or low belief)
3. Agent calls get_column_detail → inspects per-source evidence breakdown
4. Agent calls revisit_columns → re-classifies with enriched context
5. Agent calls retrain_svm → SVM learns from accumulated frontier labels
6. Agent calls check_convergence → verifies gap trend + belief floor
7. Repeat 2-6 until satisfied
8. Agent calls declare_converged with reason
The conversation loop runs up to classify_agent_max_turns (default 10)
Messages API round-trips. Each tool call returns structured JSON that the
agent uses to plan its next action.
The retrain_svm tool (M9) lets the agent decide when to retrain the SVM
classifier on accumulated frontier LLM labels. The retrained SVM is
hot-swapped via ml_inference.reset() + configure_paths() and used in
subsequent ML validation passes. The agent calls this when it judges enough
new frontier labels have accumulated to improve classification accuracy.
Each revisit_columns call increments state.iteration and triggers
full ML revalidation on all columns, not just the revisited ones. This
ensures that improved LLM labels propagate through the DST fusion.
The agent client is built via _build_client(cfg) which prefers Anthropic
when ANTHROPIC_API_KEY is set, falling back to Bedrock when AWS credentials
are available. The agent model resolves as:
classify_agent_model → agent_model → "claude-sonnet-4-5-20250929".
The bootstrap pipeline (bootstrap.py) contains the programmatic
convergence loop as well: sweep → validate → revisit uncertain → repeat.
The agent loop is an alternative that delegates the revisit strategy to
Claude. Both paths share the same underlying functions (_llm_sweep,
_run_ml_validation, etc.) and produce identical DST evidence.
The agent approach is preferred when:
The corpus has complex ambiguity patterns (confusable categories)
You want reasoning traces explaining why convergence was declared
The LLM backend supports tool_use (Anthropic, Bedrock with Claude)
The programmatic approach is used when:
The LLM backend doesn’t support tool_use (vLLM, Cerebras)
The gateway exposes /ws/orchestration for live agent event streaming.
Events include agent_spawned, agent_reasoning, agent_tool_call,
and agent_completed. The React frontend’s Agent Canvas page consumes
these events to render the agent’s decision process in real time.
Atelier’s core objective: agent-mediated metadata classification using
Dempster-Shafer Theory (DST) to produce belief intervals instead of flat
confidence scores, exposing epistemic uncertainty and source disagreement.
Four distinct sources of per-column labels show up in our writeups.
Conflating them is load-bearing error, so we name each explicitly:
Term
Source
Authority level
Where it appears
Published benchmark
External, human-curated labels (SOTAB, GitTables)
Gold standard — memorization-safe check
SOTAB pilot artifacts; docs/notes/2026-04-19/…phase_gate_2.md
Curated reference
Generator-derived (synth pairs an answer-key “reference column” per target) + spot-checked by hand
Definitive for the synthetic corpus; not equivalent to a published benchmark
build/meta-tagging-clean/curated_reference.csv
LLM commitment
A single LLM’s pass-1 or pass-2 output
Classifier opinion; not a truth
parquet llm_code, predicted_code
CatBoost prior
CatBoost fit to LLM labels, used for revisit enrichment
Not independent evidence — it is a compressed self-consensus of the LLM; valuable specifically for rescuing abstentions
parquet predicted_code via DST fusion
An ablation (as used in our writeups) is a controlled experiment
that holds most of the pipeline fixed and varies exactly one component
at a time, so changes in accuracy can be attributed to that component
rather than to the combination.
Traditional classifiers output a single confidence score (e.g., “85% email
address”). This hides two distinct types of uncertainty:
Aleatoric uncertainty: inherent randomness in the data
Epistemic uncertainty: ignorance due to insufficient evidence
DST separates these via belief intervals[Bel(A), Pl(A)]:
Bel(A) = committed evidence supporting A (lower bound)
Pl(A) = evidence that cannot rule out A (upper bound)
Pl(A) - Bel(A) = unresolved ambiguity
When Bel(A) = 0.8 and Pl(A) = 0.85, we have high confidence with low
ambiguity. When Bel(A) = 0.3 and Pl(A) = 0.9, we know something
supports A but much remains uncertain — a signal to gather more evidence.
The discount controls how much mass goes to Θ (total ignorance). Higher
discount = more conservative = wider belief intervals.
Pattern mass is graduated: detect_patterns() returns a match fraction
(0.0-1.0) per pattern, and pattern_to_mass() scales evidence mass by the
average match fraction. A 95% match produces ~3x more mass than a 35% match,
eliminating the binary cliff at the 1/3 detection threshold.
Pattern theta (0.25) is deliberately higher than LLM theta (0.10), so the
LLM cleanly dominates when pattern and LLM evidence conflict — the LLM
considers full context (name, type, values, siblings), while patterns
operate on value structure alone.
Dempster’s rule of combination requires cognitively independent evidence
sources (Shafer 1976) — each mass function must reflect information not derived
from the other sources being combined. Atelier achieves this through
architectural separation of feature spaces and training signals:
Source
Feature Space
Training Signal
Independence Basis
Name match
String/lexical
None (deterministic)
Symbolic matching only
Pattern
Regex
None (deterministic)
Hand-crafted rules only
Cosine
Dense embedding (384-dim)
Pre-trained sentence-transformer
Learned semantic similarity
LLM
Semantic (frontier or subagent model)
Pre-trained weights
In-context classification
CatBoost
Dense embedding + 12 features
Synthetic data generators
Gradient-boosted ensemble
SVM
Sparse TF-IDF (char 3-6 + word 1-2 n-grams)
Synthetic data generators
Lexical surface patterns
The SVM is architecturally the most important independence guarantee. While
cosine similarity and CatBoost both operate on the same dense
sentence-transformer embedding (384 dimensions from all-MiniLM-L6-v2), the
SVM operates on an entirely orthogonal feature representation: sparse TF-IDF
character and word n-grams extracted by sklearn.pipeline.Pipeline +
FeatureUnion. This means the SVM captures lexical surface patterns
(abbreviations, digit sequences, camelCase fragments) that the dense embedding
may collapse — providing genuine corrective signal in DST fusion.
The SVM classifier follows the Pipeline + FeatureUnion composition pattern
from the Signals project — the version of
record presented as an independent fifth DST evidence source:
Singleton class filtering — fit() drops categories with < 2 training
examples before CalibratedClassifierCV, since StratifiedKFold requires
every class to have >= 2 samples. With 316 categories and few tables, some
categories inevitably have only one example. Dropped categories are logged
and still receive predictions from the other 5 DST evidence sources.
_min_class_count() — returns the actual minimum (no longer clamped to 2)
feature_importances(top_n) — navigates CalibratedClassifierCV →
LinearSVC to extract coef_, averages absolute coefficients across classes,
cross-references with FeatureUnion.get_feature_names_out() for named
feature importance
is_fitted property for safe state checking before prediction
The Monte Carlo sampling architecture enables a stronger training signal for
the SVM without breaking independence. After the bootstrap LLM sweep, the
SVM is retrained on blended synth + frontier labels — high-quality
classifications from the Opus-tier model on the stratified importance sample.
_llm_sweep() → frontier columns get Opus labels
↓
RETRAIN #1: Blend synth data + frontier labels
SVM hot-swapped before first ML validation
↓
_run_ml_validation() — uses frontier-trained SVM
↓
Convergence loop:
Agent path: agent calls retrain_svm tool when it judges
enough new labels have accumulated
Programmatic path: retrain after each revisit iteration
that adds ≥10 new frontier labels
↓
RETRAIN #3 (final): Only if NOT converged
↓
CLASSIFYING — final pass uses best available SVM
Blending ensures categories not in the frontier sample still have
coverage from synth data (broad vocabulary), while corpus-specific patterns
dominate via frontier signal (depth).
Independence is preserved because:
Training signal: Opus (frontier model, used in LLM sweep)
Bulk LLM source in DST fusion: Sonnet/Haiku (subagent model)
SVM feature space: sparse TF-IDF (orthogonal to all other sources)
The three independence axes:
Different models at training time (Opus) vs. fusion time (Sonnet/Haiku)
Different feature spaces (sparse TF-IDF vs. semantic LLM reasoning)
Different inductive biases (maximum-margin classifier vs. autoregressive LM)
The SVM becomes the transmission mechanism for frontier-quality signal —
MC sampling bounds the Opus cost; the SVM amortizes Opus’s accuracy across
the entire table-space.
train_svm_on_frontier_labels() in ml_train.py — collects frontier
labels (label_source in ("llm", "llm_revisit")), blends with synth data,
trains SVMClassifier, saves to results_dir/svm_frontier.pkl
_maybe_retrain_svm() in pipeline.py — encapsulates retrain + hot-swap
via ml_inference.reset() + configure_paths()
Three call sites in pipeline: post-sweep, iterative, final (if not converged)
Agent tool retrain_svm for agent-driven convergence path
When DST evidence splits between two known-confusing categories, mass is
redistributed from the runner-up singleton to a compound focal element
representing the pair. This captures honest ambiguity instead of forcing a
singleton prediction that may be wrong.
Four confusable pairs are active (filtered to vocabulary at runtime):
Pair
Rationale
Record Identifier ↔ Device Identifier
Both are opaque identifiers; context determines which
Timestamp ↔ Date of Birth
Both are temporal; DOB is a specific semantic subtype
Transaction Amount ↔ Bank Account Number
Both are financial numbers
IP Address ↔ Device Identifier
IP addresses can identify devices
Mechanics: When the top-2 singleton masses form a known pair and their
ratio is below confusable_ratio_threshold (default 3.0), half of the
runner-up’s mass transfers to the pair focal element. The pair’s mass
propagates up the hierarchy via belief_at() — Bel at the common ancestor
reflects the combined evidence.
Pattern detection uses a two-stage architecture: 16 regex patterns for
recall, plus a _VALIDATORS registry for precision. A value must
pass both the regex AND the validator (if one exists) to count.
Validator
Pattern
Checks
_luhn_check
credit_card_pattern
Luhn checksum (ISO/IEC 7812)
_is_valid_ipv4
ipv4_pattern
All 4 octets in 0-255 range
_is_plausible_date
date_iso_pattern, datetime_iso_pattern
Month 01-12, day 01-31
_is_iso_currency
iso_currency_pattern
ISO 4217 whitelist (~40 codes)
The phone_pattern uses a suppression mechanism: when a more specific
digit-heavy pattern also fires (SSN, date, credit card, IP, postal code,
monetary, IBAN), the phone match is suppressed. This prevents the phone
regex from injecting false evidence on columns whose values happen to
contain formatted digits.
The pipeline wraps each column result in a HierarchicalClassification object
(ported from signals) that enables post-hoc hierarchy navigation:
belief_at(code) — query Bel at any hierarchy level (leaf or internal)
plausibility_at(code) — query Pl at any level
interval_at(code) — (Bel, Pl) tuple
uncertainty_gap — Pl - Bel for the predicted category
needs_clarification — True when uncertainty_gap > 0.3 or conflict > 0.2
from_combined_evidence() — factory method: filters vacuous sources, combines
via the configured fusion strategy, ranks by pignistic probability
Confidence is pignistic probabilityBetP(singleton), the decision-theoretic
transform that distributes multi-element focal set mass equally among members.
Two DST combination rules are implemented, selectable via classify.fusion_strategy:
dempster (default) — Classical Dempster’s rule with (1-K) normalization.
Under high conflict, surviving singletons are amplified.
yager — Yager’s modified rule. Conflict mass is redirected to Θ
(ignorance) instead of being normalized away. Preserves epistemic honesty
at the cost of higher ignorance mass and typically lower peak belief values.
When K=0, produces identical results to Dempster.
Yager is available as an opt-in alternative for empirical validation.
The default (Dempster) remains in place pending A/B comparison on real
pipeline runs — Yager’s increased conservatism may or may not improve
overall classification quality, and compensatory adjustments to per-source
discounting or decision thresholds may be needed.
The bootstrap pipeline wraps the single-pass ML pipeline in an iterative
LLM↔ML convergence loop. It adds LLM evidence and repeats until
predictions are settled — measured by belief-gap convergence, not
raw conflict K.
LLM Sweep (LLM_SWEEP): Batch-classify all columns via the configured
LLM backend (Claude via Bedrock/Anthropic, or any OpenAI-compatible endpoint).
Columns are sent in table-aware batches with sibling context. If every batch
fails, the sweep raises RuntimeError (fail-fast) instead of silently
proceeding with zero labels.
ML Validation (VALIDATING): Run the full 6-source DST pipeline for
each column. Compute per-column belief interval [Bel, Pl], conflict K,
and uncertainty gap Pl - Bel. Identify uncertain columns where
predictions need revisiting.
Targeted Revisit (back to LLM_SWEEP): Re-classify uncertain columns
with enriched context — the ML prediction, belief interval, pattern signals,
and value descriptions are included in the prompt. This gives the LLM
evidence it didn’t have in the first pass.
The primary convergence measure is the uncertainty gapPl - Bel for
each column’s predicted category. This directly answers “how settled is this
prediction?” — unlike K, which only measures source disagreement.
A column can have K=0.9 but Bel=0.95 — the sources fought hard during
combination, but the normalizing denominator (1-K) concentrated surviving
mass on the agreed-upon singleton. That column’s prediction is settled
despite high conflict; it doesn’t need revisiting.
Convergence criteria (all must hold):
Criterion
Metric
Default
Meaning
Primary
mean_gap < gap_threshold
0.15
Predictions are tight
Secondary
frac_unclear < clarity_target
0.10
At most 10% of columns need clarification
Coverage
coverage >= coverage_target
0.95
95% of columns have labels
Revisit targeting: _identify_uncertain_columns() selects columns
where gap > 0.3 OR Bel < bel_floor (default 0.50), sorted by gap
descending (most uncertain first).
Early stopping: The proof-of-progress paradigm monitors the gap trend.
When mean gap plateaus for 2 consecutive iterations (no verifiable progress),
the loop stops even if the threshold hasn’t been reached.
Conflict K remains in logs, iteration metrics, and agent tools as a
diagnostic for source disagreement. It is useful for identifying
calibration issues (e.g., a pattern detector producing false positives)
but does not gate convergence. The cumulative K formula
K = 1 - Π(1 - Kᵢ) tends to be high (~0.5-0.8) with 6 partially
correlated sources; this is expected and does not indicate poor quality.
As an alternative to the programmatic loop, the agent convergence loop
(agent_loop.py) delegates revisit strategy to Claude. The agent uses
6 tools — get_conflict_report, revisit_columns, check_convergence,
get_column_detail, retrain_svm, declare_converged — to reason about
which columns need re-examination. The agent sees both gap-based and K-based
metrics and can make nuanced decisions. See Keystone Agents.
llm_backend.py provides a factory-pattern abstraction:
OpenAICompatibleBackend: For vLLM, GLM-4.7, and any endpoint
implementing the OpenAI chat completions API. Default backend.
AnthropicBackend: For Claude via the Anthropic SDK.
BedrockBackend: For AWS Bedrock via the Converse API.
BedrockStructuredBackend: Production default on CAI. Uses
invoke_model with tool-use for structured output (output_config
is not supported on Bedrock). When extended thinking is enabled,
tool_choice must be "auto" (Anthropic constraint); a text-block
fallback parser handles this case. Both backends use region_from_arn()
to extract the target region from cross-region inference profile ARNs.
CerebrasBackend: OpenAI-compatible with Cerebras-specific defaults
(base_url=https://api.cerebras.ai/v1, model=zai-glm-4.7).
create_backend_from_cfg(cfg): Factory that reads HOCON config
to select and configure the appropriate backend.
Backends fail fast when not configured — no mock fallback in production code.
Per-item feature attribution explaining why each column was classified as
it was. Complements the global SAGE importance (which ranks features across
the entire dataset) with item-level explanations.
All DST discount factors are configurable via HOCON. The DiscountConfig
dataclass bundles all parameters with DiscountConfig.from_cfg(cfg) factory:
classify.discounts {
cosine = 0.30 # Cosine similarity → Theta mass
svm = 0.20 # SVM → Theta mass
pattern_theta = 0.25 # Pattern detection → Theta mass (graduated by match fraction)
name_match_exact = 0.70 # Exact label match singleton mass
name_match_code = 0.50 # Formal code/abbrev match mass
name_match_alias = 0.50 # Common name alias match mass
name_match_overlap = 0.30 # Word overlap match mass
catboost_base = 0.10 # Adaptive discount base
catboost_variance_scale = 1.6 # Variance-to-discount scaling
catboost_max = 0.50 # Cap on adaptive discount
catboost_fallback = 0.15 # When no variance available
confusable_ratio_threshold = 3.0 # CatBoost confusable pair threshold
}
Environment variable overrides: ATELIER_DISCOUNT_COSINE, ATELIER_DISCOUNT_SVM, etc.
At small corpus sizes (< 200 columns), every column receives direct
frontier-LLM classification. As the corpus scales to thousands or millions
of columns, this becomes prohibitively expensive. Monte Carlo stratified
sampling selects a representative subset for LLM inference and propagates
labels cheaply via embedding similarity.
This is a zero-cost optimization: below the threshold, the pipeline
behaves identically to before. The MC layer activates transparently at scale.
The MC layer operates between SAMPLING and LLM_SWEEP in the existing
pipeline. No new FSM states — it runs as sub-phases.
SAMPLING
├─ [existing] Extract features for all columns
├─ Pre-classify: cheap M0 evidence (name, pattern, cosine) — no LLM
├─ Stratify: group by preliminary category + uncertainty
└─ Select MC sample: importance-weighted within strata
LLM_SWEEP
├─ [existing] Frontier LLM classifies MC sample (not all columns)
└─ Propagate: extend labels to remaining corpus via embedding similarity
VALIDATING
└─ [existing] Full 6-source DST on ALL columns
(propagated labels enter as discounted LLM evidence)
→ High-gap / low-belief propagated columns escalate to revisit
Run M0 evidence sources only (no LLM, no ML models). For each column:
Name matching → best category + mass
Pattern detection → matched categories
Cosine similarity → top-K categories + scores
Returns a preliminary category code + confidence for every column. Uses the
existing name_match_to_mass(), pattern_to_mass(), classify_cosine()
functions from the pipeline.
For each propagation column, find nearest frontier column by cosine
similarity (stratum-local to limit search space)
If similarity >= propagation_threshold: assign same label with
discounted confidence
If similarity < threshold: column gets no LLM evidence in DST
Propagated labels enter DST fusion with a higher discount factor (0.30 vs
0.10 for direct LLM) — they carry less evidential mass. If M0 sources
disagree with the propagated label, conflict K rises and the existing
targeted-revisit loop automatically escalates the column to the frontier model.
GitTables corpus: 1.7M tables today, 10M+ near-term. Average 8-12
columns per table = 15M-120M columns at full scale.
Corpus
MC Mode
Frontier Calls
Propagated
Cost Reduction
50
Passthrough
50 (all)
0
0%
500
Active
~75 (15%)
~425
85%
5,000
Active
~500 (cap)
~4,500
90%
50K
Active
~500 (cap)
~49.5K
99%
500K
Active
~500 (cap)
~499.5K
99.9%
15M
Active
~500 (cap)
~15M
>99.99%
120M
Active
~500 (cap)
~120M
>99.99%
At the max_frontier_columns=500 cap, stratified importance sampling ensures
every category stratum gets at least min_per_stratum=3 exemplars. Uniform
random sampling at 500/15M would miss rare categories entirely.
Atelier uses GPU acceleration for sentence-transformer embedding computation
and CatBoost training/inference. GPU support is auto-detected at startup
with graceful fallback to CPU.
In devenv (nix-managed), CUDA libraries are isolated from the host system.
The GPU module handles the nix+CUDA compatibility pattern by detecting
the driver library path and ensuring PyTorch can find it. This avoids
the common nix pitfall where torch.cuda.is_available() returns False
despite GPUs being present.
embedding.py calls preflight_gpu() before initializing the
SentenceTransformer model, passing device=gpu_info.resolved_device:
gpu_info = preflight_gpu()
model = SentenceTransformer("all-MiniLM-L6-v2", device=gpu_info.resolved_device)
GPU batch encoding achieves ~2,768 texts/second on RTX 4090 (vs ~400/s
on CPU). This matters at scale: 15M columns takes ~90 minutes on GPU
vs ~10 hours on CPU.
CatBoost automatically uses GPU when available via its task_type
parameter. The virtual ensemble posterior sampling that drives uncertainty
quantification benefits from GPU parallelism.
GPU status appears in just preflight output and in the /api/status
gateway endpoint, giving operators immediate visibility into whether
GPU acceleration is active.
The classification pipeline includes two ML evidence sources — CatBoost and
SVM — that require training data. Atelier generates synthetic training data
from the controlled vocabulary, trains both classifiers, and uses them as
independent evidence sources in DST fusion.
synth_generators.py is the single source of truth for 316+ hand-coded
value generators shared across the synth framework, sample source generation,
and the registry.
Each generator is a callable (rng: random.Random) -> str that produces
realistic values for a category. Examples:
The SVM classifier uses the Pipeline + FeatureUnion composition adopted
wholesale from the Signals project:
Build short text from column name + type + sample values via build_svm_text()
FeatureUnion extracts dual TF-IDF features:
Character n-grams (3-6, char_wb analyzer) — captures subword patterns
Word n-grams (1-2) — captures multi-word patterns
CalibratedClassifierCV(LinearSVC, method="sigmoid") — Platt scaling
for calibrated probability estimates
_min_class_count() guard prevents calibration CV crash on small classes
Save to .pkl + .classes.json via joblib
The SVM operates on sparse lexical features — architecturally independent
from the dense sentence-transformer embedding used by cosine and CatBoost.
See Classification Pipeline for
the full independence analysis.
CatBoost’s posterior_sampling=True enables Bayesian uncertainty
quantification via virtual ensembles. The classifier produces not just
class probabilities but per-class variance estimates. High variance
translates to a higher DST discount factor — uncertain ML predictions
carry less evidential weight in the fusion.
After the bootstrap LLM sweep, the pipeline has high-quality frontier labels
from the Opus-tier model. train_svm_on_frontier_labels()blends these
with synthetic data and retrains the SVM progressively:
Post-sweep (always): After the first LLM sweep labels frontier columns,
retrain immediately so the SVM carries corpus-specific signal into the
first ML validation pass.
Iterative (during convergence): In the programmatic loop, retrain
after each revisit iteration that adds ≥10 new frontier labels. In the
agent-driven loop, the agent calls retrain_svm when it judges enough
new labels have accumulated.
Final (only if not converged): Last-resort retrain with all accumulated
labels before the final classification pass. Skipped when already converged
(the last iteration’s model is already in use).
sage.py computes global feature importance via permutation-based
SAGE values. Each of the 12 discrete features is ablated and the
classification accuracy impact measured:
High SAGE value = feature is critical for classification
Low SAGE value = feature adds little discriminative power
SAGE runs on the frontier sample when MC sampling is active
(representative subset), reducing computation at scale.
The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.
The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.
The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.
To prepare the visualization parquet:
# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet
# Then seed the database
just seed
The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.
The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.
Embeddings (Atelier) — Interactive scatter plot of classification embeddings
Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000
To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.
Atelier organizes classification work around data sources — each
source contains input tables, and every pipeline run against a source
produces a new dataset version. This replaces the earlier flat
dataset model and enables the OOTB onboarding experience.
Vocabulary routing: For in-situ classification, the customer’s domain
vocabulary IS the classification target — the LLM reads labels and
descriptions and classifies into the domain’s hierarchical dot-codes.
The annotations table location is configured per source via vocab_uri
(e.g. meta.vocab, meta.annotations), decoupling data tables from the
vocabulary. Multiple sources can share the same annotations table.
Future work: A portable pre-trained model (classify-ICE-then-map)
would classify against the built-in ICE vocabulary and translate results
to customer terms via VocabMapping. This requires dedicated training
hardware and is not yet implemented.
CREATE TABLE data_sources (
id TEXT PRIMARY KEY,
source_type TEXT NOT NULL, -- 'sample' | 'hive'
source_uri TEXT NOT NULL DEFAULT '',
display_name TEXT NOT NULL,
vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
vocab_uri TEXT NOT NULL DEFAULT '', -- e.g. 'meta.vocab', 'meta.annotations'
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata TEXT -- JSON: table_count, column_count
);
-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;
When a pipeline run starts, the source_id determines which vocabulary
loads:
ootb-sample: load_sample_vocabulary() → data/sample/ontology.json
(316 BFO-grounded leaves across the CCO ICE trichotomy)
hive/synth: Domain annotations loaded directly from the table
specified by vocab_uri. The domain vocabulary IS the classification
target — no composition with the universal base. Hive sources always
require an annotations table.
No source: Falls back to universal vocabulary (16 PII leaves)
The LLM classification batch uses adaptive sizing to avoid context
truncation. With large vocabularies (>200 categories), the system prompt
embedding the full category table can consume significant context.
Adaptive batch sizing: _estimate_safe_batch_size() reduces
columns_per_call for large vocabularies (e.g. 290 categories → 41)
Truncation retry: When LLMResponse.truncated is detected, the
batch is halved and retried recursively until all columns are classified
Metrics: truncation_count and effective_batch_size tracked in
BootstrapState and exposed via the agent’s check_convergence tool
The built-in “Sample” source (source_id ootb-sample) ships with
Atelier so new deployments show meaningful data immediately. When the
landing page loads and “Connected” turns green, the stats cards show
316 Terms and 316 Entities. The ootb- prefix in the id is an
internal marker distinguishing shipped sources from user-registered
connections — it is not shown in the UI.
351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.
Design principle: every category is our own BFO-grounded term. External
sources (GitTables, meta-tagging) inform which conceptual space to cover;
we never import their raw tags. The mapping goes outward from our vocabulary
via atelier-vocab.ttl, not inward.
25 mixed-domain tables with 316 columns (100 rows each). Tables are
deliberately cross-domain — a customers table contains identity,
contact, metadata, and categorical columns — so the classification
pipeline cannot rely on table name alone.
~25% of columns use opaque names (field_42, var_abc, col_xyz)
to exercise the pipeline’s ability to classify from values and context
rather than column name heuristics.
Generated by scripts/generate_sample_source.py. The curated
reference for the Sample source fixture is committed in
data/sample/reference_labels.json (scope: fixture-only, for OOTB
demo and unit tests).
For UAT / production evaluation, the curated reference lives at
build/meta-tagging-clean/curated_reference.csv (gitignored) — built
by scripts/parity/build_curated_reference.py from direct
reference-column evidence plus name-index lookup with
Ontology > Annotation > Common Names priority. UAT’s own
classification outputs are provisional predictions and are scored
against this curated reference at
build/results/parity/delta_report.md.
Data Source card: dropdown selector for sources + version table
showing version number, column count, timestamp, and summary.
Click a row to activate that version.
Classification Pipeline card: “Start Classification” passes
activeSourceId to /api/fsm/start?source_id=…
The Landing page stats cards reflect the active source:
Terms: vocabulary size for the active source (316 for the Sample source)
Entities: column count from the active dataset version
Sources badge: shows count when multiple sources exist
This page documents two planned integration points that extend the
data source model: MLflow experiment tracking (Phase 5) and Hive data
connections (Phase 6). Both are designed but not yet implemented.
MLflow is only active on CAI (cfg.is_cml). In devenv, the bridge
is a no-op. The mlflow package is an optional dependency — import
failure is handled gracefully.
The OOTB sample source demonstrates the pipeline with synthetic data.
In production on CAI, the real value comes from classifying columns in
the customer’s actual Hive tables via CAI data connections.
Hive sources are auto-discovered at gateway startup. The gateway
lifespan hook calls discover_hive_sources(cfg) which:
Iterates all connections listed in ATELIER_DATA_CONNECTIONS
For each connection, runs SHOW DATABASES and checks each database
for an annotations table
Validates the schema: fetches 1 row and checks for legacy
(id, ontology, annotation) or universal (code, label) format
Auto-registers valid sources via get_or_create_data_source()
(idempotent — safe to re-run on restart)
Once registered, the pipeline route works automatically:
Pipeline resolves data from the connection: when source_id
refers to a hive source, the pipeline calls discover_tables() and
sample_table_metadata() using that connection
Vocabulary routing: hive sources use load_annotations_from_hive()
which reads default.annotations (domain categories) and composes
them on top of the universal base
Results register as versions: each pipeline run creates a new
version under the hive source, with the same activation/versioning
semantics as the sample source
Domain categories attach to the universal tree via parent_code
references. Categories without a valid parent are logged as warnings
and placed under a catch-all internal node.
# In pipeline.py — source-based auto-resolution
if source.source_type == "hive":
connection_name = source.source_uri.split("/")[0]
database = source.source_uri.split("/")[1]
# discover_tables() and sample_table_metadata() use the connection
# load_annotations_from_hive() uses the connection for vocabulary
Phase 5 can be developed and unit-tested independently (the queue
and reconcile logic is pure Python). The MLflow API calls can be
mocked in tier-0 BDD scenarios.
Phase 6 is primarily wiring — the heavy lifting (table discovery,
vocabulary loading, pipeline execution) already exists. The main
new code is the gateway endpoint for source creation and the UI
for triggering it.
Atelier ships with encrypted deployment defaults so a CAI operator
can stand up a working instance by entering only four environment
variables — their two AWS Bedrock credentials, a direct Anthropic API
key (for overwatch), plus a single age private key that unlocks
everything else.
Every CAI deployment needs a dozen-ish environment variables: Bedrock
model ARNs, Atlas / Ranger URLs, feature toggles, governance flags,
subagent model IDs, and — for UAT runs — a curated-reference CSV for
accuracy measurement. Most of those values are identical across
every deployment of the same Atelier release; only the AWS credentials
and the Anthropic key are operator-specific. Rather than documenting
a long checklist for every customer, we encrypt the defaults and
the curated-reference fixture into the repository with
SOPS and ship one key alongside the deployment.
The operator paste-sets the key; everything else is already wired up.
Set four environment variables on the CAI Application, then start it:
Name
Value
Source
AWS_ACCESS_KEY_ID
Bedrock access key
your AWS / IAM team
AWS_SECRET_ACCESS_KEY
Bedrock secret
your AWS / IAM team
ANTHROPIC_API_KEY
direct Anthropic API key
Anthropic Console
SOPS_AGE_KEY
full AGE-SECRET-KEY-1… string
provided out-of-band by the Atelier maintainer
On startup, bin/start-app.sh runs the shared
bin/bootstrap-secrets.sh utility, which decrypts both .env.cai.enc
(dotenv defaults) and features/fixtures/curated_reference.csv.enc
(meta-tagging answer key) with the age key you provided. The dotenv
values source into the environment where HOCON’s ${?VAR} substitution
picks them up; the decrypted CSV materializes at
build/data/curated_reference.csv and ATELIER_CLASSIFY_REFERENCE_URI
points at it so evaluation_report.json carries real accuracy
numbers. No per-customer checklist to maintain.
Overrides still work. Any explicit ATELIER_* env var on the CAI
Application wins over the encrypted default — so an operator who
wants a different Bedrock ARN just sets ATELIER_AGENT_MODEL directly
and that value takes precedence.
If the operator already has the age key on disk (e.g. mounted from a
secret store), they can set SOPS_AGE_KEY_FILE=/path/to/key.txt
instead of pasting the key content. bin/start-app.sh supports both.
Place your age private key at ~/.config/sops/age/keys.txt — the
public key must match the age: age1… line in .sops.yaml. The
devenv shell provides both sops and age binaries.
The plaintext .env.cai is excluded by .gitignore; only the
encrypted .env.cai.enc is tracked. SOPS encrypts each value
independently, so diffs show which keys changed even though their
values are opaque.
The meta-tagging answer key (what evaluation_report.json compares
predictions against) ships encrypted under the BDD fixtures tree so
committed secrets live with the corpus they validate.
# From the maintainer's reviewer xlsx
uv run python -m atelier.overwatch.ingest_reference \
~/path/to/Atelier_Results_Default_DB_4-16.xlsx \
--out build/data/curated_reference.csv
# Encrypt into features/fixtures/ and commit the ciphertext only
just encrypt-reference
git add features/fixtures/curated_reference.csv.enc
git commit -m "chore: update curated-reference answer key"
To inspect the current key without re-running the xlsx ingest:
just decrypt-reference # decrypts into build/data/curated_reference.csv
$PAGER build/data/curated_reference.csv
Both the plaintext CSV (in build/) and .env.cai are ignored by
git; only the .enc ciphertexts are tracked.
age-keygen -o new-key.txt # generate replacement pair
# update .sops.yaml: replace the age: age1... line with the new public key
sops updatekeys .env.cai.enc # re-encrypt deployment defaults
sops updatekeys features/fixtures/curated_reference.csv.enc # AND the curated-reference fixture
git commit -am "chore: rotate CAI deployment key"
# distribute the new private key to operators via the same out-of-band channel
sops updatekeys rewrites the encrypted file’s recipient list in
place — nothing about the plaintext values changes, so this is a
zero-content-drift rotation. Run it against every encrypted
artifact so the new key unlocks the whole set.
SOPS only populates environment variables. HOCON (config/base.conf)
already treats all configuration as environment-overridable via the
${?VAR} pattern:
agents {
model = "claude-opus-4-7"
model = ${?ATELIER_AGENT_MODEL} # env wins when set
}
SOPS decryption runs before the gRPC server loads HOCON, so from
HOCON’s perspective the encrypted values are just ordinary environment
variables.
.env.cai.enc — deployment-specific defaults that differ
between environments but aren’t operator secrets per se (model
ARNs, Knox endpoints, feature toggles, subagent IDs). Values that
are derivable from context and you don’t want every operator
to rediscover.
config/base.conf — true defaults that hold for every
deployment; structural knobs that belong in source control in
plaintext (pipeline thresholds, port numbers, fusion strategy).
Operator-entered env vars — genuine per-deployment secrets
(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, the SOPS_AGE_KEY
itself). These never live in the repository.
SOPS_AGE_KEY decrypts only this project’s .env.cai.enc.
Losing it costs you these defaults; gaining it grants no AWS,
Cloudera, or third-party privilege on its own.
Each customer should get the same age private key (defaults are
identical across deployments) — per-customer secrets, if any, stay
in the CAI Application’s own environment variables.
Rotate the key whenever a recipient leaves the operator pool.
The age public key in .sops.yaml is intentionally committed;
public keys are meant to be public.
Atelier uses behave (BDD) to capture platform decisions as executable specifications. Every scenario answers a concrete question: Does the config load? Can the runtime start? Does the classification pipeline converge?
These aren’t just tests. They’re the design context that connects architectural choices to the deployment realities of Cloudera AI.
CAI deployment has four modalities — Project, Application, AMP, and Studio — each with different constraints on networking, filesystem layout, and process lifecycle. Traditional unit tests verify module behavior in isolation. BDD scenarios verify that the system hangs together across these modalities.
Consider the Application modality: when CDSW_APP_PORT is set, the startup script must bind to 127.0.0.1 because CAI’s reverse proxy handles external traffic. Bind to 0.0.0.0 instead and you bypass the proxy’s auth layer. This isn’t a bug in any single module — it’s a deployment contract that only a scenario can express clearly:
Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
Given CDSW_APP_PORT is set to "8090"
When I parse bin/start-app.sh for the HOST variable
Then HOST is "127.0.0.1"
The scenario is the spec. A colleague reading this knows exactly what the constraint is, why it matters, and can verify it passes with just behave.
Scenarios are tagged by the infrastructure they require. The ATELIER_BDD_TIER environment variable controls which tiers run.
Tier
Tag
Requires
Purpose
0
@tier-0
Python only
Config, imports, classification pipeline, agent loop, ML classifiers
1
@tier-1
devenv stack
PostgreSQL, Qdrant, gRPC, full gateway startup
cai
@tier-cai
CAI session
Live deployment validation — always skipped locally
Additional tags:
@slow — scenarios requiring extended runtime (pipeline E2E, ML training)
@gpu — GPU acceleration scenarios (run on CPU too, just slower)
Tier 0 runs everywhere: laptops, CI, CAI sessions. No services, no network calls. This is where the runtime profile lives — the scenarios that catch deployment failures before you push.
Tier 1 requires devenv up to be running (PostgreSQL on :5533, Qdrant on :6334). These verify that services are healthy and that the application can actually connect to its data stores.
Tier CAI exists as executable documentation. The step definitions are stubs — they express what should happen in a live CAI session without automating it. When debugging a deployment failure, these scenarios are a checklist.
# Full BDD suite including gateway checks (preferred)
just behave
# Tier-0 only (no services needed)
just bdd
# Tier-0 + tier-1 (requires devenv up)
just bdd-full
# Runtime profile specifically
just bdd-runtime
# Single domain
ATELIER_BDD_TIER=0 uv run behave features/agent/
# Single feature file
uv run behave features/agent/classification.feature
# By tag
ATELIER_BDD_TIER=0 uv run behave features/ -t @bootstrap
# Verbose (show all steps, not just failures)
just behave --no-capture
Behave only discovers step definitions from features/steps/. Domain step definitions live in <domain>/step_defs/ directories and are re-exported through features/steps/__init__.py:
from features.infra.step_defs.config_steps import *
from features.infra.step_defs.health_steps import *
from features.infra.step_defs.preflight_steps import *
from features.deployment.step_defs.runtime_steps import *
from features.deployment.step_defs.amp_steps import *
from features.deployment.step_defs.naming_steps import *
from features.agent.step_defs.agent_steps import *
from features.agent.step_defs.classification_steps import *
from features.agent.step_defs.bootstrap_steps import *
from features.agent.step_defs.backend_steps import *
from features.agent.step_defs.synth_steps import *
from features.agent.step_defs.ml_steps import *
from features.agent.step_defs.ml_e2e_steps import *
from features.agent.step_defs.sage_steps import *
from features.agent.step_defs.shap_steps import *
from features.agent.step_defs.real_data_steps import *
from features.agent.step_defs.belief_path_steps import *
from features.agent.step_defs.synth_framework_steps import *
from features.agent.step_defs.meta_tagging_steps import *
from features.agent.step_defs.experimentation_steps import *
from features.gateway.step_defs.status_steps import *
from features.gateway.step_defs.http_steps import *
from features.gateway.step_defs.endpoint_steps import *
from features.gateway.step_defs.pipeline_steps import *
from features.agent.step_defs.agent_loop_steps import *
from features.agent.step_defs.monte_carlo_steps import *
from features.gateway.step_defs.testclient_steps import *
Two conventions protect against behave’s automatic discovery behavior:
Use step_defs/, not steps/ — Behave walks the feature tree and exec’s any .py file it finds in a directory named steps/. This bypasses Python’s import system, breaking relative imports and module context. Using step_defs/ avoids this entirely.
Never name a features/ subdirectory after a stdlib module — When behave imports features.platform, Python also registers it as platform in sys.modules, shadowing the stdlib. This breaks anything that lazily imports platform (including pydantic). The infra/ domain was originally named platform/ until this caused a cascade of subtle failures.
Infrastructure steps load configuration from HOCON via atelier.config.load_config() rather than hardcoding values. This means BDD scenarios validate the same config path used in production:
from atelier.config import load_config
cfg = load_config()
_wait_for("PostgreSQL", lambda: _check_pg(cfg.db_url))
Tier-1 scenarios share a one-time stack health check in environment.py. Before the first tier-1 scenario runs, the framework verifies PostgreSQL and Qdrant are reachable (with a 60-second retry window). If either service is down, all tier-1 scenarios fail fast with a clear message rather than producing confusing connection errors.
after_scenario in environment.py removes temporary files registered via context._temp_files. This handles config materialization artifacts and other test-created files.
Cloudera AI offers four ways to run code. Each has different constraints on networking, filesystem layout, process lifecycle, and dependency management. Atelier’s BDD scenarios encode these constraints as executable specifications.
ProjectGit-backed workspace
Base for all modalitiesAMPOne-click provisioning
.project-metadata.yamlApplicationLong-running service
CDSW_APP_PORT bindingStudioPre-built Docker image
IS_COMPOSABLE=trueinstall + start tasksbind to reverse proxyembedded serviceGit-backed workspace
Base for all modalitiesOne-click provisioning
.project-metadata.yamlLong-running service
CDSW_APP_PORT bindingPre-built Docker image
IS_COMPOSABLE=true
Every CAI deployment starts as a Project — a Git-backed workspace cloned into /home/cdsw. The Project modality is implicit: it provides the filesystem layout, environment variables, and Python runtime that all other modalities build on.
No dedicated feature file. Project constraints are tested indirectly through every other deployment scenario.
An AMP is a one-click provisioning workflow defined in .project-metadata.yaml. It runs a sequence of tasks — typically create_job to install dependencies, then start_application to launch the service.
Why BDD captures this well: AMP metadata is YAML that CAI interprets at deploy time. A malformed task definition doesn’t fail until someone clicks “Deploy” in the CAI UI. Our tier-0 scenarios catch structural problems immediately.
Scenario: AMP metadata file is valid
Given the file ".project-metadata.yaml" exists
When I parse the AMP metadata
Then it has a "name" field
And it has a "runtimes" section
And it has a "tasks" section
Task ordering pattern — CAI requires create_job before run_job for the same entity label. Getting this wrong means the install job never runs:
Scenario: AMP tasks follow create_job/run_job pattern
Given the AMP metadata is loaded
Then a "create_job" task with entity_label "install_deps" exists
And a "run_job" task with entity_label "install_deps" exists
And a "start_application" task exists
Install script validity — scripts/install_deps.py runs in a bare Python environment without uv or devenv. A syntax error here means the entire deployment fails:
Scenario: Install script is valid Python
When I compile "scripts/install_deps.py" with py_compile
Then no SyntaxError is raised
Tier-CAI scenarios document what a successful AMP deploy looks like. These are skipped locally but serve as a regression checklist when debugging deployment failures:
@tier-cai
Scenario: AMP install job completes successfully
Given I am in a CAI project session
When I run the install dependencies job
Then the job exits with code 0
And "atelier" is importable in system Python
And "node --version" succeeds
And the directory "ui/dist" exists
An Application is a long-running web service. CAI assigns a port via CDSW_APP_PORT and routes subdomain traffic through a reverse proxy that handles authentication.
The key constraint: When CDSW_APP_PORT is set, the service must bind to 127.0.0.1, not 0.0.0.0. The reverse proxy connects over localhost; binding to all interfaces bypasses CAI’s auth layer.
For local development (no CDSW_APP_PORT), binding to 0.0.0.0 is correct — it lets you access the service from a browser.
Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
Given CDSW_APP_PORT is set to "8090"
When I parse bin/start-app.sh for the HOST variable
Then HOST is "127.0.0.1"
Scenario: start-app.sh binds to 0.0.0.0 for local dev
Given CDSW_APP_PORT is not set
When I parse bin/start-app.sh for the HOST variable
Then HOST is "0.0.0.0"
The tier-1 scenario verifies the full stack actually starts and serves traffic:
@tier-1
Scenario: Full application stack starts locally
When I run bin/start-app.sh in the background
Then the HTTP gateway responds on port 8090 within 30 seconds
And the gRPC server responds on port 50051
A Studio is a pre-built Docker image where IS_COMPOSABLE=true. Instead of being the root application, Atelier runs as an embedded service within a larger container.
The key constraint: When IS_COMPOSABLE is set, the install script must use /home/cdsw/atelier as the root directory (the project subdirectory) instead of /home/cdsw (the container root). Getting this wrong means dependencies install into the wrong location and imports fail at startup.
Scenario: install_deps.py handles IS_COMPOSABLE root path
When I set IS_COMPOSABLE to "true"
And I parse scripts/install_deps.py for root_dir
Then root_dir is "/home/cdsw/atelier"
Scenario: install_deps.py uses default root without IS_COMPOSABLE
When IS_COMPOSABLE is not set
And I parse scripts/install_deps.py for root_dir
Then root_dir is "/home/cdsw"
Studio support is currently speculative — these scenarios document the expected behavior so the contract is established before implementation begins.
The CAI Runtime Profile is a set of tier-0 scenarios that validate deployment readiness without requiring a live CAI session. Run it before every push to catch the class of errors that only manifest when CAI tries to start the application.
CAI deployment failures are expensive to debug. The install job runs in a container with a 30-minute timeout. If it fails, the only feedback is a log dump. If it succeeds but the application crashes at startup, the only feedback is a “Application failed to start” banner with a link to logs that may or may not contain the root cause.
The runtime profile catches failures that would otherwise require a deploy-debug-redeploy cycle:
The most common CAI deployment failure is an import error. A module works in devenv because all dev dependencies are installed, but fails in CAI because the install script only installs production dependencies.
Scenario: Core package is importable
When I import "atelier"
Then no ImportError is raised
And atelier.__version__ is defined
Scenario: All entry points are importable
When I import "atelier.server"
And I import "atelier.gateway"
And I import "atelier.config"
And I import "atelier.db.bootstrap"
Then no ImportError is raised
Scenario: Proto stubs are generated and importable
When I import "atelier.proto.atelier_pb2"
And I import "atelier.proto.atelier_pb2_grpc"
Then no ImportError is raised
These scenarios exercise the full import graph. If atelier.gateway imports fastapi which imports pydantic which imports annotated_types, and annotated_types isn’t in the dependency chain — this catches it.
CAI runs scripts via #!/usr/bin/env python3 or #!/usr/bin/env bash. If the shebang is wrong or the execute bit isn’t set, the deploy fails with a cryptic “Permission denied” error.
Scenario: Required scripts exist and are executable
Then the file "scripts/install_deps.py" exists
And the file "scripts/startup_app.py" exists
And the file "scripts/install_node.sh" is executable
And the file "scripts/install_qdrant.sh" is executable
And the file "bin/start-app.sh" is executable
HOCON configs use ${?VAR} substitution for environment variables. A typo in a variable name or an unresolvable reference won’t fail until load_config() is called at startup. The runtime profile forces resolution at test time:
Scenario: HOCON config resolves without errors
When I load the config with no overrides
Then no exception is raised
And the config has grpc_port > 0
And the config has gateway_port > 0
Atelier uses a dbmate-compatible migration runner (atelier.db.bootstrap) that parses -- migrate:up / -- migrate:down blocks from SQL files. If a migration is missing its UP block, the bootstrap silently skips it — which means the schema diverges from what the code expects.
Scenario: Database migrations are parseable
Given migration files exist in "db/migrations/"
When I parse each migration for UP/DOWN blocks
Then every migration has a valid UP block