Introduction
Atelier is an agentic classification workbench for Cloudera AI. It classifies column metadata using six independent evidence sources fused via Dempster-Shafer Theory (DST), producing belief intervals instead of point estimates. An LLM-in-the-loop convergence agent identifies disagreements between sources and orchestrates targeted reclassification until the corpus stabilizes.
Why Belief Intervals?
Traditional classifiers output a single probability \( P(A) = 0.85 \) — “85% email address.” This conflates two fundamentally different situations: high confidence with abundant evidence vs. moderate confidence with sparse evidence. A Bayesian posterior and a coin flip can both yield 0.5, but they represent very different epistemic states.
Dempster-Shafer theory separates these via the belief function \( \text{Bel}(A) \) and plausibility function \( \text{Pl}(A) \), where:
$$ \text{Bel}(A) = \sum_{B \subseteq A} m(B), \qquad \text{Pl}(A) = 1 - \text{Bel}(\bar{A}) $$
The interval \( [\text{Bel}(A),; \text{Pl}(A)] \) bounds the true probability. Its width \( \text{Pl}(A) - \text{Bel}(A) \) quantifies epistemic uncertainty — how much we don’t know:
| Interval | Interpretation |
|---|---|
| \( [0.82,; 0.87] \) | Strong evidence, low ambiguity — classify with confidence |
| \( [0.30,; 0.90] \) | Some support for \(A\), but high ignorance — gather more evidence |
| \( [0.45,; 0.55] \) | Two sources disagree — wide gap, needs revisit |
This distinction drives the entire pipeline: columns with wide belief gaps (where \( \text{Pl}(A) - \text{Bel}(A) \) is large) are automatically escalated for LLM re-examination with enriched context. Conflict \( K \) is tracked as a diagnostic but the gap width determines which columns need attention.
Architecture
Six Evidence Sources
Each source independently produces a mass function \( m_i : 2^\Theta \to [0, 1] \) over the frame of discernment \( \Theta \) (the set of all category codes). Sources are grouped by computational cost:
| Source | Feature Space | Cost Tier |
|---|---|---|
| Cosine similarity | Dense 384-dim sentence-transformer embedding (all-MiniLM-L6-v2) | M0 (local) |
| Pattern detection | 16 regex detectors + post-regex validators (email, phone, SSN, IP, UUID, date, datetime, URL, credit card + Luhn, MAC, IBAN, postal code, monetary, hash, semver, currency + ISO 4217); graduated mass scaling by match fraction | M0 |
| Name matching | Column name vs vocabulary labels, codes, and aliases (4-tier: exact > code > alias > overlap) | M0 |
| LLM classification | Frontier model reasoning (Anthropic / Bedrock / Cerebras / OpenAI-compatible) | M1 (API) |
| CatBoost | 12 discrete features + 384-dim embedding; virtual ensemble uncertainty via posterior_sampling | M2 (trained) |
| SVM | Sparse TF-IDF: character n-grams (3–6) ∪ word bigrams; Platt-scaled LinearSVC | M2 (trained) |
The SVM and CatBoost classifiers occupy deliberately orthogonal feature spaces: the SVM operates on sparse lexical features (TF-IDF) while CatBoost uses dense semantic embeddings. This architectural separation ensures genuine evidence independence for Dempster’s rule.
Fusion
Sources are combined via the conjunctive rule of combination:
$$ m_{1 \oplus 2}(C) = \frac{1}{1-K} \sum_{\substack{A \cap B = C \ A,B \subseteq \Theta}} m_1(A) \cdot m_2(B) $$
where the conflict \( K = \sum_{A \cap B = \varnothing} m_1(A) \cdot m_2(B) \) measures the degree to which sources contradict each other. High \( K \) is the diagnostic signal that drives the convergence loop: columns where independent evidence sources disagree are escalated for targeted LLM revisit with enriched context (ML prediction, belief interval, source disagreement).
Hierarchical Classification
The vocabulary forms a rooted code tree (e.g.,
ICE.SENSITIVE.PID.CONTACT.EMAIL). Belief and plausibility are queryable at
any depth — \( \text{Bel}(\texttt{ICE.SENSITIVE}) \) aggregates all
descendants. The cautious_code(τ) operator returns the deepest code where
\( \text{Bel} > \tau \), enabling principled depth-accuracy tradeoffs:
high \( \tau \) yields coarse but reliable labels; low \( \tau \) yields
specific but less certain ones.
Convergence
The bootstrap pipeline iterates three phases until the belief gap (\( \text{Pl}(A) - \text{Bel}(A) \)) stabilizes:
- LLM sweep — classify each directly-targeted column via batch LLM calls
- ML validation — run the full 6-source DST pipeline; compute per-column belief, plausibility, and gap
- Targeted revisit — re-classify only uncertain columns (high gap or low belief) with enriched context (ML prediction + belief interval + detected patterns + disagreement summary)
The primary convergence measure is mean belief gap — the average width of the \( [\text{Bel}, \text{Pl}] \) interval across all columns. A narrow gap means the evidence sources agree on a confident prediction. Conflict \( K \) is tracked as a diagnostic signal (it indicates source disagreement) but does not gate convergence — a column can have \( K = 0.9 \) but \( \text{Bel} = 0.95 \): the sources fought, but the winner is clear.
An agent-driven variant (via Claude Agent SDK) delegates the
revisit strategy to an LLM that reasons about uncertainty patterns
and declares convergence when diminishing returns are reached.
(Earlier revisions exposed a retrain_svm tool that progressively
improved the SVM on accumulated LLM labels — excised on 2026-05-04
for source-independence reasons; see
DST Evidence Independence.)
The programmatic variant uses gap + coverage thresholds for
environments where tool-use isn’t available.
SVM with Vocabulary Alignment
The SVM is trained once on the synthetic corpus with TF-IDF
features and labels keyed on the bundled-ontology ICE.* leaves. At
runtime, predictions are translated into the user’s taxonomy via a
cached LLM-mediated alignment (atelier.classify.ontology_alignment)
so the SVM contributes user-taxonomy evidence even when the operator’s
vocabulary is completely disjoint from ICE.*. The alignment is
weakly non-distinct evidence under Denoeux 2008 — vocabulary-level
shared error with the runtime LLM rather than per-column shared
labels — and the discount calibration carries the residual. See
DST Evidence Independence
for the full design rationale and the BM25-reranker future-work plan.
Scale
The pipeline handles corpora from 50 columns (OOTB sample) to 120M+ columns (full GitTables at 10M+ tables). Monte Carlo stratified sampling selects a representative subset for direct LLM classification and propagates labels to the remaining corpus via embedding similarity.
With max_sampled_columns = 500, classifying a 120M-column corpus requires
LLM inference on only 0.0004% of columns — a >99.99% cost reduction while
preserving classification quality through DST conflict-driven escalation of
uncertain propagations.
Out-of-the-Box Experience
A fresh deployment auto-seeds on first boot:
- 316-leaf BFO-grounded vocabulary (351 categories total) covering the CCO Information Content Entity trichotomy: Designative (names, IDs, codes), Descriptive (measurements, dates, amounts), Prescriptive (software, specs)
- 25 sample tables with 316 columns and a committed curated reference
- One-click classification via the Status page
- Interactive Embeddings visualization (UMAP/t-SNE via embedding-atlas)
Quick Start
Local development (devenv):
devenv shell # Enter dev environment
just install # Install Python + Node dependencies
just up # Start gRPC + gateway + Vite dev server
CAI deployment: Deploy as an AMP from https://github.com/zndx/atelier.
Documentation Map
- System Overview — Component diagram
- Deployment — CAI AMP and local dev setup
- gRPC & Gateway — Proto contract, REST endpoints, config lifecycle
- Keystone Agents — Agent convergence loop with 6 tools
- Classification Pipeline — DST methodology, evidence sources, bootstrap convergence
- Monte Carlo Sampling — Stratified sampling for scale
- GPU Acceleration — CUDA detection and batch encoding
- Synthetic Data & Training — 316+ generators, ontology-aligned SVM, CatBoost fit-to-LLM
- Embeddings — Interactive parquet visualization
- Data Sources — Source-aware versioning, OOTB sample, Hive auto-discovery
- BDD Scenarios — 141 scenarios across 4 domains
System Overview
Atelier is a multi-service application with a gRPC core, FastAPI HTTP gateway, and React frontend.
Deployment
Cloudera AI (CML)
Atelier deploys as a CAI Application from the Git URL https://github.com/zndx/atelier.
The .project-metadata.yaml defines two tasks:
- Install Dependencies — Installs Python (via uv) and Node.js dependencies, builds the React frontend
- Start Atelier — Launches the gRPC server and HTTP gateway on
CDSW_APP_PORT
Local Development
devenv shell # Enter dev environment (loads .env automatically)
just install # Install Python + Node dependencies
just proto # Generate proto stubs
just resolve-config # Materialize HOCON → build/config/atelier.env
just up # Start gRPC + Vite dev server via devenv processes
gRPC & Gateway
Atelier follows the Fine Tuning Studio proto-first pattern: the gRPC service contract defines the API, and a FastAPI gateway bridges REST to gRPC while serving the React frontend.
Proto Definition
The service contract lives in src/atelier/proto/atelier.proto.
RPCs
| RPC | Request → Response | Purpose |
|---|---|---|
HealthCheck | HealthCheckRequest → HealthCheckResponse | Prove gRPC is alive (status + version) |
ListAgents | ListAgentsRequest → ListAgentsResponse | List agent metadata (id, name, role, tools) |
GetAgent | GetAgentRequest → GetAgentResponse | Single agent by ID |
ListDataSources | ListDataSourcesRequest → ListDataSourcesResponse | List OOTB + Hive sources |
ListDatasets | ListDatasetsRequest → ListDatasetsResponse | Classification datasets (filterable by source_id) |
GetFSMStatus | FSMStatusRequest → FSMStatusResponse | Pipeline state + progress JSON |
StartClassification | StartClassificationRequest → StartClassificationResponse | Trigger a classification run |
Key Messages
DataSource— id, source_type (sample/hive), source_uri, display_name, vocabulary_modeClassificationDataset— id, name, parquet_path, source_id, version_number, is_active, summaryFSMStatusResponse— run_id, state, started_at, progress_json, errorAgentMetadata— id, name, description, role, tool_ids
Generating Stubs
just proto # runs bin/generate-proto.sh
This invokes grpc_tools.protoc to produce _pb2.py, _pb2_grpc.py,
and .pyi type stubs.
Architecture Layers
Proto (atelier.proto) ← Service contract and message definitions
↓
Servicer (service.py) ← Thin router dispatching to business logic
↓
Client (client.py) ← Wrapper around generated stub with error handling
↓
Gateway (gateway.py) ← FastAPI bridge from REST to gRPC + React SPA
Gateway REST Endpoints
Infrastructure
| Endpoint | Method | Description |
|---|---|---|
/api/health | GET | gRPC health check |
/api/status | GET | Aggregated health: gRPC + PostgreSQL + Qdrant + config state |
/api/agents/validate-credentials | POST | Test all configured LLM providers |
/api/agents/model-discovery | GET | Check for model upgrades via Anthropic Models API |
Data Sources & Datasets
| Endpoint | Method | Description |
|---|---|---|
/api/data-sources | GET | List registered data sources |
/api/datasets | GET | List datasets (optional source_id filter) |
/api/datasets/{id}/activate | POST | Set dataset version as active |
/api/datasets/{id}/data | GET | Serve parquet file |
/api/data-connections | GET | List CAI data connections |
/api/data-connections/{name}/test | POST | Test a CAI connection |
/api/vocabulary/stats | GET | Term count (source-aware routing) |
Classification Pipeline
| Endpoint | Method | Description |
|---|---|---|
/api/fsm/status | GET | Current pipeline state + progress |
/api/fsm/start | POST | Start classification (optional source_id) |
/api/fsm/runs | GET | List past classification runs |
Agents & Skills
| Endpoint | Method | Description |
|---|---|---|
/api/agents | GET | List agent metadata |
/api/skills | GET | Skill definitions from .claude/commands/ |
/api/skills/{skill_id} | GET | Single skill markdown content |
/api/agents/smoke-test | POST | Minimal Claude Agent SDK verification |
WebSocket
| Endpoint | Purpose |
|---|---|
/ws/terminal/{session_id} | Persistent terminal backed by Claude Agent SDK |
/ws/orchestration | Live agent events (spawned, reasoning, tool_call, completed) |
Persistent Terminal Sessions
Terminal sessions survive page navigation and browser reload. The WebSocket
endpoint accepts a client-provided session_id (persisted in localStorage).
On disconnect, the session stays alive server-side — SDK queries continue
running and output accumulates in a ring buffer (64KB collections.deque).
On reconnect, the buffer is replayed so the user sees everything that happened
while they were away.
- Session registry: Module-level
_sessionsdict interminal.py - Idle cleanup: Background asyncio task sweeps sessions with no client
for 30 minutes (
/api/terminal/sessionslists active sessions) - Dedicated page:
/terminalroute renders a full-screen Ghostty WASM terminal; the Landing page embeds the same component at preview size
SPA Fallback
/{path} serves ui/dist/index.html for client-side routing.
Aggregated Status Endpoint
GET /api/status returns a comprehensive health report:
{
"grpc": {"status": "ok", "latency_ms": 12},
"postgres": {"status": "ok"},
"qdrant": {"status": "ok"},
"config": {
"has_anthropic": true,
"has_bedrock": false,
"agent_model": "claude-sonnet-4-5-20250929",
"db_url": "postgresql://...(masked)"
},
"overall_status": "connected"
}
PostgreSQL probes retry 3x with 1s backoff (PGlite can have transient stalls).
Overall status is connected when gRPC responds, degraded when gRPC is up
but other services are flaky.
Gateway Lifespan
The FastAPI lifespan hook runs three startup tasks:
- OOTB seed: Check if
ootb-samplesource has any dataset versions; if none, create version 1 with metadata. - Hive auto-discovery:
discover_hive_sources()probes all configured data connections (ATELIER_DATA_CONNECTIONS), iterates databases, findsannotationstables matching the known schema (legacy or universal format), and auto-registers them viaget_or_create_data_source(). - Terminal cleanup: Background asyncio task sweeps idle terminal sessions every 60 seconds.
All three tasks are wrapped in try/except — failures are logged as warnings but don’t prevent gateway startup.
Config Lifecycle
HOCON (config/base.conf) is the single source of truth. No module reads
os.environ directly for configuration values.
.env → devenv shell → HOCON ${?VAR} substitution → AtelierConfig dataclass
load_config() reads the HOCON file with live environment variable
substitution. External tools that need a flat key=value file use
just resolve-config to materialize build/config/atelier.env.
Preflight Validation
just preflight runs structured deny/warn checks via
atelier.preflight.run_preflight():
- Deny = blocking (service cannot start). Examples: missing API keys when both Anthropic and Bedrock are unconfigured.
- Warn = advisory (degraded functionality). Examples: GPU detected but CUDA unavailable, Qdrant not reachable.
Preflight is called during gateway startup to surface configuration problems early rather than during the first pipeline run.
Keystone Agents
Atelier uses the Claude Agent SDK to drive classification convergence. Rather than a fixed programmatic loop, an LLM agent reasons about which columns to revisit based on DST conflict metrics, evidence breakdowns, and convergence trends.
Agent Convergence Loop
The agent loop (src/atelier/classify/agent_loop.py) wraps the bootstrap
pipeline functions as six Claude tools. Claude receives an initial state
summary and iteratively calls tools until it determines the classification
has converged.
Flow
1. Initial state → agent sees mean gap, mean belief, coverage, K (diagnostic)
2. Agent calls get_conflict_report → identifies uncertain columns (high gap or low belief)
3. Agent calls get_column_detail → inspects per-source evidence breakdown
4. Agent calls revisit_columns → re-classifies with enriched context
5. Agent calls check_convergence → verifies gap trend + belief floor
6. Repeat 2-5 until satisfied
7. Agent calls declare_converged with reason
The conversation loop runs up to classify_agent_max_turns (default 10)
Messages API round-trips. Each tool call returns structured JSON that the
agent uses to plan its next action.
Five Tools
| Tool | Input | Returns | Purpose |
|---|---|---|---|
get_conflict_report | k_threshold (float) | Flagged columns with K, belief, plausibility, gap, settled flag | Identify uncertain or conflicting columns |
revisit_columns | column_names (list) | Updated labels + new belief intervals | Re-classify with enriched LLM context (ML prediction + belief interval) |
check_convergence | — | mean_gap, mean_bel, frac_unclear, coverage, K (diagnostic), iteration history | Assess convergence via belief-gap criteria |
get_column_detail | column_name (string) | Per-source evidence breakdown, sample values, belief interval | Deep-dive into a specific column |
declare_converged | reason (string) | Confirmation | Exit loop with stated rationale |
Historical note (2026-05-04 refactor). Earlier revisions of the agent loop included a sixth
retrain_svmtool that retrained the SVM on accumulated LLM labels and hot-swapped the result. That tool was removed alongside the M9 in-loop SVM-on-LLM-labels retrain machinery (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented inontology_alignment.py. The SVM is now trained once on synth and translated into the user vocabulary at inference time; there is no per-run SVM retraining for the agent to drive.
Agent System Prompt
The system prompt guides the agent’s strategy:
- Examine the conflict report to understand where sources disagree
- Inspect individual columns for uncertain cases (high gap or low belief)
- Revisit uncertain columns to resolve ambiguity
- Check convergence metrics (mean gap, mean belief, coverage) to decide whether to continue — K is available as a diagnostic but does not gate
- Declare convergence when satisfied (or when diminishing returns)
State Tracking
The agent loop tracks:
state.agent_reasoning— text blocks from each agent turnstate.agent_converged_reason— the reason given at convergencestate.agent_turns— number of conversation turnsstate.tokens_input/state.tokens_output— token consumption
Each revisit_columns call increments state.iteration and triggers
full ML revalidation on all columns, not just the revisited ones. This
ensures that improved LLM labels propagate through the DST fusion.
LLM Backend Matrix
The agent loop and LLM sweep share the same backend infrastructure. No global provider switch — credentials determine what’s available.
| Backend | Class | Config | Use Case |
|---|---|---|---|
| Anthropic | AnthropicBackend | ANTHROPIC_API_KEY | Agent loop + LLM sweep |
| Bedrock | BedrockBackend | AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY + AWS_REGION | Production default on CAI |
| Cerebras | CerebrasBackend | CEREBRAS_API_KEY | Fast inference via GLM-4.7 |
| OpenAI-compatible | OpenAICompatibleBackend | ATELIER_LLM_BASE_URL + ATELIER_LLM_MODEL | vLLM, any compatible endpoint |
The agent client is built via _build_client(cfg) which prefers Anthropic
when ANTHROPIC_API_KEY is set, falling back to Bedrock when AWS credentials
are available. The agent model resolves as:
classify_agent_model → agent_model → "claude-sonnet-4-5-20250929".
Configuration
All agent and bootstrap settings live in HOCON (config/base.conf):
classify {
llm {
backend = "openai_compatible"
model = "glm-4.7"
base_url = null
columns_per_call = 50
discount = 0.10
}
bootstrap {
max_iterations = 5
k_threshold = 0.2
coverage_target = 0.95
max_total_llm_calls = 5000
# Historical: these knobs gated the excised M9 in-loop SVM
# retrain. Retained here only as illustration of the legacy
# config surface; the keys are no longer read by the pipeline.
# incremental_svm_retrain = true
# incremental_svm_min_labels = 20
}
}
agent {
model = "claude-sonnet-4-5-20250929"
model = ${?ATELIER_AGENT_MODEL}
}
classify {
agent_model = null
agent_model = ${?ATELIER_CLASSIFY_AGENT_MODEL}
agent_max_turns = 10
}
When classify.agent_model is set, it overrides agent.model for the
classification convergence loop specifically.
Agent vs Programmatic Loop
The bootstrap pipeline (bootstrap.py) contains the programmatic
convergence loop as well: sweep → validate → revisit uncertain → repeat.
The agent loop is an alternative that delegates the revisit strategy to
Claude. Both paths share the same underlying functions (_llm_sweep,
_run_ml_validation, etc.) and produce identical DST evidence.
The agent approach is preferred when:
- The corpus has wide-belief-gap columns where independent evidence sources disagree in non-obvious ways
- You want reasoning traces explaining why convergence was declared
- The LLM backend supports tool_use (Anthropic, Bedrock with Claude)
The programmatic approach is used when:
- The LLM backend doesn’t support tool_use (vLLM, Cerebras)
- Deterministic behavior is required
- Cost must be minimized (fewer API calls)
WebSocket Orchestration
The gateway exposes /ws/orchestration for live agent event streaming.
Events include agent_spawned, agent_reasoning, agent_tool_call,
and agent_completed. The React frontend’s Agent Canvas page consumes
these events to render the agent’s decision process in real time.
Classification Pipeline
Atelier’s core objective: agent-mediated metadata classification using Dempster-Shafer Theory (DST) to produce belief intervals instead of flat confidence scores, exposing epistemic uncertainty and source disagreement.
Terminology — reference-label provenance
Four distinct sources of per-column labels show up in our writeups. Conflating them is load-bearing error, so we name each explicitly:
| Term | Source | Authority level | Where it appears |
|---|---|---|---|
| Published benchmark | External, human-curated labels (SOTAB, GitTables) | Gold standard — memorization-safe check | SOTAB pilot artifacts; docs/notes/2026-04-19/…phase_gate_2.md |
| Curated reference | Generator-derived (synth pairs an answer-key “reference column” per target) + spot-checked by hand | Definitive for the synthetic corpus; not equivalent to a published benchmark | build/meta-tagging-clean/curated_reference.csv |
| LLM commitment | A single LLM’s pass-1 or pass-2 output | Classifier opinion; not a truth | parquet llm_code, predicted_code |
| CatBoost prior | CatBoost fit to LLM labels, used for revisit enrichment | Not independent evidence — it is a compressed self-consensus of the LLM; valuable specifically for rescuing abstentions | parquet predicted_code via DST fusion |
An ablation (as used in our writeups) is a controlled experiment that holds most of the pipeline fixed and varies exactly one component at a time, so changes in accuracy can be attributed to that component rather than to the combination.
Methodology
Why Dempster-Shafer?
Traditional classifiers output a single confidence score (e.g., “85% email address”). This hides two distinct types of uncertainty:
- Aleatoric uncertainty: inherent randomness in the data
- Epistemic uncertainty: ignorance due to insufficient evidence
DST separates these via belief intervals [Bel(A), Pl(A)]:
Bel(A)= committed evidence supporting A (lower bound)Pl(A)= evidence that cannot rule out A (upper bound)Pl(A) - Bel(A)= unresolved ambiguity
When Bel(A) = 0.8 and Pl(A) = 0.85, we have high confidence with low
ambiguity. When Bel(A) = 0.3 and Pl(A) = 0.9, we know something
supports A but much remains uncertain — a signal to gather more evidence.
Evidence Sources
Each source independently produces a mass function (Basic Probability Assignment) that distributes belief across the frame of discernment:
| Source | Type | Discount | Configurable | Status |
|---|---|---|---|---|
| Cosine similarity | Sentence-transformer (all-MiniLM-L6-v2) | 0.30 | classify.discounts.cosine | M0 |
| Pattern detection | 16 regex detectors + post-regex validators | 0.25 | classify.discounts.pattern_theta | M0 |
| Name matching | Column name ↔ label/abbrev/common_names | varies | classify.discounts.name_match_* | M0 |
| LLM | OpenAI-compatible / Anthropic / Bedrock / Cerebras | 0.10 | classify.llm.discount | M1 |
| CatBoost | Gradient boosted trees (virtual ensembles) | adaptive | classify.discounts.catboost_* | M2 |
| SVM | Dual TF-IDF (char+word n-grams) + LinearSVC (Platt scaling) | 0.20 | classify.discounts.svm | M2 |
The discount controls how much mass goes to Θ (total ignorance). Higher discount = more conservative = wider belief intervals.
Pattern mass is graduated: detect_patterns() returns a match fraction
(0.0-1.0) per pattern, and pattern_to_mass() scales evidence mass by the
average match fraction. A 95% match produces ~3x more mass than a 35% match,
eliminating the binary cliff at the 1/3 detection threshold.
Pattern theta (0.25) is deliberately higher than LLM theta (0.10), so the LLM cleanly dominates when pattern and LLM evidence conflict — the LLM considers full context (name, type, values, siblings), while patterns operate on value structure alone.
Evidence Independence
Dempster’s rule of combination requires cognitively independent evidence sources (Shafer 1976) — each mass function must reflect information not derived from the other sources being combined. Atelier achieves this through architectural separation of feature spaces and training signals:
| Source | Feature Space | Training Signal | Independence Basis |
|---|---|---|---|
| Name match | String/lexical | None (deterministic) | Symbolic matching only |
| Pattern | Regex | None (deterministic) | Hand-crafted rules only |
| Cosine | Dense embedding (384-dim) | Pre-trained sentence-transformer | Learned semantic similarity |
| LLM | Semantic (frontier or subagent model) | Pre-trained weights | In-context classification |
| CatBoost | Dense embedding + 12 features | Synthetic data generators | Gradient-boosted ensemble |
| SVM | Sparse TF-IDF (char 3-6 + word 1-2 n-grams) | Synthetic data generators | Lexical surface patterns |
The SVM is Atelier’s domain-adaptation channel. Cosine and the
frontier LLM both rely on pretrained models that read the columns whose
names and values carry meaning a web-text-trained model can grip on
(email_address, transaction_amount, ISO dates). Many columns in
deployed enterprise data are not like that: opaque names (val_09,
col_73, ref_addr), opaque values (hex digests, internal serial
codes, prefix-stripped tokens), or both. Pretrained models have nothing
to grip on for those — the signal lives only in domain-specific shape
(format, length, character-class distribution, prefix vocabulary) that
must be learned from data shaped like the deployed distribution. The
SVM is trained on synthetic corpora produced by procedural generators
in src/atelier/classify/synth_generators.py, so it learns precisely
those patterns. The SVM and cosine therefore operate on disjoint
signal populations — semantic-bearing columns versus inscrutable
ones — which makes their evidence sources structurally, not merely
statistically, independent under DST.
A subtler point worth naming: the historical “confusable pair” framing attributed to the data what often lived in the featurizer. Char-n-gram TF-IDF treating Brazilian CPF identifiers as date-shaped, or sub-word tokenization splitting similar-looking strings into overlapping tokens, are tokenization artifacts — properties of the model, not the data. Domain-adapted training on synthetic-corpus examples that match the deployed distribution sees past those artifacts; the SVM is not “resolving confusables” but reading columns that pretrained models fundamentally cannot.
Architecturally this also provides the most important independence
guarantee in the DST stack. While cosine similarity and CatBoost both
operate on the same dense sentence-transformer embedding (384 dimensions
from all-MiniLM-L6-v2), the SVM operates on a fully orthogonal feature
representation: sparse TF-IDF character and word n-grams extracted by
sklearn.pipeline.Pipeline + FeatureUnion. The SVM captures lexical
surface patterns (abbreviations, digit sequences, camelCase fragments)
that the dense embedding collapses — providing genuine corrective
signal in DST fusion.
SVM Architecture (adopted from Signals)
The SVM classifier follows the Pipeline + FeatureUnion composition pattern
from the Signals project — the version of
record presented as an independent fifth DST evidence source:
Column metadata text ("email_addr | user@example.com")
│
▼
FeatureUnion
├── TfidfVectorizer(analyzer="char_wb", ngram_range=(3,6))
│ → captures subword patterns, abbreviations, digit sequences
└── TfidfVectorizer(analyzer="word", ngram_range=(1,2))
→ captures multi-word patterns ("email address", "zip code")
│
▼
Sparse feature matrix (up to 100K dimensions)
│
▼
CalibratedClassifierCV(LinearSVC, method="sigmoid")
│
▼
Calibrated probability distribution {code: probability}
Key implementation details:
- Singleton class filtering —
fit()drops categories with < 2 training examples beforeCalibratedClassifierCV, sinceStratifiedKFoldrequires every class to have >= 2 samples. With 316 categories and few tables, some categories inevitably have only one example. Dropped categories are logged and still receive predictions from the other 5 DST evidence sources. _min_class_count()— returns the actual minimum (no longer clamped to 2)feature_importances(top_n)— navigatesCalibratedClassifierCV→LinearSVCto extractcoef_, averages absolute coefficients across classes, cross-references withFeatureUnion.get_feature_names_out()for named feature importanceis_fittedproperty for safe state checking before prediction
SVM Training (synth-only) and Vocabulary Alignment
The SVM is trained once on the synthetic corpus (see
synth.md) using TF-IDF char-3-6gram + word-1-2gram
features and labels keyed on bundled-ontology ICE.* leaves from
synth_generators.GENERATORS. At pipeline runtime, the ICE.*
predictions are translated into the user’s taxonomy via the cached
subsumption-prediction alignment in
atelier.classify.subsumption_alignment — sentence-transformer
cosine similarity between ICE concept signatures and enriched
annotation payloads from the Qdrant taxonomy collection. The legacy
LLM-mediated alignment was retired in the P7 intervention (see
DST Evidence Independence).
The alignment targets every user node — leaves AND internal nodes (per the dynamic-annotations principle that every node is a first-class tagging target). An ICE leaf may legitimately align to a user internal node when the user’s vocabulary covers a concept family without a leaf-specific equivalent. Restricting alignment to user leaves only would silently reject the parent-family fallback that is the architecturally-correct behavior.
The translation step is what restored the SVM as useful evidence for
non-OOTB user vocabularies — pre-alignment, the SVM emitted ICE
codes that didn’t appear in the user-taxonomy frame and silently
contributed nothing. See subsumption_alignment.py module docstring
for the full independence argument.
Historical note (2026-05-04 refactor). Earlier revisions of this design ran a mid-loop
train_svm_on_frontier_labels(historical function name) that retrained the SVM on live LLM labels and hot-swapped the result into the active model slot — labelled “M9 incremental SVM retraining” in commit history. That path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for source- independence reasons: the per-column LLM label copying made the SVM strongly non-distinct with the LLM source under Denoeux 2008. The subsequent LLM-mediated alignment introduced a vocabulary-level shared error mode (the alignment-time LLM and the runtime LLM share weights), which the P7 subsumption-prediction intervention eliminates — runtime alignment now uses sentence-transformer embeddings rather than the runtime LLM. The SVM’s TF-IDF independence at the feature and label level is preserved; the remaining weak non-distinctness is the shared enrichment-LLM upstream (offline-generated annotations), structurally identical to the late-interaction cosine source’s coupling.
Implementation
train_svm()inml_train.py— synth-only training, persists tobuild/models/svm.pkl(label space: ICE.* leaves)ontology_alignment.build_alignment()— once-per-(vocab, embedding_model) ICE → user-code mapping via subsumption prediction (sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from Qdrant); cached atbuild/cache/alignment/<sha256>.json- Discount:
classify.discounts.svm = 0.22(was 0.30 under LLM-mediated alignment, 0.55 in M9 era) reflects the enrichment-mediated subsumption-prediction regime — weakly non-distinct via shared enrichment-LLM upstream only.
Dempster’s Rule of Combination
Sources are fused via the conjunctive combination rule:
m₁₂(C) = Σ{m₁(A)·m₂(B) : A∩B=C} / (1 - K)
where K = Σ{m₁(A)·m₂(B) : A∩B=∅} is the conflict between sources.
High K means the sources disagree — a valuable diagnostic signal. Note that K is not the convergence criterion — see Belief-Gap Convergence below.
Compound Focal Elements (Uncertainty Representation)
When DST evidence splits closely between two singleton categories,
collapsing to a single top-1 prediction misrepresents what the evidence
actually says. DST’s native vocabulary for this is the compound focal
element: a portion of the runner-up’s mass transfers to a focal
element representing the union of the two singletons, honestly
reflecting that the evidence supports the disjunction but does not
discriminate between members. This is the same DST math that supports
queries at any node in the hierarchy via belief_at() — the compound
mass propagates up to the common ancestor, so belief at any level
reflects the combined evidence.
The mechanism is unconditional DST: any two singletons whose masses split closely qualify in principle. In practice the implementation maintains a short registry of category pairs where the transfer is routinely activated — examples below, filtered to vocabulary at runtime. These are illustrations of cases where the mechanism activates, not a definitional list of categories the classifier is expected to “confuse”.
| Example pair | Why mass-splitting is common |
|---|---|
| Record Identifier ↔ Device Identifier | Both are opaque identifiers; context determines which |
| Timestamp ↔ Date of Birth | Both are temporal; DOB is a specific semantic subtype |
| Transaction Amount ↔ Bank Account Number | Both are financial numbers |
| IP Address ↔ Device Identifier | IP addresses can identify devices |
Mechanics: when the top-2 singleton masses match a registered pair
and their ratio is below confusable_ratio_threshold (default 3.0),
half of the runner-up’s mass transfers to the compound focal element.
Belief at the common ancestor then reflects the combined evidence via
belief_at() propagation. (The config knob retains its historical
name for backward compatibility; the mechanism itself is honest
uncertainty representation, not pair-discrimination.)
Pattern Validation
Pattern detection uses a two-stage architecture: 16 regex patterns for
recall, plus a _VALIDATORS registry for precision. A value must
pass both the regex AND the validator (if one exists) to count.
| Validator | Pattern | Checks |
|---|---|---|
_luhn_check | credit_card_pattern | Luhn checksum (ISO/IEC 7812) |
_is_valid_ipv4 | ipv4_pattern | All 4 octets in 0-255 range |
_is_plausible_date | date_iso_pattern, datetime_iso_pattern | Month 01-12, day 01-31 |
_is_iso_currency | iso_currency_pattern | ISO 4217 whitelist (~40 codes) |
The phone_pattern uses a suppression mechanism: when a more specific
digit-heavy pattern also fires (SSN, date, credit card, IP, postal code,
monetary, IBAN), the phone match is suppressed. This prevents the phone
regex from injecting false evidence on columns whose values happen to
contain formatted digits.
12 Discrete Features
Each column produces 12 SAGE-ablatable features:
column_name— humanized column namecolumn_type— SQL type (suppresses uninformative STRING/VARCHAR)sample_values— first 5 non-null values as textcardinality— distinct value countnull_ratio— fraction of NULL valuesvalue_entropy— Shannon entropy of value lengthspattern_signals— matched regex patternsavg_value_length— mean string lengthnumeric_ratio— fraction parseable as numberssibling_context— other column names in the same tablesource_table— table namevalue_description— auto-generated natural language description
Architecture
AgentFSM
The classification pipeline runs as a background Finite State Machine:
ML-only path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE
Bootstrap path (programmatic):
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
▲ │
└─── (disagreements) ─┘
(converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE
Agent-driven path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING
▲ │
└── Agent convergence loop (5 tools)
Claude reasons about which columns to revisit
(converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE
MC sampling (when corpus > 200 columns):
SAMPLING includes pre-classify → stratify → select MC sample
LLM_SWEEP classifies the sampled subset only → propagate labels to remainder
State transitions are persisted to PostgreSQL. The Status page polls
/api/fsm/status for live progress updates.
Module Structure
src/atelier/classify/
├── __init__.py # Public API: run_pipeline(), run_bootstrap(), get_fsm_status()
├── belief.py # DST core: BeliefAssignment, FocalElement, dempster_combine()
├── mass_functions.py # Evidence→mass converters (6 active)
├── features.py # 12 features + 16 pattern detectors + 5 post-regex validators
├── taxonomy.py # ReferenceCategory, HierarchicalCategorySet
├── embedding.py # Sentence-transformer cosine classifier
├── llm_backend.py # LLM backend factory (Anthropic, OpenAI-compat, Bedrock tool-use, Cerebras)
├── bootstrap.py # Bootstrap convergence loop (LLM sweep + ML validation)
├── agent_loop.py # Agent-driven convergence (6 Claude tools)
├── monte_carlo.py # MC stratified sampling for scale (pre-classify, stratify, select, propagate)
├── gpu.py # GPU detection + NVIDIA driver symlink (nix+CUDA)
├── sampler.py # Hive metadata sampling + fixture data loading
├── synth.py # Synthetic data generation
├── synth_generators.py # 316+ hand-coded value generators (shared module)
├── synth_registry.py # Three-layer generator registry (hand-coded > template > inferred)
├── meta_tagging_overlay.py # 130+ META_TO_ICE mappings for meta-tagging alignment
├── svm_classifier.py # Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
├── catboost_classifier.py # CatBoost with virtual ensemble uncertainty
├── ml_train.py # Training orchestrator (synth → models)
├── ml_inference.py # Lazy-loading inference wrappers
├── evaluation.py # Structured evaluation (per-category P/R/F1, confusion matrix)
├── train_eval_cycle.py # Synth → train → classify → evaluate orchestrator
├── mock_llm.py # Realistic mock LLM (seeded uncertainty + mass-splitting between close categories)
├── sage.py # SAGE feature importance (permutation-based, GPU-aware)
├── shap_explanations.py # Per-item SHAP feature attribution (TreeSHAP + PermutationSHAP)
├── pipeline.py # Full pipeline orchestration (6 sources + MC + background SHAP)
├── fsm.py # AgentFSM state machine
├── fixtures/
│ ├── universal_vocabulary.json # BFO-grounded universal vocabulary (16 leaves)
│ └── fixture_tables.json # 8 tables, 50 cols — fixture reference for unit tests
│ (NOT the UAT-corpus curated reference; see
│ build/meta-tagging-clean/curated_reference.csv)
data/sample/
└── ontology.json # Expanded vocabulary (300 leaves, 25 internal)
└── ontology/
├── atelier-vocab.ttl # CCO-mediated BFO alignment (59 mapped terms)
├── sparql/unmapped-terms.rq # Totality validation query
└── README.md # Mapping methodology and usage
Build Directory
Artifacts are written to build/ (gitignored) to separate reproducible
code from potentially sensitive intermediate data:
build/
├── data/annotations/ # Cached vocabulary from hive
├── data/samples/ # Sampled metadata
├── data/synth/ # Synthetic training data
├── models/ # Trained CatBoost + SVM models, embedding caches
└── results/{run_id}/
├── classifications.json # Per-column DST results (+ SHAP columns when enabled)
├── evaluation_report.json # Per-category P/R/F1, confusion matrix
└── atelier_embeddings.parquet # For embedding-atlas (+ shap_top{1,2,3}_{name,value})
Controlled Vocabulary
Loaded from hive default.annotations (11 columns):
| Column | Maps to | Purpose |
|---|---|---|
id | code | Hierarchical dot-notation identifier |
ontology | label | Human-readable category name |
annotation | abbrev | Formal code / mnemonic |
definition | description | Human-readable definition text |
common_names | common_names | Pipe/comma-separated aliases |
specifics | (embedding text) | Examples and context |
non_corp, emp_contractor, individual, corp | sensitivity | Per-role ratings (0-4) |
deprecated | (filter) | “yes” = exclude |
API
REST Endpoints
GET /api/fsm/status— Current pipeline state + progressPOST /api/fsm/start— Start a single-pass ML classification runPOST /api/fsm/start-bootstrap— Start bootstrap convergence loop (LLM + ML)GET /api/fsm/runs— List past runs
gRPC RPCs
GetFSMStatus()→ FSMStatusResponseStartClassification()→ StartClassificationResponse
HierarchicalClassification
The pipeline wraps each column result in a HierarchicalClassification object
(ported from signals) that enables post-hoc hierarchy navigation:
belief_at(code)— query Bel at any hierarchy level (leaf or internal)plausibility_at(code)— query Pl at any levelinterval_at(code)—(Bel, Pl)tupleuncertainty_gap—Pl - Belfor the predicted categoryneeds_clarification— True whenuncertainty_gap > 0.3orconflict > 0.2from_combined_evidence()— factory method: filters vacuous sources, combines via the configured fusion strategy, ranks by pignistic probability
Confidence is pignistic probability BetP(singleton), the decision-theoretic
transform that distributes multi-element focal set mass equally among members.
Fusion Strategies
Two DST combination rules are implemented, selectable via classify.fusion_strategy:
dempster(default) — Classical Dempster’s rule with(1-K)normalization. Under high conflict, surviving singletons are amplified.yager— Yager’s modified rule. Conflict mass is redirected to Θ (ignorance) instead of being normalized away. Preserves epistemic honesty at the cost of higher ignorance mass and typically lower peak belief values. WhenK=0, produces identical results to Dempster.
Yager is available as an opt-in alternative for empirical validation. The default (Dempster) remains in place pending A/B comparison on real pipeline runs — Yager’s increased conservatism may or may not improve overall classification quality, and compensatory adjustments to per-source discounting or decision thresholds may be needed.
Bootstrap Convergence Loop
The bootstrap pipeline wraps the single-pass ML pipeline in an iterative LLM↔ML convergence loop. It adds LLM evidence and repeats until predictions are settled — measured by belief-gap convergence, not raw conflict K.
Three Phases
-
LLM Sweep (
LLM_SWEEP): Batch-classify all columns via the configured LLM backend (Claude via Bedrock/Anthropic, or any OpenAI-compatible endpoint). Columns are sent in table-aware batches with sibling context. If every batch fails, the sweep raisesRuntimeError(fail-fast) instead of silently proceeding with zero labels. -
ML Validation (
VALIDATING): Run the full 6-source DST pipeline for each column. Compute per-column belief interval[Bel, Pl], conflict K, and uncertainty gapPl - Bel. Identify uncertain columns where predictions need revisiting. -
Targeted Revisit (back to
LLM_SWEEP): Re-classify uncertain columns with enriched context — the ML prediction, belief interval, pattern signals, and value descriptions are included in the prompt. This gives the LLM evidence it didn’t have in the first pass.
Belief-Gap Convergence
The primary convergence measure is the uncertainty gap Pl - Bel for
each column’s predicted category. This directly answers “how settled is this
prediction?” — unlike K, which only measures source disagreement.
A column can have K=0.9 but Bel=0.95 — the sources fought hard during
combination, but the normalizing denominator (1-K) concentrated surviving
mass on the agreed-upon singleton. That column’s prediction is settled
despite high conflict; it doesn’t need revisiting.
Convergence criteria (all must hold):
| Criterion | Metric | Default | Meaning |
|---|---|---|---|
| Primary | mean_gap < gap_threshold | 0.15 | Predictions are tight |
| Secondary | frac_unclear < clarity_target | 0.10 | At most 10% of columns need clarification |
| Coverage | coverage >= coverage_target | 0.95 | 95% of columns have labels |
Revisit targeting: _identify_uncertain_columns() selects columns
where gap > 0.3 OR Bel < bel_floor (default 0.50), sorted by gap
descending (most uncertain first).
Early stopping: The proof-of-progress paradigm monitors the gap trend. When mean gap plateaus for 2 consecutive iterations (no verifiable progress), the loop stops even if the threshold hasn’t been reached.
K as Diagnostic
Conflict K remains in logs, iteration metrics, and agent tools as a
diagnostic for source disagreement. It is useful for identifying
calibration issues (e.g., a pattern detector producing false positives)
but does not gate convergence. The cumulative K formula
K = 1 - Π(1 - Kᵢ) tends to be high (~0.5-0.8) with 6 partially
correlated sources; this is expected and does not indicate poor quality.
Agent-Driven Convergence
As an alternative to the programmatic loop, the agent convergence loop
(agent_loop.py) delegates revisit strategy to Claude. The agent uses
6 tools — get_conflict_report, revisit_columns, check_convergence,
get_column_detail, retrain_svm, declare_converged — to reason about
which columns need re-examination. The agent sees both gap-based and K-based
metrics and can make nuanced decisions. See Keystone Agents.
LLM Backend
llm_backend.py provides a factory-pattern abstraction:
OpenAICompatibleBackend: For vLLM, GLM-4.7, and any endpoint implementing the OpenAI chat completions API. Default backend.AnthropicBackend: For Claude via the Anthropic SDK.BedrockBackend: For AWS Bedrock via the Converse API.BedrockStructuredBackend: Production default on CAI. Usesinvoke_modelwith tool-use for structured output (output_configis not supported on Bedrock). When extended thinking is enabled,tool_choicemust be"auto"(Anthropic constraint); a text-block fallback parser handles this case. Both backends useregion_from_arn()to extract the target region from cross-region inference profile ARNs.CerebrasBackend: OpenAI-compatible with Cerebras-specific defaults (base_url=https://api.cerebras.ai/v1,model=zai-glm-4.7).create_backend_from_cfg(cfg): Factory that reads HOCON config to select and configure the appropriate backend.
Backends fail fast when not configured — no mock fallback in production code.
Configuration
All bootstrap/LLM settings live in HOCON (config/base.conf):
classify {
llm {
backend = "openai_compatible" # or "anthropic", "bedrock_structured"
model = "glm-4.7"
base_url = null # vLLM endpoint URL
columns_per_call = 50
discount = 0.10 # DST discount for LLM mass
}
bootstrap {
max_iterations = 5
k_threshold = 0.2 # diagnostic (not convergence-gating)
coverage_target = 0.95
max_total_llm_calls = 5000
# Belief-gap convergence (primary criteria)
gap_threshold = 0.15 # mean(Pl - Bel) target
clarity_target = 0.10 # max fraction of unclear columns
bel_floor = 0.50 # min belief for "settled"
}
}
Environment variable overrides follow the standard pattern:
ATELIER_LLM_MODEL, ATELIER_LLM_BASE_URL, ATELIER_BOOTSTRAP_K_THRESHOLD, etc.
SHAP Explanations
Per-item feature attribution explaining why each column was classified as it was. Complements the global SAGE importance (which ranks features across the entire dataset) with item-level explanations.
Two Methods
| Method | Algorithm | Speed | Features | When Used |
|---|---|---|---|---|
| CatBoost TreeSHAP | Exact O(TLD) built-in | ~0.1s for 50 items | Grouped: embedding, discrete | Auto when CatBoost model loaded |
| Embedding PermutationSHAP | shap.PermutationExplainer | ~50s/item on CPU | 12 named features | Tier-1, explicit request only |
Auto mode (method="auto") only uses TreeSHAP — PermutationSHAP is too
slow for default pipeline runs and must be explicitly requested.
Output
Each classification gains 6 extra columns:
shap_top1_name,shap_top1_valueshap_top2_name,shap_top2_valueshap_top3_name,shap_top3_value
These flow through to JSON, parquet, and evaluation output.
Configuration
classify.shap {
enabled = true # Enable SHAP in pipeline (auto-selects method)
top_k = 3 # Number of top features to report per item
}
Configurable Discounts
All DST discount factors are configurable via HOCON. The DiscountConfig
dataclass bundles all parameters with DiscountConfig.from_cfg(cfg) factory:
classify.discounts {
cosine = 0.30 # Cosine similarity → Theta mass
svm = 0.20 # SVM → Theta mass
pattern_theta = 0.25 # Pattern detection → Theta mass (graduated by match fraction)
name_match_exact = 0.70 # Exact label match singleton mass
name_match_code = 0.50 # Formal code/abbrev match mass
name_match_alias = 0.50 # Common name alias match mass
name_match_overlap = 0.30 # Word overlap match mass
catboost_base = 0.10 # Adaptive discount base
catboost_variance_scale = 1.6 # Variance-to-discount scaling
catboost_max = 0.50 # Cap on adaptive discount
catboost_fallback = 0.15 # When no variance available
confusable_ratio_threshold = 3.0 # Mass-split ratio that triggers compound focal element transfer
}
Environment variable overrides: ATELIER_DISCOUNT_COSINE, ATELIER_DISCOUNT_SVM, etc.
Milestones
| Milestone | Scope | Status |
|---|---|---|
| M0 | Cosine + pattern + name match, FSM, pipeline E2E | Done |
| M0.5 | Schema fix, pignistic probability, HierarchicalClassification | Done |
| M1 | LLM evidence source, bootstrap convergence loop, LLM↔ML validation | Done |
| M2 | CatBoost + SVM + synthetic data, 6 evidence sources, Bedrock/Cerebras backends | Done |
| M3 | Evaluation framework, E2E synth-train-eval, realistic mock LLM, SAGE importance | Done |
| M4 | SHAP explanations, configurable discounts, thread-safe model loading | Done |
| M5 | Data sources + versioning, OOTB onboarding (316-leaf ontology, 25 sample tables) | Done |
| M6 | Agent-driven convergence loop (6 Claude tools), synth framework (316+ generators) | Done |
| M7 | Monte Carlo stratified sampling, label propagation, background SHAP | Done |
| M8 | GPU acceleration (NVIDIA driver symlink, batch encoding), meta-tagging overlay | Done |
| M8.5 | SVM signals alignment (Pipeline+FeatureUnion adoption, evidence independence documentation) | Done |
| M9 | Incremental SVM training on LLM-classified labels (cross-model distillation via MC sampling) — subsequently excised, see 2026-05-04 historical note above | Done |
| M10 | Phase Gate #2 — belief-gap convergence pivot, Cautious-Code Review, TreeSHAP per-feature attribution, reasoning-trace citation analyzer (+9 pts iterative gain), 97.8% phase-gate validation on meta-tagging | Done |
| M11 | MLflow experiment tracking, Hive data source integration | Proposed |
Pipeline Phases (FSM Walk-Through)
A run of the classification pipeline advances through a finite state
machine. Each state is a named phase with a single responsibility,
and the legal transitions between phases — defined authoritatively in
src/atelier/classify/fsm.py
— form the workflow that operators see live in the Workflows page and
that this document narrates end-to-end.
This page is the operator-facing companion to two deeper references:
- Classification Pipeline — what the pipeline does mathematically (DST, evidence sources, fusion strategies).
- DST Evidence Independence — why each evidence source qualifies for Dempster’s rule of combination.
Read this one first when you need to walk a reviewer through the run
shape: which phase produces which artifact, where the iteration loop
lives, what makes a run land in CONVERGED versus ERROR.
At a glance
┌──── revisit ────┐
│ │
▼ │
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
│
▼
CLASSIFYING → FUSING → EVALUATING → CONVERGED
│
(any phase)
▼
ERROR
The arrow back from VALIDATING to LLM_SWEEP is the iteration loop; it’s the heart of the algorithm and is described in Iteration loop below.
Phases in execution order
| # | State | What it does | Primary output |
|---|---|---|---|
| — | IDLE | No run in flight. Ready to dispatch the next classification. | — |
| 1 | LOADING_VOCAB | Load the user-supplied taxonomy (annotations CSV / Hive table / DB) and validate: label collisions, duplicate codes, orphaned aliases, parent-aware frame structure. | HierarchicalCategorySet, FrameOfDiscernment |
| 2 | DISCOVERING | Probe the data source via cml.data_v1 (Hive), the meta-tagging mount (CSV), or the bundled fixtures to enumerate the tables in scope. | list[str] of table names |
| 3 | SAMPLING | For each discovered table, sample column metadata: bare names, types, ~5 representative values, true COUNT(DISTINCT) bounded by the sample limit, null ratio, sibling list. Reference-key columns (attr_1_2_3_* answer-key shape) are filtered out so they don’t trivially leak into evaluation. | list[ColumnSample] (canonical bare names — see ColumnSample invariant) |
| 4 | LLM_SWEEP | Claude classifies each directly-targeted column into the user vocabulary. Iteration 1 sweeps every column (or the Monte Carlo sampled subset — a stratified slice for large corpora; the remaining columns get label propagation later). Iterations 2…N revisit only the columns flagged for re-look in the previous VALIDATING pass. | state.labels[qualified_name] → category_code, plus per-column LLM confidence |
| 5 | VALIDATING | ML re-validation: CatBoost (fit-to-LLM during the loop) and the synth-trained SVM (translated through the LLM-mediated ICE→user-vocab alignment) score the same columns independently of the LLM. Per-column DST mass with conflict K is computed under the parent-aware frame. The disagreement set — driven primarily by belief-gap Pl − Bel, with K and coverage as secondary signals — feeds the next iteration’s revisit batch. The loop exits when convergence criteria are satisfied; otherwise it re-enters LLM_SWEEP. | state.ml_prediction, state.ml_belief, state.ml_plausibility, state.ml_conflict, the next iteration’s disagreement list |
| 6 | CLASSIFYING | Final per-column DST evidence fusion. Up to six evidence sources combine: name_match, pattern, cosine, llm, catboost, svm. Each produces a mass function over the parent-aware frame; per-column predicted code, belief, plausibility, and conflict are computed here. | classifications: list[dict] (each entry shaped as classifications.json rows) |
| 7 | FUSING | Combine per-column mass functions via the configured fusion strategy. dempster normalizes conflict by (1 − K); yager redirects conflict mass to Θ (ignorance). Cautious-code review (when enabled) runs here — backing off over-specified leaf predictions whose belief sits below the commit threshold to a parent code where it does. | Headline classification per column; cautious_review.json (when enabled) |
| 8 | EVALUATING | Compute corpus-level metrics: accuracy vs reference (when present), per-category precision/recall, K distribution, gap distribution, non-PII residual count. Persist artifacts to disk and emit the parquet for the Embeddings page. Overwatch (when enabled) runs at the tail of this phase. | evaluation_report.json, column_trajectories.json, taxonomy_findings.json, atelier_embeddings.parquet, ML artifacts, overwatch.md (when enabled) |
| — | CONVERGED | Terminal success. Convergence criteria satisfied; results are on disk under build/results/{run_id}/ and registered in the DB via the run-end registration path (or recovered later by atelier.db.sync.sync_filesystem_to_db on restart). | — |
| — | ERROR | Terminal failure. The FSM error field carries the diagnostic; pipeline logs and register_error.json (if any) carry the rest. | — |
The FSM defines two states that the standard inference run does not
visit — GENERATING_SYNTH and TRAINING. These belong to the
offline synth-corpus generation + SVM-training flow that
produces the bundled SVM artifact (legacy filename
svm_frontier.pkl retained on disk for backward compatibility with
older run directories), and are reachable from SAMPLING only on the
explicit synth-generate code path.
Iteration loop: LLM_SWEEP ⇄ VALIDATING is the algorithm
The single most important thing to internalize when reviewing the
pipeline: LLM_SWEEP and VALIDATING are not two separate one-shot
phases — they form an iteration loop, and the loop is the convergence
algorithm.
Each cycle:
LLM_SWEEPlabels (or re-labels) the directly-targeted column set on iteration 1, the disagreement set on iterations 2…N.VALIDATINGruns ML re-validation, computes per-column belief, plausibility, and conflict under the parent-aware DST frame, and identifies the next disagreement set.- The loop exits when one of the convergence criteria
is satisfied; otherwise it re-enters
LLM_SWEEP.
The Workflows page draws this as a purple dashed back-edge from
VALIDATING to LLM_SWEEP precisely because the geometry teaches the
algorithm: this is bootstrapping with active-learning revisit, not a
linear pipeline.
The driver of the loop is configurable:
- Programmatic (default):
pipeline._llm_revisitpicks revisit candidates from_identify_disagreementsand_identify_uncertain_columns. - Agent-driven (capability flag): the Agent Convergence skill replaces the programmatic driver — Claude chooses revisit candidates and decides when to declare convergence via tool calls.
Convergence criteria
A run reaches CONVERGED when any of the following holds at the end
of an iteration in the LLM_SWEEP ⇄ VALIDATING loop:
| Criterion | Config key | Default | Notes |
|---|---|---|---|
| Mean belief-gap below threshold | classify.bootstrap.gap_threshold | 0.05 | The primary signal — converging on mean(Pl − Bel), not on K. Locked in by commit bd7de2c after the parent-aware DST frame audit. |
| Coverage met + K acceptable | classify.bootstrap.coverage_floor, classify.bootstrap.k_threshold | 0.95, 0.40 | Backstop — a corpus that LLM-labels everything cleanly on the first sweep doesn’t need additional iterations. |
| Iteration cap reached | classify.bootstrap.max_iterations | 4 | Fallback. The convergence reason is recorded as max_iterations_reached so the UI can show an honest “ran the full budget” rather than claiming gap convergence the run didn’t actually achieve. |
| Min iterations honored | classify.bootstrap.min_iterations | 2 | Forces at least N revisit cycles before any convergence path can fire. Defends against a single-pass LLM that happens to land cleanly without the ML cross-check having run. |
Conflict K (Dempster’s rule’s normalization mass) is diagnostic, not
the gating signal. Earlier iterations of the design framed K as the
convergence headline; that framing was retired in commit bd7de2c and
matters for any review of older docs or telemetry that still leads
with K.
ColumnSample canonical form
The pipeline’s data-model invariant:
ColumnSample.name
is always the bare column identifier — table-relative, free of any
f"{table_name}." prefix. Cross-table identity uses the
qualified_name property (f"{table_name}.{name}") for dict keying.
This invariant is enforced in __post_init__ and validated at every
source boundary:
- Hive sampler (
sampler._strip_table_qualifier) — strips thef"{table_name}."qualifier that Hive’s JDBC driver returns fromSELECT * FROM db.table. - Meta-tagging CSV (
meta_tagging_source) — strips the same prefix from CSV headers that encode the table name. - Synth, OOTB sample, fixtures — produce bare names by construction.
Any new source path that produces qualified names will trip the
__post_init__ invariant at construction time with a clear
diagnostic, rather than letting them silently propagate into the
embedding text — where a repeated table-name prefix in the column
name and the sibling list would drown the actual column signal in
table-theme noise and produce table-wide misclassification.
Optional capability skills
Three skills attach to specific phases and are gated by the
corresponding capability flag. Each renders only when its flag is
enabled in /api/status config.
| Skill | Attaches to | Behavior | Capability flag |
|---|---|---|---|
| Agent Convergence | VALIDATING | Claude drives the convergence loop directly via the agent_loop tool surface — picks revisit candidates from belief/conflict signals and declares convergence when satisfied — replacing the programmatic loop driver. Bounded by max_turns. | classify_agent_enabled |
| Cautious Review | FUSING | Per-column LLM review that backs off over-specified leaf predictions to a parent code where belief crosses the commit threshold. Defends against false-precision claims on opaque or ambiguous columns. | cautious_review_enabled |
| Overwatch | EVALUATING | Single-turn Opus analysis writes overwatch.md with pipeline-tuning recommendations after the run lands. Requires direct Anthropic API (not Bedrock) — Bedrock lags Opus releases. | overwatch_enabled |
The three skills are visible as orange dashed nodes in the Workflows
page when their flags are enabled, attached to their host phases.
This is the registry-MVP shape — when we hit roughly six surfaces it
graduates to a backend /api/skills endpoint reading from a real
registry rather than a hand-coded list.
Phase ↔ artifact map
For an end-to-end review of any single run, here’s what’s recoverable
from build/results/{run_id}/, indexed by the phase that produced it:
| Phase | Artifact | What’s in it |
|---|---|---|
| Run start | settings_snapshot.json | The config that drove the run — source_id, all overlay values at start, default values, the resolved settings the pipeline actually used. |
LLM_SWEEP ⇄ VALIDATING | column_trajectories.json | Per-column history across iterations: label changes, ML predictions, belief/plausibility/conflict trajectory, the revisited flag per iteration. |
LLM_SWEEP ⇄ VALIDATING | catboost_fit_to_llm.cbm + .classes.json | CatBoost fit to the in-loop LLM labels. Persisted for Extend runs. |
LLM_SWEEP ⇄ VALIDATING | svm_frontier.pkl + .classes.json | Synth-trained SVM with the in-run LLM-mediated alignment. Persisted for Extend runs. (Filename retained for backward compatibility; underlying model is the synth-trained SVM, not the excised M9 in-loop retrain.) |
CLASSIFYING + FUSING | classifications.json | The per-column output: predicted code, belief, plausibility, conflict, full evidence-source mass distributions, belief path, llm/ML/cautious codes. The headline corpus result. |
FUSING | cautious_review.json | Cautious Review skill audit (only when enabled). |
EVALUATING | evaluation_report.json | Corpus-level metrics: accuracy, per-category precision/recall, K distribution, gap distribution, non-PII residual count. |
EVALUATING | taxonomy_findings.json | Notes flagged during taxonomy traversal — orphaned codes, suspicious aliases, near-duplicate labels. |
EVALUATING | atelier_embeddings.parquet + umap.pkl | Input for the Embeddings page (UMAP projection of the per-column embedding vectors with predicted-code colorings). |
EVALUATING | sage_importance.json + shap_summary.json | GPU-accelerated global feature importance + per-column SHAP attributions (when enabled). |
EVALUATING (post) | overwatch.md | Overwatch skill output (only when enabled). |
| Run end | register_error.json (rename to .resolved after sync) | If DB registration failed mid-run, the sync path on restart picks the run up from this sidecar. |
Where to look in code
| Concern | File |
|---|---|
FSM state enum, transitions, FSMRun dataclass | src/atelier/classify/fsm.py |
Phase advancement (every fsm.advance(...) call) | src/atelier/classify/pipeline.py |
| Iteration loop driver (programmatic) | src/atelier/classify/bootstrap.py (_llm_sweep, _llm_revisit, _identify_disagreements, _run_ml_validation) |
| Iteration loop driver (agent) | src/atelier/classify/agent_loop.py |
| Per-column DST fusion | src/atelier/classify/pipeline.py (_classify_column) |
| Convergence criteria evaluation | src/atelier/classify/bootstrap.py (_mean_gap, _mean_k, _coverage, should_stop_early) |
| Cautious Review skill | src/atelier/classify/cautious_review.py |
| Overwatch skill | src/atelier/overwatch/agent.py |
| Workflows page topology (UI) | ui/src/lib/fsmPipelineLayout.ts |
State transition reference
Authoritative state-transition table (from
fsm.py:_TRANSITIONS):
| From | Legal next states |
|---|---|
IDLE | LOADING_VOCAB |
LOADING_VOCAB | DISCOVERING, ERROR |
DISCOVERING | SAMPLING, ERROR |
SAMPLING | CLASSIFYING, GENERATING_SYNTH, LLM_SWEEP, ERROR |
GENERATING_SYNTH | TRAINING, ERROR |
TRAINING | CLASSIFYING, ERROR |
LLM_SWEEP | VALIDATING, CLASSIFYING, ERROR |
VALIDATING | LLM_SWEEP, CLASSIFYING, ERROR |
CLASSIFYING | FUSING, ERROR |
FUSING | EVALUATING, ERROR |
EVALUATING | CONVERGED, IDLE, ERROR |
CONVERGED | IDLE |
ERROR | IDLE |
SAMPLING → CLASSIFYING (skipping LLM_SWEEP) is the path used by
Extend runs, where ML-only inference is desired
because the LLM has already classified an earlier corpus and the
artifacts are being applied to a new dataset. LLM_SWEEP → CLASSIFYING (skipping VALIDATING) is the “first-sweep convergence”
path on small corpora that don’t need iteration.
DST Evidence Independence
This note documents how Atelier’s classification pipeline handles non-distinct evidence sources under Dempster-Shafer fusion, and why the discount calibration and revisit gate are structured the way they are. It is intended to be cited by code reviewers and academic readers.
The pipeline as iterative refinement
Atelier’s bootstrap loop is iterative refinement on a belief-
assignment vector B over columns: B_{n+1} = T(B_n), where T
composes the LLM sweep, ML validation (CatBoost + SVM), DST fusion,
and targeted revisit on disagreement. Cast in the language of
classical numerical analysis (Banach 1922; Saad 2003 §4.1,
Iterative Methods for Sparse Linear Systems), every component of
the pipeline maps onto a numerical-method primitive:
| Component | Numerical-methods primitive |
|---|---|
| Bootstrap loop | Fixed-point iteration on B |
| LLM sweep | Stochastic operator T_LLM (Robbins-Monro 1951 framing) |
| ML validation | Deterministic linearization T_ML |
| DST fusion | Combiner ⊕ producing fused state |
| Targeted revisit on disagreement | Local smoothing in multigrid (Brandt 1977) |
| Pl − Bel gap | A posteriori error estimate per column |
| Conflict K | Nonlinear residual diagnostic |
| Ontology priors | Preconditioner — conditions first-pass output |
| Reliability discount on derivative sources | Damping / step-size control |
| Hierarchical cosine mass | Coarse-grid correction (multigrid) |
cautious_promoted_code | Projection onto coarse grid at level where evidence unambiguous (Smets 1993) |
needs_clarification | Residual-exceeds-tolerance flag |
The diagnostic that ties the framework together is the
residual norm ‖r(B)‖ — a unified scalar measuring distance
from the fixed point — and the contraction factor ρ_n = ‖r_{n+1}‖ / ‖r_n‖, the headline iterative-method indicator
(Saad §4.1):
ρ < 1: contractive — successive iterations reduce residual.ρ → 1: stalled — iterations not making progress; warrants strategy change (different fusion rule, different preconditioner, agent escalation).ρ > 1: diverging — iterations growing the residual.
bootstrap.residual_norm and bootstrap.contraction_rate
implement the diagnostic. The unified residual is an L2
combination of four normalized components: mean(gap) /
gap_threshold, frac_unclear / clarity_target, mean(K) /
k_threshold, and frac(indep-tier disagreement at meaningful mass).
A residual_norm of 1.0 means “at convergence threshold across the
board”; values <1 are converged. Both are surfaced in
IterationMetrics and the agent loop’s iteration_history.
This framing is what makes the rest of the design — non- distinctness handling, hierarchical aggregation, ontology priors, reliability shaping — operate as a cohesive accuracy-targeting engine rather than a collection of clever heuristics. Each mechanism is a numerical-method primitive in service of driving the residual to zero.
The non-distinctness problem
Dempster’s rule of combination assumes the bodies of evidence being combined are produced by distinct, conditionally independent sources (Shafer 1976, A Mathematical Theory of Evidence, Ch. 3 §3 and Ch. 4). Smets’ Transferable Belief Model (Smets 1990; Smets & Kennes 1994, The Transferable Belief Model) preserves this assumption at the credal level. Denoeux 2008 (Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence, Artificial Intelligence) characterizes the pathology that arises when the assumption is violated: combining two mass functions that derive from a shared evidential atom via Dempster’s rule effectively raises the contribution of that atom to a power. The conjunctive cautious rule, defined on commonality functions and idempotent on identical evidence (Denoeux 2008 §4), recovers soundness — but is non-normalising and not a drop-in replacement for Dempster.
The Atelier-specific violation
The classification pipeline in src/atelier/classify/ declares six
evidence sources:
name_match— lexical column-name matching against the vocabulary.pattern— regex/validator detection (email, IBAN, monetary, …).cosine— semantic similarity between the curated embedding text and the user-vocabulary embedding.llm— Claude Opus first-pass classification.catboost— CatBoost classifier.svm— synth-trained TF-IDF + LinearSVC classifier with an LLM-mediated ICE → user-taxonomy alignment applied at inference time.
The first three are genuinely independent of the LLM: their evidence arises from the column’s name, value patterns, and semantic embedding comparison against the vocabulary. The remaining sources have a mixed independence profile:
catboostis trained infit_to_llmmode (default true) on(embedding_text, llm_code)pairs from the current run’s LLM sweep. Seeml_train.fit_catboost_to_llm_labelsandpipeline._install_fit_to_llm_catboost. The fitted model is, by construction, an explainability surface over the LLM’s labels — not a competing classifier. Strongly non-distinct with the LLM source under Denoeux 2008 (per-column shared label provenance).svmis trained once on the synthetic corpus (scripts/generate_synth_source.py→ml_train.train_svm), with TF-IDF char-3-6gram + word-1-2gram features and labels keyed on the bundled-ontology ICE.* leaves fromsynth_generators.GENERATORS. At pipeline runtime, predictions are translated into the user taxonomy via subsumption-prediction alignment inclassify.subsumption_alignment— sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from the Qdrant taxonomy collection (one alignment computation per (vocab, embedding_model) tuple, results cached on disk). Weakly non-distinct with the cosine source via shared enrichment-LLM upstream — the enriched annotations were generated offline by an LLM, but the alignment computation itself uses a structurally independent model (BERT embeddings), not the runtime autoregressive LLM. The prior LLM-mediated approach (one LLMclassify_batchcall per alignment, excised in the P7 subsumption-alignment intervention) was weakly non-distinct with the runtime LLM through shared model weights — the new approach eliminates that correlation. See theontology_alignment.pymodule docstring for the full independence argument.
Treating LLM and CatBoost(LLM) as fully-independent sources and
combining them via Dempster’s rule double-counts the LLM atom; the
SVM evidence sits between fully independent and fully derivative.
The pre-2026-04-30 discount schedule made the legacy three-way
overlap worse: llm=0.10, catboost=0.10, svm=0.20, vs
cosine=0.30. The genuinely independent semantic source was
more discounted than the two derivative ones, mathematically
suppressing it whenever the LLM was loud.
A failure case observed during pipeline validation illustrated the
pathology in the abstract. A column whose values match the
monetary_pattern regex was classified as a generic catch-all
code rather than a financial-domain code. Cosine top-1 distributed
mass across several financial-leaning codes in the active
vocabulary, but at softmax-spread mass on the order of a few
thousandths per code it could not overcome LLM mass (≈ 0.83) and
CatBoost mass (≈ 0.81), both concentrated on the catch-all. The
fused prediction matched the LLM; the disagreement gate at
bootstrap._identify_disagreements required
llm_code != fused_code and so never fired despite K ≈ 0.81 and
a unanimous independent-source pull toward financial codes.
needs_clarification=True was emitted, but no LLM revisit
followed. Specific customer table names, column names, and codes
are intentionally not reproduced in this document.
Treatment in this codebase
The pipeline uses two complementary, scope-bounded fixes:
1. Reliability discounting on derivative sources (Shafer §11.3)
The discount operator from Shafer 1976 §11.3 multiplies a source’s
mass by reliability α = 1 - discount:
m’(A) = α · m(A); m’(Θ) = α · m(Θ) + (1 - α)
When evidence sources are non-distinct, the reliability of the derivative source with respect to the original is bounded above by 1 minus their information overlap. For sources trained directly on LLM output that overlap is near-total, so a substantial discount is the principled response under classical Dempster fusion.
The current defaults (config/base.conf:341+) place CatBoost and
SVM above the cosine discount:
| Source | Discount | Rationale |
|---|---|---|
cosine | 0.20 | independent of LLM; semantic prior |
pattern | 0.25 | independent; deterministic regex evidence |
name_match | 0.30–0.70 | independent; lexical match against vocab |
llm | 0.15 | original; first-pass label |
catboost | 0.55 | strongly non-distinct (fit_to_llm, per-column LLM labels) |
svm | 0.22 | weakly non-distinct (enrichment-mediated subsumption alignment; was 0.30 under LLM-mediated, 0.55 under M9) |
catboost_max | 0.75 | variance ceiling; maintains headroom |
Operators can dial these via the Settings page when retraining
CatBoost on labels independent of the current LLM sweep (e.g.
synth-only training); the metadata in config_overlay.SETTINGS_METADATA
exposes the full range. The SVM discount at 0.22 (slightly above
cosine’s 0.20) reflects the subsumption-prediction alignment:
structurally independent of the runtime LLM (uses BERT embeddings,
not autoregressive inference), with weak non-distinctness only via
the shared enrichment-LLM upstream (same structural dependency the
late-interaction cosine source carries). The 0.02 margin above
cosine accounts for subsumption prediction being a single per-ICE-code
decision (structurally more brittle than per-column cosine evidence).
2. Independent-tier consensus + revisit gate
For revisit decisions, the pipeline computes a parallel, isolated fusion over the LLM-independent subset only:
m_indep = m_cosine ⊕ m_pattern ⊕ m_name_match (Dempster's rule)
indep_top1 = argmax_singleton m_indep
Implemented in pipeline._classify_column via the INDEPENDENT_TIER
constant and combine_multiple(strategy="dempster"). The top-1
singleton and its mass are exposed in the result dict
(independent_top1_code, independent_top1_mass,
independent_top1_conflict) and stored on the BootstrapState.
The revisit gate at bootstrap._identify_disagreements then fires
when:
indep_top1_code ≠ llm_codeANDindep_top1_mass ≥ classify.bootstrap.indep_revisit_mass_threshold(default 0.45)
This restores a real cross-source disagreement test that cannot be
masked by LLM-derivative sources amplifying the LLM’s vote. The
legacy high-K branch (llm_code != fused_code AND K > k_threshold)
is retained as a safety net and runs second in priority.
The revisit prompt context at bootstrap._llm_revisit now includes
the independent-tier consensus code/label/mass so the LLM has the
counter-evidence in front of it during the second pass.
Ontology priors — substrate as semantic anchor
Patterns detect at extraction time. When a pattern fires we resolve
its canonical ICE.* metadata from universal_vocabulary.json (label,
description, common-name aliases, full ontological path root→leaf)
and thread that metadata through three insertion points sourced from
a single lookup (mass_functions.lookup_pattern_ontology):
-
Embedding text (
features.ColumnFeatures.to_embedding_text—ontology_priorsis a discreteFEATURE_NAMESentry, ablatable for SAGE). Cosine similarity then operates over publicly-grounded ontology terms an embedding model recognizes from training rather than the regex name alone. On the failure case that motivated this work, the column embedding gains the literal substring “Transaction Amount; The monetary value of a financial transaction.; aliases: amount, payment, price; ontology: Sensitive Data → Personally Identifiable Data → Financial Data → Payment Data → Transaction Amount” — orders of magnitude more semantic surface thanpatterns: monetary_patterncarried. -
First-pass LLM user prompt (
llm_backend.build_batch_user_prompt). Every batch — sweep AND revisit — the prompt now includes per-column “Pattern-detected ontology priors (from Atelier’s universal taxonomy — translate to the closest fit in the candidate vocabulary)” with each fired pattern’s label, description, alias list, and path. The LLM is explicitly instructed that the canonical ICE.* code is never a valid classification target; its job is ontology alignment from the publicly-grounded substrate to the user’s frame (He et al. 2023, Exploring Large Language Models for Ontology Alignment; Hertling & Paulheim 2023, OLaLa: Ontology Matching with LLMs; Ehrig & Sure 2004 for the classical foundation). -
SAGE/SHAP attribution surface (
features.FEATURE_NAMES).ontology_priorsis now its own ablatable feature distinct frompattern_signalsandsample_values. Operators can attribute classification mass to the publicly-grounded ontology prior independently of the raw embedding text — the explainability story ties each prediction back to the public substrate that motivated it.
Surfaced in the result dict as ontology_priors (list of dicts:
pattern, code, label, description, common_names, path, match_fraction). The codes are universal-substrate IDs; they never
appear in user-facing classifications. The user’s vocabulary
remains the authoritative result space; ICE.* is the bridge.
Architectural significance: this is the substrate→tagging bridge the design has been pointing at. Pattern detection was always publicly-grounded; the resolver turns ICE.* into the user’s codes when it can; when it can’t, ontology priors carry the public semantic anchor straight through to cosine + LLM + SHAP without ever fabricating a code in the user’s frame. Compatible with — and strengthens — the indep-tier consensus + reliability-discount mechanisms above.
Cosine reliability shaping (Haenni-Hartmann 2006)
Static discount=0.30 allocated 0.70 of cosine mass uniformly via
softmax across all candidate singletons. On large vocabularies (300+
leaves) this produced softmax compression — even a sharp top-1 hit
landed at ~0.004 mass per code. Cosine could see the right answer
but couldn’t carry it through fusion, and the indep-tier consensus
sat permanently below the revisit threshold.
mass_functions.cosine_to_mass now applies dynamic source
reliability per Haenni & Hartmann 2006, Modeling Partially
Reliable Information Sources: A General Approach Based on
Dempster-Shafer Theory (Information Fusion 7(4), 361–379, §3):
the source-reliability factor α is an observable function of
quality indicators, with (1 − α) allocated to ignorance.
Two quality indicators:
- α_abs — sigmoid of top-1 absolute similarity around
τ_abs = 0.40withσ_abs = 0.10. Encodes “is cosine matching anything strongly, or just noise?” - α_marg —
tanh((s₁ − s₂) / σ_marg)withσ_marg = 0.05. Encodes “is the top-1 a decisive winner, or ambiguous among similar candidates?”
Weighted blend (w_abs = 0.6, w_marg = 0.4), clamped to
[reliability_floor, reliability_ceiling] = [0.10, 1 − classify_discount_maxsim]. The ceiling preserves the legacy
maximum-mass behavior under sharp signal; the floor keeps cosine
contributing some mass even under noise.
The α-bounded evidence mass is then split via margin-aware allocation:
m(top-1) = α · margin_weight + α · (1 − margin_weight) · softmax_top1
m(top-i, i>1) = α · (1 − margin_weight) · softmax_top_i
m(Θ) = 1 − α
where margin_weight = tanh((s₁ − s₂) / σ_marg). When the margin
is wide, almost all evidence mass concentrates on top-1 directly
rather than diluting through softmax. When the margin is narrow,
the formula reduces to classical softmax allocation across the
full candidate set.
Behavior across regimes (BDD-locked in
features/agent/evidence_independence.feature, “Cosine reliability
shaping concentrates mass on a clear top-1”):
| Top-1 sim | Top-2 sim | α | margin_weight | Top-1 mass | Θ mass |
|---|---|---|---|---|---|
| 0.70 | 0.50 | 0.700 | 1.000 | 0.700 | 0.300 |
| 0.45 | 0.20 | 0.700 | 1.000 | 0.700 | 0.300 |
| 0.45 | 0.44 | 0.452 | 0.197 | 0.091 | 0.548 |
| 0.23 | 0.23 | 0.100 | 0.002 | 0.0005 | 0.900 |
Sharp signal recovers the legacy ceiling allocation but concentrates it on top-1 (~170× the prior compressed mass). Ambiguous and noise regimes correctly route most mass to Θ rather than fabricating false confidence. The indep-tier revisit gate (threshold 0.45) is now reachable on cosine alone whenever cosine has clear semantic signal.
Composes cleanly with the indep-tier consensus computation: when cosine carries decisive mass on a code different from the LLM’s vote, that code becomes the indep-tier top-1 and the revisit gate fires — which is the soundness invariant the whole evidence- independence treatment is reaching for.
Hierarchical mass aggregation + cross-subtree visibility
A separate structural gap surfaced after reliability shaping landed: when cosine evidence localizes to a subtree (multiple financial-leaning leaves under a common parent) but the LLM picks a confident leaf in a different subtree, the predicted code falls to the LLM’s leaf and there is no surfaced signal that an honest-but-coarser parent would apply. Three cooperating fixes close that gap:
1. Cosine emits hierarchical mass
mass_functions.cosine_to_mass now walks up from the cosine top-1
leaf, finds the most-specific internal node whose descendants
capture ≥ 50% of the softmax probability mass
(_significant_subtree), and redirects the in-subtree residual
mass to that internal-node focal element rather than diluting it
across leaves. The frame already exposed every parent code as an
internal-node FocalElement (descendant leaf set); we just
weren’t emitting mass there. Hierarchical Dempster-Shafer
treatment per Shafer 1976 §3 and Smets 1990 §6 (refinement /
coarsening): an internal-node focal element represents a
disjunction — “the answer is somewhere in this subtree” — without
committing to a specific leaf.
Walking up from top-1 (rather than requiring every top-K to share an LCA) tolerates outliers cleanly: a small amount of probability leaking outside the subtree doesn’t void the aggregation as long as the bulk of mass remains inside.
Sharp-signal regimes are unaffected — when the margin is wide
the residual mass α · (1 − margin_weight) is small, so the
hierarchical aggregation simply scales proportionally. The
top-1 leaf still wins when one is decisive.
2. cautious_code walks the full hierarchy
HierarchicalClassification.cautious_code previously walked only
the predicted code’s ancestor chain via belief_path —
structurally blind to belief mass in any other subtree. It now
delegates to cross_subtree_belief, which iterates every
singleton AND every internal-node focal element in the frame and
returns those with Bel ≥ threshold. The most-specific code
wins, regardless of subtree.
Concretely: when the LLM votes 0.1 Internal Non-Sensitive but
cosine’s hierarchical aggregation puts Bel(Financial Data) = 0.55 on a different subtree’s parent, cautious_code(0.5) can
now return Financial Data — not just 0 (the predicted
code’s parent).
3. cross_subtree_belief surfaces the conflict
The result dict now carries a cross_subtree_belief field
listing every code (leaf or internal node, any subtree) where
Bel ≥ 0.5. Operators see both the LLM’s leaf vote AND the
cosine-derived alternative subtree as legitimate signals,
instead of the predicted-leaf-only belief_path. When evidence
sources disagree on the subtree, both candidates appear and the
operator can act on the disagreement directly.
This composes cleanly with the prior mechanisms: reliability
shaping ensures cosine top-1 carries enough mass to trigger
hierarchical aggregation when signal is clear; the indep-tier
gate fires when cosine’s hierarchical mass disagrees with LLM at
the leaf level; and cross_subtree_belief makes the cross-
subtree disagreement explicit in the operator-facing result. The
predicted_code field retains its leaf-argmax semantics for
backward compatibility — operators consume the cautious /
cross-subtree fields when needs_clarification = True or when
the cross-subtree summary surfaces a competing internal node.
Operator-facing visibility
The fusion mechanisms above can produce mathematically correct belief structures that are nonetheless invisible to operators when the result-dict surface area is too narrow. Three small changes close that gap:
Evidence string carries per-source codes + competing summary
HierarchicalClassification.from_combined_evidence builds the
evidence field. Previously: dst(cosine=0.65, llm=0.77, catboost=0.42, svm=0.22) → Internal Non-Sensitive [Bel=0.71, ...] — masses only, not the codes each source voted. Now:
dst(cosine→1.4.1.1.1(0.65), llm→0.1(0.77), ...) → Internal Non-Sensitive [Bel=0.67, ...] [competing: Sensitive (1) Bel=0.26] — leaf-level disagreement is visible at a glance,
and a “competing” trailer surfaces non-trivial belief in any
non-predicted top-level subtree.
cross_subtree_belief is always informative
The 0.5 absolute threshold previously suppressed competing-
subtree alternatives whenever Dempster fusion compressed their
mass below the headline bar (the common case when one source
dominates). The default is now lower (0.20) AND a
always_include_top_per_subtree rule guarantees that the
highest-belief leaf and highest-belief internal node from each
top-level subtree appears in the result regardless of
threshold (subject to a small min_bel floor so we don’t
flood the result with noise). Operators always see the
structured “what does each subtree look like?” view.
cautious_promoted_code (Smets least-commitment)
Per Smets 1993 (Belief Functions: The Disjunctive Rule of Combination and the Generalized Bayesian Theorem and related work on least-commitment), when a fine-grained decision is unsupported by evidence the principled response is to commit only at the level of granularity where evidence IS unambiguous. This is exactly the mechanism for “the predicted leaf is not the right answer; the parent code is more honest.”
HierarchicalClassification.cautious_promoted_code returns
either the predicted leaf (no promotion) or the most-specific
code anywhere in the hierarchy whose belief meets the
commit_threshold (default 0.55). Promotion fires only when
needs_clarification = True — operators get the leaf
prediction by default; the cautious promotion is the
epistemically-honest fallback when the system itself flags the
prediction as uncertain.
The predicted_code field retains its leaf-argmax semantics
for backward compatibility with Atlas governance sync and
existing UI rendering. cautious_promoted_code lives
alongside it as a separate field operators consult when
needs_clarification is True.
Per-column residual trajectory
The corpus-wide residual norm + contraction factor establish the
headline iterative-method diagnostic, but they obscure per-column
behaviour. BootstrapState.column_history: dict[str, list[ColumnResidualSnapshot]]
captures the column-major view: one snapshot per labeled column per
iteration, populated in record_iteration_metrics after each
iteration’s ML validation completes. Each snapshot records the
column’s gap, belief, K, indep-tier top-1 code/mass, label, label
source, and a revisited flag indicating whether
_llm_revisit touched the column in that iteration.
bootstrap.column_contraction(state, name) mirrors the corpus-wide
contraction_rate at the column level: ρ_col = current_gap /
prev_gap (falling back to K when gap is zero), or None when the
column has fewer than two snapshots. ρ_col < 1 means the column is
converging; ρ_col → 1 stalled; ρ_col > 1 diverging. Per-column ρ
exposes the empirical contraction distribution that corpus
aggregates obscure — operators see which specific columns are
converging vs stalling.
The full trajectory is written to
build/results/{run_id}/column_trajectories.json at pipeline end
alongside classifications.json, enabling offline analysis,
operator post-mortem, and audit. The agent loop’s
iteration_history carries a summarized view (per-column
gap/bel/K sequences plus ρ_col) so the agent can reason about
which columns are moving.
This trajectory infrastructure is the substrate for any future acceleration scheme. Three plausible Phase B / Phase C extensions all consume it:
-
Bandit-style revisit ordering (Phase B) — extend
_identify_disagreementsto mixexpected_revisit_gain(name)derived from history into the sort key. Revisits ordered by predicted marginal residual reduction. Default-off knob; trajectory data backs it. -
Aitken Δ² early-stop (Phase B) — for columns with ≥3 snapshots and a clean linear-convergence pattern, predict the limit and skip further revisits when the predicted gap is below
cfg.gap_threshold. Saves LLM cost on the predictable tail. Default-off knob; trajectory data backs it. -
Limited per-column belief-mass Anderson (Phase C, deferred) — only on columns that genuinely oscillate (per-column ρ near 1 with sign-changing residual differences). Phase A’s trajectories let us measure whether such a population exists before shipping any Anderson code.
The honest framing: classical Anderson acceleration on the full belief-vector iteration is poorly suited to LLM-driven dynamics (stochastic T, mostly-static state, discrete labels, targeted-not- uniform revisit). What’s value-add given the problem structure is the per-column trajectory data itself — operators see per-column convergence behaviour, future acceleration schemes have real per-column data to operate on, and we can decide between bandit / Aitken / Anderson empirically rather than rhetorically. Phase A ships that substrate; Phase B and Phase C are gated on what the substrate reveals.
Cost-sensitive classification at the LLM layer (Elkan 2001)
All the prior mechanisms operate at or below the fusion layer —
they shape how per-source evidence is combined into a fused
belief. But on the canonical failure case
(loan_applications.requested_amount), the LLM at confidence 0.88
plus its derivative cluster (CatBoost, SVM) reinforces a vote on
0.1 Internal Non-Sensitive, and Dempster fusion’s normalization
preserves that dominance. Algorithmic mitigations stalled at
Bel ≈ 0.74 — an honest reduction from Bel = 0.955 baseline, but
the headline classification still miscategorized financial PII.
The principled response, per cost-sensitive classification
(Elkan 2001, The Foundations of Cost-Sensitive Learning) is to
adjust the decision threshold under asymmetric cost. In data
governance the asymmetry is severe: failing to flag truly
sensitive data (false negative, Type II) creates regulatory
liability (GDPR Art. 25 data protection by default; HIPAA Safe
Harbor; PCI DSS scope creep guidance), while over-classifying
(false positive, Type I) produces review overhead but is
recoverable. Treating the costs as cost(FN) ≫ cost(FP) is the
canonical privacy-regime convention.
Atelier applies this at the LLM layer — upstream of fusion —
via a Sensitivity classification perspective section in the
system prompt (llm_backend.build_system_prompt). The framing is
deliberately collaborative rather than prescriptive: modern LLMs
respond better to a colleague’s framing than to a compliance
checklist. Three load-bearing moves:
-
Invoke what the LLM already knows. The preamble names BFO, CCO, and the privacy regimes (GDPR, HIPAA, PCI DSS) those ontologies overlap with — concepts the model has substantial training exposure to. The customer’s taxonomy is framed as “their refinement of those publicly-grounded concepts,” and the model is asked to pick whichever of their codes matches the canonical sensitivity concept it would otherwise assign (PII, Financial Information, Technical Identifier, Biometric, etc.). No re-teaching, no rule list — invocation.
-
State the asymmetry once, casually. Cost-sensitive classification appears as “a practical asymmetry: in governance, calling sensitive data non-sensitive is a larger error than the reverse.” The over-classification guard is embedded conversationally: “When signals are genuinely absent (operational metadata, surrogate keys, timestamps, status enums), non-sensitive is the correct call — don’t reach for sensitive just because of the asymmetry.” One sentence on confidence calibration: “Calibrate confidence to what you actually saw, not to this asymmetry.”
-
Vocabulary-aware sensitivity map, ICE conventions only.
_sensitive_subtree_summary(category_set)activates onICE.SENSITIVE.*/ICE.NONSENSITIVE.*paths and emits a Markdown block naming the sensitive root, catch-all, and a few publicly-grounded leaf abbreviations (persrc/atelier/classify/fixtures/PROVENANCE.md). Returns""for every other vocabulary shape so the prompt stays silent where the framework can’t verify the encoding is publicly grounded. For non-ICE vocabularies the LLM still has the full markdown category table, per-column ontology priors for pattern-bearing columns, and the perspective preamble — that is sufficient to navigate any taxonomy without the framework guessing at its sensitivity structure.
The prompt block is default-on for every classification run;
no config knob. Built once per pipeline run at
pipeline.py:577 so the helper computation is amortized and the
new content lives inside the Anthropic prompt-cache prefix —
one-time cache miss on the first batch, normal cache hits
thereafter. Token cost is bounded (~250–300 fixed + ~80 for
the per-vocab summary).
This composes cleanly with everything below it: reliability
discounts on derivative sources still suppress double-counting,
cosine reliability shaping still concentrates mass on clear
top-1 hits, hierarchical aggregation still flows residual mass
to internal-node focal elements, the indep-tier consensus gate
still triggers revisits on cross-source disagreement, and
cautious_promoted_code still applies Smets least-commitment
on uncertain leaves. The Governance Cost Model changes what
the LLM votes — biasing toward sensitive parents under
uncertainty — leaving every downstream mechanism unchanged.
The hypothesis: with a governance prior at the source, the LLM
will either (a) pick a defensible sensitive parent code on
columns like requested_amount, or (b) lower its confidence on
the non-sensitive choice — either of which is an improvement
over the status quo. The exact behavior is non-deterministic
and confirmed against real LLM runs; BDD scenarios assert the
prompt structure (features/agent/governance_cost_model.feature),
not the LLM’s vote.
Pattern-target alias resolver
A second, narrower bug surfaced during investigation: the static
DEFAULT_PATTERN_MAP at mass_functions.py references canonical
ICE.* mnemonic strings (monetary_pattern → ICE.SENSITIVE.PID.FINANCIAL.PAYMENT.TXNAMT)
that are absent from non-ICE vocabularies. The pre-2026-04-30
behavior silently dropped any pattern whose target wasn’t in
frame.singletons, disabling the entire pattern source on numeric
or domain-specific vocabularies — including the run that motivated
this work.
mass_functions.resolve_pattern_map now resolves each ICE.* target
through three fallback layers against the active category_set:
- Direct hit on
all_by_code. - Match on
by_abbrevusing the leaf mnemonic (suffix after the final.). - Token-normalized match against
common_namesaliases.
Misses log a single WARNING enumerating the patterns that were
dropped. The resolver is cached on the category_set instance and
runs once per pipeline. The deeper BFO/Common-Core ontology mapping
this shim approximates remains future work.
Deferred work
This treatment preserves Dempster’s rule end-to-end and handles non-distinctness through reliability discounting + per-source reliability shaping. One future refinement remains scoped out:
- Tiered fusion with the cautious rule (Denoeux 2008). Combine
the LLM-derivative cluster
{llm, catboost, svm}via cautious conjunction (idempotent on identical evidence; commonality formulationq1 ∧̂ q2), the independent cluster{cosine, pattern, name_match}via Dempster, and combine the two cluster-level mass functions across-tier. This dissolves the non-distinctness problem at the math level rather than approximating it via discount. Trade-off: cautious is non-normalising, so derivative-tier-only columns will see narrower belief intervals (which is correct behaviour but a UI shift).
The combine_multiple infrastructure already supports adding a
strategy="cautious" branch alongside the existing dempster /
yager options, so the refinement is surgical when it lands.
References
- Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press. Ch. 3 §3 (independence assumption); Ch. 4 §3 (Dempster’s rule); §11.3 (reliability discount).
- Smets, P. (1990). The Combination of Evidence in the Transferable Belief Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 447–458.
- Smets, P. & Kennes, R. (1994). The Transferable Belief Model. Artificial Intelligence 66(2), 191–234.
- Denoeux, T. (2008). Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence. Artificial Intelligence 172(2-3), 234–264. §1, §3.1, §4.
- Haenni, R. & Hartmann, S. (2006). Modeling Partially Reliable Information Sources: A General Approach Based on Dempster-Shafer Theory. Information Fusion 7(4), 361–379.
Operational impact
Operators upgrading to this calibration should expect:
- More columns marked
needs_clarification=Trueon the first run after upgrade. This is the intended outcome: derivative-source amplification no longer hides genuine cross-source conflict. - A modest increase in LLM revisit volume (the gate fires on a
wider, principled condition). Mitigated by the
indep_revisit_mass_thresholdfloor and the existing budget caps atclassify.bootstrap.max_total_llm_calls/max_total_llm_attempts. - A pattern-source
WARNINGat startup enumerating any patterns whose ICE.* target failed to resolve to the active vocabulary. Acceptable as long as the leaf mnemonics that do exist in the vocab carry the relevantabbrevorcommon_namesaliases — expected on first run with a domain-specific vocabulary.
MaxSim Channel — ColBERT Late-Interaction via Qdrant
Naming. This DST evidence channel is named
maxsim— after the scoring operation Qdrant performs (a sum of per-query-token max cosines over the ColBERT multi-vector field), not the single-vector cosine it replaced. The per-token metric is cosine and the encoder is ColBERT, but the channel’s identity — the key insource_masses,INDEPENDENT_TIER, theclassify.maxsim.*config namespace, and theclassify.discounts.maxsimdiscount — ismaxsim. The legacy single-vectorcosinechannel is retired (no fallback). Historical sprint notes may still say “cosine”.
This note specifies the maxsim evidence source: a
multi-vector late-interaction (ColBERT-style) representation per
annotation, stored in Qdrant, with enrichment supplied by an Agent-SDK
curation loop and procedural deterministic verifiers. It composes
with — does not replace — the reliability discounting, indep-tier
consensus gate, hierarchical mass aggregation, and cost-sensitive LLM
prompting documented in dst-evidence-independence.md.
Position in the architecture
The existing DST treatment shapes how per-source masses fuse. This work shapes the cosine source’s input representation. Both are necessary; neither is sufficient on its own.
The motivating gap is structural rather than algorithmic. Current
cosine compresses each annotation into a single embedding from
label + mnemonic + description and compares it to a single
column-side embedding from column_name + concatenated_samples. On
adversarial corpora — anonymized column names (comm_val,
period_val, addr_ref), mixed sample distributions, vocab-token-as-
data columns — the single-vector representation collapses
discriminative signal before it reaches the fusion layer. Reliability
shaping (Haenni-Hartmann 2006) can route mass to ignorance correctly
in this regime, but it cannot recover the discriminative signal that
was lost to the compression.
Late interaction via ColBERT restores the discriminative surface: instead of one dense-vector comparison per (column, tag) pair, the ColBERT encoder produces per-token contextual embeddings (128-d after the linear projection) for both entity and annotation texts. Qdrant’s native MaxSim comparator computes the token-level cross-alignment score directly — no Python-side scoring loop, no per-role weight tuning.
The entity side feeds ColumnFeatures.to_embedding_text() — the same
text SAGE/SHAP ablate over — through the ColBERT encoder. The
annotation side feeds a composed text from the enrichment payload
(label, description, prototype values, name hints, value patterns,
parent path, mnemonic) through the same encoder. Anti-examples are
excluded from the annotation text (they add noise in the embedding
space without improving MaxSim discrimination).
The motivating failure modes resolve through token-level alignment:
- Anonymized columns — column-name tokens contribute little MaxSim, but sample-value tokens still align to annotation prototype- value tokens. Graceful degradation by token structure: weak tokens contribute near-zero MaxSim without polluting strong token matches.
- Long-tail distinguishing values — a single distinctive sample value’s tokens claim their own MaxSim against annotation prototype tokens, no longer averaged out by a single dense vector.
- Sibling discrimination — token-level alignment discriminates between semantically adjacent annotations (e.g., “credit card number” vs “bank account number”) through fine-grained token matching that dense single-vector cosine collapses.
- Parent-pull — parent-path tokens in the annotation text provide
hierarchical context. The hierarchical aggregation in
_maxsim_positive_masscontinues to flow residual mass to internal-node focal elements when subtree-level signal is what’s available.
This is morphologically close to what the upstream Ægir project provides through a learned hierarchical foundation model (RWKV-7 time-mixing + H-Net dynamic chunking, RLVR-trained against a deterministic four-component verifier on SOTAB / GitTables / WikiTables). The two are complementary, not redundant: Ægir’s representations are learned end- to-end against external corpora; late-interaction here is engineered from the user-selected taxonomy with LLM-augmented annotation profiles. Both can coexist as separate evidence sources, and the late-interaction infrastructure remains useful even after Ægir integration for taxonomies Ægir has not been adapted to.
Architecture overview
┌─ Source taxonomy (default.annotations or any user-selected) ────┐
│ label, mnemonic, description, parent path │
└────────────────────┬─────────────────────────────────────────────┘
│
▼ scripts/enrich_annotations.py
┌──────────────────────────────┐
│ Agent SDK enrichment loop │
│ + deterministic verifiers │
└──────────────┬───────────────┘
│
▼ ColBERT token vectors + payload
┌────────────────────────────────────────────┐
│ Qdrant collection: annotations_<tax>_<ver> │
│ - single "colbert" multi-vector field │
│ (per-token 128-d, MaxSim comparator) │
│ - structured JSON payload │
│ - operator_edits audit log │
└────────────┬───────────────────────────────┘
│
│ registered in PGlite taxonomy_registry
│ (administrative pointer, never primary storage)
│
▼ build/exports/<tax>-enriched-<ver>-<utc>.parquet|tsv
on-demand snapshots for operator inspection
At classify time:
ColumnFeatures.to_embedding_text()
│
▼ ColBERT encoder (colbert-ir/colbertv2.0)
entity token vectors (N × 128)
│
▼ Qdrant query_points (using="colbert", MaxSim)
top-K annotations ranked by MaxSim score
│
▼ maxsim_to_mass
mass function (Haenni-Hartmann reliability shaping)
│
▼ DST fusion (existing pipeline)
belief, plausibility, conflict per tag
Qdrant payload schema
The collection per (taxonomy_id, augmentation_version) is the source of truth. No parallel relational mirror. One point per annotation.
Vector field
Each annotation point carries a single multi-vector field:
| Name | Type | Source |
|---|---|---|
colbert | multi-vector | ColBERT token-level embeddings of the composed annotation text |
The composed annotation text is produced by
qdrant_writer.compose_annotation_text() from the enrichment
payload: label, description, prototype values (up to 10), name hints
(up to 10), value pattern descriptions (up to 5), parent path
(ontology chain), and mnemonic. Anti-examples are deliberately
excluded — they add noise in the embedding space without improving
MaxSim discrimination.
The ColBERT encoder (colbert-ir/colbertv2.0) produces per-token
128-dimensional vectors via BERT + a learned linear projection
(768 → 128). Special tokens ([CLS], [SEP], [PAD]) are stripped;
only content tokens contribute to MaxSim.
The collection is configured with MultiVectorConfig(comparator=MAX_SIM)
so Qdrant computes token-level late-interaction scoring natively —
no Python-side scoring loop.
Payload (JSON)
{
// Source taxonomy fields, immutable passthrough
"code": "ICE.SENSITIVE.PID.CONTACT.EMAIL", // or user-vocab equivalent
"label": "Email",
"mnemonic": "EMAIL",
"description": "RFC 5322 email addresses, including international forms.",
"parent_code": "ICE.SENSITIVE.PID.CONTACT",
"parent_path": ["Sensitive Data", "PII", "Contact", "Email"],
// Enrichment fields, generated + verified
"prototype_values": ["jane.doe@example.com", "user@subdomain.example.org", ...],
"value_patterns": [
{"kind": "regex", "expr": "[^@\\s]+@[^@\\s]+\\.[^@\\s]+"},
{"kind": "format", "expr": "local-part @ domain, RFC 5322"}
],
"name_hints": ["email", "e_mail", "email_addr", "contact_email", "msg_val"],
"anti_examples": [
{"value": "+1-555-123-4567", "confusable_tag": "A_PHN", "reason": "phone-shaped"},
{"value": "https://example.com/path", "confusable_tag": "SYSURL", "reason": "URL-shaped"}
],
// Provenance + audit
"augmentation_version": "v1", // prompt template + verifier version
"embedding_model": "colbert-ir/colbertv2.0",
"embedding_dim": 128,
"generated_at": "2026-05-16T20:00:00Z",
"generated_by": "agent-sdk:opus-4.7", // model + harness identifier
"verifier_results": {
"prototype_values_match_patterns": true,
"patterns_compile": true,
"anti_example_targets_exist": true,
"parent_path_consistent": true,
"checks_passed": 4,
"checks_total": 4
},
// Operator edits log — append-only, every edit recorded
"operator_edits": [
{
"at": "2026-05-17T09:14:00Z",
"by": "operator@example.com",
"field": "prototype_values",
"op": "remove",
"value": "test@test.test",
"reason": "weak exemplar"
}
],
// Cross-reference
"taxonomy_id": "default",
"taxonomy_version": "2026-05-01"
}
Cache key (content-addressed)
Rebuilds are idempotent under stable inputs. The cache key for a single annotation point is:
key = sha256(
taxonomy_id ||
taxonomy_version_hash ||
augmentation_version ||
embedding_model ||
source_row_hash // hash of label+mnemonic+description+parent_code
)
Skip-on-cache-hit during rebuilds; force-rebuild via CLI flag. The cache layer is responsible for invalidation on any input change.
Collection naming
annotations_<taxonomy_id>_<augmentation_version>
Example: annotations_default_v1, annotations_hivepoc_synth_v1.
The PGlite registry row tracks which collection is current for a
given taxonomy_id; old collections remain queryable for A/B
comparison and rollback.
Enrichment pipeline (high-level)
Detailed in scripts/enrich_annotations.py (P2) and the
atelier.enrichment package. Vocabulary identity is dynamic:
operators select a (connection, database, annotations_table)
triple at runtime; the pipeline must not encode the count, names,
or structure of the currently-loaded set as intrinsic. The single
universal is that every node — leaf or internal — is a first-class
tagging target, so both leaf and internal nodes receive enrichment.
The shape:
- Read source taxonomy rows from the active annotations table selected by the operator at runtime. No vocabulary identity is hardcoded.
- For each node (leaf or internal), run the enrichment loop:
- Build a generation prompt with parent-aware framing for internal nodes (children listed, “what does a column tagged at this generality look like without specializing to a child”) or leaf-aware framing for leaves (sibling-discriminative patterns, concrete prototype values).
- Call the provider-co-located generator (see below) to produce the six-field structured payload.
- Run the deterministic verifier suite
(
atelier.enrichment.verifiers). Failed checks become verifier feedback that is fed back into the next generation attempt up toenrichment.max_attempts. - Compute
parent_pathdeterministically from the taxonomy structure (no LLM needed) and confirm the LLM’s reasoning is consistent with it.
- Compute embeddings for each named vector using the configured embedding model.
- Write the multi-vector point + payload to Qdrant, keyed by the
content-addressed cache key. Idempotent: same
(vocabulary content hash, augmentation_version, embedding_model, source_row hash)quadruple → same point ID → no redundant work on partial rebuilds. - Update the PGlite
taxonomy_registryrow to record the build (taxonomy_id, augmentation_version, collection name,built_at, status). The registry is an administrative pointer — it records that a collection exists and where, never the primary content.
This pipeline satisfies the LLM-mediated reference artifact bar (audited via memory): every output is procedurally reproducible from its inputs and falsifiable by the verifier suite.
Provider co-location with classify
The enrichment generator does NOT introduce a separate provider
knob. It reads cfg.classify_llm_backend and uses the same
backend the classification path uses — operators manage one set of
credentials, one cost regime, one billing surface. Within that
backend, the generator selects the strongest reasoning model
available, because per-node generation is single-shot and
benefits from extended deliberation on structural taxonomy
judgments (sibling discrimination, prototype induction, regex
synthesis).
Selection rule (highest priority first), implemented in
atelier.enrichment.model_resolver.resolve_enrichment_model:
cfg.enrichment_model_override(env:ATELIER_ENRICHMENT_MODEL) — explicit operator choice, used verbatim.- Per-backend apex constant when the platform owns the model
identity (currently:
anthropic → claude-opus-4-7). - Fall through to
cfg.classify_llm_modelfor backends where the model identity is endpoint-owned (openai_compatible,cerebras) — the operator’s served endpoint is the apex available to that deployment. - Bedrock without
model_overrideraisesEnrichmentModelErrorwith an operator-facing remediation hint. Bedrock model identities are AWS account + region + inference-profile specific; no portable default constant would be correct across deployments, and silently degrading to a weaker model would contradict the strongest-reasoning-model discipline. This is a deployment-readiness gate consistent with the no-silent-DST-degradation principle.
The generator records {backend}:{model} in the point’s
generated_by provenance field, so verifier pass-rate per node is
attributable to the exact provider+model combination — the unit of
replayable experiment.
Parent-aware vs leaf-aware prompts
Both prompt variants produce the same six-field JSON schema, so downstream code treats their outputs identically. The framing difference shapes content quality:
- Leaf prompt asks for values, patterns, and name hints describing what a column tagged exactly at this leaf would contain. Patterns are narrow enough to discriminate against sibling leaves under the same parent.
- Parent prompt asks for what a column tagged at this
generality level — without further specificity to a child
looks like. Children are listed so the model knows what
specializations would NOT route here. Anti-examples are
hierarchically aware: the
confusable_tagfield (a vestigial name retained for schema stability — seeanti_example_targets_existverifier) may point to a sibling at the same level OR a sibling of an ancestor, because the late-interaction architecture’s anti-example evidence applies regardless of where in the tree the negative exemplar lives.
Late-interaction execution
Column-side multi-vector representation
For each column being classified, build the multi-vector query:
| Query vector | Source |
|---|---|
col_name_view | embed(column_name + " in " + table_name) |
col_sample_* | embed(sample_value) per deduped sample (top-N by frequency or distinctness, configurable) |
col_context_view | embed("table columns: " + concat(other column names in same table)) |
col_pattern_view | embed(extracted format hints from samples) |
col_pattern_view is computed from sample values via the existing
regex/validator detection in the pattern source — this is where the
original “regex as embedding-text enrichment” intent (referenced in
dst-evidence-independence.md and in
the upstream Ægir documentation) re-enters cleanly: regex outputs
contribute structured features into one of the multi-vector query
slots, not as an independent mass function competing with cosine.
The pattern source’s standalone mass-function status is preserved
for narrow PII detection (email, IBAN, monetary, …) where its hits
are crisp; the col_pattern_view augmentation is additional, not a
replacement.
MaxSim aggregation
For each candidate tag and each query vector, find the best match in the annotation’s multi-vectors of the corresponding role:
positive_score(col, tag) =
sim(col_name_view, label_view of tag) * w_label
+ sim(col_name_view, name_hints of tag) * w_name
+ max(sim(col_sample_i, prototype_values of tag)) * w_proto_per_sample
+ max(sim(col_sample_i, value_patterns of tag)) * w_pattern_per_sample
+ sim(col_context_view, parent_path_view of tag) * w_context
+ ...
Execution happens in-engine via Qdrant’s multi-vector query API with MaxSim comparator. HNSW indexing brings the cost down to logarithmic in the annotation count, which is the dominant cost as vocabularies scale across deployments.
Mass function construction
mass_functions.maxsim_to_mass(scores, frame) produces a
BeliefAssignment over the candidate frame from the Qdrant MaxSim
scores.
The MaxSim score per tag is calibrated to evidence mass via the
same reliability-shaping pattern documented in
dst-evidence-independence.md:
Haenni-Hartmann α-bounded reliability + margin-aware allocation.
α_abs— sigmoid of top-1 MaxSim score. “Is the best match strong enough to carry mass?”α_marg—tanh((s₁ − s₂) / σ). “Is the top-1 decisive?”
Allocation:
m(top-1) = α · margin_weight + α · (1 − margin_weight) · softmax_top1
m(top-i, i > 1) = α · (1 − margin_weight) · softmax_top_i
m(Θ) = 1 − α
Hierarchical subtree aggregation (_significant_subtree) routes
residual mass to internal-node focal elements when subtree-level
signal dominates leaf-level signal.
Storage philosophy
Single source of truth per layer, with administrative pointers in PGlite.
| Layer | Primary storage | Role |
|---|---|---|
| Vectors + payload | Qdrant (annotations_<tax>_<ver>) | Truth for enriched annotations; supports late-interaction execution |
| Run artifacts | build/ (existing pattern) | Parquet, classifications, evaluation, sweep manifests, exports |
| Administrative | PGlite (taxonomy_registry, run regs) | Where things live, at which version, in which status |
| Future (planned) | Iceberg in S3 | Intermediates + hx history tables (taxonomy_history, enrichment_history, classification_runs_history, sweep_history); snapshot/time-travel for hx semantics native to Iceberg |
PGlite never holds vectors, payloads, classifications, or intermediates. Its job is to answer “where is the current enriched annotation collection for taxonomy X?” and “which run produced this dataset?” Both registries are small, fast to query, and survive backend migrations untouched.
When Iceberg-HX-in-S3 lands, the migration is a backend swap at the
registry layer — pipeline_run_registry.artifacts_backend flips
from build_local to iceberg_s3, artifacts_path switches to an
S3 URI, and pipeline logic remains unchanged. Current build/
artifacts are forward-compatible with this transition.
PGlite tables (P1.2 migration)
CREATE TABLE taxonomy_registry (
taxonomy_id TEXT PRIMARY KEY,
source_table TEXT NOT NULL,
qdrant_collection TEXT NOT NULL,
qdrant_url TEXT,
augmentation_version TEXT NOT NULL,
embedding_model TEXT NOT NULL,
embedding_dim INTEGER NOT NULL,
built_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
status TEXT NOT NULL DEFAULT 'building',
-- 'building' | 'current' | 'stale' | 'archived'
summary TEXT
);
CREATE INDEX idx_taxonomy_registry_current
ON taxonomy_registry(taxonomy_id, status);
-- Extends fsm_runs to record which enriched annotation collection
-- the run consumed. NULL = legacy cosine; non-NULL = late-interaction.
ALTER TABLE fsm_runs ADD COLUMN IF NOT EXISTS
taxonomy_collection TEXT REFERENCES taxonomy_registry(qdrant_collection);
Operator inspection and edit surface
The active enriched-annotations collection in Qdrant (whatever the operator’s runtime vocabulary selection happens to produce) is operator-facing through two surfaces:
On-demand export (scripts/export_enriched_annotations.py,
P2.4): writes the Qdrant payload for a given (taxonomy_id, version)
to build/exports/<tax>-enriched-<ver>-<utc>.parquet and a
human-readable .tsv. Read-only snapshots, diffable across
versions, dropable when no longer needed. Operators inspect via
their existing tooling (parquet viewers, spreadsheet apps,
mlr/q/duckdb for CLI).
Structured edit CLI (scripts/edit_enriched_annotation.py,
deferred — part of P2 follow-on): operators issue targeted edits
(add/remove prototype value, rewrite anti-example, etc.) which:
- Write back to the Qdrant point’s payload + re-embed affected views
- Append an entry to the
operator_editsaudit log - Bump a per-row revision counter (separate from
augmentation_version, which is the system-level prompt/verifier version)
Edits are reversible — the audit log carries the prior value for every change. Per-customer overlays (deployment-specific augmentations beyond the base) follow the same shape on a separate edits stack.
SHAP / SAGE shift under late interaction
The structured per-segment inputs (column_name, each sample, context, pattern view) provide natural attribution surfaces that the prior single-vector representation flattened.
SHAP becomes per-decision interpretability infrastructure. For a
column predicted EMAIL, SHAP attributes the score across the
structured inputs: “sample_3 contributed 0.42 via match against
EMAIL.prototype_values[7]; column_name contributed 0.08 via
name_hints; everything else < 0.05.” Operator-legible
explanation per prediction, computable in-pipeline at moderate
cost (one late-interaction pass per perturbation). Wired into
features.FEATURE_NAMES as new ablatable feature slots:
late_interaction_positive, late_interaction_negative,
late_interaction_view_<name>.
SAGE moves to offline-first. Late-interaction inputs are richer (more “features” — per-view contributions, per-vector contributions), and SAGE’s permutation-based global compute scales with that dimensionality. Per-pipeline-run SAGE becomes impractical and, more importantly, of low marginal value: SAGE’s value proposition is corpus-level stability rather than per-run signal. The shift:
- SAGE runs as a separate offline pipeline, scheduled or on-demand, against the current enriched annotations + corpus characterization.
- Artifact written to
build/sage/<corpus_id>-<taxonomy_version>-<utc>.parquet. - Downstream consumers (UI, view-prioritization, operator dashboards) reference the cached artifact; the pipeline hot path never recomputes inline.
- Optional integration: SAGE importance scores prioritize which annotation views the late-interaction engine computes first, with early-exit when high-importance views already discriminate confidently — a wall-time win on large taxonomies.
CLAUDE.md already notes SAGE is optional; this makes “optional” precise: optional in the hot path, scheduled-only otherwise.
Integration with existing fusion mechanisms
Every mechanism in dst-evidence-independence.md
composes cleanly with this work. Specifically:
| Existing mechanism | Composes by |
|---|---|
| Reliability discounting (Shafer §11.3) | Late-interaction cosine carries its own discount slot in config/base.conf; default starts at cosine value (0.20) and is sweep-tunable. |
| Indep-tier consensus + revisit gate | Late-interaction cosine remains in the independent tier (its only LLM dependence is the enrichment, which is offline + verified). Indep-tier fusion picks it up unchanged. |
| Cosine reliability shaping (Haenni-Hartmann 2006) | The α-bounded + margin-aware allocation pattern is reused for the positive channel; quality indicators extend to include verifier-pass-rate. |
| Hierarchical mass aggregation + cross-subtree visibility | The positive-channel mass function emits hierarchical mass identically: walk up from top-1 leaf to the most-specific subtree capturing ≥ 50% of softmax probability, redirect residual to internal-node focal element. cautious_promoted_code walks the full hierarchy as before. |
| Cost-sensitive classification at LLM layer (Elkan 2001) | Unchanged — operates upstream of fusion and is orthogonal to the cosine representation. |
| Pattern-target alias resolver | Unchanged for the standalone pattern source. The pattern source’s hits additionally enrich the col_pattern_view query vector. |
| Per-column residual trajectory | Unchanged — operates on the iteration history of fused belief, which still flows through BootstrapState. The late-interaction cosine’s per-view scores can be added to the snapshot for finer-grained trajectory analysis (deferred). |
Configuration
New keys under classify.cosine.late_interaction in config/base.conf:
classify {
cosine {
# Late-interaction multi-vector cosine is the production cosine
# source. Default ON. The legacy single-vector cosine path
# remains in the code only as a transitional emergency fallback;
# when the late-interaction flag is on and the path cannot run
# (no enriched collection, Qdrant unreachable, qdrant-client
# missing), the pipeline logs WARNING + marks the run degraded
# via `maxsim_path` in the per-column result.
late_interaction {
enabled = true
enabled = ${?ATELIER_CLASSIFY_COSINE_LATE_INTERACTION}
model = "colbert-ir/colbertv2.0"
model = ${?ATELIER_COLBERT_MODEL}
qdrant_url = "http://127.0.0.1:6333"
qdrant_url = ${?ATELIER_QDRANT_URL}
}
}
}
Existing classify.cosine.* keys are unchanged; the late-interaction
path is the production cosine source under this design. The flag
exists for emergency rollback only — leaving the pipeline in legacy
single-vector cosine is a deployment-degraded state, not a normal
operating mode, and runs in that state are tagged with
maxsim_path: "legacy_degraded:<reason>" in the per-column result
so the degradation is visible in operator-facing artifacts.
Deferred work
- Synthia / copula-aware column-side patterns: when the SVM-on-synthetic work lands (separate track), the column-side multi-vector can include copula-derived inter-column dependency features as additional query vectors. The query-vector slot is already structurally available; only the feature extractor needs to land.
- Aegir CTA + CPA outputs as additional query vectors: when Aegir integration lands, its predictions (and its CPA / cross- table grouping outputs) can enter the column-side multi-vector as supplementary query views. Same structural slot.
- Per-deployment edit overlays with separate version stack from the base augmentation. Schema for the overlay is sketched above; implementation deferred until operator workflow is validated.
- Iceberg-HX-in-S3 backend for the on-demand exports + run artifacts. Designed-for; not yet built.
References
- Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR ’20, 39–48. Introduces the late-interaction MaxSim formulation.
- Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL 2022. Refines the MaxSim scoring + residual compression.
- Qdrant multi-vector named-vectors API: https://qdrant.tech/course/multi-vector-search/module-1/late-interaction-basics/
- Shafer, G. (1976). A Mathematical Theory of Evidence. §11.3 reliability discount. (Reused per the existing DST treatment.)
- Smets, P. (1990). The Combination of Evidence in the Transferable Belief Model. IEEE TPAMI 12(5), 447–458. Negative-channel framing.
- Haenni, R. & Hartmann, S. (2006). Modeling Partially Reliable Information Sources. Information Fusion 7(4), 361–379. α-bounded reliability shaping reused here.
- Companion architecture note:
dst-evidence-independence.md— reliability discounting, indep-tier consensus, hierarchical aggregation, cost-sensitive LLM prompting. - Upstream foundation-model work: https://zndx.github.io/aegir/ (hierarchical byte-level sequence model + RLVR-trained ontology policy for CTA/CPA/cross-table grouping; complementary independent evidence source on a longer timeline).
Nautilus — Mid-Run Pipeline Watcher
Nautilus is the in-process, mid-run watcher for a classification run. A
daemon thread polls the FSM and
BootstrapState.batch_audit, decides when a run is going sideways, and
hands a structured InterventionRecord to a callback. The callback —
not nautilus — decides what to do (record, cancel, escalate).
The thread itself is observation + decision framing. It owns no LLM-calling code, holds no agent context, and never kills a process. That separation keeps nautilus testable without tool-using agents in the loop and lets the same trigger logic serve both the gateway’s auto-cancel hook and the supervisor Overwatch post-mortem.
Why it exists
UAT surfaced a class of failure where the pipeline stopped making
progress without erroring — typically a frozen LLM sweep on a
problem batch with no heartbeat advance for 20+ minutes. The FSM still
read LLM_SWEEP; nothing was wrong from the FSM’s point of view. The
operator either waited or killed the process by hand.
Nautilus closes that gap. It pairs with two other layers of self-remediation in the pipeline:
| Pillar | Where it lives | What it does |
|---|---|---|
| 1 — Halving retry | classify/bootstrap.py | Per-batch: on LLM failure, halve columns_per_call and retry until single-column or success. Preserves 100% column coverage. |
| 2 — Nautilus (this doc) | overwatch/nautilus.py | Per-run: observe FSM + batch_audit, fire an intervention when the run stalls, sweeps too long, or accumulates failures. |
| 3 — Supervisor Overwatch | overwatch/agent.py, apply_and_rerun.py | Post-run: read the latest intervention record, propose a config overlay, optionally rerun. Multi-attempt session in overwatch/session.py. |
Triggers
Each trigger fires at most once per FSM phase. Phase change resets the
phase-scoped triggers (stall, slow_llm_sweep) so a long run can
record one intervention per phase rather than storming every poll.
| Trigger constant | Fires when | Default threshold |
|---|---|---|
TRIGGER_STALL ("stall") | No new batch_audit activity for stall_threshold_s while FSM is in a non-terminal state. | 120 s |
TRIGGER_SLOW_SWEEP ("slow_llm_sweep") | fsm.state == LLM_SWEEP for more than llm_sweep_threshold_s, regardless of batch progress. | 300 s |
TRIGGER_FAILED_BATCHES ("failed_batches") | Count of batch_audit entries with status failed or fatal exceeds failed_batch_threshold. | 10 |
TRIGGER_FSM_ERROR ("fsm_error") | Pipeline transitioned to ERROR. Unconditional; bypasses other evaluation. | — |
evaluate_triggers() is a pure function of heartbeat + config — tests
exercise it directly with a seeded _Heartbeat and synthetic clock,
no threads required.
How it observes
- State registry (module-level
_state_registry): the pipeline callsregister_state(run_id, state)early inrun_classification_pipelineandunregister_state(run_id)in the finally block. The registry is lock-guarded so nautilus never observes a partially-destructed state during teardown. - FSM polling (
tick):fsm.get_status(run_id)each poll (default every 10 s). Phase change resets phase-scoped triggers and the phase-entry clock. - batch_audit tail: nautilus counts entries and failed entries.
The pipeline appends to
state.batch_auditbetween LLM calls so the audit length acts as the heartbeat — its non-advance is the stall signal.
Dispatch and cooperative cancel
When a trigger fires, _dispatch builds an InterventionRecord,
appends it to watcher.interventions, and invokes
intervene_callback(rec) if one was supplied. The callback returns a
dict with a decision field ("observed" | "intervened" |
"cancelled") and an optional reason.
If decision == "cancelled" and cfg.can_cancel is true, nautilus
flips state.cancelled = True on the registered BootstrapState. The
pipeline checks this flag between LLM batches in
bootstrap.py
and exits cleanly. There is no SIGKILL path. An in-flight LLM
call finishes before the run terminates. This is what “cooperative
cancel” means.
can_cancel is gated by overwatch.autonomy:
| Autonomy mode | can_cancel | What nautilus does on stall |
|---|---|---|
monitor | false | Record only. |
propose | false | Record only. The supervisor reads the record post-run. |
autonomous | true | Flip state.cancelled so the pipeline exits. |
The gateway’s default callback (gateway.py:2154)
always returns {"decision": "cancelled"} — so the autonomy gate is
the only thing keeping propose / monitor runs from auto-cancelling.
Deployment gates — Bedrock-only and direct-Anthropic
Nautilus runs without the direct Anthropic API. It makes no LLM calls
of its own — it observes, decides, and hands a record to a callback.
The Anthropic gate (cfg.has_overwatch) only applies to Pillar 3
— the post-run supervisor agent that consumes nautilus’s records
and proposes config overlays. Pillar 2 (this watcher) is upstream of
that gate.
Three independent gates drive what nautilus actually does:
| Gate | Source | Affects |
|---|---|---|
overwatch.nautilus.enabled | HOCON / env (default true) | Whether the watcher attaches at all |
overwatch.autonomy == "autonomous" | HOCON / env (default propose) | Whether nautilus can flip state.cancelled itself; whether kill_run CLI is permitted |
cfg.has_overwatch (= overwatch.enabled AND has_anthropic) | derived from ANTHROPIC_API_KEY | Whether Pillar 3 supervisor agent runs post-run |
Capability matrix on a Bedrock-only deployment (no
ANTHROPIC_API_KEY — typical for CAI on Bedrock or air-gapped
environments):
| Capability | Bedrock + propose (default) | Bedrock + autonomous |
|---|---|---|
| Watcher thread starts and polls | ✅ | ✅ |
| Trigger detection (stall / slow-sweep / failed-batches / fsm_error) | ✅ | ✅ |
InterventionRecords queryable via /api/overwatch/nautilus* | ✅ | ✅ |
Operator UI Stop (POST /api/fsm/cancel) | ✅ — never autonomy-gated | ✅ |
Auto-cancel on stall (nautilus → state.cancelled) | ❌ recorded only — can_cancel=False | ✅ |
kill_run CLI | ❌ rejected (autonomy gate) | ✅ |
| Post-run supervisor agent (proposes overlay, optional rerun) | ❌ requires direct Anthropic API | ❌ requires direct Anthropic API |
To unlock auto-cancel without adding an Anthropic key, set
ATELIER_OVERWATCH_AUTONOMY=autonomous. Autonomy is independent of
has_overwatch. The trade-off: nautilus will cancel runs based on
threshold rules alone, with no AI judgement layer behind the
decision.
Config
overwatch {
autonomy = "propose" # monitor | propose | autonomous
nautilus {
enabled = true
poll_interval_s = 10.0
stall_threshold_s = 120.0
llm_sweep_threshold_s = 300.0
failed_batch_threshold = 10
}
}
Environment overrides: ATELIER_OVERWATCH_NAUTILUS_ENABLED,
ATELIER_OVERWATCH_NAUTILUS_POLL_INTERVAL_S,
ATELIER_OVERWATCH_NAUTILUS_STALL_THRESHOLD_S,
ATELIER_OVERWATCH_NAUTILUS_LLM_SWEEP_THRESHOLD_S,
ATELIER_OVERWATCH_NAUTILUS_FAILED_BATCH_THRESHOLD.
nautilus_config_from_cfg(cfg) reads these and fills can_cancel
from overwatch.autonomy == "autonomous".
HTTP surface
Both routes are read-only. Cancellation goes through the operator
“Stop run” UI control or the kill_run CLI; nautilus does not expose
a cancel endpoint of its own.
| Method | Path | Purpose |
|---|---|---|
GET | /api/overwatch/nautilus/{run_id} | Watcher snapshot for a specific run: heartbeat, intervention list, cancelled flag. |
GET | /api/overwatch/nautilus | All active watchers (typically one — runs are single-flight). |
The watcher object is held in a module-level _active_watchers map
so the gateway can answer status queries without plumbing the
reference through pipeline internals. Intervention history survives
watcher stop and remains queryable until the gateway restarts.
Operator CLI: cooperative kill
uv run python -m atelier.overwatch.kill_run <run_id> \
--reason "stuck on partner-data sweep" \
[--session <supervisor-session-id>]
kill_run looks the run up in the nautilus registry, sets
state.cancelled = True, and stops the watcher. Gated to
autonomous mode — in propose / monitor an operator must use
the UI Stop control, since the supervisor (which calls this CLI in
autonomous mode) isn’t yet authorized to cancel on its own. With
--session, the cancel is appended to the supervisor session’s
intervention log via overwatch.session.record_intervention.
Lifecycle
- Operator hits
POST /api/fsm/start. Gateway spawns the pipeline thread. - Gateway polls
fsm.get_status()for up to ~1 s waiting for the pipeline to claim arun_id, then constructs aNautilusWatcher, registers it in_active_watchers, and starts the daemon thread. - Pipeline calls
register_state(run_id, state)early inrun_classification_pipeline. From here nautilus can observe. - On each
poll_interval_stick: read FSM, refresh heartbeat, evaluate triggers, dispatch records, repeat. - On terminal state (
IDLE/CONVERGED/ERROR) the watcher’stickreturnsTrueand the thread exits. Pipeline’sfinallycallsunregister_state(run_id)and the gateway callsclear_active_watcher(run_id).
Testing
The watcher is split deliberately to keep tests synchronous:
- Trigger logic — drive
evaluate_triggers(state_name=..., now=..., failed_count=...)against a watcher with a hand-seeded_Heartbeatand a fake clock. No threads, no FSM, noBootstrapState. - Dispatch / cancel — instantiate a
NautilusWatcherwith a fakeintervene_callbackand a stub FSM; assert that the callback return value drivesstate.cancelledcorrectly under eachcan_cancelvalue. - End-to-end — BDD scenarios under
features/agent/cover the registry round-trip and the gateway routes; the slow-path watch is not exercised in CI (it would require a real long-running sweep).
Monte Carlo Sampling
At small corpus sizes (< 200 columns), every column receives direct LLM classification. As the corpus scales to thousands or millions of columns, this becomes prohibitively expensive. Monte Carlo stratified sampling selects a representative subset for direct LLM inference and propagates labels cheaply via embedding similarity to the remainder.
This is a zero-cost optimization: below the threshold, the pipeline behaves identically to before. The MC layer activates transparently at scale.
Three-Phase MC Layer
The MC layer operates between SAMPLING and LLM_SWEEP in the existing pipeline. No new FSM states — it runs as sub-phases.
SAMPLING
├─ [existing] Extract features for all columns
├─ Pre-classify: cheap M0 evidence (name, pattern, cosine) — no LLM
├─ Stratify: group by preliminary category + uncertainty
└─ Select MC sample: importance-weighted within strata
LLM_SWEEP
├─ [existing] LLM classifies the MC sample (not all columns)
└─ Propagate: extend labels to remaining corpus via embedding similarity
VALIDATING
└─ [existing] Full 6-source DST on ALL columns
(propagated labels enter as discounted LLM evidence)
→ High-gap / low-belief propagated columns escalate to revisit
Phase 1: Pre-Classification
Run M0 evidence sources only (no LLM, no ML models). For each column:
- Name matching → best category + mass
- Pattern detection → matched categories
- Cosine similarity → top-K categories + scores
Returns a preliminary category code + confidence for every column. Uses the
existing name_match_to_mass(), pattern_to_mass(), classify_cosine()
functions from the pipeline.
Phase 2: Stratification
Partition columns by their preliminary category code:
- Rare strata (< 2 x
min_per_stratummembers): fully sampled - UNRESOLVED stratum (M0 sources disagree or low confidence): fully sampled
- Normal strata: proportional allocation with importance weighting
Phase 3: Sample Selection
Within each normal stratum, select columns via importance-weighted random sampling without replacement. Importance weight per column:
w = (1 - confidence) × (1 + uncertainty)
where confidence = max cosine similarity, uncertainty = ratio of
2nd-best to 1st-best similarity (ambiguity measure).
Total budget: min(max_sampled_columns, total × sample_fraction)
Label Propagation
After the LLM sweep on the sampled subset:
- For each propagation column, find the nearest directly-classified column by cosine similarity (stratum-local to limit search space)
- If similarity >=
propagation_threshold: assign same label with discounted confidence - If similarity < threshold: column gets no LLM evidence in DST
Propagated labels enter DST fusion with a higher discount factor (0.30 vs 0.10 for direct LLM) — they carry less evidential mass. If M0 sources disagree with the propagated label, conflict K rises and the existing targeted-revisit loop automatically escalates the column for direct LLM classification.
Why This Works with DST
The evidence fusion framework makes MC sampling robust:
- Propagated evidence carries less mass (more goes to Theta/ignorance)
- M0 agreement with propagated label → high belief, narrow gap (good)
- M0 disagreement with propagated label → wide gap → revisit-via-LLM
- Escalation is automatic — no special MC-aware revisit logic needed
Scaling Projections
GitTables corpus: 1.7M tables today, 10M+ near-term. Average 8-12 columns per table = 15M-120M columns at full scale.
| Corpus | MC Mode | Direct LLM Calls | Propagated | Cost Reduction |
|---|---|---|---|---|
| 50 | Passthrough | 50 (all) | 0 | 0% |
| 500 | Active | ~75 (15%) | ~425 | 85% |
| 5,000 | Active | ~500 (cap) | ~4,500 | 90% |
| 50K | Active | ~500 (cap) | ~49.5K | 99% |
| 500K | Active | ~500 (cap) | ~499.5K | 99.9% |
| 15M | Active | ~500 (cap) | ~15M | >99.99% |
| 120M | Active | ~500 (cap) | ~120M | >99.99% |
At the max_sampled_columns=500 cap, stratified importance sampling ensures
every category stratum gets at least min_per_stratum=3 exemplars. Uniform
random sampling at 500/15M would miss rare categories entirely.
Scale-Critical Design Decisions
- Embedding computation: batch GPU encoding at ~2,768 texts/s (RTX 4090); 15M columns takes ~90 minutes. One-time cost, GPU-parallelizable.
- Stratum-local propagation: similarity search within each stratum (not across the full corpus) to limit memory and compute.
- Memory: 15M columns × 200B = ~3GB for metadata; 15M × 1.5KB = ~22GB for embeddings. Requires streaming/chunked processing.
- Escalation budget: ~50-100 additional direct-LLM calls from revisit. Total LLM call budget: ~600 calls for a 15M-column corpus.
Configuration
classify {
monte_carlo {
min_corpus_size = 200 # Below this, classify everything
min_corpus_size = ${?ATELIER_MC_MIN_CORPUS_SIZE}
sample_fraction = 0.15 # Fraction directly classified by LLM
sample_fraction = ${?ATELIER_MC_SAMPLE_FRACTION}
min_per_stratum = 3 # Minimum samples per category stratum
max_sampled_columns = 500 # Hard cap on directly-classified columns
max_sampled_columns = ${?ATELIER_MC_MAX_SAMPLED}
propagation_threshold = 0.85 # Cosine sim for propagation
propagation_threshold = ${?ATELIER_MC_PROPAGATION_THRESHOLD}
propagation_discount = 0.30 # LLM mass discount for propagated labels
}
}
Module Structure
src/atelier/classify/monte_carlo.py
├── MCConfig — Frozen dataclass with from_cfg() factory
├── PreClassification — Per-column M0 result (code + confidence + uncertainty)
├── Stratum — Column group by preliminary category
├── MCPlan — Sampling plan (sampled + propagation sets)
├── pre_classify() — Run M0 evidence for all columns
├── stratify() — Group by preliminary category + uncertainty
├── select_sample() — Importance-weighted selection within strata
└── propagate_labels() — Embedding-similarity label extension
GPU Acceleration
Atelier uses GPU acceleration for sentence-transformer embedding computation and CatBoost training/inference. GPU support is auto-detected at startup with graceful fallback to CPU.
Detection
gpu.preflight_gpu() runs once at config load time and caches the result
for the process lifetime. Three-step detection:
- nvidia-smi probe: subprocess call to detect device count, names, VRAM, and driver CUDA version
- CUDA version extraction: parse nvidia-smi header for driver compatibility
- PyTorch check:
torch.cuda.is_available()confirms runtime support
The result is a GpuInfo dataclass with:
available— whether CUDA is usabledevice_count— number of GPUsdevices— device names with VRAM (e.g., “NVIDIA RTX 4090 24GB”)resolved_device—"cuda"or"cpu"for model initializationwarnings— non-blocking issues (version mismatches, library path hints)
NVIDIA Driver Symlink (nix + CUDA)
In devenv (nix-managed), CUDA libraries are isolated from the host system.
The GPU module handles the nix+CUDA compatibility pattern by detecting
the driver library path and ensuring PyTorch can find it. This avoids
the common nix pitfall where torch.cuda.is_available() returns False
despite GPUs being present.
Integration Points
Sentence-Transformer Embedding
embedding.py calls preflight_gpu() before initializing the
SentenceTransformer model, passing device=gpu_info.resolved_device:
gpu_info = preflight_gpu()
model = SentenceTransformer("all-MiniLM-L6-v2", device=gpu_info.resolved_device)
GPU batch encoding achieves ~2,768 texts/second on RTX 4090 (vs ~400/s on CPU). This matters at scale: 15M columns takes ~90 minutes on GPU vs ~10 hours on CPU.
CatBoost Training
CatBoost automatically uses GPU when available via its task_type
parameter. The virtual ensemble posterior sampling that drives uncertainty
quantification benefits from GPU parallelism.
Preflight Reporting
GPU status appears in just preflight output and in the /api/status
gateway endpoint, giving operators immediate visibility into whether
GPU acceleration is active.
Configuration
GPU detection is automatic — no configuration needed. The system probes hardware and falls back gracefully:
- GPU available: uses CUDA for all embedding and training operations
- GPU detected but CUDA unavailable: warns about library path issues, falls back to CPU
- No GPU: runs entirely on CPU with no warnings
CAI Considerations
CAI ML workloads can request GPU runtimes. When running on a GPU-enabled CAI session:
- The NVIDIA drivers are provided by the container runtime
- PyTorch CUDA support depends on the Python runtime image
- GPU memory is shared with other processes in the session
- Background SHAP computation can be memory-intensive; monitor with
nvidia-smiif running alongside large models
Synthetic Data & Training
The classification pipeline includes two ML evidence sources — CatBoost and SVM — that require training data. Atelier generates synthetic training data from the controlled vocabulary, trains both classifiers, and uses them as independent evidence sources in DST fusion.
Synth Generators
synth_generators.py is the single source of truth for 316+ hand-coded
value generators shared across the synth framework, sample source generation,
and the registry.
Each generator is a callable (rng: random.Random) -> str that produces
realistic values for a category. Examples:
EMAIL→"j.smith@example.com","alice.chen@corp.net"SSN→"123-45-6789"(formatted US Social Security Number)LATITUDE→"41.8781"(valid geographic coordinate)CURRENCY_CODE→"USD","EUR","JPY"
Three-Layer Generator Registry
synth_registry.py builds a complete generator set for any vocabulary
through a priority-based registry:
| Priority | Source | Description |
|---|---|---|
| 1 (highest) | Hand-coded | From GENERATORS dict in synth_generators.py |
| 2 | Template | Real sample values with mild perturbation (±10% numeric jitter, character substitution) |
| 3 (lowest) | Inferred | Regex pattern matching on category metadata (description, common_names) |
registry = GeneratorRegistry.from_vocabulary(category_set)
# registry.coverage_summary() → {"hand-coded": 250, "template": 40, "inferred": 26}
The registry provides coverage_report() and coverage_summary() to
identify categories without generators — important for vocabulary expansion.
Column Name Generation
Synthetic training data deliberately uses diverse column names to prevent classifiers from relying on name heuristics:
- Semantic names:
email_address,emailAddress,EMAIL_ADDR(snake_case, camelCase, uppercase variants, synonym-based) - Opaque names:
field_42,col_abc,v_123(~25% of columns)
This forces the ML models to learn from value patterns and context, not just column naming conventions.
ML Training Pipeline
ml_train.py orchestrates training for both classifiers:
synth_*.csv + reference_labels.json
↓
_load_synth_data()
↓
┌────┴────┐
↓ ↓
SVM CatBoost
↓ ↓
svm.pkl catboost.cbm
SVM Path (Signals Architecture)
The SVM classifier uses the Pipeline + FeatureUnion composition adopted
wholesale from the Signals project:
- Build short text from column name + type + sample values via
build_svm_text() FeatureUnionextracts dual TF-IDF features:- Character n-grams (3-6,
char_wbanalyzer) — captures subword patterns - Word n-grams (1-2) — captures multi-word patterns
- Character n-grams (3-6,
CalibratedClassifierCV(LinearSVC, method="sigmoid")— Platt scaling for calibrated probability estimates_min_class_count()guard prevents calibration CV crash on small classes- Save to
.pkl+.classes.jsonvia joblib
The SVM operates on sparse lexical features — architecturally independent from the dense sentence-transformer embedding used by cosine and CatBoost. See Classification Pipeline for the full independence analysis.
CatBoost Path (GPU-accelerated)
- Extract 12 features per column via
features.extract_features() - Compute sentence-transformer embeddings (384-dim, GPU batch encoding)
- Fit
CatBoostColumnClassifierwith:loss_function="MultiClass"posterior_sampling=True(virtual ensemble uncertainty)auto_class_weights="Balanced"(handle imbalanced categories)
- Save to
.cbm+.classes.json
Virtual Ensemble Uncertainty
CatBoost’s posterior_sampling=True enables Bayesian uncertainty
quantification via virtual ensembles. The classifier produces not just
class probabilities but per-class variance estimates. High variance
translates to a higher DST discount factor — uncertain ML predictions
carry less evidential weight in the fusion.
SVM Training (synth-only, with vocab alignment at inference)
The SVM is trained once on the synthetic corpus
(scripts/generate_synth_source.py → ml_train.train_svm), with
TF-IDF char-3-6gram + word-1-2gram features and labels keyed on the
bundled-ontology ICE.* leaves from synth_generators.GENERATORS.
At pipeline runtime, the ICE.* predictions are translated into the
user’s taxonomy via the cached LLM-mediated alignment in
atelier.classify.ontology_alignment (one LLM call per
(vocabulary, model) tuple; result cached on disk under
build/cache/alignment/).
data/synth/*.csv + ICE.* reference labels
↓
train_svm() (sklearn LinearSVC + TfidfVectorizer)
↓
build/models/svm.pkl (label space: ICE.* leaves)
──────── pipeline runtime ──────────────────────
svm.predict_proba(text) → {ICE.X: p, ICE.Y: q, ...}
↓
translate_proba(proba, alignment) ← from ontology_alignment
↓
{user_code_A: p+q, user_code_B: r, ...}
↓
svm_to_mass(...) → BeliefAssignment in user-taxonomy frame
Historical note — earlier revisions of this design ran a mid-loop
train_svm_on_frontier_labels(historical function name) that retrained the SVM on live LLM labels and hot-swapped the result into the active model slot. That path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented inontology_alignment.py. The current design preserves the SVM’s TF-IDF independence at the feature and label level; the only LLM dependency is the per-vocabulary alignment table, which is vocabulary-level rather than column-level shared error. Seeontology_alignment.pymodule docstring for the full independence argument and the BM25-reranker future-work plan.
Train-Eval Cycle
train_eval_cycle.py orchestrates the full loop:
- Generate synthetic data from vocabulary
- Train CatBoost + SVM models
- Classify using the trained models
- Evaluate against the curated reference
This runs as part of the classification pipeline when models don’t exist yet, or can be triggered explicitly for experimentation.
SAGE Feature Importance
sage.py computes global feature importance via permutation-based
SAGE values. Each of the 12 discrete features is ablated and the
classification accuracy impact measured:
- High SAGE value = feature is critical for classification
- Low SAGE value = feature adds little discriminative power
SAGE runs on the directly-LLM-classified sampled subset when MC sampling is active (representative by stratification design), reducing computation at scale.
SHAP Per-Item Attribution
shap_explanations.py provides per-column explanations for why each
column was classified as it was:
| Method | Algorithm | Speed | When Used |
|---|---|---|---|
| CatBoost TreeSHAP | Exact O(TLD) built-in | ~0.1s for 50 items | Auto when CatBoost loaded |
| PermutationSHAP | shap.PermutationExplainer | ~50s/item | Explicit request only |
Each classification gains 6 SHAP columns:
shap_top1_name, shap_top1_value, shap_top2_name, shap_top2_value,
shap_top3_name, shap_top3_value.
Background SHAP
For large corpora, SHAP can run in a background thread while the pipeline proceeds to EVALUATING. Controlled by the HOCON flag:
classify {
background_analysis = true
background_analysis = ${?ATELIER_BACKGROUND_ANALYSIS}
}
Set to false on CAI if background threads cause runtime issues.
Key Files
| File | Role |
|---|---|
synth_generators.py | 316+ hand-coded value generators |
synth_registry.py | Three-layer registry: hand-coded > template > inferred |
synth.py | Synthetic data generation with diverse column names |
ml_train.py | Training orchestrator: synth-only CatBoost + synth-only SVM (ICE.* labels) |
catboost_classifier.py | CatBoost with virtual ensemble uncertainty |
svm_classifier.py | Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals) |
train_eval_cycle.py | Generate → train → classify → evaluate loop |
sage.py | Global SAGE feature importance |
shap_explanations.py | Per-item SHAP attribution |
Embeddings
The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.
Architecture
The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.
Data Flow
- Backend serves the parquet file via
/api/datasets/{id}/data - React fetches the parquet and loads it into DuckDB WASM via a Mosaic coordinator
- EmbeddingAtlas queries the DuckDB table for rendering: x/y coordinates, categories, text for tooltips
- All filtering, search, and aggregation happens client-side — no round-trips to the server
Parquet Schema
The Embeddings page expects parquet files with these columns:
| Column | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique row identifier |
x | float32 | yes | 2D projection x-coordinate (UMAP) |
y | float32 | yes | 2D projection y-coordinate (UMAP) |
text | string | recommended | Tooltip and search text |
category | string | recommended | Color-coding category |
Additional columns (e.g., source_table, belief, plausibility) are automatically available as cross-filter charts.
GitTables Dataset
The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.
To prepare the visualization parquet:
# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet
# Then seed the database
just seed
The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.
Naming: Embeddings vs Apache Atlas
The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.
- Embeddings (Atelier) — Interactive scatter plot of classification embeddings
- Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000
To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.
Data Sources & Versioning
Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.
Data Model
DataSource (1) Dataset versions (N)
┌─────────────────────────┐ ┌──────────────────────────┐
│ id: "ootb-sample" │───1:N──│ v3 (active) — 2 min ago │
│ type: "sample" │ │ v2 — yesterday │
│ display: "Sample" │ │ v1 — built-in │
│ vocab_mode: "universal" │ └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐ ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive" │ └──────────────────────────┘
│ display: "hive:prod/…" │
│ vocab_mode: "hive" │
└─────────────────────────┘
Source Types
| Type | Tables loaded from | Vocabulary | Created by |
|---|---|---|---|
sample | data/sample/tables/*.csv | Expanded ICE ontology (316 leaves) | Auto-seeded on first boot |
hive | CAI data connection | Domain annotations from vocab_uri | User creates via Status page |
synth | data/synth/tables/*.csv | Domain annotations from vocab_uri | Generated by scripts/generate_synth_source.py |
Vocabulary routing: For in-situ classification, the customer’s domain
vocabulary IS the classification target — the LLM reads labels and
descriptions and classifies into the domain’s hierarchical dot-codes.
The annotations table location is configured per source via vocab_uri
(e.g. meta.vocab, meta.annotations), decoupling data tables from the
vocabulary. Multiple sources can share the same annotations table.
Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via
VocabMapping. This requires dedicated training hardware and is not yet implemented.
Database Schema
CREATE TABLE data_sources (
id TEXT PRIMARY KEY,
source_type TEXT NOT NULL, -- 'sample' | 'hive'
source_uri TEXT NOT NULL DEFAULT '',
display_name TEXT NOT NULL,
vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
vocab_uri TEXT NOT NULL DEFAULT '', -- e.g. 'meta.vocab', 'meta.annotations'
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata TEXT -- JSON: table_count, column_count
);
-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;
Vocabulary Routing
When a pipeline run starts, the source_id determines which vocabulary
loads:
ootb-sample:load_sample_vocabulary()→data/sample/ontology.json(316 BFO-grounded leaves across the CCO ICE trichotomy)hive/synth: Domain annotations loaded directly from the table specified byvocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.- No source: Falls back to universal vocabulary (16 PII leaves)
LLM Robustness
The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.
- Adaptive batch sizing:
_estimate_safe_batch_size()reducescolumns_per_callfor large vocabularies (e.g. 290 categories → 41) - Truncation retry: When
LLMResponse.truncatedis detected, the batch is halved and retried recursively until all columns are classified - Metrics:
truncation_countandeffective_batch_sizetracked inBootstrapStateand exposed via the agent’scheck_convergencetool
Sample Source
The built-in “Sample” source (source_id ootb-sample) ships with
Atelier so new deployments show meaningful data immediately. When the
landing page loads and “Connected” turns green, the stats cards show
316 Terms and 316 Entities. The ootb- prefix in the id is an
internal marker distinguishing shipped sources from user-registered
connections — it is not shown in the UI.
Expanded Vocabulary (ICE.* Ontology)
The vocabulary follows the CCO ICE (Information Content Entity) trichotomy,
grounded in BFO via atelier-vocab.ttl:
ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│ ├── ICE.NONSENSITIVE.DESIGNATIVE ⊑ cco:DesignativeICE
│ │ ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│ │ ├── .CODE (.ID, .ABBREV, .POSTAL)
│ │ ├── .GEO (.COUNTRY, .REGION, .CITY, .LOCATION)
│ │ ├── .REF (.CITATION, .VERSION, .SOURCE)
│ │ └── .TITLE
│ ├── ICE.NONSENSITIVE.DESCRIPTIVE ⊑ cco:DescriptiveICE
│ │ ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│ │ ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│ │ ├── .MEASUREMENT (~20 subtypes)
│ │ └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│ └── ICE.NONSENSITIVE.PRESCRIPTIVE ⊑ cco:PrescriptiveICE
│ └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│ ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│ ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│ └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
└── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …
351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.
Design principle: every category is our own BFO-grounded term. External
sources (GitTables, meta-tagging) inform which conceptual space to cover;
we never import their raw tags. The mapping goes outward from our vocabulary
via atelier-vocab.ttl, not inward.
Sample Tables
25 mixed-domain tables with 316 columns (100 rows each). Tables are
deliberately cross-domain — a customers table contains identity,
contact, metadata, and categorical columns — so the classification
pipeline cannot rely on table name alone.
~25% of columns use opaque names (field_42, var_abc, col_xyz)
to exercise the pipeline’s ability to classify from values and context
rather than column name heuristics.
Generated by scripts/generate_sample_source.py. The curated
reference for the Sample source fixture is committed in
data/sample/reference_labels.json (scope: fixture-only, for OOTB
demo and unit tests).
For UAT / production evaluation, the curated reference lives at
build/meta-tagging-clean/curated_reference.csv (gitignored) — built
by scripts/parity/build_curated_reference.py from direct
reference-column evidence plus name-index lookup with
Ontology > Annotation > Common Names priority. UAT’s own
classification outputs are provisional predictions and are scored
against this curated reference at
build/results/parity/delta_report.md.
Auto-Import on First Boot
The gateway seeds the Sample source (id ootb-sample) via a FastAPI
lifespan context manager:
- Check if
ootb-samplesource has any dataset versions - If none, read
sample_source_stats()(table count, column count) - Create dataset version 1 with the stats as metadata
- Update source metadata JSON
This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.
API
REST Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/data-sources | GET | List all data sources |
/api/data-sources | POST | Create a new data source |
/api/datasets?source_id=X | GET | List versions for a source |
/api/datasets/{id}/activate | POST | Set a version as active |
/api/vocabulary/stats?source_id=X | GET | Term count (source-aware) |
/api/fsm/start?source_id=X | POST | Start pipeline for a source |
gRPC RPCs
| RPC | Description |
|---|---|
ListDataSources() | List all sources |
StartClassification(source_id=…) | Start pipeline for a source |
UI Integration
The Status page has two new cards:
- Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
- Classification Pipeline card: “Start Classification” passes
activeSourceIdto/api/fsm/start?source_id=…
The Landing page stats cards reflect the active source:
- Terms: vocabulary size for the active source (316 for the Sample source)
- Entities: column count from the active dataset version
- Sources badge: shows count when multiple sources exist
DatasetContext
interface DatasetContextValue {
sources: DataSourceInfo[];
activeSourceId: string | null;
setActiveSourceId: (id: string) => void;
datasets: DatasetInfo[]; // for activeSourceId
activeDatasetId: string | null;
setActiveDatasetId: (id: string) => void;
refreshSources: () => Promise<void>;
refreshDatasets: () => Promise<void>;
}
Key Files
| File | Role |
|---|---|
db/migrations/20260414…_data_sources_and_versions.sql | Schema migration |
src/atelier/db/model.py | DataSource ORM model |
src/atelier/db/dao.py | Source + version DAO methods |
src/atelier/classify/sampler.py | load_sample_source(), sample_source_stats() |
src/atelier/classify/taxonomy.py | load_sample_vocabulary() |
src/atelier/classify/pipeline.py | Source-aware routing |
src/atelier/gateway.py | REST endpoints + auto-import lifespan |
data/sample/ontology.json | Expanded vocabulary (316 leaves) |
data/sample/tables/*.csv | 25 sample tables |
data/sample/reference_labels.json | 316-entry Sample-source fixture reference labels |
build/meta-tagging-clean/curated_reference.csv (gitignored) | UAT-corpus curated reference |
scripts/expand_vocabulary.py | Vocabulary expansion script |
scripts/generate_sample_source.py | Sample table generation script |
ui/src/contexts/DatasetContext.tsx | Source-aware React context |
ui/src/pages/Status.tsx | Data source + version UI |
ML Artifact Management + Extend Classification
Each Atelier classify run trains a CatBoost classifier, optionally an
SVM classifier (synth-trained, with runtime LLM-mediated alignment to
the user vocabulary — see ontology_alignment.py), and (when
umap-learn handles the projection) a fitted UMAP reducer. The ML Artifact Set feature makes those
trained models first-class entities — registered in PG, listed in the
UI, and replayable on new data through a streamlined Extend
Classification pipeline that skips the LLM sweep, DST iteration, and
agent loop.
Why
The classify pipeline costs tens of minutes and (on Bedrock / Anthropic direct) tens of dollars per run. When the governance team adds new tables to an existing Hive database, or stands up a new Hive / Impala database with the same taxonomy, re-running the full pipeline is the wrong tool — there’s no new agent-mediated reference to learn from, and the LLM sweep adds nothing the trained CatBoost can’t reproduce at >100x speed. Extend Classification is the right shape: load the trained artifacts, predict on the new columns, write a parquet, register a new dataset. Done.
The data model deliberately tracks lineage in OpenLineage terminology
(Run → Job → Dataset → Facet) so Marquez or a similar lineage backend
can be wired in later without remodeling. The pathspec scheme (run
id-keyed artifact directories) borrows from Metaflow’s DataStore
addressing — every artifact resolves to
build/results/{run_id}/{filename}.
Concepts
| Term | What it is | Where it lives |
|---|---|---|
| Data Source | A configured source (Hive DB, Impala DB, OOTB Sample, filesystem mount). | data_sources table |
| Dataset | One classify or extend run’s output parquet, versioned per source. | datasets table |
| FSM Run | One pipeline invocation (classify or extend). | fsm_runs table |
| ML Artifact Set | The bundle a classify run produced: CatBoost (.cbm + .classes.json), optional SVM (.pkl + .classes.json), optional UMAP (.pkl), plus vocab signature and embedding-model identity. | ml_artifact_sets table |
| Active Artifact Set | The single ArtifactSet a future Extend run will use. | ml_artifact_sets.is_active (partial unique index enforces only-one-active) |
| Classify Run | The full LLM + DST + agent pipeline. Produces a Dataset AND an ArtifactSet. | run_kind = 'classify' on the dataset row |
| Extend Run | The streamlined ML-only pipeline. Consumes an ArtifactSet, produces a Dataset only. | run_kind = 'extend' |
Database schema
The migration 20260427000000_ml_artifact_sets.sql adds
ml_artifact_sets and three lineage columns on datasets:
ml_artifact_sets:
id, source_id (→ data_sources.id), fsm_run_id (→ fsm_runs.id),
parent_artifact_set_id (self-FK),
catboost_path, catboost_classes_path,
svm_path?, svm_classes_path?, umap_path?,
classes (JSON), feature_groups (JSON),
vocab_signature (sha256(sorted(classes))),
embedding_model, embedding_dim,
display_name, summary, is_active, is_archived,
facets (JSON, OpenLineage projection),
created_at
datasets (added):
artifact_set_id (→ ml_artifact_sets.id),
parent_dataset_id (→ datasets.id), -- extend lineage
run_kind ('classify' | 'extend')
The partial unique index
idx_ml_artifact_sets_one_active ON (is_active) WHERE is_active = TRUE
is the Postgres-side guarantee that only one row may be active globally
at any time. The DAO’s
set_active_artifact_set
runs the demote + promote in a single transaction so the index
constraint never sees two TRUE rows.
On-disk layout
Each classify run writes to build/results/{run_id}/:
build/results/{run_id}/
catboost_fit_to_llm.cbm # required
catboost_fit_to_llm.classes.json # required (classes + feature_groups sidecar)
svm_frontier.pkl # optional (skipped if fit-to-LLM didn't fire)
svm_frontier.classes.json # optional
umap.pkl # optional (only when CPU umap-learn was used)
atelier_embeddings.parquet # the dataset
classifications.json # full per-column results
evaluation_report.json # accuracy stats
settings_snapshot.json # config-at-start
taxonomy_findings.json # vocab QA
...
atelier.classify.artifact_set is the single point of knowledge about
this layout — it builds the artifact-set record from a run dir and
loads the bundle for an Extend run.
Pipeline writes (classify side)
At the end of EVALUATING, after the dataset row is upserted:
# pipeline.py (paraphrased)
parquet_path = _write_parquet(...) # also persists umap.pkl
# alongside via joblib
dao.upsert_dataset(
..., artifact_set_id=run_id, run_kind='classify',
parent_dataset_id=None,
)
spec = build_artifact_set_record(
run_id=run_id, results_dir=results_dir, cfg=cfg,
n_columns=len(classifications),
source_id=source_id, fsm_run_id=run_id,
)
if spec is not None:
dao.register_artifact_set(**spec)
The first registered artifact set on a fresh deploy auto-activates (idempotent — subsequent registrations don’t steal active). Registration failures are non-fatal: the dataset still ships.
Extend pipeline
atelier.classify.extend_pipeline.run_extend_classification orchestrates
the streamlined runner. Phase walk:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING
→ CLASSIFYING → FUSING → EVALUATING → CONVERGED
No new FSM states — SAMPLING → CLASSIFYING is already a legal
transition (the full pipeline uses the same edge when synthesis is
disabled). Production guards run BEFORE the FSM run is created:
- Artifact-set existence — DAO lookup must return non-NULL, non-archived row.
- File-existence preflight — every non-NULL path on the row must exist on disk (catboost + sidecar required; SVM / UMAP optional but when set must be present). Stale DB pointers fail fast.
- Embedding-model identity — the artifact’s
embedding_modelfield must equal the runtimecfg.embedding_model. Catches the BGE-large vs MiniLM swap that would silently produce nonsense predictions. - Vocab compatibility — surfaces in
progress.vocab_compatibilityas one ofok | superset | partial | disjoint. Warns but does NOT block (per the project decision); the artifact’s training classes drive the runtime taxonomy of the extend run.
Inference is intentionally simple — no DST iteration:
- CatBoost
predict_probaper column → top-1 = primary prediction. - (Optional) SVM
predict_proba→ second look; soft confidence haircut on disagreement. belief = top1_p,plausibility = top1_p + (1 − sum_top3),conflict = 0.0(clear “ML-only inference” marker for the UI).
UMAP transforms via bundle.umap_model.transform() when the bundle
includes a fitted reducer (lands in the parent run’s coordinate
space). Falls back to a fresh fit_transform when no UMAP was bundled
— Extend coordinates differ from the parent’s; the divergence is
recorded in settings_snapshot.json.
Gateway endpoints
GET /api/artifact-sets[?source_id=&include_archived=]
GET /api/artifact-sets/{id}
POST /api/artifact-sets/{id}/activate
POST /api/artifact-sets/{id}/archive
POST /api/artifact-sets/{id}/unarchive
GET /api/artifact-sets/{id}/compatibility?source_id=
POST /api/fsm/extend body: {source_id,
artifact_set_id,
parent_dataset_id?}
The /api/fsm/extend endpoint mirrors /api/fsm/start’s background-
thread plumbing so the existing /api/fsm/status polling carries the
run through to the UI without any new client-side wiring. Returns 404
synchronously when artifact_set_id is missing from the DB; 409 when
another FSM run is in flight.
UI
The Status page renders a new ML Artifacts panel between the
Classification Pipeline panel and the Data Source panel. Composition
mirrors DataSourceCard for visual continuity:
- Header (
extraslot): active source / dataset indicator (read-only), Refresh button, Extend Classification primary button. - Table columns: Active (radio) / Run ID (linked to overwatch when fsm_run_id is set) / Created / Summary / Models (CB / SVM / UMAP chips with informative tooltips) / Archive (trash icon).
The Data Source panel was reworked to match: it now has a leftmost
Active column with Radio cells, and the Version column lost its
inline [active] chip. Click row OR click radio → activate.
OpenLineage projection
atelier.classify.oplineage_emit.build_run_event projects an FSM run
into an OpenLineage event dict. The Job is atelier.classify or
atelier.extend_classify; the Run is fsm_runs.id. Outputs include
the parquet plus one Dataset entry per artifact file (CatBoost / SVM /
UMAP), each carrying a zndx_ml_artifact custom facet with framework,
vocab_signature, embedding_model, classes_count.
Extend runs additionally emit a ParentRunFacet linking back to the
classify run that produced the consumed ArtifactSet — the OpenLineage-
canonical way to express “this run is a descendant of run X”.
Day one we don’t wire the HTTP transport — the projection is pure, and
operators who configure Marquez later only need to add the POST
plumbing. The custom zndx_ml_artifact and zndx_extend_lineage
facets follow the OpenLineage custom-facet convention with _producer
and _schemaURL attributes pointing at our schemas.
BDD coverage
features/agent/artifact_set.feature(tier-0): vocab signature determinism, signature stability under reordering, all four compatibility statuses (ok / superset / partial / disjoint).features/agent/extend_pipeline.feature(tier-1, @gpu): Extend produces a Dataset withrun_kind='extend', the dataset references the consumed artifact set, the run NEVER invokes an LLM (structural proof —run_extend_classificationdoesn’t accept anllm_backendparameter), vocab compatibility surfaces, atlas-compatible files appear in the run dir.features/gateway/artifact_sets.feature(tier-1, @gpu): seven scenarios covering list / get / activate / compatibility / extend body validation / 404 paths.
Out of scope (deferred)
- Auto-prune retention policy for artifact sets (manual archive only).
- Cross-source vocab translation (mapping artifact’s classes onto a source with a different taxonomy).
- Full per-table input dataset expansion in OpenLineage events (currently emits one aggregate input dataset per source).
- HTTP transport for OpenLineage emission — pure projection only.
Deployment: Unseen Ontology, Known Schema
Operating principle: out here we iterate on public benchmarks; in CAI we execute to a customer-specified objective. The customer brings an unseen ontology shaped like a known annotations schema; the system has to produce classifications + calibrated belief intervals against that ontology without prior calibration.
This document captures the deployment-time invariants that the classification pipeline must honor, names the assumptions baked into today’s code that would fail against a sufficiently weird customer ontology, and proposes a roadmap milestone (M11 — Bring Your Own Vocabulary) that closes the remaining gaps and makes public-data iteration a test surface rather than a target for the same execution path.
Mode split — iteration vs. execution
| Dimension | Iteration mode (local / public data) | Execution mode (CAI deployment) |
|---|---|---|
| Vocabulary source | atelier-vocab.ttl (300 ICE leaves) + curated mappings + benchmark-specific class lists (SOTAB 82 Schema.org types, GitTables 122 DBpedia types, SemTab DBpedia hierarchy) | Customer’s default.annotations Hive table — opaque to us until run-time |
| Hierarchy depth | Known (5 levels for ICE, ~3 for DBpedia subset, varying for Schema.org) | Unknown — could be 1 (flat) or 8+ (deep regulatory taxonomy) |
| Hierarchy shape | Tree, single root | Tree assumed; multi-root forest, cycles, unbalanced subtrees all plausible |
| Validation labels | Curated reference (synth, meta-tagging UAT, GitTables CTA gold) | Often absent. Sometimes a small spot-check set; sometimes none. |
| Accuracy bar | Track records over time on published benchmarks | Customer-stated objective; calibration + sample review when no agent-mediated reference exists |
| BFO / CCO grounding | Available — we mapped 360 terms ourselves | Opportunistic — only if the customer’s ontology happens to carry a bfo_anchor / cco_anchor / schema_org_class / dbpedia_class column |
| Iteration latency | Tight (re-run with overlay tweaks; soak on devenv) | Wide (CAI session lifecycle; nautilus + overwatch loops the only mid-run feedback) |
The bridge between the two modes is structural: every iteration
target gets transformed into annotations-schema shape before the
pipeline sees it. SOTAB v2’s class list, GitTables’ 122 DBpedia
types, the OOTB sample’s 316 ICE leaves, the customer’s Hive table —
all four end up as a HierarchicalCategorySet built from a
list[ReferenceCategory] with parent_code edges, fed into the same
DST + agent + nautilus + overwatch stack. The execution path doesn’t
know or care which mode it’s running under.
The annotations schema contract (what’s stable)
load_annotations_from_hive reads SELECT * FROM default.annotations
and returns list[dict]. _normalize_annotations_row and
_build_category_set_from_records (both in
src/atelier/classify/taxonomy.py) translate that into a
HierarchicalCategorySet. The fields we already accept, in order of
preference per row:
| Field | Required | Purpose | Fallback when missing |
|---|---|---|---|
code (or id / path / dot-path) | yes | Identity for tree navigation, DST focal element, Atlas type name | row dropped (we cannot classify into an unnamed term) |
label (or display_name / name) | strongly preferred | Human surface in UI + LLM prompt | falls back to last component of code |
parent_code | no | Explicit parent edge | derived from dot-path (A.B.C → parent = A.B) |
description | no | LLM context, embedding text | empty |
common_names (synonyms / aliases) | no | LLM expansion + embedding text | empty |
notation | no | SKOS-style dot code (numeric or otherwise) | empty |
abbrev | no | Mnemonic shortcode for leaves | empty |
taxonomy | no | Namespace discriminator | "annotations" |
sensitivity | no | Domain-specific classification metadata | absent |
The contract is structural, not semantic — we do not require any
particular set of root codes, depth, or ICE-trichotomy alignment. A
customer ontology rooted at LEGAL.PRIVILEGE.ATTORNEY_CLIENT is
structurally indistinguishable from one rooted at
ICE.SENSITIVE.PID.CONTACT.EMAIL from the algorithms’ point of view.
What’s already ontology-agnostic
Most of the algorithmic surface from v0.4.0-rc1 operates on the hierarchy as a graph, not on ICE-specific anchors:
- Parent-aware DST frame (
mass_functions.py) — votes at any node; fold-up usesHierarchicalCategorySet.descendants(code). - Hierarchical cosine mass (Shafer §3) — distributes embedding similarity through the graph regardless of root identity.
- Cross-subtree cautious_code (Smets §6) — least-commitment promotion finds the deepest common ancestor on whatever tree is loaded.
- Belief-path tracing (
belief.py::belief_path) — walksparent_codechains; doesn’t care about labels. - Indep-tier revisit gate — fires on consensus disagreement, not on a code-pattern match.
- Atlas type graph export (
HierarchicalCategorySet.atlas_type_graph) — turns any tree into Atlas Classification typedefs withsuperTypeschains. - Validation (
validate_taxonomy) — collision + duplicate detector catches structural problems before classify starts (cycles emerge as parent_code self-reference; multi-root surfaces as multiple parent-less entries). - Cautious-Code Review (
cautious_review.py) — agent-mediated backoff is structural; the agent reasons about depth-vs-confidence on whatever tree it sees.
The DST math doesn’t know it’s classifying PII. That’s a feature — it means the work we did on v0.4.0-rc1 transfers to deployment with zero algorithm changes.
What anticipates badly today
Five gaps that an unseen customer ontology will surface on first encounter:
1. Schema flexibility — column-name variants
_normalize_annotations_row matches a fixed set of column-name
candidates. Customers regularly bring annotation tables with names
like Class_Name, parent_class, category_definition,
sensitivity_tier, pii_category — none of which exactly match our
preferred field names. Today this falls through to silent drops or
empty fields.
Fix: extend the column-name normalization to be configurable and
fuzzy. Add a vocab_schema_map overlay setting that lets the
operator declare { "code": "Class_Name", "parent_code": "Parent" }
at run-time. Default behavior stays automatic via fuzzy matching on
common synonyms.
2. Hierarchy-shape resilience
Today’s calibration assumes our 5-level ICE depth. Discount defaults
(cosine 0.20, llm 0.15, SVM 0.55), gap_threshold 0.15, and
cautious_review bel_threshold 0.85 are all tuned against that. A
customer’s flat 50-class taxonomy doesn’t need cautious-code review
(no parents to back off to), and an 8-deep regulatory taxonomy
demands tighter cautious thresholds (more depth × more places to be
wrong).
Fix: depth-aware defaults. Compute hierarchy statistics at LOADING_VOCAB time (max depth, mean branching factor, leaf/internal ratio); apply scaled defaults if the operator hasn’t overridden them. Surface the stats in the Status page so the operator sees what they got.
3. Multi-root and cycle handling
Single-root tree is assumed in descendants / ancestors traversal.
Customer dumps can have multiple top-level concepts (a forest), or —
rarely but consequentially — a cycle introduced by data entry error.
Today: cycles cause infinite recursion in descendants; multiple
roots silently work because the traversal is parent-anchored, but
pre-classification tooling (Atlas export, vocabulary stats UI) breaks.
Fix: explicit multi-root support in
HierarchicalCategorySet. Cycle detection + clear error in
validate_taxonomy with the offending edge identified. Both behind a
feature flag so pathological customer data fails fast rather than
hangs.
4. Opportunistic CCO/BFO grounding
Customer ontologies that overlap with Schema.org / DBpedia / BFO /
CCO carry that overlap as metadata columns (e.g.,
schema_org_class, bfo_anchor, cco_class). Today we ignore
these. Wiring them lets us:
- Auto-validate the customer’s hierarchy against a known reference
(warn on inconsistent BFO anchoring; e.g., a node mapped to
cco:DesignativeICEwhose children include acco:Agent). - Reuse our 360-term mapping for embedding-text enrichment (a
customer term mapped to
schema:Personborrows the full description from the Schema.org corpus). - Bridge to Atlas BFO classifications when the customer’s governance team is ahead of theirs (Cloudera Atlas now ships BFO alignment as of mid-2025).
Fix: optional bfo_anchor / cco_class / schema_org_class /
dbpedia_class columns in the annotations contract; when present,
the loader populates them on ReferenceCategory and the embedding +
LLM-prompt builders consume them. When absent, no behavior change.
5. Accuracy reporting without an agent-mediated reference
The customer often has no per-column gold-standard labels. Our
v0.4.0-rc1 evaluation pipeline assumes a curated_reference table
(or per-row reference_code field). When the customer doesn’t
provide one:
What we have: belief-gap distribution, mean K, cautious-code depth distribution, cross-source agreement counts, reasoning-trace attribution analyzer. These are calibration metrics, not accuracy.
What we need: a deployment-mode evaluation report that’s honest about the absence of an agent-mediated reference. Three-tier report:
- Internal consistency — DST K stats, belief-gap distribution, contraction rate. Always available. Tells the operator the pipeline converged.
- Sample review workflow — eject N highest-uncertainty columns and N highest-confidence columns to the UI for human spot-check. The operator’s accept/reject decisions feed an ad-hoc curated reference that grows over time. This is essentially what UAT reviewers were doing manually; we can formalize it.
- Public-benchmark proxy — when the customer’s ontology overlaps with SOTAB / GitTables / SemTab through opportunistic CCO grounding (gap #4), accuracy on the public benchmark serves as a conditional-confidence floor.
Public-data iteration as test surface
The principle: every public benchmark we adopt becomes a deployment-shape simulator, not a one-off integration. Concretely:
-
SOTAB v2 — wire as a classify source by transforming the 82 Schema.org type list into a
HierarchicalCategorySet- shaped annotations table. The Schema.org type tree provides parent_code edges; our existingatelier-vocab.ttlmappings (schema:Person → cco:Agent, etc.) opportunistically populate the BFO/CCO anchors on the resultingReferenceCategoryrows. Pipeline runs against SOTAB tables exactly as it would against a customer Hive corpus. -
GitTables — same treatment. The 122 DBpedia types become a flat (or DBpedia-hierarchy-enriched) annotations table. Our 15 already-mapped DBpedia → CCO bridges populate anchors where they exist; the other 107 stay un-anchored (correct behavior for opportunistic grounding).
-
SemTab annual — register the system, produce the annotations table from each year’s vocabulary release, evaluate against the
cscoremetric (which natively rewards our cautious_code). -
Customer schema simulators — synthetic annotation tables that test specific deployment shapes: flat 50-class taxonomy (legal exemption codes), 8-deep regulatory tree (HIPAA subcategories), forest with 3 roots (multi-domain governance). These exercise the M11 shape-resilience work without needing real customer data.
Each iteration target ships as a data_sources row + a loader (one
function each) + an annotations table built from the benchmark’s
class list. None of them need pipeline-side knowledge.
M11 — Bring Your Own Vocabulary (proposed)
A milestone that delivers ontology-agnostic execution with the five gaps above closed:
- Configurable vocab schema mapping — overlay setting + fuzzy default; surfaces in Status when applied.
- Depth-aware default calibration — compute hierarchy stats at load; scale gap_threshold + cautious bel_threshold.
- Multi-root + cycle support — explicit; behind feature flags that fail loudly when violated.
- Opportunistic anchor columns —
bfo_anchor/cco_class/schema_org_class/dbpedia_classconsumed when present. - Three-tier deployment evaluation report — internal consistency / sample review workflow / public-benchmark proxy.
- SOTAB v2 + GitTables wired as test sources — proof that the same execution path handles three published benchmarks plus the customer’s Hive table without code changes per target.
The work is concrete and bounded — roughly two focused sessions (taxonomy.py + pipeline.py extensions; one loader + one fixture test per benchmark). Stronger leverage than a feature-by-feature roadmap because every fix lands on the existing structural abstraction rather than introducing new mechanisms.
Out of scope (deferred)
- Cross-customer ontology learning. Two customers with similar regulatory domains might benefit from shared inferences; we explicitly do not transfer learning across deployments. Each customer’s session is a closed world.
- Customer-driven hierarchy editing. The annotations table is a contract the customer controls upstream of Atelier. We don’t ship UI for editing it.
- Ontology auto-discovery. Inferring a hierarchy from unannotated data tables (clustering plus LLM proposes a tree) is a research direction in its own right; out of scope for M11.
Open questions
- What if the customer brings two annotations tables? A
primary domain vocabulary (
hipaa.annotations) and a generic PII overlay (atelier.annotations). Today’s pipeline takes one. M11 should considercompose_vocabulariesin the loader path, but it changes the meaning of “the customer’s ontology” — needs a design conversation. - Embedding-model robustness across languages. A German / Japanese / Mandarin annotations table will produce shorter embedding-text and weaker cosine signal at MiniLM-L6 scale. Bigger embedding models (BGE-large, E5-mistral) help but inflate per-run cost. Defer to a separate i18n milestone.
- Atlas BFO sync. Cloudera’s Atlas team is shipping BFO
alignment. Once stable, our
atelier-vocab.ttl↔ Atlas classification typedef mapping should round-trip without loss (we ship to Atlas; Atlas hands back BFO-anchored entities; we read them as opportunistic anchors per gap #4). Wait for Atlas BFO general availability before wiring.
Cross-references
- Classification Pipeline — the execution path being made ontology-agnostic.
- DST Evidence Independence — the numerical-methods framing that already operates on arbitrary hierarchies.
- Pareto Capability Evolution — the longer-horizon search-space that builds on M11.
src/atelier/classify/ontology/README.md— the BFO/CCO substrate that opportunistic anchoring lifts into.src/atelier/classify/taxonomy.py—_normalize_annotations_row,_build_category_set_from_records(the adapter layer).src/atelier/classify/sampler.py—load_annotations_from_hive,load_annotations_from_json,load_annotations_from_filesystem(the source-shape variants).
SOTAB v2 Coverage Strategy
Ownership note (2026-05-09). Going forward, all ontology / vocabulary / synthetic-data work moves to Ægir. The label space conditions model pre-training directly, so it lives where the model lives. Atelier becomes the consumer of trained artifacts (H-Net/RWKV checkpoints + SVMs trained on Ægir-curated datasets). This document stays in Atelier’s docs as the specification of what we want covered; the actual TTL extensions, generators, and SOTAB integration implementation belong in
~/local/src/zndx/aegir/.
This document specifies how the BFO/CCO-grounded vocabulary should cover the SOTAB v2 Schema.org CTA label space (82 labels), so the hierarchical RWKV-7 model in Ægir can ladder predictions up from raw benchmark labels to BFO/CCO concepts.
Background
- Atelier (current): ICE trichotomy (Designative / Descriptive /
Prescriptive) grounded through Common Core Ontologies into BFO 2020.
20 Schema.org subjects mapped today (11 classes, 9 properties) in
src/atelier/classify/ontology/atelier-vocab.ttl. This snapshot remains operational for the existing classification pipeline during the migration window. - Atelier (future): consumer of pre-trained model artifacts. Loads H-Net/RWKV checkpoints and SVMs trained in Ægir against ontology-grounded label spaces; uses them as evidence sources in DST fusion. No longer owns vocabulary.
- Ægir (current → future ontology home): hierarchical RWKV-7 model
targeting CTA + CPA on wide tables, trained against
gt-signals-dbpediaand SOTAB v2. SOTAB infrastructure already wired:scripts/download_sotab.pyfetches the four canonical bundles,src/aegir/data/table_dataset.pyloadssotab_v2_cta_*_set.csvreference labels. Inheritingatelier-vocab.ttl+ synth generators is part of the M2 roadmap. - Synthetic data pipeline: currently in atelier
(
synth_generators.py, 316+ generators). Migration target: Ægir, since the generators feed pre-training corpora directly. Atelier’s classification pipeline can consume generator output via a thin client during transition.
Authoritative SOTAB v2 label space
Verified against /raid/datasets/sotab/sotab_v2_cta_*_set.csv (union of
training, validation, test, and the three robustness test sets:
corner_cases, missing_values, format_heterogeneity):
82 distinct CTA labels covering 17 root entity types
Root entity types (from webdatacommons.org/structureddata/sotab/v2/,
Table 2): Book, CreativeWork, Event, Hotel, JobPosting, LocalBusiness,
Movie, Museum, MusicAlbum, MusicRecording, Person, Place, Product,
Recipe, Restaurant, SportsEvent, TVEpisode.
The 82 labels are a mix of:
- Class names —
Country,MonetaryAmount,Organization, etc. - Entity-property pairs —
Book/name,Hotel/description,JobPosting/description(the slash separates entity type from the property whose value the column carries). - Measurement units —
Distance,Duration,Energy,Mass. - Enumeration types —
BookFormatType,EventStatusType,GenderType,RestrictedDiet. - Coded attribute types —
CoordinateAT,IdentifierAT,MusicArtistAT(theATsuffix denotes “atomic type”, a SOTAB convention, not Schema.org).
Aegir’s stale _LABEL_DIMS["sotab"] = 91 comment in
src/aegir/data/table_dataset.py should be reduced to 82; the extra 9
appear to be carry-over from an earlier label set draft.
Coverage analysis
Direct hits (14 of 82)
Already grounded in atelier-vocab.ttl:
| SOTAB label | Atelier mapping |
|---|---|
Country | schema:Country ⊑ BFO:Site |
CreativeWork | schema:CreativeWork ⊑ cco:ICE |
CreativeWork/name | schema:name (rdf property; we have it) |
Event/description, Event/name | schema:Event + schema:description/schema:name |
MonetaryAmount | schema:MonetaryAmount ⊑ cco:DescriptiveICE |
Organization | schema:Organization ⊑ cco:Organization |
Person/name | schema:Person ⊑ cco:Person + schema:name |
Place/name | schema:Place ⊑ BFO:Site + schema:name |
PostalAddress | schema:PostalAddress ⊑ cco:DesignativeICE |
QuantitativeValue | schema:QuantitativeValue ⊑ cco:DescriptiveICE |
email | schema:email ⊑ cco:DesignativeICE |
telephone | schema:telephone ⊑ cco:DesignativeICE |
URL | schema:url (we use lowercase; SOTAB uses URL) |
Subsumption-reachable (~20 of 82)
Subclasses of types already grounded — adding them is a single
rdfs:subClassOf edge under the existing CCO branch:
Schema:CreativeWork descendants (8): Book/description, Book/name,
BookFormatType, Movie/description, Movie/name, Recipe/description,
Recipe/name, MusicAlbum/name, MusicRecording/name, TVEpisode/name,
CreativeWorkSeries, Photograph, Review.
Schema:Organization descendants (5): Hotel/description, Hotel/name,
LocalBusiness/name, Museum/name, Restaurant/name, SportsTeam.
Schema:Event descendants (1): SportsEvent/name.
Schema:PostalAddress sub-properties (4): addressLocality,
addressRegion, postalCode, streetAddress.
Schema:QuantitativeValue measurement subtypes (4 + 1):
Distance, Duration, Energy, Mass, weight (column-as-property).
Missing — requires new vocab work (~48 of 82)
Grouped by extension target:
| Group | SOTAB labels | Target CCO/BFO grounding |
|---|---|---|
| Product family (new branch) | Product/description, Product/name, ProductModel, Brand | cco:Artifact + cco:ArtifactModel (Prescriptive territory) |
| Job posting family | JobPosting/description, JobPosting/name, OccupationalExperienceRequirements, EducationalOccupationalCredential, workHours, paymentAccepted | cco:DescriptiveICE (descriptive content about employment) |
| Product economics | price, priceRange, currency, DeliveryMethod, ItemAvailability, OfferItemCondition | Mix: cco:DescriptiveICE (price/range), enumerations under cco:DesignativeICE (DeliveryMethod, ItemAvailability) |
| Temporal granularities | Date, DateTime, Time, DayOfWeek | cco:DescriptiveICE temporal subtree (refinements of existing TIMESTAMP) |
| Generic data types | Number, Boolean, Language | Mix: cco:DescriptiveICE (Number, Boolean), cco:DesignativeICE (Language code) |
| Enumerations | CategoryCode, EventStatusType, EventAttendanceModeEnumeration, GenderType, RestrictedDiet, BookFormatType | cco:DesignativeICE (coded value identifiers) |
| Person/identity attributes | GenderType, MusicArtistAT | cco:DescriptiveICE (gender), cco:Person (artist) |
| Place attributes | CoordinateAT, LocationFeatureSpecification, openingHours | cco:DescriptiveICE (coordinates, hours), cco:DescriptiveICE (features) |
| Annotations / commentary | category, label, Rating, Review, ItemList | cco:DescriptiveICE |
| Measurement helpers | unitCode, unitText | Properties of schema:QuantitativeValue (we have); add as named properties |
| Communication channel | faxNumber | cco:DesignativeICE (sibling of telephone) |
| Attribute-typed identifiers | IdentifierAT | cco:DesignativeICE (sibling of schema:identifier) |
Three-tier extension strategy
Tier-A — measurement zoo (lowest effort, highest leverage)
Add ~10 schema:QuantitativeValue subclasses under cco:DescriptiveICE:
schema:Distance rdfs:subClassOf cco:ont00000853 . # Descriptive ICE
schema:Duration rdfs:subClassOf cco:ont00000853 .
schema:Energy rdfs:subClassOf cco:ont00000853 .
schema:Mass rdfs:subClassOf cco:ont00000853 .
schema:Speed rdfs:subClassOf cco:ont00000853 .
schema:Temperature rdfs:subClassOf cco:ont00000853 .
Plus property-level: unitCode, unitText, weight.
Implementation: ~10 lines in atelier-vocab.ttl, ~5 generators in
synth_generators.py (already have NUMERIC.* generators that can be
re-keyed to schema URIs).
SOTAB labels covered: 10 (Distance, Duration, Energy, Mass, weight, unitCode, unitText, Number, Boolean, plus one ancillary).
Tier-B — subclass plumbing (CreativeWork + Organization + Event subtrees)
Single-edge additions for entity types already grounded at parent level:
schema:Book rdfs:subClassOf schema:CreativeWork .
schema:Movie rdfs:subClassOf schema:CreativeWork .
schema:Recipe rdfs:subClassOf schema:CreativeWork .
schema:MusicAlbum rdfs:subClassOf schema:CreativeWork .
schema:MusicRecording rdfs:subClassOf schema:CreativeWork .
schema:TVEpisode rdfs:subClassOf schema:CreativeWork .
schema:Photograph rdfs:subClassOf schema:CreativeWork .
schema:Review rdfs:subClassOf schema:CreativeWork .
schema:Hotel rdfs:subClassOf schema:Organization .
schema:LocalBusiness rdfs:subClassOf schema:Organization .
schema:Museum rdfs:subClassOf schema:Organization .
schema:Restaurant rdfs:subClassOf schema:Organization .
schema:SportsTeam rdfs:subClassOf schema:Organization .
schema:SportsEvent rdfs:subClassOf schema:Event .
Implementation: ~14 lines in atelier-vocab.ttl, ~14 SSSOM annotation
blocks, ~14 generators in synth_generators.py.
SOTAB labels covered: ~20 (entity-property pairs cascade through
parent’s name/description mappings).
Tier-C — Product branch + JobPosting + economics
Largest single addition; introduces cco:Artifact lineage:
schema:Product rdfs:subClassOf cco:Artifact . # NEW branch
schema:ProductModel rdfs:subClassOf cco:ArtifactModel . # NEW
schema:Brand rdfs:subClassOf cco:DesignativeICE . # NEW
schema:JobPosting rdfs:subClassOf cco:DescriptiveICE . # NEW
Plus property-level mappings: price, priceRange, currency,
paymentAccepted, DeliveryMethod, ItemAvailability,
OfferItemCondition, workHours, OccupationalExperienceRequirements,
EducationalOccupationalCredential.
Plus temporal refinements: Date, DateTime, Time, DayOfWeek as
subproperties of existing TIMESTAMP lineage.
Plus enumerations: CategoryCode, EventStatusType,
EventAttendanceModeEnumeration, GenderType, RestrictedDiet,
BookFormatType (~6 enumeration classes).
Implementation: ~30 lines in atelier-vocab.ttl, comparable SSSOM
annotation overhead, ~25 new synth generators (Product family is its own
generator pack: SKU, brand, model, GTIN, etc.).
SOTAB labels covered: remaining ~48.
Cumulative coverage after all three tiers
100% of the 82 SOTAB v2 Schema.org CTA labels mapped to BFO/CCO grounding,
with provenance trails (SSSOM sssom:object_label axioms) for every
mapping.
Ownership flow (post-migration)
| Concern | Owner | Notes |
|---|---|---|
| Vocabulary IRIs + CCO/BFO grounding | Ægir (target) | atelier-vocab.ttl migrates to aegir/src/aegir/ontology/ |
| Synth value generators | Ægir (target) | Generator output feeds pre-training corpora directly |
| SOTAB-label → vocab-IRI lookup | Ægir | aegir-vocab exposes label↔IRI map; consumed by training + inference |
| SOTAB v2 download + extraction | Ægir | scripts/download_sotab.py (already wired) |
| CTA/CPA dataset loaders | Ægir | src/aegir/data/table_dataset.py (already wired) |
| Model training + evaluation | Ægir | train.py, src/aegir/models/heads.py::AegirForColumnAnnotation |
| Per-class F1 + Pareto evaluation | Ægir | M2 roadmap entry (per-class F1 bars in leaderboard UI) |
| BFO-grounded prediction emission | Ægir | Leaderboard predicts SOTAB label AND emits its CCO/BFO ancestry |
| Trained checkpoint consumption | Atelier | New: load H-Net/RWKV + SVM artifacts as DST evidence sources |
| DST evidence fusion + classification pipeline | Atelier | Unchanged — trichotomy + belief/plausibility logic stays |
| Gateway + UI + governance integration | Atelier | Unchanged |
During the migration window (until ontology fully relocates), atelier
keeps its operational atelier-vocab.ttl snapshot. The concrete
contract Ægir publishes to atelier becomes a vocab_label_map.json
(IRI + BFO ancestry per SOTAB label) plus the trained model checkpoints
themselves.
Aegir touchpoints (informative, not prescriptive)
The work in Ægir, in roadmap terms, lands inside its M2 milestone (“external-baseline harness, ontology editor with Postgres write paths, per-class F1 bars”):
src/aegir/data/table_dataset.py— fix the stale_LABEL_DIMS["sotab"] = 91to82; add alabel_to_iriresolver that consumes the sharedsotab_label_map.json.scripts/sotab_diagnostic.py— extend representation-collapse diagnostics to surface per-tier (A/B/C) coverage of predictions, so we can see whether collapses correlate with vocab gaps.- Leaderboard gateway (
src/aegir/gateway/app.py) —/api/ontologyendpoint already exists; extend its response to include the BFO ancestry of each predicted label. src/aegir/models/heads.py::AegirForColumnAnnotation— no model change needed for tier work; the head already operates on a(num_labels,)output, and 82 vs 91 is just a config delta.
The pretraining work documented in
aegir/docs/notes/2026-04-19/234700_sotab_diagnostic_representation_collapse.md
(model collapses to single embedding point on SOTAB-small) is orthogonal
to this strategy — it’s a model issue, not a vocabulary issue. Vocab
extension proceeds independently and should improve the post-collapse
ceiling once representations are healthy.
Synthetic data pipeline implications
The synth framework (synth_generators.py, 316+ hand-coded generators
plus the three-layer registry) migrates to Ægir with the rest of the
ontology work. Ægir-resident synth gives pre-training direct access to
generator output without crossing repo boundaries. After Tier-A/B/C
extensions:
- Tier-A adds measurement generators —
DURATION(ISO-8601 strings),MASS(with unit suffix),DISTANCE,ENERGY— these are mostly numeric with unit annotations. ExistingNUMERIC.*generators can be re-keyed. - Tier-B adds entity-name generators —
BOOK_TITLE,MOVIE_TITLE,RECIPE_NAME,HOTEL_NAME, etc. Cascade through the registry’stemplatepriority (priority 2): once Ægir has ~50 real Book/name samples from SOTAB itself, the registry generates plausible book titles via perturbation. - Tier-C adds product attribute generators — SKU, Brand, GTIN, ProductModel. Domain-specific; benefit from hand-coded generators (priority 1) seeded with realistic patterns.
Atelier’s classification pipeline, post-migration, can either (a) call into Ægir’s synth via a thin client during local dev, or (b) bundle a generator snapshot at release time. The decision depends on whether Atelier’s BDD/pytest scenarios remain self-contained or are content to require Ægir as a sibling repo.
Verification
Coverage is mechanically verifiable via SPARQL totality:
PREFIX cco: <https://www.commoncoreontologies.org/>
PREFIX schema: <https://schema.org/>
# Every SOTAB label must have a path to cco:InformationContentEntity (or descendant).
SELECT ?label WHERE {
VALUES ?label { schema:Distance schema:Duration schema:Mass ... } # all 82
FILTER NOT EXISTS {
?label rdfs:subClassOf+ cco:ont00000958 . # cco:InformationContentEntity
}
}
# Empty result == 100% coverage.
This goes in src/atelier/classify/ontology/sparql/sotab_totality.rq
once Tier-A lands.
Status
- Strategy doc: this file (2026-05-09).
- Ownership migration: Ægir takes over ontology / vocab / synth.
- Tier-A implementation: Ægir M2 — vocab edits + SSSOM annotations + SPARQL totality query + measurement generators.
- Tier-B implementation: Ægir M2.
- Tier-C implementation: Ægir M3.
Atelier’s contribution post-migration is consumption-side: load Ægir’s trained checkpoints as DST evidence sources, surface BFO ancestry via the gateway/UI, integrate predictions into the existing belief/plausibility fusion machinery. The vocabulary itself, and the work to extend it, lives next to the model that uses it.
Pareto Capability Evolution (Roadmap)
Status: research-shaped capstone milestone. No incremental rollout — we ship it whole when the pieces converge.
This document proposes a long-horizon evolution of the Atelier classification pipeline from a single-config bootstrap loop into a multi-objective, population-based search over the policy space (LLM prompts, classifier hyperparameters, fusion strategy). The framing is rooted in three bodies of work — Active Learning, Automatic Prompt Optimization (APO), and GEPA — each of which already maps cleanly onto a piece of what we ship today.
Why this is a capstone, not a feature
The current bootstrap loop is already an active-learning system, just informally named. We sweep with an Opus oracle, fuse with Dempster- Shafer, revisit disagreements, retrain incrementally — all under a single configuration. Operators have started asking the next question: could we have run with a tighter belief gap, fewer LLM tokens, deeper cautious predictions? Each answer requires re-running with different settings. We need a search procedure that can carry this load without forcing operators to hand-tune one knob at a time.
We ship this when the prerequisite pieces converge:
- The reasoning model in
overwatch/agent.pystabilizes as a reliable proposer of structured configuration edits (prompt diffs and JSON patches over the config tree, not free-form advice). - We have enough corpus diversity in
data_sourcesto evaluate candidates against generalization, not point estimates on one source. - A persistent population store (the config leaderboard) is in place so evolution survives gateway restarts and CAI session boundaries.
Until those land, individual ideas in this doc may be borrowed in isolation (e.g. an APO-only loop that evolves a single sweep prompt against accuracy). The capstone is the integrated whole — the borrowed pieces alone do not constitute “Pareto Capability Evolution”.
Foundations
Active Learning — the paradigm we already implement
Active learning minimizes label cost by querying an oracle on examples the model is most uncertain about (Settles 2009). Mapped onto the Atelier pipeline:
| Active Learning concept | Atelier component |
|---|---|
| Oracle | Opus during sweep + revisit (pipeline.py::_llm_sweep, _llm_revisit) |
| Labeled pool (T_K) | Synth corpus + curated reference + accumulated LLM labels |
| Unlabeled pool (T_U) | Discovered source columns awaiting classification |
| Query strategy | Belief-gap-driven revisit selection (largest Pl − Bel) |
| Query-by-committee | Disagreement between CatBoost-fit-to-LLM and the synth-trained SVM (via the ICE→user alignment) |
| Pool vs. stream | Pool-based — Monte Carlo stratification picks each batch |
| Stopping criterion | mean_gap < gap_threshold OR max_iterations reached |
| Cold-start mitigation | Synth pre-training + pattern evidence on first sweep |
The active-learning incorporation of new oracle labels is
concentrated in the catboost source (fit_to_llm mode trains
on the live LLM labels mid-run). The SVM was previously also part
of this active-learning loop via the M9 frontier_svm retrain,
but that path was excised on 2026-05-04 (commits 8627c2c, 5199379,
cc59d01) for the independence reasons documented in
ontology_alignment.py. The SVM now contributes a label-stable
TF-IDF view that complements the live-LLM-aligned CatBoost view.
Automatic Prompt Optimization — APO and GEPA
Both APO (Microsoft Agent Lightning) and GEPA (Lakhotia et al., ICLR 2026) optimize LLM prompts via reflection-driven mutation: the LLM diagnoses its own failures in natural language and proposes prompt edits, evaluated against held-out tasks. They differ on search shape:
| Dimension | APO | GEPA |
|---|---|---|
| Search structure | Beam (default width 4) | Pareto frontier (open-ended population) |
| Objective | Single scalar reward | Multi-objective, non-dominated sorting |
| Mutation | Textual gradient → LLM-edit | LLM reflection + cross-candidate recombination |
| Targets | One prompt template at a time | One or more prompts; full system policy |
| Scope | “Pick the best system prompt” | “Discover diverse strategies and combine them” |
| Sample efficiency | Not benchmarked vs. RL | 35× fewer rollouts than GRPO; +6–20% over MIPROv2 |
For Atelier, APO is the right shape for narrow optimizations (tune one sweep prompt against accuracy on a known corpus). GEPA is the right shape for the capstone: we have multiple operator-relevant objectives (accuracy, calibration, cost, coverage, latency), and we benefit from preserving complementary policies rather than collapsing to a single configuration.
We treat APO and GEPA as peer techniques. APO is invoked when one objective clearly dominates and beam search is sufficient; GEPA is invoked when objectives trade off and the frontier’s diversity is itself the asset. Both share the same reflection-engine plumbing.
The synthesis — Pareto Capability Evolution
The capstone integrates AL, APO/GEPA, and population-based search into one loop:
- Active learning drives label acquisition within each candidate run (the existing bootstrap loop, unchanged).
- Reflection-driven mutation drives proposal of new pipeline configurations: prompt edits, classifier knobs, fusion swaps.
- Pareto sorting decides which configurations survive into the next generation.
The reflection model is the same Opus instance already wired for overwatch — it reads the convergence report of a finished run and proposes targeted edits to the configuration that produced it.
Pipeline policy space
Mutation targets the configuration tuple, not just the prompt:
- LLM prompts: sweep template, revisit template, classification subagent system prompt.
- Classifier hyperparameters: CatBoost depth, learning rate, class weights; SVM C and kernel; SVM-vs-LLM blend ratio in DST mass construction.
- Fusion strategy: Dempster vs. Yager; gap threshold; bel-floor; pignistic vs. cautious decision rule; cautious depth threshold.
- Search budget: sweep batch size, max bootstrap iterations, Monte Carlo stratification fraction, revisit triggers.
- Pattern evidence weights: per-pattern mass discount, evidence layering order.
- Embedding choice: MiniLM-L6 (today’s default) vs. BGE-large vs. E5-mistral — bounded by the embedding-model identity check we already enforce on Extend runs.
Hard invariants encoded elsewhere (e.g.
classify.bootstrap.max_iterations >= 2,
classify.catboost.fit_to_llm = true) remain non-negotiable —
mutations that violate them are rejected before evaluation, never
committed to the population.
Objectives (Pareto axes)
| Objective | Source | Direction | Why operators care |
|---|---|---|---|
| Mean Bel of correct prediction | curated reference | maximize | core accuracy |
| Mean Pl − Bel | EVALUATING report | minimize | calibration tightness |
| LLM tokens / converged column | sweep accounting | minimize | governance budget |
| Cautious accuracy @ depth-N | epistemic_evaluation | maximize | hierarchy faithfulness |
| Vocab coverage @ τ | classifications.json | maximize | “did we touch every leaf?” |
| Pipeline duration | fsm_runs.{started_at, updated_at} | minimize | iteration speed |
A configuration enters the frontier if no other configuration beats it on every axis (non-dominated sorting). The frontier is open-ended in size; crowding-distance pruning bounds it under operator-defined caps.
Population store (“config leaderboard”)
A persistent backing store records:
- Each evaluated configuration as a row, keyed by hash of the config tuple (bit-stable across host reboots).
- Every objective score per evaluation, with provenance back to the
fsm_runs.idthat produced it. - Lineage edges: which configuration mutated to which, via what proposer (APO-style critic vs. GEPA-style recombiner) and what diff.
- Frontier membership over time, so an operator can see which configurations entered, dominated others, or were pruned.
This is conceptually a leaderboard — operators sort and filter by any axis or weighted combination — and structurally a write-once registry that supports re-evaluation as new corpora arrive. A frontier that holds against corpus A may not hold against corpus B; the registry preserves both views without conflating them.
The store interfaces with existing tables: it points at
ml_artifact_sets rows (the bundle a winning config produced still
ships through Extend Classification) and fsm_runs rows (each
evaluation is one FSM run). It does not duplicate them — there is
one source of truth for artifacts, and the leaderboard layers
search-state on top.
Reflection loop — concrete shape
Per generation:
- Sample a parent from the current frontier, weighted by either crowding distance (favor diversity) or recency (favor live operator priorities). A small fraction of generations sample a dominated ancestor instead, to escape local frontier traps.
- Diagnose by feeding the parent’s run report to the reflection model. The report includes the final classifications, per-axis objective scores, the convergence trace, and any cautious-review findings.
- Propose edits as a structured patch (JSON) against the
configuration tuple — e.g.
{"classify.bootstrap.gap_threshold": 0.05, "classify.svm.blend_ratio": 0.6}— or a textual prompt diff when the target is a prompt template. - Evaluate by instantiating the patched config, running it as an FSM run, and recording scores into the leaderboard.
- Update the frontier via non-dominated sort; admit the new configuration if it is non-dominated; prune incumbents whose crowding distance falls below a threshold.
Mutation diversity is encouraged via dual proposers: one focused on accuracy/calibration (the reflection model with a “be conservative” system prompt), one focused on cost/latency (the same model with an “aggressively shrink the budget” system prompt). The frontier preserves both styles rather than collapsing to whichever proposer happened to find an early local optimum.
What this retires
- “Frontier SVM” terminology and the M9 retrain it described.
The mid-loop
train_svm_on_frontier_labelsretrain that gave the “frontier SVM” its name was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented inontology_alignment.py. The SVM is now trained once on synth with ICE.* labels and translated into the user vocabulary at inference time via the LLM-mediated alignment. “Frontier” the word is freed for the Pareto sense used elsewhere in this doc. - Single-config tuning by hand. Today operators tweak
base.confor the runtime overlay and re-run. The capstone replaces that loop with population-based search; the overlay UI surfaces frontier picks and lets operators promote one to active rather than asking them to choose individual values. - The “single best” mental model. Operators learn to think in trade-offs: “the accuracy-leader spends 4× tokens; the budget-leader loses 6 points of cautious accuracy at depth-3” — and the system surfaces both rather than averaging them away.
Non-goals (explicitly deferred)
- Multi-tenant scheduling under CAI quotas. The search loop assumes single-tenant compute on the host’s GPU. Quota-aware scheduling is a separate concern.
- Cross-corpus warm-start. A frontier from corpus A is not automatically transplanted to corpus B. The leaderboard preserves both, but transfer learning across taxonomies is research in its own right.
- Re-training the embedding model in-loop. Embedding-model identity is locked per artifact set (already enforced for Extend runs); evolution can swap the embedding model only by spinning up a fresh population, not via mutation within an existing one.
- Online / streaming evaluation. Pool-based AL is the operating mode. Streaming evaluation as columns arrive continuously is a candidate for v2 — the leaderboard would persist while the pool grows.
Open research questions
- Cold start for the proposer. The reflection model needs at least one finished run before it can propose edits. Bootstrap with N random perturbations of the default config? Use APO-style beam search for the first generation, then expand into Pareto?
- Noisy oracle problem. AL assumes the oracle is roughly correct. Opus is excellent but not infallible. The cautious-review pass catches some errors, but whether the leaderboard should down-weight configurations whose convergence relied on later-overturned LLM labels is open.
- Convergence detection for the meta-loop. When does evolution stop? Frontier-stability heuristics (no admissions in K generations) versus operator-driven termination versus budget-exhausted.
- Reflection-model agreement. APO’s textual-gradient critic and GEPA’s recombination critic are both LLM-driven. Do they propose meaningfully different edits, or do they collapse to the same suggestion? Worth empirical study before committing the architecture.
- Reproducibility under stochastic LLM outputs. Two evaluations of the same config can disagree on objective scores. How much smoothing (multi-seed averaging) is required before non-dominated sorting becomes stable?
Cross-references
- Classification Pipeline — the AL loop being generalized.
- Synthetic Data & Training — synth provides the labeled-pool floor.
- ML Artifacts & Extend Classification — winning configurations produce artifact sets that flow through the existing Extend pipeline.
- GPU Acceleration — population-based search amplifies the payoff of fast per-evaluation rollouts.
- Proposed Integrations — neighboring roadmap items that may interact with the leaderboard surface.
References
- Settles, B. (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
- Lakhotia, K. et al. (2025). GEPA: Genetic-Evolutionary Pareto- frontier Adaptation. arXiv:2507.19457. ICLR 2026 (Oral).
- Pryzant, R. et al. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search. arXiv:2305.03495.
- Microsoft Agent Lightning, APO Algorithm Documentation, https://microsoft.github.io/agent-lightning/latest/algorithm-zoo/apo/
Proposed Integrations
This page documents two planned integration points that extend the data source model: MLflow experiment tracking (Phase 5) and Hive data connections (Phase 6). Both are designed but not yet implemented.
MLflow Integration (Phase 5)
Motivation
On CAI deployments, MLflow is available as a managed service. Logging pipeline runs to MLflow provides:
- Experiment history: compare accuracy, conflict, and coverage across pipeline versions without the Atelier UI
- Model registry: when CatBoost/SVM models are trained, register them as versioned artifacts
- Artifact persistence: classifications.json, evaluation reports, and parquet files survive pod restarts
- Cross-project visibility: other CAI workloads can discover Atelier’s registered models
Architecture: Write-Then-Reconcile
The MLflow bridge follows the RAG Studio reconciler pattern — the pipeline never blocks on MLflow I/O.
Pipeline thread Reconciler (background)
────────────── ───────────────────────
write JSON to queue dir ──────► poll queue dir
(non-blocking) parse JSON envelope
log to MLflow (retries)
move to archive/
This design is resilient to:
- MLflow downtime (queue accumulates, reconciler catches up)
- Pipeline latency (no synchronous API calls in the hot path)
- Pod restarts (queue dir is on persistent storage)
Queue Format
Each pipeline state transition writes a JSON envelope to
build/mlflow_queue/:
{
"event": "run_complete",
"run_id": "abc123",
"source_id": "ootb-sample",
"timestamp": "2026-04-14T12:00:00Z",
"payload": {
"params": {
"source_id": "ootb-sample",
"vocabulary_mode": "universal",
"sample_size": 50,
"llm_model": "glm-4.7",
"discount_cosine": 0.30
},
"metrics": {
"accuracy": 0.847,
"micro_f1": 0.832,
"macro_f1": 0.791,
"mean_belief": 0.724,
"mean_conflict": 0.089,
"coverage": 0.973,
"llm_calls": 42,
"bootstrap_iterations": 3
},
"artifacts": [
"build/results/abc123/classifications.json",
"build/results/abc123/evaluation_report.json",
"build/results/abc123/atelier_embeddings.parquet"
]
}
}
MLflow Experiment Structure
Each data source maps to an MLflow experiment:
Experiment: atelier/ootb-sample
├── Run: v1 (params, metrics, artifacts)
├── Run: v2 (params, metrics, artifacts)
└── Run: v3 (params, metrics, artifacts)
Experiment: atelier/hive-prod-default
└── Run: v1 (params, metrics, artifacts)
What Gets Logged
| Category | Items | Notes |
|---|---|---|
| Params | source_id, vocabulary_mode, sample_size, llm_model, discount factors | Static per run |
| Metrics | accuracy, micro_f1, macro_f1, mean_belief, mean_conflict, coverage | Numeric scalars |
| Artifacts | classifications.json, evaluation_report.json, parquet | Full result set |
| Models | CatBoost (.cbm), SVM (.pkl) | Registered when newly trained |
Module Design
# src/atelier/classify/mlflow_bridge.py
class MLflowBridge:
"""Async write-then-reconcile bridge to MLflow."""
def __init__(self, queue_dir: Path, experiment_prefix: str = "atelier"):
self.queue_dir = queue_dir
self.experiment_prefix = experiment_prefix
def enqueue(self, event: str, run_id: str, source_id: str, payload: dict):
"""Write an event envelope to the queue (non-blocking)."""
...
def reconcile(self):
"""Process all pending queue items. Called by background thread."""
...
Pipeline integration points:
# In pipeline.py — at key state transitions:
bridge.enqueue("run_started", run_id, source_id, {"params": {...}})
# ... pipeline work ...
bridge.enqueue("run_complete", run_id, source_id, {"metrics": {...}, "artifacts": [...]})
Gating
MLflow is only active on CAI (cfg.is_cml). In devenv, the bridge
is a no-op. The mlflow package is an optional dependency — import
failure is handled gracefully.
Configuration
# config/base.conf (proposed)
mlflow {
enabled = false
enabled = ${?ATELIER_MLFLOW_ENABLED}
tracking_uri = null
tracking_uri = ${?MLFLOW_TRACKING_URI}
queue_dir = "build/mlflow_queue"
}
Implementation Notes
- The reconciler runs as a daemon thread started in the gateway lifespan, similar to the sample source seeding
- Queue items are atomic files (write to
.tmp, rename to.json) to prevent partial reads - Failed reconciliation retries with exponential backoff (max 5 min)
- Archive dir (
build/mlflow_queue/archive/) retains processed items for debugging
Files (Proposed)
| File | Action |
|---|---|
src/atelier/classify/mlflow_bridge.py | New: bridge + reconciler |
src/atelier/classify/pipeline.py | Extend: bridge calls at transitions |
config/base.conf | Extend: mlflow config block |
src/atelier/config.py | Extend: mlflow fields |
src/atelier/gateway.py | Extend: reconciler daemon thread |
Hive Data Source (Phase 6)
Motivation
The OOTB sample source demonstrates the pipeline with synthetic data. In production on CAI, the real value comes from classifying columns in the customer’s actual Hive tables via CAI data connections.
How It Works
Hive sources are auto-discovered at gateway startup. The gateway
lifespan hook calls discover_hive_sources(cfg) which:
- Iterates all connections listed in
ATELIER_DATA_CONNECTIONS - For each connection, runs
SHOW DATABASESand checks each database for anannotationstable - Validates the schema: fetches 1 row and checks for legacy
(
id,ontology,annotation) or universal (code,label) format - Auto-registers valid sources via
get_or_create_data_source()(idempotent — safe to re-run on restart)
Once registered, the pipeline route works automatically:
- Pipeline resolves data from the connection: when
source_idrefers to a hive source, the pipeline callsdiscover_tables()andsample_table_metadata()using that connection - Vocabulary routing: hive sources use
load_annotations_from_hive()which readsdefault.annotations(domain categories) and composes them on top of the universal base - Results register as versions: each pipeline run creates a new version under the hive source, with the same activation/versioning semantics as the sample source
Data Flow
CAI Data Connection (Hive/Impala)
│
▼
discover_tables(cfg, connection_name, database)
│ ┌─────────────────────────┐
▼ │ load_annotations_from_ │
sample_table_metadata() │ hive(cfg, connection) │
│ │ → default.annotations │
▼ └──────────┬──────────────┘
│
┌───────────────────────────────────┘
▼
compose_vocabularies(universal, hive_domain)
│
▼
run_classification_pipeline(cfg, fsm, source_id="hive-prod-default")
│
▼
Dataset version N+1 registered under hive source
Vocabulary Composition
Hive sources use two-layer vocabulary composition:
Layer 1 (always): Universal vocabulary (16 BFO-grounded PII categories)
╱╲
Layer 2 (hive only): Domain annotations from default.annotations table
(290+ customer-specific categories with hierarchical codes)
╱╲
Composed CategorySet (300+ terms)
Domain categories attach to the universal tree via parent_code
references. Categories without a valid parent are logged as warnings
and placed under a catch-all internal node.
Source Creation
When a user selects a data connection from the Status page dropdown and clicks “Create Source”, the gateway:
-
Validates the connection by running
SHOW DATABASES -
Creates a
data_sourcesrecord:{ "id": "hive-{connection}-{database}", "source_type": "hive", "source_uri": "{connection}/{database}", "display_name": "hive:{connection}/{database}", "vocabulary_mode": "hive" } -
The source appears in the dropdown immediately
Pipeline Routing
# In pipeline.py — source-based auto-resolution
if source.source_type == "hive":
connection_name = source.source_uri.split("/")[0]
database = source.source_uri.split("/")[1]
# discover_tables() and sample_table_metadata() use the connection
# load_annotations_from_hive() uses the connection for vocabulary
Configuration
No new configuration needed. Existing settings control Hive behavior:
classify {
connection_name = "" # Default CAI data connection
connection_name = ${?ATELIER_CLASSIFY_CONNECTION}
database = "default"
database = ${?ATELIER_CLASSIFY_DATABASE}
}
cml {
data_connections = "" # Comma-separated connection names
data_connections = ${?ATELIER_DATA_CONNECTIONS}
}
Files (Proposed Changes)
| File | Change |
|---|---|
src/atelier/gateway.py | Add POST /api/data-sources endpoint with connection validation |
src/atelier/classify/pipeline.py | Extend source routing to resolve hive connections |
ui/src/pages/Status.tsx | Add “Create Source” button in data connection card |
Existing Modules Used (No Changes)
| Module | Function | Role |
|---|---|---|
sampler.py | discover_tables() | List tables via cml.data_v1 |
sampler.py | sample_table_metadata() | Sample column values |
taxonomy.py | load_annotations_from_hive() | Load domain vocabulary |
taxonomy.py | compose_vocabularies() | Merge universal + domain |
Implementation Priority
| Phase | Integration | Depends On | Testable Without Services |
|---|---|---|---|
| 5 | MLflow bridge | Phase 2 (data model) | Partially — queue/reconcile logic is pure Python |
| 6 | Hive source | Phase 2 (data model) | No — requires CAI data connection |
Phase 5 can be developed and unit-tested independently (the queue and reconcile logic is pure Python). The MLflow API calls can be mocked in tier-0 BDD scenarios.
Phase 6 is primarily wiring — the heavy lifting (table discovery, vocabulary loading, pipeline execution) already exists. The main new code is the gateway endpoint for source creation and the UI for triggering it.
Encrypted Deployment Defaults (SOPS + age)
Atelier ships with encrypted deployment defaults so a CAI operator can stand up a working instance by entering only four environment variables — their two AWS Bedrock credentials, a direct Anthropic API key (for overwatch), plus a single age private key that unlocks everything else.
Why
Every CAI deployment needs a dozen-ish environment variables: Bedrock model ARNs, Atlas / Ranger URLs, feature toggles, governance flags, subagent model IDs, and — for UAT runs — a curated-reference CSV for accuracy measurement. Most of those values are identical across every deployment of the same Atelier release; only the AWS credentials and the Anthropic key are operator-specific. Rather than documenting a long checklist for every customer, we encrypt the defaults and the curated-reference fixture into the repository with SOPS and ship one key alongside the deployment.
The operator paste-sets the key; everything else is already wired up.
Operator workflow (what to tell your CAI users)
Set four environment variables on the CAI Application, then start it:
| Name | Value | Source |
|---|---|---|
AWS_ACCESS_KEY_ID | Bedrock access key | your AWS / IAM team |
AWS_SECRET_ACCESS_KEY | Bedrock secret | your AWS / IAM team |
ANTHROPIC_API_KEY | direct Anthropic API key | Anthropic Console |
SOPS_AGE_KEY | full AGE-SECRET-KEY-1… string | provided out-of-band by the Atelier maintainer |
On startup, bin/start-app.sh runs the shared
bin/bootstrap-secrets.sh utility, which decrypts both .env.cai.enc
(dotenv defaults) and features/fixtures/curated_reference.csv.enc
(meta-tagging answer key) with the age key you provided. The dotenv
values source into the environment where HOCON’s ${?VAR} substitution
picks them up; the decrypted CSV materializes at
build/data/curated_reference.csv and ATELIER_CLASSIFY_REFERENCE_URI
points at it so evaluation_report.json carries real accuracy
numbers. No per-customer checklist to maintain.
Overrides still work. Any explicit ATELIER_* env var on the CAI
Application wins over the encrypted default — so an operator who
wants a different Bedrock ARN just sets ATELIER_AGENT_MODEL directly
and that value takes precedence.
Alternative: pointing at a key file
If the operator already has the age key on disk (e.g. mounted from a
secret store), they can set SOPS_AGE_KEY_FILE=/path/to/key.txt
instead of pasting the key content. bin/start-app.sh supports both.
Maintainer workflow
The age public key is committed in .sops.yaml; the private
key is held by the Atelier maintainer and distributed out-of-band to
each CAI operator.
First-time setup
Place your age private key at ~/.config/sops/age/keys.txt — the
public key must match the age: age1… line in .sops.yaml. The
devenv shell provides both sops and age binaries.
Editing defaults
just decrypt-secrets # .env.cai.enc → .env.cai (plaintext, gitignored)
$EDITOR .env.cai # add / change values
just encrypt-secrets # .env.cai → .env.cai.enc
git add .env.cai.enc
git commit -m "chore: update CAI deployment defaults"
The plaintext .env.cai is excluded by .gitignore; only the
encrypted .env.cai.enc is tracked. SOPS encrypts each value
independently, so diffs show which keys changed even though their
values are opaque.
Editing the curated-reference fixture
The meta-tagging answer key (what evaluation_report.json compares
predictions against) ships encrypted under the BDD fixtures tree so
committed secrets live with the corpus they validate.
# From the maintainer's reviewer xlsx
uv run python -m atelier.overwatch.ingest_reference \
~/path/to/Atelier_Results_Default_DB_4-16.xlsx \
--out build/data/curated_reference.csv
# Encrypt into features/fixtures/ and commit the ciphertext only
just encrypt-reference
git add features/fixtures/curated_reference.csv.enc
git commit -m "chore: update curated-reference answer key"
To inspect the current key without re-running the xlsx ingest:
just decrypt-reference # decrypts into build/data/curated_reference.csv
$PAGER build/data/curated_reference.csv
Both the plaintext CSV (in build/) and .env.cai are ignored by
git; only the .enc ciphertexts are tracked.
Rotating the key
age-keygen -o new-key.txt # generate replacement pair
# update .sops.yaml: replace the age: age1... line with the new public key
sops updatekeys .env.cai.enc # re-encrypt deployment defaults
sops updatekeys features/fixtures/curated_reference.csv.enc # AND the curated-reference fixture
git commit -am "chore: rotate CAI deployment key"
# distribute the new private key to operators via the same out-of-band channel
sops updatekeys rewrites the encrypted file’s recipient list in
place — nothing about the plaintext values changes, so this is a
zero-content-drift rotation. Run it against every encrypted
artifact so the new key unlocks the whole set.
Adding a second recipient (e.g. ops team shared key)
Add a second age: entry under the matching creation_rules block
in .sops.yaml, then run sops updatekeys .env.cai.enc. Either
private key will decrypt.
How this fits with HOCON
SOPS only populates environment variables. HOCON (config/base.conf)
already treats all configuration as environment-overridable via the
${?VAR} pattern:
agents {
model = "claude-opus-4-7"
model = ${?ATELIER_AGENT_MODEL} # env wins when set
}
SOPS decryption runs before the gRPC server loads HOCON, so from HOCON’s perspective the encrypted values are just ordinary environment variables.
What belongs in .env.cai.enc vs config/base.conf
.env.cai.enc— deployment-specific defaults that differ between environments but aren’t operator secrets per se (model ARNs, Knox endpoints, feature toggles, subagent IDs). Values that are derivable from context and you don’t want every operator to rediscover.config/base.conf— true defaults that hold for every deployment; structural knobs that belong in source control in plaintext (pipeline thresholds, port numbers, fusion strategy).- Operator-entered env vars — genuine per-deployment secrets
(
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY, theSOPS_AGE_KEYitself). These never live in the repository.
Security notes
SOPS_AGE_KEYdecrypts only this project’s.env.cai.enc. Losing it costs you these defaults; gaining it grants no AWS, Cloudera, or third-party privilege on its own.- Each customer should get the same age private key (defaults are identical across deployments) — per-customer secrets, if any, stay in the CAI Application’s own environment variables.
- Rotate the key whenever a recipient leaves the operator pool.
- The age public key in
.sops.yamlis intentionally committed; public keys are meant to be public.
Reference
.sops.yaml— recipient rules (covers.env.cai.enc+features/fixtures/*.csv.enc).env.cai.enc— encrypted deployment defaults (committed)features/fixtures/curated_reference.csv.enc— encrypted curated-reference CSV (committed)bin/bootstrap-secrets.sh— shared decrypt utility; runs frombin/start-app.sh,devenventerShell, andjust bootstrap-secretsbin/start-app.sh— CAI startup; invokes bootstrap-secrets then sources.env.caijustfilehelpers:bootstrap-secrets— run the shared decrypt utilitydecrypt-secrets/encrypt-secrets— dotenv editing workflowdecrypt-reference/encrypt-reference— curated-reference CSV editing workflow
devenv.nix— providessops+agein the dev shell; runs bootstrap inenterShell- SOPS docs · age docs
Reviewer’s Guide to the Embeddings Canvas
This guide is for operators auditing classification runs and proposing
algorithm-tuning remediations. It explains the Dempster–Shafer (DST)
measures the canvas exposes, the rationale behind the curated SQL
Predicate panel, and a concrete walk-through of using the canvas to
diagnose the four root causes called out in
audit_2026-05-06_a.md (runs
40f07630, 8d67b1ed, e5b0ac26).
The guide assumes you have an Embeddings page open for one of those runs and a copy of the audit alongside.
1. The DST measures, in plain English
Atelier fuses up to six independent evidence sources (name match, pattern, cosine, LLM, CatBoost, SVM) via Dempster’s rule of combination. The fused result is a mass function over the taxonomy’s frame of discernment. From that mass function we report five scalars per column:
| Field | Formula | Meaning |
|---|---|---|
belief | Bel(A) = Σ m(B), B ⊆ A | Lower bound on the probability the prediction is correct. Mass committed only to A or its subsets. |
plausibility | Pl(A) = Σ m(B), B ∩ A ≠ ∅ | Upper bound. Mass consistent with A — what hasn’t been ruled out. |
uncertainty | Pl(A) − Bel(A) | The width of the [Bel, Pl] interval. Epistemic uncertainty, smaller is better. |
confidence | `BetP(x) = Σ_{x∈A} m(A) / | A |
conflict | K, the pre-normalization mass on ∅ | Source-disagreement diagnostic. Under Dempster’s rule, K is normalized out of [Bel, Pl] but logged separately; under Yager, K is redirected to ignorance (Θ). |
The invariant: Bel(A) ≤ BetP(A) ≤ Pl(A) for every column.
Pl + Bel of A’s complement always equals 1 (duality).
Which one is the “rigor” signal?
For a single positive scalar, prefer belief. It is the
honest floor — mass that cannot be redirected by additional evidence
even in principle. The cautious-review gate (bel_threshold = 0.80)
operates on Bel; bootstrap convergence is on the gap (Pl − Bel);
needs_clarification fires when Bel < 0.80 OR gap > 0.20. The
project’s algorithms already treat Bel as the truth proxy; the
reviewer should too.
confidence is not redundant — it serves a different purpose.
BetP redistributes ignorance optimistically, so a vacuous mass
function over a 16-singleton frame still produces BetP ≈ 0.06 per
singleton. Comparing belief to confidence on the same row is
how reviewers build intuition for “how much of this column’s
prediction is committed evidence vs. evenly-spread ignorance.” Big
gap between Bel and BetP = the prediction looks confident only
because the rest of the frame is empty.
A worked example from 8d67b1ed’s row 1 (fitness_members.row_id):
Bel = 0.834 BetP = 0.834 Pl = 0.933 K = 0.358
BetP and Bel align tightly because the mass is concentrated on singletons (no large compound focal elements to spread). A healthy, committed prediction.
Compare with a hypothetical weak prediction:
Bel = 0.30 BetP = 0.55 Pl = 0.85 K = 0.10
The same headline confidence = 0.55 masks a Bel of 0.30 — meaning
70% of the mass is sitting on compound focal elements that BetP is
spraying across singletons. Reviewer’s read: this is not an 0.55
prediction; it’s a 0.30 prediction wearing a 0.55 hat.
Why conflict is no longer the canvas color default
Under Dempster’s rule (the default fusion strategy), K is normalized
out of [Bel, Pl] — every fused mass function is renormalized by
(1 − K). K still gets reported as a diagnostic, but it does not
correlate with prediction quality. Run 8d67b1ed averaged
K = 0.27 across all 287 columns; rows with very different beliefs
(Bel = 0.30 vs Bel = 0.85) commonly share the same K. Coloring by
K painted the canvas a nearly-uniform fog.
belief paints the canvas with information. Low-Bel rows (the
cautious-review candidate pool) cluster in warm colors; committed
predictions cool out. The 0.80 cliff that drives the cautious
review and needs_clarification is visible — it’s the threshold
between a calm canvas and a hot-spot region that demands human
attention.
If you switch the run to Yager fusion, K is no longer normalized out — it shows up as ignorance mass, which depresses Bel and widens the gap. Reviewers comparing fusion strategies side-by-side should look at Bel + gap on both, not at K — K means different things under the two rules.
2. The curated SQL Predicate panel
Embedding-Atlas’s default behavior is to auto-generate one chart per data column. With 35 fields in the parquet, that’s noise: tooltips overlap with the canvas, projection coordinates render as histograms, JSON blobs render as illegible text fields.
The curated panel exposes only fields that map to an algo-tuning decision. Order is intentional — top to bottom, the panel walks the reviewer from “is this run healthy” → “where is the pain concentrated” → “which feature is driving it.”
| # | Field | Chart shape | Why it’s there |
|---|---|---|---|
| 1 | belief | Histogram | Primary quality signal. The 0.80 cliff is the cautious-review threshold; rows below are the candidate pool. Brushing this filters the canvas to “weak” predictions. |
| 2 | confidence | Histogram | BetP. Side-by-side with belief builds intuition for the Bel-vs-BetP gap. Wide gap on a row = mass concentrated on compound focal elements. |
| 3 | review_decision | Count plot | Categorical: keep / backoff / reroute / "" (untouched). This is the audit’s central concern — Finding 1 names reroute as the instability amplifier. |
| 4 | predicted_annotation | Count plot | Compact mnemonic (e.g. NAMEFULL, EMAIL, PHONE) — the dot-codes are unreadable in a small chart, but the annotation tells the same story. The full label appears in the embedding tooltip. |
| 5 | needs_clarification | 2-bar count | Boolean union of Bel < 0.80 OR gap > 0.20. The “demands attention” set, expressed as a single flag. |
| 6 | llm_confidence | Histogram | LLM’s self-reported confidence. Low-tail rows are the population at risk for reroute amplification — a weakly-asserted LLM code that DST then has to defend. |
| 7 | uncertainty | Histogram | Pl − Bel — gap-driven revisit set. Bootstrap convergence is on mean(uncertainty); canvas histogram lets reviewers see whether the run actually converged or just hit max-iterations. |
| 8 | conflict | Histogram | K, demoted from default but kept as a source-disagreement diagnostic. Useful when comparing Dempster vs Yager runs (K means different things under each). |
| 9–14 | shap_top1/2/3_name, shap_top1/2/3_value | Count + histogram pairs | Surfaces which feature is driving each prediction. Top-1 is usually sample_values; top-2/3 reveal sibling-context vs column-name dominance. The intentional inclusion of all three reflects the steep dropoff in SHAP utility between top-1 and top-3 — the dropoff is itself the situational signal. When top-1 dominates by 5×, single-feature explanations work; when top-1/2/3 are flat, the prediction is broadly diffuse and remediation needs to address feature-engineering, not source weights. |
| 15 | table_name | Count plot | Hotspot navigation. Audit calls out legal_cases and loan_applications as hallucination concentration zones; this chart lets reviewers brush-filter to one. |
| 16 | column_type | Count plot | Numeric vs object. Pattern-signal source is type-conditioned; reviewing remediations to pattern detectors benefits from typed slicing. |
What’s not in the default panel. Reference fields
(reference_code, reference_label, matches_reference) are usually
empty for production Hive data. When a run does have a curated
reference set (UAT meta-tagging mounts), reviewers can add
reference_code and matches_reference via the SQL Predicates
control on the panel header — type into the predicate input directly,
or click “Add” and pick the column. The panel re-renders instantly.
Same mechanism applies to any field the reviewer wants ad-hoc — e.g.,
predicted_label for a long-form taxonomy view, or predicted_code
when a numeric dot-code is needed for filtering.
3. Walk-through against audit_2026-05-06_a.md
The audit identifies four root causes. Below: how to reach each one on the canvas, what the right brushing pattern is, and what the algo-tuning lens reveals.
Finding 1 — Three-way reroute as instability amplifier
“20.6% of columns flip between runs with identical configuration. The reroute mechanism turns a minor LLM fluctuation into a major classification change.”
Brush: review_decision = "reroute" on chart 3.
What you see: the canvas highlights all rerouted rows. Look at their distribution — are they clustered in one taxonomy region (a single subtree’s entropy bleeding into neighbors), or are they scattered (the LLM is fluctuating uniformly)?
Cross-brush with belief chart 1, brushing the 0.40–0.70 band:
this is the cohort that fails the 0.80 threshold but isn’t trivially
weak. Reroute decisions are most consequential here — an LLM fluke
on a 0.85-Bel row gets rejected by the threshold; on a 0.65-Bel row
it gets handed to the reviewer. The audit’s recommended P1 guard
(“reject reroutes where pre-review code matches LLM code with
conf > 0.80”) would visibly clip the right edge of this brush.
Algo-tuning read: if rerouted rows cluster around belief ≈ 0.5
and have llm_confidence > 0.80, the audit’s P1 guard is the right
remediation. If they cluster at belief < 0.4, the upstream issue
is fusion strength, not the reviewer.
Finding 2 — LLM annotation-code hallucination (27 columns)
“The LLM returned annotation mnemonics (
SSN,DOB,FNAME) instead of numeric taxonomy codes for 27 columns in 40f07630. Whenllm_codeis an annotation string, the code-resolution layer discards it —evidence_sources.llm = {}.”
Brush: llm_confidence chart 6, isolate the low-confidence
tail below 0.20. These are columns whose LLM evidence was
discarded (the evidence layer assigns 0 confidence when the code
fails to resolve).
Cross-brush with belief (chart 1) — affected columns will pile up
at low Bel because they fall back to cosine alone.
SHAP signal: chart 9 (shap_top1_name) on the same brush should
show column_name or sibling_context dominating instead of
sample_values. When SHAP’s top-1 is not sample_values, the
classifier wasn’t given enough evidence from the values themselves —
the LLM-evidence loss is showing up as an upstream feature-importance
shift.
Algo-tuning read: the audit’s P0 (“map annotation mnemonics to
numeric codes in _resolve_llm_code()”) would eliminate this brush
entirely — its impact is visible as the disappearance of a low-tail
cluster on llm_confidence. Reviewer can size the impact:
“≈ 27 columns × mean(belief gain) = X total mass committed.”
Finding 3 — col_04 and sibling-context poisoning
“When sibling opaque columns (
col_02,col_32) are all misclassified as Shipping Address (because the table name biases the embedding), the reviewer uses those wrong sibling labels as evidence to perpetuate the error.”
Brush: shap_top1_name = "sibling_context" on chart 9, then
cross-brush column_name LIKE 'col\\_%' via the SQL Predicate
input. This is the at-risk population.
What you see: the rerouted opaque columns cluster on the canvas
near their (incorrectly inferred) neighbors. When the reviewer-bias
poisoning is at work, these clusters will show consistent
predicted_annotation across the cluster — the error has
propagated.
Cross-brush with review_decision = "reroute": rerouted opaque
columns where SHAP shows sibling-context dominance are the precise
target of the audit’s P2 remediation (“exclude sibling columns with
opaque names from reviewer context”).
Algo-tuning read: the size of this brush is the population the P2 remediation removes. If SHAP top-2 and top-3 (charts 11, 13) are also dominated by sibling-context for these rows, the value-side evidence is systematically under-represented and the remediation needs to extend beyond “exclude opaque siblings” to “rebalance feature weights when sample-values entropy is high.”
Finding 4 — Baseline 20% non-determinism
“Between e5b0ac26 and 8d67b1ed — identical configuration, same dataset, 5 hours apart — 59 of 287 columns (20.6%) changed their final
predicted_code. This establishes the non-determinism floor.”
This finding is cross-run; one canvas can’t render it directly. But
it manifests on the canvas as confidence vs belief gap dispersion.
A run with high non-determinism has many columns where confidence
diverges from belief — these are the rows whose mass is spread
across compound focal elements rather than committed to singletons,
making them sensitive to small evidence perturbations.
Brush: in the SQL Predicate input, type confidence - belief > 0.15.
The canvas highlights the diffuse-mass cohort. These are the rows
most likely to flip on the next run.
Algo-tuning read: the audit’s P2 (raise bel_threshold from
0.80 to 0.85–0.90) tightens the cautious-review entry criterion —
fewer borderline rows enter review, fewer reroutes amplify. Brushing
belief ∈ (0.80, 0.85) shows the population the threshold raise
removes from review, which is also the high-flip-rate population.
A 5-point threshold raise on the 8d67b1ed canvas removes ≈ 30 columns
from the review pool — pre-computable from the histogram.
4. Algo-tuning playbook
When you arrive at a fresh canvas, walk top to bottom:
- Healthy run check:
beliefchart, look at the mass below 0.80. If < 10% of the corpus is below the cliff, the run converged comfortably. If > 25%, something upstream (LLM, alignment, vocab) is weak. - Reroute pressure check:
review_decision, count thereroutebar. Reroutes > 20% of total = the reviewer is doing too much work; consider raisingbel_threshold(audit P2) or constraining the shortlist. - LLM-evidence integrity check:
llm_confidence, look for the < 0.20 tail. Population of that tail = approximately the annotation-hallucination cohort (audit P0). - Feature-driver check:
shap_top1_name, see whethersample_valuesdominates. When it doesn’t, the prediction is leaning on schema/sibling context — fragile. - Hotspot triage:
table_name, see whether failures concentrate in a small number of tables. Per-table failure patterns often point to vocabulary alignment gaps that affect only certain domains (legal, financial, medical).
The remediations the audit recommends should each have a visible signature on the canvas. When you propose a fix, predict where on the canvas the fix will land — and verify against the next run.
5. Configuration reference
The curated panel is configured in
ui/src/pages/Embeddings.tsx
via the defaultChartsConfig prop on the EmbeddingAtlas component.
The category field on the embedding spec sets the canvas color;
the include array sets the predicate panel contents and order.
Reviewers needing different fields for a one-off audit can use the
SQL Predicate control at the top of the predicate panel — type a
SQL expression directly (e.g.
predicted_code = '1.1.1.9.1' AND review_decision = 'reroute') and
brush the result. The expression composes with all other brushes on
the canvas.
For permanent additions to the default panel, edit the include
array; the order in the array is the order in the panel. Avoid
adding high-cardinality fields (column_name, evidence,
embedding_text) — they render as illegible count plots.
Further reading
- Classification Pipeline — how the six evidence sources are produced.
- DST Evidence Independence — why source independence matters for the rigor of these measures.
- Embeddings — the data flow that produces the parquet feeding this canvas.
audit_2026-05-06_a.md— the worked example this guide is structured around.
Addendum — Remediation Paper-Trade Observations (2026-05-06)
This addendum captures observations from the static validation of the
algo-tuning playbook against 8d67b1ed’s parquet, and the paper-trade
of each audit_2026-05-06_a.md remediation against the same run.
It is intended both as honest documentation of what worked vs. what
needed adjustment, and as the calibration baseline against which the
post-remediation validation run will be evaluated.
A.1 Playbook validation findings
Walking the playbook brushes against 8d67b1ed/atelier_embeddings.parquet
surfaced three corrections to the original guide:
Correction 1 — BetP − Bel brush is empty in practice
The playbook section on Finding 4 (baseline non-determinism) prescribes
brushing confidence - belief > 0.15 to find the diffuse-mass cohort.
On 8d67b1ed:
mean(BetP - Bel) = 0.0007
rows with gap > 0.15 = 0
rows with gap > 0.05 = 0
Why: mass concentrates on singleton focal elements in this corpus, so the pignistic transform has nothing to redistribute. BetP ≈ Bel everywhere. The theoretical intuition (BetP optimistically spreads ignorance) is sound but only manifests when significant mass lives on compound focal elements — rare in production runs.
Replacement brush for the non-determinism cohort: uncertainty > 0.20
(Pl − Bel above the cautious-review gap threshold). This does
populate; in 8d67b1ed, 199/287 rows (69%) carry uncertainty > 0.20,
so for narrowing purposes pair it with belief < 0.6 to focus on
the genuinely weak predictions.
Correction 2 — bel_threshold direction is the opposite of the audit’s claim
The audit’s P2 recommendation says:
Raise
bel_thresholdfrom 0.80 to 0.85-0.90 → reduces candidate pool
This is mechanically false. The threshold gates entry to cautious
review (cautious_review.py:454: if bel < bel_threshold); raising it
strictly enlarges the candidate pool. Measured on 8d67b1ed:
bel_threshold | Candidates |
|---|---|
| 0.80 | 199 / 287 (69.3%) |
| 0.85 | 239 / 287 (83.3%) |
| 0.90 | 255 / 287 (88.9%) |
Decision: R4 is deferred. R1 (annotation-mnemonic recovery) materially lifts Bel for the 33%-of-corpus cohort with previously-empty LLM evidence; the candidate-pool size after R1 may make a threshold adjustment unnecessary. Re-evaluate after the validation run.
Correction 3 — Audit’s “16 hallucination cases” undercount
Audit Finding 2 cites “~16 cols hallucinate annotation in 8d67b1ed.”
The actual count of rows whose LLM evidence is absent from the fused
mass is 95 / 287 (33%) — six times the audit’s number. The
discrepancy is partly because the audit conflated two distinct cases:
true mnemonic emission (which R1 recovers) and LLM voting at a parent
focal element (which _mass_summary filters out of the singleton-only
evidence_sources.llm field even though the mass is fully present in
the fused result). The latter is not a hallucination — it’s an
observability artifact in _mass_summary.
Implication for the canvas: rows with evidence_sources.llm = {}
should not be read as “LLM contributed nothing.” When the row’s
llm_code is non-empty and falls inside the runtime taxonomy, the
LLM voted at an internal node and contributed mass through that
parent FE. The brush is more honest as a resolution-failure
indicator: filter to rows where llm_code is non-numeric AND evidence_sources.llm is empty to find the genuine mnemonic cohort.
2026-05-07 update — R10:
_mass_summarynow surfaces internal- node FEs alongside singletons. Internal-node entries carry a trailing*(e.g."1.1.1.9.4*": 0.65) to distinguish “parent FE, mass spread across descendants” from a singleton-leaf vote. The singleton-only filter is gone —evidence_sources.llm = {}now means the LLM produced no code we could map at all, which is the intended semantics.
A.2 Per-remediation paper-trade results
Each remediation was paper-traded against 8d67b1ed after
implementation. Predicted impact in the leftmost column comes from
the audit’s recommendation; observed impact is what the paper-trade
measured.
| ID | Remediation | Audit’s predicted impact | Paper-traded impact | Note |
|---|---|---|---|---|
| R1 | Annotation-mnemonic fallback in _resolve_to_focal_element | “Recovers LLM evidence for 27 cols, ~20% reroute candidate reduction” | 38 / 287 columns (13.2% of corpus) recover full LLM evidence; mean llm_confidence on recovered cohort = 0.89; 5 hr_compensation columns concentrate on EMPDET → 1.1.1.2.5.3 (Employment Related) from scattered low-Bel predictions | Significantly higher than audit’s 27. Recovery is concentrated in tables with rich user-vocab mnemonics (EMPDET, PANEXP, SHIPADDR, BIN). |
| R6 | Skip Hive/Hue temp tables (__tmp_*) at discovery | (not in audit) | 1 table dropped (hue__tmp_ecommerce_orders); 16 cols removed from classification, 9 of which were R1-recovery candidates → net R1 impact after R6: 29 cols | R6 supersedes R1 for those 9 cols (correct: temp tables shouldn’t classify at all). Net cohort R1 actually recovers in next run = 29. |
| R2b | Markdown-fence + extra-data extraction in _parse_decision | “Eliminates 3-5 hard errors per run” | 3 / 11 errored decisions in 8d67b1ed had the markdown-fence-with-trailing-prose shape; new _extract_json_object parses them cleanly (verified against captured response) | Audit estimate accurate. |
| R2c | Shortlist-permissive parsing (NEW — audit conflated with R2b) | (not separately specified) | 8 / 11 errored decisions rejected codes that were valid in the runtime taxonomy but outside the 5-entry shortlist. R2c accepts these as shortlist_extended reroutes | Audit’s “11 errors” summary should split into two classes; R2b alone would only catch 3/11. |
| R3 | Exclude opaque siblings (col_NN, var_NN, dim_NN, …) from reviewer context | “Prevents sibling-context poisoning” | Cohort visible in 8d67b1ed is small (1 rerouted, 2 candidates) because filter was ON; in 40f07630 (filter off) the cohort is 13+ | Paper-trade limited by which run is on hand. Validation run will need filter OFF or include opaque-name tables to size the impact. |
| R2a | Stability guard on cross-subtree reroutes | “Prevents the gaming_profiles.handle failure class” | Three iterations: v1 (naive: pre==llm ∧ conf>0.80) blocked 20 / 64 reroutes, including legitimate depth corrections. v2 (top-level-root differs) blocked 5 / 64 but missed sideways moves within the 1.x namespace. v3 (neither-is-ancestor — current implementation) blocks 12 / 64 — all visibly cross-subtree, with depth corrections preserved | Audit framing assumed all “LLM+fusion agreed” reroutes are noise; in practice 15 such reroutes were within-subtree backoffs (e.g., 1.1.1.8.2 → 1.1.1.8). The neither-is-ancestor rule cleanly separates these. |
| R5 | Split llm_agreement into pre/post-review metrics | “Makes overwatch signal useful” | Purely additive; new llm_agreement_pre_review field reports DST-vs-LLM alignment without review reassignment confounding | Diagnostic only; no impact on classification outcomes. |
| R4 | Raise bel_threshold 0.80 → 0.85-0.90 | “Reduces candidate pool” | Deferred — see Correction 2. Audit direction is mechanically wrong. Re-evaluate after R1 lifts Bel | Expected outcome: post-R1, the threshold may not need adjustment; if it does, the right direction is down (0.65-0.70). |
A.3 Predicted canvas signatures for the validation run
What to look for on the post-remediation canvas to verify each remediation landed:
| Remediation | Predicted canvas signature |
|---|---|
| R1 | belief histogram shifts right — the mode of the < 0.5 cluster moves toward 0.7-0.8 (LLM evidence now contributing). The hr_compensation table (5 cols, all currently scattered) collapses onto a single predicted_annotation value (EMPDET). |
| R6 | Total column count drops by ~16 (the hue__tmp_ecommerce_orders columns). table_name count plot loses one bar. |
| R2b | cautious_review.json’s errored count drops by ~3. Bedrock-deployed runs benefit most. |
| R2c | cautious_review.json’s errored count drops by ~8 (combined with R2b: total errored drops to 0-1). New shortlist_extended counter in summary > 0. |
| R3 | cautious_review.json row records show siblings_after_filter < siblings_unfiltered for tables containing col_NN columns. Reroutes whose rationale referenced sibling labels (e.g., the col_04 → Shipping Address case) lose that justification. |
| R2a | cautious_review.json summary shows stability_guard_fired > 0; the guard’s blocked reroutes show up as decision = "keep" with rationales prefixed [R2a stability guard fired: ...]. Brush by review_decision = "keep" AND review_rationale LIKE '[R2a%' in the SQL Predicate panel to count. |
| R5 | Overwatch report’s Health Signals table gains a row; llm_agreement_pre_review > llm_agreement when reviewer reassigned LLM-aligned predictions. |
A.4 What the paper-trade cannot validate
- Cumulative interaction effects. R1 raises Bel for 38 cols, which changes which cols enter cautious review, which changes the shortlist composition for those cols, which changes whether R2c’s permissive path fires. Static paper-trade can’t model this cascade.
- Real LLM behavior in cautious review. R2c assumes the LLM occasionally picks valid-but-out-of-shortlist codes; the true rate may differ once the run uses the post-R1 frame (more LLM evidence → fewer cautious-review entries → smaller cohort exposed to R2c).
- Bedrock vs Anthropic-direct response shapes. R2b was smoke-tested against one captured Bedrock fence-with-prose case; other Bedrock formatting variants (mid-stream JSON, Latin-1 whitespace, multi-block responses) are unobserved in the dataset.
- R3 sibling-context poisoning size on this corpus. With
classify_exclude_reference_columns = true(8d67b1ed’s setting), the col_04-class cohort is suppressed at discovery; the validation run should toggle this off (or include opaque-name tables) if the goal is to measure R3’s true impact.
A.5 Expected delta on overwatch’s Health Signals table
Pre-remediation (8d67b1ed):
| Signal | Configured | Actual | In Contract? |
|---|---|---|---|
llm_agreement | ≥ 0.9895 | 0.6794 | ❌ No |
state.failed_columns | ≤ 2 | 11 | ❌ No |
Post-remediation (validation run prediction):
| Signal | Configured | Predicted | In Contract? |
|---|---|---|---|
llm_agreement (post-review) | ≥ 0.9895 | ~0.85 (R1+R2c+R3+R2a all push it up) | ❌ Still under, but materially closer |
llm_agreement_pre_review (R5, NEW) | (no contract) | ~0.92 | — |
state.failed_columns | ≤ 2 | 0-1 (R2b + R2c eliminate parser/shortlist failures) | ✅ Yes |
total_columns | — | 271 (was 287; R6 drops 16) | — |
Cohort with empty evidence_sources.llm | — | ~57 (was 95; R1 recovers 38) | — |
stability_guard_fired (R2a, NEW) | — | ~12 | — |
shortlist_extended (R2c, NEW) | — | ~8 | — |
If the post-validation overwatch report shows llm_agreement still
sub-0.80, the residual gap is in the 55-column “numeric-unresolved”
cohort that R1 doesn’t touch. That points to a follow-up
remediation — likely a frame-coverage gap where the LLM emits codes
the runtime taxonomy doesn’t carry.
A.6 Configuration
Each remediation is gated by an independent flag, so a follow-up A/B run (if any signature is missing or wrong) can isolate per-remediation contribution by toggling one flag at a time.
| Flag | Default | Disable for ablation |
|---|---|---|
classify.resolve_llm_annotation_mnemonic | true | R1 off |
classify.exclude_temp_tables | true | R6 off |
classify.cautious_review.shortlist_permissive | true | R2c off |
classify.cautious_review.exclude_opaque_siblings | true | R3 off |
classify.cautious_review.stability_guard_enabled | true | R2a off |
classify.cautious_review.stability_guard_llm_conf | 0.80 | R2a threshold |
The R2b parser improvement is not flag-gated — it’s strictly more correct than the prior greedy regex on every input.
A.7 Test surface
Unit tests in tests/classify/test_audit_remediations.py cover R1,
R2b, R2c, and R6. R2a and R3 are paper-traded against
build/results/8d67b1ed/cautious_review.json rather than unit-tested
because their value lives in cohort behavior (cross-subtree
distribution, sibling filtering effects), not single-decision
transforms. R5 is a metric addition with no decision logic to test.
PYTHONPATH=src python3 -m pytest tests/classify/test_audit_remediations.py -v
# 19 tests, all passing as of 2026-05-06.
Extend Classification Workflow
End-to-end procedure for classifying a Hive corpus that grows over time: train CatBoost on the stable subset, then extend the trained model to newly-added tables without re-running the full LLM-driven classification pipeline.
This report documents the procedure and the empirical results from a
session on 2026-05-13 against the
hive-poc/reference_corpus source (reference data-governance POC, 40
tables, ~620 columns), running with the Phase-3 DST frame and the
LLM-emission validation + retry mechanism enabled.
Why two-phase classification
A full classify run uses LLM sweeps, multi-source DST fusion, and cautious-review on top of CatBoost training — minutes to tens of minutes per 300-column batch with non-trivial LLM cost. An extend run reuses a previous run’s CatBoost (and optionally UMAP / SVM) and applies them directly to new columns — seconds to a couple of minutes regardless of corpus size, no LLM cost.
The pattern lets data-governance teams:
- Establish a stable baseline classification on the tables they already know
- Onboard new tables incrementally without re-running expensive LLM sweeps
- Compare new-table predictions against a known model artifact for audit and consistency
Empirically, on the corpus we measured, the extend output actually scored higher on the operator-flagged ground-truth proxy than the parent classify run (71.9% strict vs 68.1%) — the cautious-review backoff in the full pipeline turned out to be over-conservative on this corpus. The workflow below establishes both runs so you can compare them and pick the artifact that best matches your governance team’s expectations.
Prerequisites
| Atelier deployment | CAI Application or local devenv with cml.data_v1 access |
| Hive source | A data_sources row registered for the corpus (e.g. hive-poc/reference_corpus) |
| Annotations table | Deployed at <connection>.<cfg.classify_database>.annotations (typically <connection>.default.annotations) — not colocated with the data tables |
| Config | config/base.conf editable, or env-var overrides for the toggles below |
| LLM backend | Configured via ANTHROPIC_API_KEY / Bedrock credentials so the classify-phase sweep can run |
The classify and extend runs are triggered from the UI’s pipeline
panel or via POST /api/fsm/start and POST /api/fsm/extend
respectively.
The config knobs that drive the workflow
Two HOCON settings under classify { … } in config/base.conf:
classify.table_exclude_patterns
Comma-separated regex patterns matched against Hive table names
(re.search semantics, case-sensitive). Tables whose name matches
any pattern are dropped at discover_tables time and never sampled
— same mechanism applies uniformly to classify and extend pipelines.
Empty (default) = no filtering. Operator edits this between runs.
classify.svm.enabled
When false (current default), the per-vocabulary SVM evidence
source is skipped — the alignment LLM call doesn’t fire, no SVM is
trained, and the pipeline runs with 5 evidence sources instead of
6. Toggle back to true after the recipe-driven synth training
described in docs/src/architecture/... (separate workstream)
replaces the LLM-mediated alignment.
Both also have env-var overrides
(ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS,
ATELIER_CLASSIFY_SVM_ENABLED) that take precedence over the HOCON
defaults at load time.
Procedure
Step 1 — Identify the “stable” subset of the corpus
Decide which tables you want CatBoost to train on. The pattern is typically: “tables that have been in production long enough to have operator-validated classifications.” Newly-added tables go in the excluded set.
For the documented session, the stable subset was the 20 tables that
existed in the previous classify baseline (5450b626), with 20 new
tables added to Hive after that.
Identify the newly-added tables by diffing the current Hive table list against a previous run’s classifications:
python3 << 'EOF'
import json
from pathlib import Path
parent_run = '5450b626' # or whichever prior run defines your baseline
new_run = 'f931f469' # a fresh run that classified the post-addition full source
parent_tables = sorted({c['table_name'] for c in
json.loads(Path(f'/home/cdsw/build/results/{parent_run}/classifications.json').read_text())
if c.get('table_name')})
new_tables = sorted({c['table_name'] for c in
json.loads(Path(f'/home/cdsw/build/results/{new_run}/classifications.json').read_text())
if c.get('table_name')})
added = sorted(set(new_tables) - set(parent_tables))
print(f'Added tables: {len(added)}')
for t in added:
print(f' + {t}')
EOF
For each new table, build a fully-anchored regex pattern
(^name$) so a future table named e.g. member_registry_v2 doesn’t
accidentally get caught by a pattern targeting member_registry.
Step 2 — Filter the new tables before the classify run
Edit config/base.conf to populate classify.table_exclude_patterns
with the comma-separated regex list:
classify {
…
table_exclude_patterns = "^app_developer_records$, ^compliance_documents$, ^component_catalog$, ^contact_supplemental$, ^content_profiles$, ^credential_vault$, ^device_identity_log$, ^engagement_signals$, ^headcount_ledger$, ^health_location_profiles$, ^member_registry$, ^order_shipments$, ^payment_events$, ^program_index$, ^return_billing$, ^screening_records$, ^security_research_assets$, ^staff_registry$, ^system_audit_records$, ^workforce_data$"
table_exclude_patterns = ${?ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS}
…
}
Or as a single-line env override in .env.cai.enc:
ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS="^app_developer_records$, …, ^workforce_data$"
Verify the config loads correctly:
python3 -c "
import sys; sys.path.insert(0, 'src')
from atelier.config import load_config
cfg = load_config()
print(f'{len(cfg.classify_table_exclude_pattern_list)} patterns:')
for p in cfg.classify_table_exclude_pattern_list:
print(f' {p}')
"
Step 3 — Restart the Application to pick up the new config
In the CAI Workspace UI, Application → Restart. The pipeline
loads HOCON values fresh on each load_config() call, but the
in-memory Python module cache for _HOCON_MAP is initialized once;
a restart guarantees both layers see the new config.
Step 4 — Run the parent classify against the stable subset
Trigger from the UI’s pipeline panel, or:
curl -s -X POST "$ATELIER_BASE_URL/api/fsm/start" \
-H 'content-type: application/json' \
-d '{"source_id": "hive-poc/reference_corpus"}'
Expected:
discover_tablesenumerates all tables in Hive, drops the excluded set, returns the stable subset- The pipeline runs end-to-end on the filtered set: LLM sweep, DST fusion, fit-to-LLM CatBoost training, cautious review, SHAP/SAGE if enabled
- Run dir lands at
build/results/<run_id>/with the full artifact set (CatBoost CBM, classes JSON, UMAP, parquet, classifications, evaluation_report, etc.) - Run kind:
classify. Artifact set: same id asrun_id.
Note the run_id of this baseline — it becomes the
artifact_set_id for the extend run.
What you should see in validation_retries.json
{
"total_retries": 1-5, // small number is healthy
"events": [
{
"column_names": ["..."],
"invalid_codes": ["A_FD", "1.2.1.3.3", ...],
"retry_idx": 0
},
…
]
}
Each entry is a column where the LLM emitted a code that’s not in
the deployed default.annotations taxonomy. The retry mechanism
re-prompted the LLM with the specific invalid code named, and the
LLM (almost always at retry_idx: 0) emitted a valid code on the
second attempt. After-exhaustion blanking (residual invalid
emissions getting category_code = None) is rare; if it happens,
those columns are simply dropped from CatBoost training data.
Empty events: [] means the LLM emitted only in-taxonomy codes
throughout the sweep — the goal state.
Step 5 — Clear the filter before the extend run
Edit config/base.conf:
classify {
…
table_exclude_patterns = ""
…
}
Or unset the env var. Restart the Application again.
Step 6 — Run extend against the artifact from Step 4
curl -s -X POST "$ATELIER_BASE_URL/api/fsm/extend" \
-H 'content-type: application/json' \
-d '{
"source_id": "hive-poc/reference_corpus",
"artifact_set_id": "<parent_run_id>",
"parent_dataset_id": "<parent_run_id>"
}'
Or trigger from the UI’s Extend panel against the artifact set
matching the parent’s run_id.
Expected:
discover_tablesenumerates all 40 tables (no filtering)sample_table_metadatasamples each- The parent run’s CatBoost predicts
predict_probaon every column - No LLM sweep, no DST fusion, no cautious review — straight CatBoost top-1
- Run dir at
build/results/<extend_run_id>/with parquet, classifications, evaluation_report - Run kind:
extend. References the parent viaartifact_set_idandparent_dataset_id
A real cost in elapsed time
For a 40-table / ~620-column corpus, the extend run completes in roughly 2–3 minutes (dominated by Hive metadata sampling). Compare to the parent classify which takes 10–30 minutes depending on LLM batch latency.
Caveats observed during the session
The annotations database is NOT colocated with the data tables
The deployment has data tables at hive-poc.reference_corpus but
the canonical taxonomy at hive-poc.default.annotations. The full
classify pipeline handles this via cfg.classify_database
(defaults to "default") and an optional vocab_uri on the
data_sources row. The extend pipeline must do the same — early
in the session a regression was found where extend was querying
<data_db>.annotations (which doesn’t exist), silently catching
the exception, and producing output with predicted_annotation
empty and predicted_label echoing predicted_code. The fix at
src/atelier/classify/extend_pipeline.py reads from
cfg.classify_database for annotations, independent of the
data-tables database resolved from source_id.
validation_retries.json is the audit trail
Any LLM emission outside the deployed taxonomy is captured in
build/results/<run_id>/validation_retries.json with the column
name and the invalid code. Empty events list = clean sweep. The
audit lives alongside the run artifacts so post-mortem doesn’t
require pod-log access.
Cautious-review backoff can be over-conservative
On the documented corpus, the parent classify’s cautious-review
mechanism backed off 15 columns from terminal predictions to
parent codes that the extend run subsequently recovered as
correct terminals. The threshold knob
(classify.cautious_review.bel_threshold, default 0.80) is the
lever; tightening it to 0.85 or 0.90 will reduce the rate of
backoffs.
Re-running classify with the filter restored is cheap regression-protection
If the extend output looks worse than expected on the OLD tables,
the parent’s artifacts are unchanged and re-deploying is one config
edit + restart. Both runs land in build/results/ and are
independently auditable.
Results from the 2026-05-13 session
Five classify+extend runs were measured against the same
operator-curated review spreadsheet
(Atelier-Results-vs-Prompt-solution-522d89ae.xlsx), which
encodes one operator’s expected classifications for the 20 OLD
tables. Three metrics matter:
- Strict (canonical-validated) —
predicted_annotationmatches the spreadsheet’s expected tag, validated againstdefault.annotationsso spreadsheet hallucinations don’t count as Atelier misses - Stem-collapsed — same as strict but ignoring
A_/C_/S_prefix differences within a code’s annotation family - Binary sensitive-vs-public — predicted sensitive vs non-sensitive
matches spreadsheet’s
Data Sensitivityfield - Operator-curated recall — 15 columns the operator explicitly flagged as “Atelier got this wrong”; recall counts how many now resolve correctly
| Run | Notes | Strict | Stem | Binary | Op-curated |
|---|---|---|---|---|---|
| 522d89ae | Original baseline (pre-Phase-3, pre-validation) | 69.1% | 44.6% | 84.2% | 0/15 |
| 5450b626 | Pre-Phase-3 retrain (filtered to 20 OLD tables) | 66.7% | 42.8% | 83.2% | 3/15 |
| 1d6e3fae | Phase 3 only (full DST frame, no validation+retry) | 67.4% | 42.1% | 83.9% | 3/15 |
| 2ac4d0a6 | Phase 3 + validation+retry classify | 68.1% | 43.2% | 84.6% | 4/15 |
| 0146134f | Phase 3 + validation+retry extend (from 2ac4d0a6) | 71.9% | 47.0% | 84.6% | 7/15 |
Three distinct improvements
-
Validation+retry catches the parent classify up. 2ac4d0a6 over 1d6e3fae: +0.7pp strict, +1 op-curated. Driven by the 3 LLM hallucinations the new mechanism caught and corrected in real-time (
A_FDon monetary columns,1.2.1.3.3oncase_ref). -
Extend’s CatBoost-only path materially outperforms the parent’s full pipeline. 0146134f over 2ac4d0a6: +3.8pp strict, +3 op-curated. Surprise: extend lacks DST fusion and cautious review, yet scores higher — the parent’s cautious-review backoff was over-conservative on this corpus.
-
Op-curated recall climbs across the whole arc. 0/15 → 7/15 over the session’s work, without ground-truth supervision or model changes — just architectural correctness improvements (Phase 3, validation+retry, correct annotations database in extend).
Column-level diff (0146134f vs 2ac4d0a6 on the OLD 20 tables)
Of 300 shared OLD-table predictions:
unchanged: 263 (88%)
leaf → parent (regression): 3 (1%)
parent → leaf (refinement): 15 (5%)
sibling-within-subtree: 14 (5%)
cross-subtree: 5 (2%)
Net specificity move: +12 columns more specific in extend than parent
Confidence delta on unchanged: median +0.177, mean +0.196
Specific Phase-3+validation refinements
The 15 parent-to-leaf flips include exactly the failure modes documented in earlier xlsx reviews:
shipping_manifests/tracking_id:A_TRIDparent →TRANSIDleaflegal_cases/party_ref:C_PIDparent →NAMEFULLleafgaming_profiles/linked_account:ACCOUNT_ID→SOCIAL_IDinsurance_claims/alt_contact:A_PHN→OTHPHNUMhr_compensation/comp_value:INCOME→SALARYshipping_manifests/col_32:COUNTRY→SHIPCNTY
Three column-classes that still miss
Of the 8 operator-curated columns 0146134f still misses, all fall into pre-documented failure modes:
- TRANSID over-application on permit columns —
permit_ref,rec_33wantingTRAVPERM/WORKPERM, still gettingTRANSID - System-vs-Person URL —
page_ref,media_refwantingPRSNURL/INPPHOTO, still gettingSYSURL - Network identifier domain-adaptation gap —
network_addrwantingDEVMACADDR, still gettingIPADDR— the SVM has not been trained on synthetic examples that separate MAC-shape from IPv4-shape
These are the targets for the recipe-driven dense-synth SVM retraining workstream (parked pending implementation) — the generators need to teach the SVM patterns the pretrained models cannot read.
Reproducibility checklist
For others to reproduce this work end-to-end:
- Clone the Atelier repo at the commit landed during the 2026-05-13 session (Phase 3 + validation+retry merged).
- Configure a Hive connection
pointing at a corpus that matches the shape (data tables in
one database, annotations table in
default.annotations, ~10-50 tables). - Identify a stable subset and an “added” subset of the corpus.
- Follow Steps 1–6 above.
- Compare:
build/results/<parent_run>/evaluation_report.jsonvsbuild/results/<extend_run>/evaluation_report.jsonfor headline metricsbuild/results/<parent_run>/classifications.jsonvsbuild/results/<extend_run>/classifications.jsonfor column-level diffs on the overlapbuild/results/<parent_run>/validation_retries.jsonfor the LLM-hallucination audit trail
- If you have an operator-curated review spreadsheet (per
docs/src/operations/embeddings-reviewer-guide.md), apply the scoring methodology in this report.
The session’s artifacts live at:
build/results/5450b626/ # pre-Phase-3 baseline
build/results/1d6e3fae/ # Phase 3 only
build/results/2ac4d0a6/ # Phase 3 + validation+retry classify
build/results/0146134f/ # Phase 3 + validation+retry extend
Spreadsheet:
Atelier-Results-vs-Prompt-solution-522d89ae.xlsx
Backfill script (used to populate predicted_annotation on
extend runs produced before the colocation fix landed):
scripts/backfill_extend_annotations.py
What’s not in scope for this report
- Recipe-driven SVM retraining to address the 8 remaining operator-curated misses (parked; needs synth-generator densification around the documented domain-adaptation gaps)
- Cautious-review threshold tuning to align parent classify predictions more closely with extend (A/B candidate)
- Multi-reviewer ground truth to replace the single-operator spreadsheet as the evaluation substrate (Tier 0 of the broader accuracy-improvement roadmap)
- Subjective Logic / conformal prediction for the no-ground-truth deployment scenario (architectural discussion captured in separate design notes)
Each is tracked separately; the workflow documented here is the current operationally-ready path.
Scenario Overview
Atelier uses behave (BDD) to capture platform decisions as executable specifications. Every scenario answers a concrete question: Does the config load? Can the runtime start? Does the classification pipeline converge?
These aren’t just tests. They’re the design context that connects architectural choices to the deployment realities of Cloudera AI.
Active Domains
155 scenarios across 35 features, 4 domains.
Infrastructure (infra)
Health checks and configuration lifecycle for the services Atelier depends on.
| Feature | Tag | Tier | Scenarios | What it validates |
|---|---|---|---|---|
| Config lifecycle | @config | 0 | 3 | HOCON load, CLI override precedence, materialize + validate |
| PostgreSQL health | @postgres | 1 | 2 | Connection with pgvector extension, migration state |
| Qdrant health | @qdrant | 1 | 1 | Vector store HTTP health endpoint |
| PGlite process | @pglite | 0 | 2 | Node.js script existence, npm dependency declarations |
| Preflight | @preflight | 0 | 3 | Structured deny/warn checks, GPU detection |
Deployment
CAI deployment modalities and the runtime profile that catches failures before pushing.
| Feature | Tag | Tier | Scenarios | What it validates |
|---|---|---|---|---|
| Runtime profile | @runtime-profile | 0 | 6 | Import chain, script executability, config resolution, migration parsing |
| AMP lifecycle | @amp | 0 + cai | 5 | .project-metadata.yaml structure, task patterns, install + start |
| Application modality | @application | 0 + 1 | 3 | HOST binding logic, full local stack startup |
| Studio modality | @studio | 0 | 2 | IS_COMPOSABLE root directory routing |
| Embeddings integration | @embeddings | 0 | 4 | npm dependency, page component, React Router, preparation script |
| Naming conventions | @naming | 0 | 2 | User-facing surfaces say “Embeddings”, no Apache Atlas confusion |
Gateway
HTTP gateway endpoints, gRPC bridge, and live service integration.
| Feature | Tag | Tier | Scenarios | What it validates |
|---|---|---|---|---|
| API endpoints | @api | 0 + 1 | 8 | REST endpoint contracts, response shapes |
| API testclient | @testclient | 0 | 7 | FastAPI TestClient integration (no running server) |
| Status endpoint | @status | 0 + 1 | 4 | Aggregated health report, config state |
| Pipeline integration | @pipeline | 1 | 2 | Classification pipeline via gateway |
| SPA routes | @spa | 0 | 1 | Client-side routing fallback |
Agent
Classification pipeline, DST evidence fusion, ML classifiers, and agent orchestration.
| Feature | Tag | Tier | Scenarios | What it validates |
|---|---|---|---|---|
| Classification pipeline | @gpu | 0 | 28 | DST belief, Dempster combination, features, patterns (+ Luhn/IPv4/date/currency validation), name matching, pipeline E2E, Monte Carlo sampling |
| Bootstrap convergence | @bootstrap | 0 | 11 | LLM sweep, ML validation, targeted revisit, convergence criteria, ontology-aligned SVM |
| Agent convergence loop | @gpu | 0 | 6 | 6-tool agent loop, conflict reports, convergence, mock client |
| Agent smoke test | @agent | 0 | 6 | Agent metadata, tool definitions, state formatting |
| LLM backends | @backend | 0 | 8 | Backend factory, Anthropic/Bedrock/Cerebras/OpenAI clients |
| ML classifiers | @ml | 0 | 4 | CatBoost + SVM training, inference, virtual ensemble UQ |
| ML E2E | @ml-e2e | 0 | 2 | Full synth → train → classify → evaluate cycle |
| Belief path | @belief-path | 0 | 3 | Hierarchical navigation, cautious classification |
| SAGE importance | @sage | 0 | 1 | Permutation-based feature importance |
| SHAP explanations | @shap | 0 | 2 | TreeSHAP + PermutationSHAP attribution |
| Synth generation | @synth | 0 | 2 | Synthetic data + reference-label generation |
| Synth framework | @synth-framework | 0 | 2 | Generator registry, coverage reporting |
| Meta-tagging | @meta-tagging | 0 | 2 | META_TO_ICE mappings, coverage |
| Experimentation | @experimentation | 0 | 3 | Discount tuning, comparative evaluation |
| Real data | @real-data | 0 | 3 | Production annotation validation (requires build/data/) |
By Tier
| Tier | Requires | Scenarios | Pass locally |
|---|---|---|---|
| 0 | Python only | ~120 | Yes |
| 1 | devenv stack | ~15 | Yes (with devenv up) |
| cai | Live CAI session | ~5 | Skipped (documentation-only) |
Additional tags: @slow (~17 scenarios requiring extended runtime),
@gpu (GPU detection/acceleration scenarios — run on CPU too, just slower).
Why BDD for a Deployment Platform?
CAI deployment has four modalities — Project, Application, AMP, and Studio — each with different constraints on networking, filesystem layout, and process lifecycle. Traditional unit tests verify module behavior in isolation. BDD scenarios verify that the system hangs together across these modalities.
Consider the Application modality: when CDSW_APP_PORT is set, the startup script must bind to 127.0.0.1 because CAI’s reverse proxy handles external traffic. Bind to 0.0.0.0 instead and you bypass the proxy’s auth layer. This isn’t a bug in any single module — it’s a deployment contract that only a scenario can express clearly:
Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
Given CDSW_APP_PORT is set to "8090"
When I parse bin/start-app.sh for the HOST variable
Then HOST is "127.0.0.1"
The scenario is the spec. A colleague reading this knows exactly what the constraint is, why it matters, and can verify it passes with just behave.
Test Infrastructure
Framework
Atelier uses behave for BDD and pytest for unit tests. The BDD scenarios live in features/ and are organized by domain.
Tier System
Scenarios are tagged by the infrastructure they require. The ATELIER_BDD_TIER environment variable controls which tiers run.
| Tier | Tag | Requires | Purpose |
|---|---|---|---|
| 0 | @tier-0 | Python only | Config, imports, classification pipeline, agent loop, ML classifiers |
| 1 | @tier-1 | devenv stack | PostgreSQL, Qdrant, gRPC, full gateway startup |
| cai | @tier-cai | CAI session | Live deployment validation — always skipped locally |
Additional tags:
@slow— scenarios requiring extended runtime (pipeline E2E, ML training)@gpu— GPU acceleration scenarios (run on CPU too, just slower)
Tier 0 runs everywhere: laptops, CI, CAI sessions. No services, no network calls. This is where the runtime profile lives — the scenarios that catch deployment failures before you push.
Tier 1 requires devenv up to be running (PostgreSQL on :5533, Qdrant on :6334). These verify that services are healthy and that the application can actually connect to its data stores.
Tier CAI exists as executable documentation. The step definitions are stubs — they express what should happen in a live CAI session without automating it. When debugging a deployment failure, these scenarios are a checklist.
Running Tests
# Full BDD suite including gateway checks (preferred)
just behave
# Tier-0 only (no services needed)
just bdd
# Tier-0 + tier-1 (requires devenv up)
just bdd-full
# Runtime profile specifically
just bdd-runtime
# Single domain
ATELIER_BDD_TIER=0 uv run behave features/agent/
# Single feature file
uv run behave features/agent/classification.feature
# By tag
ATELIER_BDD_TIER=0 uv run behave features/ -t @bootstrap
# Verbose (show all steps, not just failures)
just behave --no-capture
Feature Organization
features/
├── environment.py # Tier filtering, stack health, cleanup hooks
├── steps/__init__.py # Central re-exports (behave's discovery point)
├── infra/ # Domain: infrastructure & services
│ ├── step_defs/
│ │ ├── helpers.py
│ │ ├── config_steps.py
│ │ ├── health_steps.py
│ │ └── preflight_steps.py
│ ├── config_lifecycle.feature # 3 scenarios
│ ├── health_postgres.feature # 2 scenarios
│ ├── health_qdrant.feature # 1 scenario
│ ├── health_pglite.feature # 2 scenarios
│ └── preflight.feature # 3 scenarios
├── deployment/ # Domain: CAI deployment workflows
│ ├── step_defs/
│ │ ├── helpers.py
│ │ ├── runtime_steps.py
│ │ ├── amp_steps.py
│ │ └── naming_steps.py
│ ├── runtime_profile.feature # 6 scenarios
│ ├── amp_lifecycle.feature # 5 scenarios
│ ├── application.feature # 3 scenarios
│ ├── studio.feature # 2 scenarios
│ ├── embeddings.feature # 4 scenarios
│ └── naming_audit.feature # 2 scenarios
├── gateway/ # Domain: HTTP/gRPC gateway
│ ├── step_defs/
│ │ ├── status_steps.py
│ │ ├── http_steps.py
│ │ ├── endpoint_steps.py
│ │ ├── pipeline_steps.py
│ │ └── testclient_steps.py
│ ├── api_endpoints.feature # 8 scenarios
│ ├── api_testclient.feature # 7 scenarios
│ ├── status_endpoint.feature # 4 scenarios
│ ├── pipeline_integration.feature # 2 scenarios
│ └── spa_routes.feature # placeholder
└── agent/ # Domain: classification & agents
├── step_defs/
│ ├── agent_steps.py
│ ├── classification_steps.py
│ ├── bootstrap_steps.py
│ ├── backend_steps.py
│ ├── synth_steps.py
│ ├── ml_steps.py
│ ├── ml_e2e_steps.py
│ ├── sage_steps.py
│ ├── shap_steps.py
│ ├── real_data_steps.py
│ ├── belief_path_steps.py
│ ├── synth_framework_steps.py
│ ├── meta_tagging_steps.py
│ ├── experimentation_steps.py
│ ├── agent_loop_steps.py
│ └── monte_carlo_steps.py
├── classification.feature # 19 scenarios (DST, pipeline, MC sampling)
├── bootstrap.feature # 10 scenarios
├── agent_loop.feature # 6 scenarios
├── agent_smoke.feature # 6 scenarios
├── backend.feature # 8 scenarios
├── ml_classifiers.feature # 4 scenarios
├── ml_e2e.feature # 2 scenarios
├── synth.feature # 2 scenarios
├── synth_framework.feature # 2 scenarios
├── sage.feature # 1 scenario
├── shap.feature # 2 scenarios
├── belief_path.feature # 3 scenarios
├── meta_tagging.feature # 2 scenarios
├── experimentation.feature # 3 scenarios
└── real_data.feature # 3 scenarios
Step Discovery
Behave only discovers step definitions from features/steps/. Domain step definitions live in <domain>/step_defs/ directories and are re-exported through features/steps/__init__.py:
from features.infra.step_defs.config_steps import *
from features.infra.step_defs.health_steps import *
from features.infra.step_defs.preflight_steps import *
from features.deployment.step_defs.runtime_steps import *
from features.deployment.step_defs.amp_steps import *
from features.deployment.step_defs.naming_steps import *
from features.agent.step_defs.agent_steps import *
from features.agent.step_defs.classification_steps import *
from features.agent.step_defs.bootstrap_steps import *
from features.agent.step_defs.backend_steps import *
from features.agent.step_defs.synth_steps import *
from features.agent.step_defs.ml_steps import *
from features.agent.step_defs.ml_e2e_steps import *
from features.agent.step_defs.sage_steps import *
from features.agent.step_defs.shap_steps import *
from features.agent.step_defs.real_data_steps import *
from features.agent.step_defs.belief_path_steps import *
from features.agent.step_defs.synth_framework_steps import *
from features.agent.step_defs.meta_tagging_steps import *
from features.agent.step_defs.experimentation_steps import *
from features.gateway.step_defs.status_steps import *
from features.gateway.step_defs.http_steps import *
from features.gateway.step_defs.endpoint_steps import *
from features.gateway.step_defs.pipeline_steps import *
from features.agent.step_defs.agent_loop_steps import *
from features.agent.step_defs.monte_carlo_steps import *
from features.gateway.step_defs.testclient_steps import *
Two conventions protect against behave’s automatic discovery behavior:
-
Use
step_defs/, notsteps/— Behave walks the feature tree and exec’s any.pyfile it finds in a directory namedsteps/. This bypasses Python’s import system, breaking relative imports and module context. Usingstep_defs/avoids this entirely. -
Never name a
features/subdirectory after a stdlib module — When behave importsfeatures.platform, Python also registers it asplatforminsys.modules, shadowing the stdlib. This breaks anything that lazily importsplatform(including pydantic). Theinfra/domain was originally namedplatform/until this caused a cascade of subtle failures.
Config-Driven BDD
Infrastructure steps load configuration from HOCON via atelier.config.load_config() rather than hardcoding values. This means BDD scenarios validate the same config path used in production:
from atelier.config import load_config
cfg = load_config()
_wait_for("PostgreSQL", lambda: _check_pg(cfg.db_url))
Stack Health Gate
Tier-1 scenarios share a one-time stack health check in environment.py. Before the first tier-1 scenario runs, the framework verifies PostgreSQL and Qdrant are reachable (with a 60-second retry window). If either service is down, all tier-1 scenarios fail fast with a clear message rather than producing confusing connection errors.
Cleanup
after_scenario in environment.py removes temporary files registered via context._temp_files. This handles config materialization artifacts and other test-created files.
Unit Tests
Alongside BDD, tests/ contains pytest unit tests for isolated module behavior:
just test # Run all pytest tests
uv run pytest tests/ -x # Stop on first failure
BDD and pytest serve complementary roles: pytest validates that individual functions behave correctly; BDD validates that the system’s deployment contracts hold.
Deployment Modalities
Cloudera AI offers four ways to run code. Each has different constraints on networking, filesystem layout, process lifecycle, and dependency management. Atelier’s BDD scenarios encode these constraints as executable specifications.
Project
Every CAI deployment starts as a Project — a Git-backed workspace cloned into /home/cdsw. The Project modality is implicit: it provides the filesystem layout, environment variables, and Python runtime that all other modalities build on.
No dedicated feature file. Project constraints are tested indirectly through every other deployment scenario.
AMP (Automated Machine Learning Prototype)
An AMP is a one-click provisioning workflow defined in .project-metadata.yaml. It runs a sequence of tasks — typically create_job to install dependencies, then start_application to launch the service.
Why BDD captures this well: AMP metadata is YAML that CAI interprets at deploy time. A malformed task definition doesn’t fail until someone clicks “Deploy” in the CAI UI. Our tier-0 scenarios catch structural problems immediately.
What the scenarios validate
AMP metadata structure (amp_lifecycle.feature):
Scenario: AMP metadata file is valid
Given the file ".project-metadata.yaml" exists
When I parse the AMP metadata
Then it has a "name" field
And it has a "runtimes" section
And it has a "tasks" section
Task ordering pattern — CAI requires create_job before run_job for the same entity label. Getting this wrong means the install job never runs:
Scenario: AMP tasks follow create_job/run_job pattern
Given the AMP metadata is loaded
Then a "create_job" task with entity_label "install_deps" exists
And a "run_job" task with entity_label "install_deps" exists
And a "start_application" task exists
Install script validity — scripts/install_deps.py runs in a bare Python environment without uv or devenv. A syntax error here means the entire deployment fails:
Scenario: Install script is valid Python
When I compile "scripts/install_deps.py" with py_compile
Then no SyntaxError is raised
Tier-CAI scenarios document what a successful AMP deploy looks like. These are skipped locally but serve as a regression checklist when debugging deployment failures:
@tier-cai
Scenario: AMP install job completes successfully
Given I am in a CAI project session
When I run the install dependencies job
Then the job exits with code 0
And "atelier" is importable in system Python
And "node --version" succeeds
And the directory "ui/dist" exists
Application
An Application is a long-running web service. CAI assigns a port via CDSW_APP_PORT and routes subdomain traffic through a reverse proxy that handles authentication.
The key constraint: When CDSW_APP_PORT is set, the service must bind to 127.0.0.1, not 0.0.0.0. The reverse proxy connects over localhost; binding to all interfaces bypasses CAI’s auth layer.
For local development (no CDSW_APP_PORT), binding to 0.0.0.0 is correct — it lets you access the service from a browser.
Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
Given CDSW_APP_PORT is set to "8090"
When I parse bin/start-app.sh for the HOST variable
Then HOST is "127.0.0.1"
Scenario: start-app.sh binds to 0.0.0.0 for local dev
Given CDSW_APP_PORT is not set
When I parse bin/start-app.sh for the HOST variable
Then HOST is "0.0.0.0"
The tier-1 scenario verifies the full stack actually starts and serves traffic:
@tier-1
Scenario: Full application stack starts locally
When I run bin/start-app.sh in the background
Then the HTTP gateway responds on port 8090 within 30 seconds
And the gRPC server responds on port 50051
Studio (future)
A Studio is a pre-built Docker image where IS_COMPOSABLE=true. Instead of being the root application, Atelier runs as an embedded service within a larger container.
The key constraint: When IS_COMPOSABLE is set, the install script must use /home/cdsw/atelier as the root directory (the project subdirectory) instead of /home/cdsw (the container root). Getting this wrong means dependencies install into the wrong location and imports fail at startup.
Scenario: install_deps.py handles IS_COMPOSABLE root path
When I set IS_COMPOSABLE to "true"
And I parse scripts/install_deps.py for root_dir
Then root_dir is "/home/cdsw/atelier"
Scenario: install_deps.py uses default root without IS_COMPOSABLE
When IS_COMPOSABLE is not set
And I parse scripts/install_deps.py for root_dir
Then root_dir is "/home/cdsw"
Studio support is currently speculative — these scenarios document the expected behavior so the contract is established before implementation begins.
Runtime Profile
The CAI Runtime Profile is a set of tier-0 scenarios that validate deployment readiness without requiring a live CAI session. Run it before every push to catch the class of errors that only manifest when CAI tries to start the application.
just bdd-runtime
Why This Exists
CAI deployment failures are expensive to debug. The install job runs in a container with a 30-minute timeout. If it fails, the only feedback is a log dump. If it succeeds but the application crashes at startup, the only feedback is a “Application failed to start” banner with a link to logs that may or may not contain the root cause.
The runtime profile catches failures that would otherwise require a deploy-debug-redeploy cycle:
| Check | Failure mode it prevents |
|---|---|
| Core package importable | Missing __init__.py, circular imports, broken package structure |
| Entry points importable | New dependency not declared in pyproject.toml |
| Proto stubs importable | Forgot to run just proto after editing .proto |
| Scripts exist and are executable | Missing chmod +x, file not committed |
| HOCON config resolves | Undefined substitution variable, syntax error in .conf |
| Migrations parseable | Malformed -- migrate:up block, missing SQL terminator |
The Scenarios
Import chain validation
The most common CAI deployment failure is an import error. A module works in devenv because all dev dependencies are installed, but fails in CAI because the install script only installs production dependencies.
Scenario: Core package is importable
When I import "atelier"
Then no ImportError is raised
And atelier.__version__ is defined
Scenario: All entry points are importable
When I import "atelier.server"
And I import "atelier.gateway"
And I import "atelier.config"
And I import "atelier.db.bootstrap"
Then no ImportError is raised
Scenario: Proto stubs are generated and importable
When I import "atelier.proto.atelier_pb2"
And I import "atelier.proto.atelier_pb2_grpc"
Then no ImportError is raised
These scenarios exercise the full import graph. If atelier.gateway imports fastapi which imports pydantic which imports annotated_types, and annotated_types isn’t in the dependency chain — this catches it.
Script executability
CAI runs scripts via #!/usr/bin/env python3 or #!/usr/bin/env bash. If the shebang is wrong or the execute bit isn’t set, the deploy fails with a cryptic “Permission denied” error.
Scenario: Required scripts exist and are executable
Then the file "scripts/install_deps.py" exists
And the file "scripts/startup_app.py" exists
And the file "scripts/install_node.sh" is executable
And the file "scripts/install_qdrant.sh" is executable
And the file "bin/start-app.sh" is executable
Configuration resolution
HOCON configs use ${?VAR} substitution for environment variables. A typo in a variable name or an unresolvable reference won’t fail until load_config() is called at startup. The runtime profile forces resolution at test time:
Scenario: HOCON config resolves without errors
When I load the config with no overrides
Then no exception is raised
And the config has grpc_port > 0
And the config has gateway_port > 0
Migration parsing
Atelier uses a dbmate-compatible migration runner (atelier.db.bootstrap) that parses -- migrate:up / -- migrate:down blocks from SQL files. If a migration is missing its UP block, the bootstrap silently skips it — which means the schema diverges from what the code expects.
Scenario: Database migrations are parseable
Given migration files exist in "db/migrations/"
When I parse each migration for UP/DOWN blocks
Then every migration has a valid UP block
When to Extend the Profile
Add a new runtime profile scenario whenever you:
- Add a new Python entry point or importable module
- Add a new script that CAI executes directly
- Add a new HOCON config key that downstream code depends on
- Add a new migration file
The rule of thumb: if it can break a CAI deploy and you can verify it without services running, it belongs in the runtime profile.
Sprint Summary: 2026-05-06 to 2026-05-20
This appendix records the engineering work completed during the
two-week sprint ending 2026-05-20. The sprint covered 27 commits
on feat/dst-late-interaction-cosine across three major work
streams: (1) training-time Normalized Hierarchical SVM with the
Structured Shared Frobenius Norm, (2) ColBERT late-interaction cosine
integration with Qdrant, and (3) CatBoost/SVM calibration under the
Dempster-Shafer evidence-independence framework. A DST numeric
sensitivity study and BDD scenario expansion provided the empirical
grounding.
1. Training-Time NHSVM (Choi et al. 2015)
Motivation
The prior NHSVM implementation was a post-hoc approximation: a flat
SVM trained with standard Frobenius norm regularization (no hierarchy
awareness), then nhsvm_reweight() nudged the probability
distribution at inference time using tree-distance penalties. This
cannot recover what was never learned. The SVM’s decision boundaries
are flat; the reweighting is a band-aid. On an asymmetric taxonomy
(deep sensitive subtree vs. shallow operational subtrees), the flat
SVM systematically under-penalizes cross-subtree probability flow,
allowing shallow catch-all nodes to absorb classifications that
belong in the deep subtree.
The Structured Shared Frobenius Norm
Choi et al. (2015, arXiv:1508.02479) shows that for single-label hierarchical classification, proper NHSVM reduces to a standard multi-class SVM with a modified feature map. The key insight is the Structured Shared Frobenius Norm:
||W||^2_G = sum_n ||u_n||^2 / alpha_n
where u_n is the per-node weight component and alpha_n is the path-normalized budget for node n. This regularizer explicitly incorporates the label structure G: it promotes models to utilize shared information along tree paths, penalizing complexity proportionally to each node’s position in the hierarchy.
The Kronecker product feature expansion (Eq. 5) implements this norm without a custom solver. For sample x with label y, the expanded feature map is:
phi(x, y) = Lambda(y) tensor-product x
where Lambda(y)_n = sqrt(alpha_n) for nodes n on the root-to-y path, and zero elsewhere. Standard L2 regularization on the expanded space equals the Structured Shared Frobenius Norm on the original space. The geometry is exact, not approximate.
Directional Constraint (Eq. 7)
The alpha budget is computed via a linear program with the directional constraint: alpha_child >= alpha_parent for every parent-child pair. This forces more of the information budget toward leaves, preventing degenerate solutions on unbalanced trees where shallow internal nodes would otherwise absorb the entire alpha budget.
The LP formulation:
maximize min_n alpha_n
subject to sum(alpha_n for n in path(root, l)) = 1 for every leaf l
alpha_child >= alpha_parent for every parent-child
alpha_n >= 0
Solver: scipy.optimize.linprog(method='highs'). On the project
taxonomy: 296 variables, 220 equalities, 582 directional
inequalities. Solves in under one second with zero violations and
path sums exact to machine precision (deviation < 1e-15). The
unconstrained closed-form (Lemma 2: alpha_n = 1/D_n - 1/D_parent)
is preserved as a private fallback.
Implementation
The training pipeline proceeds:
- TF-IDF (char 3-6 + word 1-2 n-grams, 50K max features)
- TruncatedSVD to 200 components (configurable via
classify.svm.nhsvm_svd_components). Necessary because full TF-IDF times Kronecker expansion would produce 14.75M features and a 34.8 GB coefficient matrix. At 200 dimensions the expanded space is 59K features and the model fits in approximately 250 MB. - Kronecker expansion via
HierarchicalFeatureExpander: training-time expansion populates only the label’s path blocks (sparse, ~path_len x d non-zeros per row); inference-time expansion populates all blocks (dense across nodes, sparse across features). - LinearSVC with
CalibratedClassifierCV(ensemble=False)for Platt-scaled probabilities.
The model serializes as a dict bundle ({feature_union, svd, expander, classifier, classes}) with automatic detection on load.
Legacy flat .pkl files load unchanged, preserving backward
compatibility. A _nhsvm suffix on the per-vocabulary cache key
prevents serving a flat model as hierarchical or vice versa.
When the pipeline detects a training-time NHSVM model (via the
_hierarchical attribute), it skips all post-hoc reweighting
infrastructure (distance matrix precomputation, nhsvm_to_mass
routing) and sends SVM probabilities directly through svm_to_mass.
The hierarchy is already in the probabilities.
SVM Training Consolidation
In the same sprint, SVM training was consolidated from two paths
(Path A: ICE alignment-based, Path B: enrichment-based) into a
single enrichment-required path. Qdrant is the source of truth for
enrichment payloads; a JSON export under build/ serves as the
offline/CI fallback.
The synthetic corpus generator (synth_registry.py) now covers all
taxonomy nodes (leaves and internal) via a three-layer generator
architecture:
- ICE-matched hand-coded (highest priority): enrichment metadata
is matched against 31 inference patterns to select the best ICE
generator. A mnemonic-to-dot-code bridge maps category abbreviations
(e.g.,
EMAIL) to enrichment payload keys (e.g.,1.1.1.9.3.1). - Template generators (medium): prototype values from enrichment payloads with mild perturbation (numeric jitter, character substitution).
- Inferred generators (lowest): fallback via pattern matching on category description and common names.
Coverage: 100% of all taxonomy nodes receive a generator. The leaf-only assumption was corrected at six sites across three files; every node is a first-class tagging target.
2. ColBERT Late-Interaction Cosine via Qdrant
Architecture
The sprint delivered the full P1-P3 stack for multi-vector cosine evidence:
P1 (storage foundation): Qdrant collection schema with named multi-vector fields. Each annotation point stores ColBERT token-level embeddings (128-d after the linear projection) alongside the structured enrichment payload (prototype values, value patterns, name hints, anti-examples, parent path).
P2 (enrichment pipeline): LLM-mediated annotation enrichment generates a six-field structured payload per taxonomy node. Each payload is verified by a deterministic suite of six checks before being written to Qdrant:
patterns_compile– every regex pattern must be valid Pythonresyntax.prototype_values_match_patterns– at least 50% of prototypes must match a declared regex (relaxed from 100% this sprint to handle diverse free-text categories like marketplace names).anti_example_targets_exist– every value in theconfusable_tagfield (the anti-example pointer) must exist in the taxonomy.parent_path_consistent– the generated parent path must match the taxonomy hierarchy exactly.name_hints_non_empty– at least one usable name hint.no_contradiction_with_anti_examples– no prototype value may appear in anti-examples (self-contradiction rejection).
Prompts come in two variants (leaf and parent framing) because the principle that drives the architecture – every node is a first-class tagging target – means parent and leaf nodes describe different kinds of column. A leaf prompt asks for maximum-specificity signals; a parent prompt asks for family-level signals with the children listed so the model knows what specializations would not route to the parent.
P3 (late-interaction integration): The bridge
(maxsim_bridge.py) encodes entity text through the same
ColBERT encoder, queries Qdrant with native MaxSim, normalizes
scores by query token count to recover mean per-token similarity,
and converts scores to DST mass functions via
maxsim_to_mass.
Token-Level Discrimination
The motivating failure modes of single-vector cosine resolve through token-level alignment:
- Anonymized columns (
comm_val,period_val,addr_ref) – column-name tokens contribute little MaxSim, but sample-value tokens still align to annotation prototype-value tokens. Weak tokens contribute near-zero MaxSim without polluting strong matches. - Long-tail distinguishing values – a single distinctive sample value’s tokens claim their own MaxSim against annotation prototypes, no longer averaged out by a single dense vector.
- Sibling discrimination – token-level alignment discriminates between semantically adjacent annotations (e.g., “credit card number” vs. “bank account number”) through fine-grained matching that single-vector cosine collapses.
- Parent-pull – parent-path tokens in the annotation text provide hierarchical context for the mass aggregation layer.
Channel-Decomposed Dempster Combination (P3.6)
The mass function produced by late-interaction cosine separates into two channels:
- Positive channel: MaxSim scores on annotation points allocate mass to focal elements (leaf singletons and internal-node descendant sets). Haenni-Hartmann reliability shaping (alpha-bounded allocation) ensures the source never over-commits. Margin-aware allocation places top-1 mass proportional to the gap between first and second candidates; residual mass splits softmax across remaining candidates.
- Negative channel: Anti-example evidence on a code c allocates mass to Theta \ D(c), where D(c) is the descendant leaf set. This is structurally correct for hierarchical exclusion: negating an internal node removes its entire subtree, not just the node itself.
The two channels combine via channel-decomposed Dempster’s rule. When channels conflict on the same node (high positive and high negative simultaneously), conflict K materializes as a diagnostic signal rather than being silently cancelled. The hierarchical aggregation layer walks from the top-1 leaf up to the most-specific ancestor with at least 50% descendant-mass concentration, promoting mass to internal-node focal elements when subtree-level signal is what the evidence supports.
3. CatBoost, SVM, and Dempster-Shafer Calibration
The Non-Distinctness Problem
Dempster’s rule assumes the evidence sources being combined are distinct and conditionally independent (Shafer 1976, Ch. 3-4; Smets & Kennes 1994). When sources share provenance – one source’s labels are deterministically derived from another’s – Dempster’s rule double-counts their agreement, inflating confidence beyond what the evidence warrants.
Atelier’s pipeline has six evidence sources with varying degrees of independence:
| Source | Discount | Independence status |
|---|---|---|
| MaxSim (ColBERT late-interaction) | 0.20 | Weakly non-distinct (ColBERT encoder is deterministic; per-user-code reference vectors share enrichment-LLM upstream) |
| NHSVM | 0.22 | Weakly non-distinct (sentence-transformer subsumption alignment shares enrichment-LLM upstream — same provenance as ColBERT plus an additional alignment step, hence the slightly higher discount) |
| Pattern | 0.25 | Independent (deterministic regex matching) |
| Name match | 0.30–0.70 | Independent (deterministic string matching) |
| CatBoost | 0.55 | Strongly non-distinct (fit-to-LLM: per-column shared label provenance with LLM) |
| LLM | 0.15 | Primary source |
The discount schedule follows Shafer’s reliability discount (alpha = 1 - discount applied to source mass) with adjustments per Denoeux (2008): when a source rides on labels deterministically derived from another source, an undiscounted derivative source mathematically swallows the only genuinely independent signal.
Pending work — manually curated annotation specifications in Ægir are the path to fully eliminating the shared LLM-upstream provenance on ColBERT and NHSVM. When per-user-code annotation payloads are author-curated rather than LLM-generated during enrichment, ColBERT’s reference vectors and the subsumption alignment both become structurally independent of any runtime LLM, and their discounts can drop toward the calibrations a truly distinct source carries. Until that curation is in place, the 0.20 / 0.22 calibration above is the right under-confidence price to pay.
CatBoost: Fit-to-LLM and Adaptive Discount
CatBoost operates in fit_to_llm mode (default): it trains on
(embedding_text, llm_code) pairs from the current run’s LLM sweep.
The model is the explainability surface over LLM labels – SHAP and
SAGE attribute to a model that actually agrees with the LLM, which is
the transparent “why this code” story presented to the operator. But
this makes CatBoost strongly non-distinct with the LLM under
Denoeux’s framework: per-column shared label provenance.
The adaptive discount addresses this through virtual ensemble variance. CatBoost’s virtual ensemble provides uncertainty quantification per code; the discount formula is:
discount = min(max_discount, base_discount + avg_var x variance_scale)
Defaults: base 0.55, variance_scale 1.6, max 0.75, fallback 0.55. High variance (uncertain predictions) produces a larger discount, routing more mass to Theta (ignorance) rather than inflating a weakly-supported prediction. This is the step-size control in the iterative-methods framing: the derivative source’s contribution is damped proportionally to its own uncertainty.
SVM: Subsumption Alignment and Weak Non-Distinctness
The SVM’s discount (0.22) reflects a qualitatively different non- distinctness regime. The SVM trains on a synthetic corpus generated from the bundled ICE ontology, then translates predictions into the user taxonomy via sentence-transformer cosine subsumption alignment. The alignment is a per-vocabulary mapping table computed via BERT cosine similarity between ICE concept signatures and enriched annotation payloads – structurally independent of the runtime LLM.
The weak non-distinctness comes from the shared enrichment-LLM upstream: the enrichment payloads that anchor the subsumption alignment were themselves generated by an LLM (though a different call, different prompt, different temperature than the runtime classification LLM). This is the same structural dependency shared by the late-interaction cosine source, which justifies SVM’s discount (0.22) sitting near cosine’s (0.20) rather than near CatBoost’s (0.55).
With training-time NHSVM, the SVM’s probabilities already incorporate
hierarchy, so the pipeline routes them through svm_to_mass directly.
The post-hoc nhsvm_to_mass reweighting is preserved as a legacy
fallback for flat-trained SVMs loaded in hierarchical mode.
The Pipeline as Iterative Refinement
The DST evidence-independence architecture frames the bootstrap loop as fixed-point iteration on a belief-assignment vector B over columns: B_{n+1} = T(B_n). Each component maps onto a numerical-method primitive (Banach 1922; Saad 2003):
| Component | Primitive |
|---|---|
| Bootstrap loop | Fixed-point iteration on B |
| LLM sweep | Stochastic operator (Robbins-Monro framing) |
| ML validation (CatBoost + SVM) | Deterministic linearization |
| DST fusion | Combiner producing fused state |
| Targeted revisit | Local smoothing (multigrid) |
| Pl - Bel gap | A posteriori error estimate per column |
| Conflict K | Nonlinear residual diagnostic |
| Reliability discount | Damping / step-size control |
| Hierarchical cosine mass | Coarse-grid correction (multigrid) |
The unified residual norm combines four components: mean(gap) / gap_threshold, frac_unclear / clarity_target, mean(K) / k_threshold, and independent-tier disagreement fraction. Residual below 1.0 means converged. The contraction factor rho = ||r_{n+1}|| / ||r_n|| is the headline diagnostic: rho < 1 is contractive, rho -> 1 is stalled, rho > 1 is diverging.
4. DST Sensitivity Study
A numeric sensitivity study (P3.12-P3.13) swept 2,549 synthetic cells across 10 invariants on an 11-node taxonomy (7 leaves, 4 internal). Zero mathematical violations were found. Key findings:
Channel conflict K is bounded. At the production negative-channel weight beta = 0.30, conflict K caps at approximately 0.24. The K threshold logs never fire under normal operating conditions; the Yager fallback path is effectively dead code under Dempster fusion.
The _significant_subtree concentration threshold is a structural
cliff. The hard 0.50 threshold for promoting mass to an
internal-node focal element produces a discontinuity of Delta = 0.203
in parent mass when sibling probability crosses 0.65 to 0.70.
This is a plausible driver of the parent-instead-of-leaf error
cluster (22-25% of error budget in evaluation).
Internal-node top-1 switch is the largest discontinuity. The transition from leaf-dominant to internal-node-dominant top-1 prediction produces Delta_mass = 0.57 at the crossover point. This is a high-volatility regime where late-interaction positive-weight calibration is critical.
Anti-example negative channel is a tie-breaker, not a primary driver. At beta = 0.30, the negative channel’s effect on parent mass is approximately Delta = 0.0015 under full negative evidence. The channel requires positive-channel support to produce meaningful rank changes.
Leaf dominance is preserved. Across all swept parameter ranges, the top-1 leaf’s mass never falls below the parent’s mass in realistic operating regimes. Parent focal-element mass is a disjunctive signal (contributing to plausibility, not belief) rather than a competing prediction.
5. BDD Scenario Expansion
The sprint added hierarchical anti-subtree BDD scenarios (P3.9-P3.11) testing the channel-decomposed Dempster combination on an abstract taxonomy fixture:
- Anti-example on internal node allocates to descendant complement Theta \ D(n), correctly removing the entire subtree rather than just the node.
- Anti-example on leaf preserves singleton complement semantics (regression guard).
- Channel conflict K surfaces contradiction when both channels fire strongly on the same node, materializing K as a diagnostic rather than silently cancelling.
- Internal-node tag is a first-class prediction target with mass landing directly on the node’s descendant-set focal element.
Additional DST boundary-condition scenarios (P3.10) test operator-observable failure modes: uniform evidence, vacuous sources, and single-source dominance. The generic-vs-specific-same-depth scenario (P3.11) validates that sibling discrimination at equal depth is structurally sound.
6. Operational Improvements
Cautious review disabled (empirically validated as harmful).
Run ce4f3777 against 920 reference columns demonstrated:
reroute miss rate 76.1%, backoff miss rate 78.8%, net accuracy
destruction -13.6 percentage points vs. LLM-only. The cascade:
degraded evidence from a second LLM call on high-conflict evidence
produces high K, low belief, mass review, mass damage. Disabled
by default with bel_threshold = 0.0 (unreachable) as a
belt-and-suspenders guard.
Enrichment verifier relaxation. The prototype_values_match_ patterns check was relaxed from 100% to 50% match threshold.
Categories with diverse free-text values (marketplace names,
descriptive labels) legitimately produce prototypes that do not fit
a single regex family. The prior strict threshold caused false
rejections and forced manual bypass.
Enrichment prompt feedback key fix. The retry prompt read
verifier_feedback.get("failed_checks") but the verifier report
writes "details". Retry prompts had empty diagnostic information;
the LLM was asked to fix failures it could not see.
Bootstrap-environment and curate-agent-mediated skills. Two
Claude Agent SDK skills were added to .claude/commands/: a unified
enrichment + curation + SVM skill (6-phase back-pressure rubric,
resume-safe persistence) and a targeted per-table curation skill.
Late-interaction bridge self-supplies embedder. The ColBERT encoder is now initialized by the bridge itself rather than requiring the caller to pass one, fixing a CAI venv import ordering issue.
Commit Log
| Hash | Summary |
|---|---|
a505953 | R7-R10 audit remediations + bundled R1-R6 + UI / config |
6010e94 | Cite canonical CCO IRIs alongside shorthand labels |
baafa5f | SOTAB v2 coverage strategy + Aegir handoff |
70ec5b5 | P1 storage foundation for late-interaction cosine |
6716935 | P2 LLM-mediated annotation enrichment pipeline |
b5e97ea | P3 late-interaction cosine integration (default off) |
8a9e771 | P3.5 default-on + loud-fallback for late-interaction cosine |
142b91e | P3.6 channel-decomposed positive/negative Dempster combination |
8faf242 | Academic-grade DST Reborn brief |
c324fbe | P3.7 SHAP per-decision attribution surface for late-interaction cosine |
ed57fd1 | P3.8 hierarchical integrity – internal-node tags as first-class |
28e7273 | P3.9 hierarchical anti-subtree carve-out – abstract taxonomy fixture |
519a1c9 | P3.10 DST boundary-condition scenarios |
a5652db | P3.11 generic-vs-specific-same-depth scenario |
77b41d8 | P3.12 DST numeric sensitivity study + findings |
2fb7377 | P3.13 hierarchical-aggregation interaction battery |
f155e89 | P7 subsumption alignment + P5 frontier cleanup + P4 enrichment infra |
3d6696f | Stage A DST sensitivity visibility instrumentation + Stage B script |
1fee0be | top1_margin disjoint-FE traversal – Stage A regression |
929e29e | Late-interaction bridge self-supplies embedder + CAI venv fix |
7a1e4e7 | Bootstrap-environment + curate-agent-mediated skills |
1df1383 | Training-time NHSVM via Structured Shared Frobenius Norm |
References
- Choi, Chung, and Hewitt. 2015. “Normalized Hierarchical Multi-label SVM.” arXiv:1508.02479.
- Denoeux, Thierry. 2008. “Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence.” Artificial Intelligence 172(2-3): 234-264.
- Haenni, Rolf and Stephan Hartmann. 2006. “Modeling partially reliable information sources.” Studia Logica 82(1): 103-133.
- Khoo, Omar, and Steedman. 2006. “An Information Retrieval approach to short text classification.” EMNLP 2006.
- Saad, Yousef. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM.
- Shafer, Glenn. 1976. A Mathematical Theory of Evidence. Princeton University Press.
- Smets, Philippe and Robert Kennes. 1994. “The Transferable Belief Model.” Artificial Intelligence 66(2): 191-234.