Introduction

Atelier is an agentic classification workbench for Cloudera AI. It classifies column metadata using six independent evidence sources fused via Dempster-Shafer Theory (DST), producing belief intervals instead of point estimates. An LLM-in-the-loop convergence agent identifies disagreements between sources and orchestrates targeted reclassification until the corpus stabilizes.

Why Belief Intervals?

Traditional classifiers output a single probability $ P(A) = 0.85 $ — “85% email address.” This conflates two fundamentally different situations: high confidence with abundant evidence vs. moderate confidence with sparse evidence. A Bayesian posterior and a coin flip can both yield 0.5, but they represent very different epistemic states.

Dempster-Shafer theory separates these via the belief function $ \text{Bel}(A) $ and plausibility function $ \text{Pl}(A) $, where:

$$ \text{Bel}(A) = \sum_{B \subseteq A} m(B), \qquad \text{Pl}(A) = 1 - \text{Bel}(\bar{A}) $$

The interval $ [\text{Bel}(A),; \text{Pl}(A)] $ bounds the true probability. Its width $ \text{Pl}(A) - \text{Bel}(A) $ quantifies epistemic uncertainty — how much we don’t know:

Interval	Interpretation
$ [0.82,; 0.87] $	Strong evidence, low ambiguity — classify with confidence
$ [0.30,; 0.90] $	Some support for $A$, but high ignorance — gather more evidence
$ [0.45,; 0.55] $	Two sources disagree — wide gap, needs revisit

This distinction drives the entire pipeline: columns with wide belief gaps (where $ \text{Pl}(A) - \text{Bel}(A) $ is large) are automatically escalated for LLM re-examination with enriched context. Conflict $ K $ is tracked as a diagnostic but the gap width determines which columns need attention.

Architecture

Six Evidence Sources

Each source independently produces a mass function $ m_i : 2^\Theta \to [0, 1] $ over the frame of discernment $ \Theta $ (the set of all category codes). Sources are grouped by computational cost:

Source	Feature Space	Cost Tier
Cosine similarity	Dense 384-dim sentence-transformer embedding (all-MiniLM-L6-v2)	M0 (local)
Pattern detection	16 regex detectors + post-regex validators (email, phone, SSN, IP, UUID, date, datetime, URL, credit card + Luhn, MAC, IBAN, postal code, monetary, hash, semver, currency + ISO 4217); graduated mass scaling by match fraction	M0
Name matching	Column name vs vocabulary labels, codes, and aliases (4-tier: exact > code > alias > overlap)	M0
LLM classification	Frontier model reasoning (Anthropic / Bedrock / Cerebras / OpenAI-compatible)	M1 (API)
CatBoost	12 discrete features + 384-dim embedding; virtual ensemble uncertainty via `posterior_sampling`	M2 (trained)
SVM	Sparse TF-IDF: character n-grams (3–6) ∪ word bigrams; Platt-scaled LinearSVC	M2 (trained)

The SVM and CatBoost classifiers occupy deliberately orthogonal feature spaces: the SVM operates on sparse lexical features (TF-IDF) while CatBoost uses dense semantic embeddings. This architectural separation ensures genuine evidence independence for Dempster’s rule.

Fusion

Sources are combined via the conjunctive rule of combination:

$$ m_{1 \oplus 2}(C) = \frac{1}{1-K} \sum_{\substack{A \cap B = C \ A,B \subseteq \Theta}} m_1(A) \cdot m_2(B) $$

where the conflict $ K = \sum_{A \cap B = \varnothing} m_1(A) \cdot m_2(B) $ measures the degree to which sources contradict each other. High $ K $ is the diagnostic signal that drives the convergence loop: columns where independent evidence sources disagree are escalated for targeted LLM revisit with enriched context (ML prediction, belief interval, source disagreement).

Hierarchical Classification

The vocabulary forms a rooted code tree (e.g., ICE.SENSITIVE.PID.CONTACT.EMAIL). Belief and plausibility are queryable at any depth — $ \text{Bel}(\texttt{ICE.SENSITIVE}) $ aggregates all descendants. The cautious_code(τ) operator returns the deepest code where $ \text{Bel} > \tau $, enabling principled depth-accuracy tradeoffs: high $ \tau $ yields coarse but reliable labels; low $ \tau $ yields specific but less certain ones.

Convergence

The bootstrap pipeline iterates three phases until the belief gap ($ \text{Pl}(A) - \text{Bel}(A) $) stabilizes:

LLM sweep — classify each directly-targeted column via batch LLM calls
ML validation — run the full 6-source DST pipeline; compute per-column belief, plausibility, and gap
Targeted revisit — re-classify only uncertain columns (high gap or low belief) with enriched context (ML prediction + belief interval + detected patterns + disagreement summary)

The primary convergence measure is mean belief gap — the average width of the $ [\text{Bel}, \text{Pl}] $ interval across all columns. A narrow gap means the evidence sources agree on a confident prediction. Conflict $ K $ is tracked as a diagnostic signal (it indicates source disagreement) but does not gate convergence — a column can have $ K = 0.9 $ but $ \text{Bel} = 0.95 $: the sources fought, but the winner is clear.

An agent-driven variant (via Claude Agent SDK) delegates the revisit strategy to an LLM that reasons about uncertainty patterns and declares convergence when diminishing returns are reached. (Earlier revisions exposed a retrain_svm tool that progressively improved the SVM on accumulated LLM labels — excised on 2026-05-04 for source-independence reasons; see DST Evidence Independence.) The programmatic variant uses gap + coverage thresholds for environments where tool-use isn’t available.

SVM with Vocabulary Alignment

The SVM is trained once on the synthetic corpus with TF-IDF features and labels keyed on the bundled-ontology ICE.* leaves. At runtime, predictions are translated into the user’s taxonomy via a cached LLM-mediated alignment (atelier.classify.ontology_alignment) so the SVM contributes user-taxonomy evidence even when the operator’s vocabulary is completely disjoint from ICE.*. The alignment is weakly non-distinct evidence under Denoeux 2008 — vocabulary-level shared error with the runtime LLM rather than per-column shared labels — and the discount calibration carries the residual. See DST Evidence Independence for the full design rationale and the BM25-reranker future-work plan.

Scale

The pipeline handles corpora from 50 columns (OOTB sample) to 120M+ columns (full GitTables at 10M+ tables). Monte Carlo stratified sampling selects a representative subset for direct LLM classification and propagates labels to the remaining corpus via embedding similarity.

With max_sampled_columns = 500, classifying a 120M-column corpus requires LLM inference on only 0.0004% of columns — a >99.99% cost reduction while preserving classification quality through DST conflict-driven escalation of uncertain propagations.

Out-of-the-Box Experience

A fresh deployment auto-seeds on first boot:

316-leaf BFO-grounded vocabulary (351 categories total) covering the CCO Information Content Entity trichotomy: Designative (names, IDs, codes), Descriptive (measurements, dates, amounts), Prescriptive (software, specs)
25 sample tables with 316 columns and a committed curated reference
One-click classification via the Status page
Interactive Embeddings visualization (UMAP/t-SNE via embedding-atlas)

Quick Start

Local development (devenv):

devenv shell          # Enter dev environment
just install          # Install Python + Node dependencies
just up               # Start gRPC + gateway + Vite dev server

CAI deployment: Deploy as an AMP from https://github.com/zndx/atelier.

Documentation Map

System Overview — Component diagram
Deployment — CAI AMP and local dev setup
gRPC & Gateway — Proto contract, REST endpoints, config lifecycle
Keystone Agents — Agent convergence loop with 6 tools
Classification Pipeline — DST methodology, evidence sources, bootstrap convergence
Monte Carlo Sampling — Stratified sampling for scale
GPU Acceleration — CUDA detection and batch encoding
Synthetic Data & Training — 316+ generators, ontology-aligned SVM, CatBoost fit-to-LLM
Embeddings — Interactive parquet visualization
Data Sources — Source-aware versioning, OOTB sample, Hive auto-discovery
BDD Scenarios — 141 scenarios across 4 domains

System Overview

Atelier is a multi-service application with a gRPC core, FastAPI HTTP gateway, and React frontend.

Deployment

Cloudera AI (CML)

Atelier deploys as a CAI Application from the Git URL https://github.com/zndx/atelier.

The .project-metadata.yaml defines two tasks:

Install Dependencies — Installs Python (via uv) and Node.js dependencies, builds the React frontend
Start Atelier — Launches the gRPC server and HTTP gateway on CDSW_APP_PORT

Local Development

devenv shell          # Enter dev environment (loads .env automatically)
just install          # Install Python + Node dependencies
just proto            # Generate proto stubs
just resolve-config   # Materialize HOCON → build/config/atelier.env
just up               # Start gRPC + Vite dev server via devenv processes

gRPC & Gateway

Atelier follows the Fine Tuning Studio proto-first pattern: the gRPC service contract defines the API, and a FastAPI gateway bridges REST to gRPC while serving the React frontend.

Proto Definition

The service contract lives in src/atelier/proto/atelier.proto.

RPCs

RPC	Request → Response	Purpose
`HealthCheck`	`HealthCheckRequest` → `HealthCheckResponse`	Prove gRPC is alive (status + version)
`ListAgents`	`ListAgentsRequest` → `ListAgentsResponse`	List agent metadata (id, name, role, tools)
`GetAgent`	`GetAgentRequest` → `GetAgentResponse`	Single agent by ID
`ListDataSources`	`ListDataSourcesRequest` → `ListDataSourcesResponse`	List OOTB + Hive sources
`ListDatasets`	`ListDatasetsRequest` → `ListDatasetsResponse`	Classification datasets (filterable by source_id)
`GetFSMStatus`	`FSMStatusRequest` → `FSMStatusResponse`	Pipeline state + progress JSON
`StartClassification`	`StartClassificationRequest` → `StartClassificationResponse`	Trigger a classification run

Key Messages

DataSource — id, source_type (sample/hive), source_uri, display_name, vocabulary_mode
ClassificationDataset — id, name, parquet_path, source_id, version_number, is_active, summary
FSMStatusResponse — run_id, state, started_at, progress_json, error
AgentMetadata — id, name, description, role, tool_ids

Generating Stubs

just proto    # runs bin/generate-proto.sh

This invokes grpc_tools.protoc to produce _pb2.py, _pb2_grpc.py, and .pyi type stubs.

Architecture Layers

Proto (atelier.proto)     ← Service contract and message definitions
    ↓
Servicer (service.py)     ← Thin router dispatching to business logic
    ↓
Client (client.py)        ← Wrapper around generated stub with error handling
    ↓
Gateway (gateway.py)      ← FastAPI bridge from REST to gRPC + React SPA

Gateway REST Endpoints

Infrastructure

Endpoint	Method	Description
`/api/health`	GET	gRPC health check
`/api/status`	GET	Aggregated health: gRPC + PostgreSQL + Qdrant + config state
`/api/agents/validate-credentials`	POST	Test all configured LLM providers
`/api/agents/model-discovery`	GET	Check for model upgrades via Anthropic Models API

Data Sources & Datasets

Endpoint	Method	Description
`/api/data-sources`	GET	List registered data sources
`/api/datasets`	GET	List datasets (optional `source_id` filter)
`/api/datasets/{id}/activate`	POST	Set dataset version as active
`/api/datasets/{id}/data`	GET	Serve parquet file
`/api/data-connections`	GET	List CAI data connections
`/api/data-connections/{name}/test`	POST	Test a CAI connection
`/api/vocabulary/stats`	GET	Term count (source-aware routing)

Classification Pipeline

Endpoint	Method	Description
`/api/fsm/status`	GET	Current pipeline state + progress
`/api/fsm/start`	POST	Start classification (optional `source_id`)
`/api/fsm/runs`	GET	List past classification runs

Agents & Skills

Endpoint	Method	Description
`/api/agents`	GET	List agent metadata
`/api/skills`	GET	Skill definitions from `.claude/commands/`
`/api/skills/{skill_id}`	GET	Single skill markdown content
`/api/agents/smoke-test`	POST	Minimal Claude Agent SDK verification

WebSocket

Endpoint	Purpose
`/ws/terminal/{session_id}`	Persistent terminal backed by Claude Agent SDK
`/ws/orchestration`	Live agent events (spawned, reasoning, tool_call, completed)

Persistent Terminal Sessions

Terminal sessions survive page navigation and browser reload. The WebSocket endpoint accepts a client-provided session_id (persisted in localStorage). On disconnect, the session stays alive server-side — SDK queries continue running and output accumulates in a ring buffer (64KB collections.deque). On reconnect, the buffer is replayed so the user sees everything that happened while they were away.

Session registry: Module-level _sessions dict in terminal.py
Idle cleanup: Background asyncio task sweeps sessions with no client for 30 minutes (/api/terminal/sessions lists active sessions)
Dedicated page: /terminal route renders a full-screen Ghostty WASM terminal; the Landing page embeds the same component at preview size

SPA Fallback

/{path} serves ui/dist/index.html for client-side routing.

Aggregated Status Endpoint

GET /api/status returns a comprehensive health report:

{
  "grpc": {"status": "ok", "latency_ms": 12},
  "postgres": {"status": "ok"},
  "qdrant": {"status": "ok"},
  "config": {
    "has_anthropic": true,
    "has_bedrock": false,
    "agent_model": "claude-sonnet-4-5-20250929",
    "db_url": "postgresql://...(masked)"
  },
  "overall_status": "connected"
}

PostgreSQL probes retry 3x with 1s backoff (PGlite can have transient stalls). Overall status is connected when gRPC responds, degraded when gRPC is up but other services are flaky.

Gateway Lifespan

The FastAPI lifespan hook runs three startup tasks:

OOTB seed: Check if ootb-sample source has any dataset versions; if none, create version 1 with metadata.
Hive auto-discovery: discover_hive_sources() probes all configured data connections (ATELIER_DATA_CONNECTIONS), iterates databases, finds annotations tables matching the known schema (legacy or universal format), and auto-registers them via get_or_create_data_source().
Terminal cleanup: Background asyncio task sweeps idle terminal sessions every 60 seconds.

All three tasks are wrapped in try/except — failures are logged as warnings but don’t prevent gateway startup.

Config Lifecycle

HOCON (config/base.conf) is the single source of truth. No module reads os.environ directly for configuration values.

.env → devenv shell → HOCON ${?VAR} substitution → AtelierConfig dataclass

load_config() reads the HOCON file with live environment variable substitution. External tools that need a flat key=value file use just resolve-config to materialize build/config/atelier.env.

Preflight Validation

just preflight runs structured deny/warn checks via atelier.preflight.run_preflight():

Deny = blocking (service cannot start). Examples: missing API keys when both Anthropic and Bedrock are unconfigured.
Warn = advisory (degraded functionality). Examples: GPU detected but CUDA unavailable, Qdrant not reachable.

Preflight is called during gateway startup to surface configuration problems early rather than during the first pipeline run.

Keystone Agents

Atelier uses the Claude Agent SDK to drive classification convergence. Rather than a fixed programmatic loop, an LLM agent reasons about which columns to revisit based on DST conflict metrics, evidence breakdowns, and convergence trends.

Agent Convergence Loop

The agent loop (src/atelier/classify/agent_loop.py) wraps the bootstrap pipeline functions as six Claude tools. Claude receives an initial state summary and iteratively calls tools until it determines the classification has converged.

Flow

1. Initial state → agent sees mean gap, mean belief, coverage, K (diagnostic)
2. Agent calls get_conflict_report → identifies uncertain columns (high gap or low belief)
3. Agent calls get_column_detail → inspects per-source evidence breakdown
4. Agent calls revisit_columns → re-classifies with enriched context
5. Agent calls check_convergence → verifies gap trend + belief floor
6. Repeat 2-5 until satisfied
7. Agent calls declare_converged with reason

The conversation loop runs up to classify_agent_max_turns (default 10) Messages API round-trips. Each tool call returns structured JSON that the agent uses to plan its next action.

Five Tools

Tool	Input	Returns	Purpose
`get_conflict_report`	`k_threshold` (float)	Flagged columns with K, belief, plausibility, gap, settled flag	Identify uncertain or conflicting columns
`revisit_columns`	`column_names` (list)	Updated labels + new belief intervals	Re-classify with enriched LLM context (ML prediction + belief interval)
`check_convergence`	—	mean_gap, mean_bel, frac_unclear, coverage, K (diagnostic), iteration history	Assess convergence via belief-gap criteria
`get_column_detail`	`column_name` (string)	Per-source evidence breakdown, sample values, belief interval	Deep-dive into a specific column
`declare_converged`	`reason` (string)	Confirmation	Exit loop with stated rationale

Historical note (2026-05-04 refactor). Earlier revisions of the agent loop included a sixth retrain_svm tool that retrained the SVM on accumulated LLM labels and hot-swapped the result. That tool was removed alongside the M9 in-loop SVM-on-LLM-labels retrain machinery (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented in ontology_alignment.py. The SVM is now trained once on synth and translated into the user vocabulary at inference time; there is no per-run SVM retraining for the agent to drive.

Agent System Prompt

The system prompt guides the agent’s strategy:

Examine the conflict report to understand where sources disagree
Inspect individual columns for uncertain cases (high gap or low belief)
Revisit uncertain columns to resolve ambiguity
Check convergence metrics (mean gap, mean belief, coverage) to decide whether to continue — K is available as a diagnostic but does not gate
Declare convergence when satisfied (or when diminishing returns)

State Tracking

The agent loop tracks:

state.agent_reasoning — text blocks from each agent turn
state.agent_converged_reason — the reason given at convergence
state.agent_turns — number of conversation turns
state.tokens_input / state.tokens_output — token consumption

Each revisit_columns call increments state.iteration and triggers full ML revalidation on all columns, not just the revisited ones. This ensures that improved LLM labels propagate through the DST fusion.

LLM Backend Matrix

The agent loop and LLM sweep share the same backend infrastructure. No global provider switch — credentials determine what’s available.

Backend	Class	Config	Use Case
Anthropic	`AnthropicBackend`	`ANTHROPIC_API_KEY`	Agent loop + LLM sweep
Bedrock	`BedrockBackend`	`AWS_ACCESS_KEY_ID` + `AWS_SECRET_ACCESS_KEY` + `AWS_REGION`	Production default on CAI
Cerebras	`CerebrasBackend`	`CEREBRAS_API_KEY`	Fast inference via GLM-4.7
OpenAI-compatible	`OpenAICompatibleBackend`	`ATELIER_LLM_BASE_URL` + `ATELIER_LLM_MODEL`	vLLM, any compatible endpoint

The agent client is built via _build_client(cfg) which prefers Anthropic when ANTHROPIC_API_KEY is set, falling back to Bedrock when AWS credentials are available. The agent model resolves as: classify_agent_model → agent_model → "claude-sonnet-4-5-20250929".

Configuration

All agent and bootstrap settings live in HOCON (config/base.conf):

classify {
    llm {
        backend = "openai_compatible"
        model = "glm-4.7"
        base_url = null
        columns_per_call = 50
        discount = 0.10
    }
    bootstrap {
        max_iterations = 5
        k_threshold = 0.2
        coverage_target = 0.95
        max_total_llm_calls = 5000
        # Historical: these knobs gated the excised M9 in-loop SVM
        # retrain.  Retained here only as illustration of the legacy
        # config surface; the keys are no longer read by the pipeline.
        # incremental_svm_retrain = true
        # incremental_svm_min_labels = 20
    }
}

agent {
    model = "claude-sonnet-4-5-20250929"
    model = ${?ATELIER_AGENT_MODEL}
}

classify {
    agent_model = null
    agent_model = ${?ATELIER_CLASSIFY_AGENT_MODEL}
    agent_max_turns = 10
}

When classify.agent_model is set, it overrides agent.model for the classification convergence loop specifically.

Agent vs Programmatic Loop

The bootstrap pipeline (bootstrap.py) contains the programmatic convergence loop as well: sweep → validate → revisit uncertain → repeat. The agent loop is an alternative that delegates the revisit strategy to Claude. Both paths share the same underlying functions (_llm_sweep, _run_ml_validation, etc.) and produce identical DST evidence.

The agent approach is preferred when:

The corpus has wide-belief-gap columns where independent evidence sources disagree in non-obvious ways
You want reasoning traces explaining why convergence was declared
The LLM backend supports tool_use (Anthropic, Bedrock with Claude)

The programmatic approach is used when:

The LLM backend doesn’t support tool_use (vLLM, Cerebras)
Deterministic behavior is required
Cost must be minimized (fewer API calls)

WebSocket Orchestration

The gateway exposes /ws/orchestration for live agent event streaming. Events include agent_spawned, agent_reasoning, agent_tool_call, and agent_completed. The React frontend’s Agent Canvas page consumes these events to render the agent’s decision process in real time.

Classification Pipeline

Atelier’s core objective: agent-mediated metadata classification using Dempster-Shafer Theory (DST) to produce belief intervals instead of flat confidence scores, exposing epistemic uncertainty and source disagreement.

Terminology — reference-label provenance

Four distinct sources of per-column labels show up in our writeups. Conflating them is load-bearing error, so we name each explicitly:

Term	Source	Authority level	Where it appears
Published benchmark	External, human-curated labels (SOTAB, GitTables)	Gold standard — memorization-safe check	SOTAB pilot artifacts; `docs/notes/2026-04-19/…phase_gate_2.md`
Curated reference	Generator-derived (synth pairs an answer-key “reference column” per target) + spot-checked by hand	Definitive for the synthetic corpus; not equivalent to a published benchmark	`build/meta-tagging-clean/curated_reference.csv`
LLM commitment	A single LLM’s pass-1 or pass-2 output	Classifier opinion; not a truth	parquet `llm_code`, `predicted_code`
CatBoost prior	CatBoost fit to LLM labels, used for revisit enrichment	Not independent evidence — it is a compressed self-consensus of the LLM; valuable specifically for rescuing abstentions	parquet `predicted_code` via DST fusion

An ablation (as used in our writeups) is a controlled experiment that holds most of the pipeline fixed and varies exactly one component at a time, so changes in accuracy can be attributed to that component rather than to the combination.

Methodology

Why Dempster-Shafer?

Traditional classifiers output a single confidence score (e.g., “85% email address”). This hides two distinct types of uncertainty:

Aleatoric uncertainty: inherent randomness in the data
Epistemic uncertainty: ignorance due to insufficient evidence

DST separates these via belief intervals [Bel(A), Pl(A)]:

Bel(A) = committed evidence supporting A (lower bound)
Pl(A) = evidence that cannot rule out A (upper bound)
Pl(A) - Bel(A) = unresolved ambiguity

When Bel(A) = 0.8 and Pl(A) = 0.85, we have high confidence with low ambiguity. When Bel(A) = 0.3 and Pl(A) = 0.9, we know something supports A but much remains uncertain — a signal to gather more evidence.

Evidence Sources

Each source independently produces a mass function (Basic Probability Assignment) that distributes belief across the frame of discernment:

Source	Type	Discount	Configurable	Status
Cosine similarity	Sentence-transformer (all-MiniLM-L6-v2)	0.30	`classify.discounts.cosine`	M0
Pattern detection	16 regex detectors + post-regex validators	0.25	`classify.discounts.pattern_theta`	M0
Name matching	Column name ↔ label/abbrev/common_names	varies	`classify.discounts.name_match_*`	M0
LLM	OpenAI-compatible / Anthropic / Bedrock / Cerebras	0.10	`classify.llm.discount`	M1
CatBoost	Gradient boosted trees (virtual ensembles)	adaptive	`classify.discounts.catboost_*`	M2
SVM	Dual TF-IDF (char+word n-grams) + LinearSVC (Platt scaling)	0.20	`classify.discounts.svm`	M2

The discount controls how much mass goes to Θ (total ignorance). Higher discount = more conservative = wider belief intervals.

Pattern mass is graduated: detect_patterns() returns a match fraction (0.0-1.0) per pattern, and pattern_to_mass() scales evidence mass by the average match fraction. A 95% match produces ~3x more mass than a 35% match, eliminating the binary cliff at the 1/3 detection threshold.

Pattern theta (0.25) is deliberately higher than LLM theta (0.10), so the LLM cleanly dominates when pattern and LLM evidence conflict — the LLM considers full context (name, type, values, siblings), while patterns operate on value structure alone.

Evidence Independence

Dempster’s rule of combination requires cognitively independent evidence sources (Shafer 1976) — each mass function must reflect information not derived from the other sources being combined. Atelier achieves this through architectural separation of feature spaces and training signals:

Source	Feature Space	Training Signal	Independence Basis
Name match	String/lexical	None (deterministic)	Symbolic matching only
Pattern	Regex	None (deterministic)	Hand-crafted rules only
Cosine	Dense embedding (384-dim)	Pre-trained sentence-transformer	Learned semantic similarity
LLM	Semantic (frontier or subagent model)	Pre-trained weights	In-context classification
CatBoost	Dense embedding + 12 features	Synthetic data generators	Gradient-boosted ensemble
SVM	Sparse TF-IDF (char 3-6 + word 1-2 n-grams)	Synthetic data generators	Lexical surface patterns

The SVM is Atelier’s domain-adaptation channel. Cosine and the frontier LLM both rely on pretrained models that read the columns whose names and values carry meaning a web-text-trained model can grip on (email_address, transaction_amount, ISO dates). Many columns in deployed enterprise data are not like that: opaque names (val_09, col_73, ref_addr), opaque values (hex digests, internal serial codes, prefix-stripped tokens), or both. Pretrained models have nothing to grip on for those — the signal lives only in domain-specific shape (format, length, character-class distribution, prefix vocabulary) that must be learned from data shaped like the deployed distribution. The SVM is trained on synthetic corpora produced by procedural generators in src/atelier/classify/synth_generators.py, so it learns precisely those patterns. The SVM and cosine therefore operate on disjoint signal populations — semantic-bearing columns versus inscrutable ones — which makes their evidence sources structurally, not merely statistically, independent under DST.

A subtler point worth naming: the historical “confusable pair” framing attributed to the data what often lived in the featurizer. Char-n-gram TF-IDF treating Brazilian CPF identifiers as date-shaped, or sub-word tokenization splitting similar-looking strings into overlapping tokens, are tokenization artifacts — properties of the model, not the data. Domain-adapted training on synthetic-corpus examples that match the deployed distribution sees past those artifacts; the SVM is not “resolving confusables” but reading columns that pretrained models fundamentally cannot.

Architecturally this also provides the most important independence guarantee in the DST stack. While cosine similarity and CatBoost both operate on the same dense sentence-transformer embedding (384 dimensions from all-MiniLM-L6-v2), the SVM operates on a fully orthogonal feature representation: sparse TF-IDF character and word n-grams extracted by sklearn.pipeline.Pipeline + FeatureUnion. The SVM captures lexical surface patterns (abbreviations, digit sequences, camelCase fragments) that the dense embedding collapses — providing genuine corrective signal in DST fusion.

SVM Architecture (adopted from Signals)

The SVM classifier follows the Pipeline + FeatureUnion composition pattern from the Signals project — the version of record presented as an independent fifth DST evidence source:

Column metadata text ("email_addr | user@example.com")
        │
        ▼
    FeatureUnion
    ├── TfidfVectorizer(analyzer="char_wb", ngram_range=(3,6))
    │   → captures subword patterns, abbreviations, digit sequences
    └── TfidfVectorizer(analyzer="word", ngram_range=(1,2))
        → captures multi-word patterns ("email address", "zip code")
        │
        ▼
    Sparse feature matrix (up to 100K dimensions)
        │
        ▼
    CalibratedClassifierCV(LinearSVC, method="sigmoid")
        │
        ▼
    Calibrated probability distribution {code: probability}

Key implementation details:

Singleton class filtering — fit() drops categories with < 2 training examples before CalibratedClassifierCV, since StratifiedKFold requires every class to have >= 2 samples. With 316 categories and few tables, some categories inevitably have only one example. Dropped categories are logged and still receive predictions from the other 5 DST evidence sources.
_min_class_count() — returns the actual minimum (no longer clamped to 2)
feature_importances(top_n) — navigates CalibratedClassifierCV → LinearSVC to extract coef_, averages absolute coefficients across classes, cross-references with FeatureUnion.get_feature_names_out() for named feature importance
is_fitted property for safe state checking before prediction

SVM Training (synth-only) and Vocabulary Alignment

The SVM is trained once on the synthetic corpus (see synth.md) using TF-IDF char-3-6gram + word-1-2gram features and labels keyed on bundled-ontology ICE.* leaves from synth_generators.GENERATORS. At pipeline runtime, the ICE.* predictions are translated into the user’s taxonomy via the cached subsumption-prediction alignment in atelier.classify.subsumption_alignment — sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from the Qdrant taxonomy collection. The legacy LLM-mediated alignment was retired in the P7 intervention (see DST Evidence Independence).

The alignment targets every user node — leaves AND internal nodes (per the dynamic-annotations principle that every node is a first-class tagging target). An ICE leaf may legitimately align to a user internal node when the user’s vocabulary covers a concept family without a leaf-specific equivalent. Restricting alignment to user leaves only would silently reject the parent-family fallback that is the architecturally-correct behavior.

The translation step is what restored the SVM as useful evidence for non-OOTB user vocabularies — pre-alignment, the SVM emitted ICE codes that didn’t appear in the user-taxonomy frame and silently contributed nothing. See subsumption_alignment.py module docstring for the full independence argument.

Historical note (2026-05-04 refactor). Earlier revisions of this design ran a mid-loop train_svm_on_frontier_labels (historical function name) that retrained the SVM on live LLM labels and hot-swapped the result into the active model slot — labelled “M9 incremental SVM retraining” in commit history. That path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for source- independence reasons: the per-column LLM label copying made the SVM strongly non-distinct with the LLM source under Denoeux 2008. The subsequent LLM-mediated alignment introduced a vocabulary-level shared error mode (the alignment-time LLM and the runtime LLM share weights), which the P7 subsumption-prediction intervention eliminates — runtime alignment now uses sentence-transformer embeddings rather than the runtime LLM. The SVM’s TF-IDF independence at the feature and label level is preserved; the remaining weak non-distinctness is the shared enrichment-LLM upstream (offline-generated annotations), structurally identical to the late-interaction cosine source’s coupling.

Implementation

train_svm() in ml_train.py — synth-only training, persists to build/models/svm.pkl (label space: ICE.* leaves)
ontology_alignment.build_alignment() — once-per-(vocab, embedding_model) ICE → user-code mapping via subsumption prediction (sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from Qdrant); cached at build/cache/alignment/<sha256>.json
Discount: classify.discounts.svm = 0.22 (was 0.30 under LLM-mediated alignment, 0.55 in M9 era) reflects the enrichment-mediated subsumption-prediction regime — weakly non-distinct via shared enrichment-LLM upstream only.

Dempster’s Rule of Combination

Sources are fused via the conjunctive combination rule:

m₁₂(C) = Σ{m₁(A)·m₂(B) : A∩B=C} / (1 - K)

where K = Σ{m₁(A)·m₂(B) : A∩B=∅} is the conflict between sources.

High K means the sources disagree — a valuable diagnostic signal. Note that K is not the convergence criterion — see Belief-Gap Convergence below.

Compound Focal Elements (Uncertainty Representation)

When DST evidence splits closely between two singleton categories, collapsing to a single top-1 prediction misrepresents what the evidence actually says. DST’s native vocabulary for this is the compound focal element: a portion of the runner-up’s mass transfers to a focal element representing the union of the two singletons, honestly reflecting that the evidence supports the disjunction but does not discriminate between members. This is the same DST math that supports queries at any node in the hierarchy via belief_at() — the compound mass propagates up to the common ancestor, so belief at any level reflects the combined evidence.

The mechanism is unconditional DST: any two singletons whose masses split closely qualify in principle. In practice the implementation maintains a short registry of category pairs where the transfer is routinely activated — examples below, filtered to vocabulary at runtime. These are illustrations of cases where the mechanism activates, not a definitional list of categories the classifier is expected to “confuse”.

Example pair	Why mass-splitting is common
Record Identifier ↔ Device Identifier	Both are opaque identifiers; context determines which
Timestamp ↔ Date of Birth	Both are temporal; DOB is a specific semantic subtype
Transaction Amount ↔ Bank Account Number	Both are financial numbers
IP Address ↔ Device Identifier	IP addresses can identify devices

Mechanics: when the top-2 singleton masses match a registered pair and their ratio is below confusable_ratio_threshold (default 3.0), half of the runner-up’s mass transfers to the compound focal element. Belief at the common ancestor then reflects the combined evidence via belief_at() propagation. (The config knob retains its historical name for backward compatibility; the mechanism itself is honest uncertainty representation, not pair-discrimination.)

Pattern Validation

Pattern detection uses a two-stage architecture: 16 regex patterns for recall, plus a _VALIDATORS registry for precision. A value must pass both the regex AND the validator (if one exists) to count.

Validator	Pattern	Checks
`_luhn_check`	`credit_card_pattern`	Luhn checksum (ISO/IEC 7812)
`_is_valid_ipv4`	`ipv4_pattern`	All 4 octets in 0-255 range
`_is_plausible_date`	`date_iso_pattern`, `datetime_iso_pattern`	Month 01-12, day 01-31
`_is_iso_currency`	`iso_currency_pattern`	ISO 4217 whitelist (~40 codes)

The phone_pattern uses a suppression mechanism: when a more specific digit-heavy pattern also fires (SSN, date, credit card, IP, postal code, monetary, IBAN), the phone match is suppressed. This prevents the phone regex from injecting false evidence on columns whose values happen to contain formatted digits.

12 Discrete Features

Each column produces 12 SAGE-ablatable features:

column_name — humanized column name
column_type — SQL type (suppresses uninformative STRING/VARCHAR)
sample_values — first 5 non-null values as text
cardinality — distinct value count
null_ratio — fraction of NULL values
value_entropy — Shannon entropy of value lengths
pattern_signals — matched regex patterns
avg_value_length — mean string length
numeric_ratio — fraction parseable as numbers
sibling_context — other column names in the same table
source_table — table name
value_description — auto-generated natural language description

Architecture

AgentFSM

The classification pipeline runs as a background Finite State Machine:

ML-only path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Bootstrap path (programmatic):
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
                                                    ▲                     │
                                                    └─── (disagreements) ─┘
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Agent-driven path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING
                                                    ▲           │
                                                    └── Agent convergence loop (5 tools)
                                                          Claude reasons about which columns to revisit
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

MC sampling (when corpus > 200 columns):
SAMPLING includes pre-classify → stratify → select MC sample
LLM_SWEEP classifies the sampled subset only → propagate labels to remainder

State transitions are persisted to PostgreSQL. The Status page polls /api/fsm/status for live progress updates.

Module Structure

src/atelier/classify/
├── __init__.py          # Public API: run_pipeline(), run_bootstrap(), get_fsm_status()
├── belief.py            # DST core: BeliefAssignment, FocalElement, dempster_combine()
├── mass_functions.py    # Evidence→mass converters (6 active)
├── features.py          # 12 features + 16 pattern detectors + 5 post-regex validators
├── taxonomy.py          # ReferenceCategory, HierarchicalCategorySet
├── embedding.py         # Sentence-transformer cosine classifier
├── llm_backend.py       # LLM backend factory (Anthropic, OpenAI-compat, Bedrock tool-use, Cerebras)
├── bootstrap.py         # Bootstrap convergence loop (LLM sweep + ML validation)
├── agent_loop.py        # Agent-driven convergence (6 Claude tools)
├── monte_carlo.py       # MC stratified sampling for scale (pre-classify, stratify, select, propagate)
├── gpu.py               # GPU detection + NVIDIA driver symlink (nix+CUDA)
├── sampler.py           # Hive metadata sampling + fixture data loading
├── synth.py             # Synthetic data generation
├── synth_generators.py  # 316+ hand-coded value generators (shared module)
├── synth_registry.py    # Three-layer generator registry (hand-coded > template > inferred)
├── meta_tagging_overlay.py # 130+ META_TO_ICE mappings for meta-tagging alignment
├── svm_classifier.py    # Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
├── catboost_classifier.py # CatBoost with virtual ensemble uncertainty
├── ml_train.py          # Training orchestrator (synth → models)
├── ml_inference.py      # Lazy-loading inference wrappers
├── evaluation.py        # Structured evaluation (per-category P/R/F1, confusion matrix)
├── train_eval_cycle.py  # Synth → train → classify → evaluate orchestrator
├── mock_llm.py          # Realistic mock LLM (seeded uncertainty + mass-splitting between close categories)
├── sage.py              # SAGE feature importance (permutation-based, GPU-aware)
├── shap_explanations.py # Per-item SHAP feature attribution (TreeSHAP + PermutationSHAP)
├── pipeline.py          # Full pipeline orchestration (6 sources + MC + background SHAP)
├── fsm.py               # AgentFSM state machine
├── fixtures/
│   ├── universal_vocabulary.json  # BFO-grounded universal vocabulary (16 leaves)
│   └── fixture_tables.json        # 8 tables, 50 cols — fixture reference for unit tests
│                                    (NOT the UAT-corpus curated reference; see
│                                    build/meta-tagging-clean/curated_reference.csv)
data/sample/
└── ontology.json                  # Expanded vocabulary (300 leaves, 25 internal)
└── ontology/
    ├── atelier-vocab.ttl          # CCO-mediated BFO alignment (59 mapped terms)
    ├── sparql/unmapped-terms.rq   # Totality validation query
    └── README.md                  # Mapping methodology and usage

Build Directory

Artifacts are written to build/ (gitignored) to separate reproducible code from potentially sensitive intermediate data:

build/
├── data/annotations/    # Cached vocabulary from hive
├── data/samples/        # Sampled metadata
├── data/synth/          # Synthetic training data
├── models/              # Trained CatBoost + SVM models, embedding caches
└── results/{run_id}/
    ├── classifications.json           # Per-column DST results (+ SHAP columns when enabled)
    ├── evaluation_report.json         # Per-category P/R/F1, confusion matrix
    └── atelier_embeddings.parquet     # For embedding-atlas (+ shap_top{1,2,3}_{name,value})

Controlled Vocabulary

Loaded from hive default.annotations (11 columns):

Column	Maps to	Purpose
`id`	`code`	Hierarchical dot-notation identifier
`ontology`	`label`	Human-readable category name
`annotation`	`abbrev`	Formal code / mnemonic
`definition`	`description`	Human-readable definition text
`common_names`	`common_names`	Pipe/comma-separated aliases
`specifics`	(embedding text)	Examples and context
`non_corp`, `emp_contractor`, `individual`, `corp`	`sensitivity`	Per-role ratings (0-4)
`deprecated`	(filter)	“yes” = exclude

API

REST Endpoints

GET /api/fsm/status — Current pipeline state + progress
POST /api/fsm/start — Start a single-pass ML classification run
POST /api/fsm/start-bootstrap — Start bootstrap convergence loop (LLM + ML)
GET /api/fsm/runs — List past runs

gRPC RPCs

GetFSMStatus() → FSMStatusResponse
StartClassification() → StartClassificationResponse

HierarchicalClassification

The pipeline wraps each column result in a HierarchicalClassification object (ported from signals) that enables post-hoc hierarchy navigation:

belief_at(code) — query Bel at any hierarchy level (leaf or internal)
plausibility_at(code) — query Pl at any level
interval_at(code) — (Bel, Pl) tuple
uncertainty_gap — Pl - Bel for the predicted category
needs_clarification — True when uncertainty_gap > 0.3 or conflict > 0.2
from_combined_evidence() — factory method: filters vacuous sources, combines via the configured fusion strategy, ranks by pignistic probability

Confidence is pignistic probability BetP(singleton), the decision-theoretic transform that distributes multi-element focal set mass equally among members.

Fusion Strategies

Two DST combination rules are implemented, selectable via classify.fusion_strategy:

dempster (default) — Classical Dempster’s rule with (1-K) normalization. Under high conflict, surviving singletons are amplified.
yager — Yager’s modified rule. Conflict mass is redirected to Θ (ignorance) instead of being normalized away. Preserves epistemic honesty at the cost of higher ignorance mass and typically lower peak belief values. When K=0, produces identical results to Dempster.

Yager is available as an opt-in alternative for empirical validation. The default (Dempster) remains in place pending A/B comparison on real pipeline runs — Yager’s increased conservatism may or may not improve overall classification quality, and compensatory adjustments to per-source discounting or decision thresholds may be needed.

Bootstrap Convergence Loop

The bootstrap pipeline wraps the single-pass ML pipeline in an iterative LLM↔ML convergence loop. It adds LLM evidence and repeats until predictions are settled — measured by belief-gap convergence, not raw conflict K.

Three Phases

LLM Sweep (LLM_SWEEP): Batch-classify all columns via the configured LLM backend (Claude via Bedrock/Anthropic, or any OpenAI-compatible endpoint). Columns are sent in table-aware batches with sibling context. If every batch fails, the sweep raises RuntimeError (fail-fast) instead of silently proceeding with zero labels.
ML Validation (VALIDATING): Run the full 6-source DST pipeline for each column. Compute per-column belief interval [Bel, Pl], conflict K, and uncertainty gap Pl - Bel. Identify uncertain columns where predictions need revisiting.
Targeted Revisit (back to LLM_SWEEP): Re-classify uncertain columns with enriched context — the ML prediction, belief interval, pattern signals, and value descriptions are included in the prompt. This gives the LLM evidence it didn’t have in the first pass.

Belief-Gap Convergence

The primary convergence measure is the uncertainty gap Pl - Bel for each column’s predicted category. This directly answers “how settled is this prediction?” — unlike K, which only measures source disagreement.

A column can have K=0.9 but Bel=0.95 — the sources fought hard during combination, but the normalizing denominator (1-K) concentrated surviving mass on the agreed-upon singleton. That column’s prediction is settled despite high conflict; it doesn’t need revisiting.

Convergence criteria (all must hold):

Criterion	Metric	Default	Meaning
Primary	`mean_gap < gap_threshold`	0.15	Predictions are tight
Secondary	`frac_unclear < clarity_target`	0.10	At most 10% of columns need clarification
Coverage	`coverage >= coverage_target`	0.95	95% of columns have labels

Revisit targeting: _identify_uncertain_columns() selects columns where gap > 0.3 OR Bel < bel_floor (default 0.50), sorted by gap descending (most uncertain first).

Early stopping: The proof-of-progress paradigm monitors the gap trend. When mean gap plateaus for 2 consecutive iterations (no verifiable progress), the loop stops even if the threshold hasn’t been reached.

K as Diagnostic

Conflict K remains in logs, iteration metrics, and agent tools as a diagnostic for source disagreement. It is useful for identifying calibration issues (e.g., a pattern detector producing false positives) but does not gate convergence. The cumulative K formula K = 1 - Π(1 - Kᵢ) tends to be high (~0.5-0.8) with 6 partially correlated sources; this is expected and does not indicate poor quality.

Agent-Driven Convergence

As an alternative to the programmatic loop, the agent convergence loop (agent_loop.py) delegates revisit strategy to Claude. The agent uses 6 tools — get_conflict_report, revisit_columns, check_convergence, get_column_detail, retrain_svm, declare_converged — to reason about which columns need re-examination. The agent sees both gap-based and K-based metrics and can make nuanced decisions. See Keystone Agents.

LLM Backend

llm_backend.py provides a factory-pattern abstraction:

OpenAICompatibleBackend: For vLLM, GLM-4.7, and any endpoint implementing the OpenAI chat completions API. Default backend.
AnthropicBackend: For Claude via the Anthropic SDK.
BedrockBackend: For AWS Bedrock via the Converse API.
BedrockStructuredBackend: Production default on CAI. Uses invoke_model with tool-use for structured output (output_config is not supported on Bedrock). When extended thinking is enabled, tool_choice must be "auto" (Anthropic constraint); a text-block fallback parser handles this case. Both backends use region_from_arn() to extract the target region from cross-region inference profile ARNs.
CerebrasBackend: OpenAI-compatible with Cerebras-specific defaults (base_url=https://api.cerebras.ai/v1, model=zai-glm-4.7).
create_backend_from_cfg(cfg): Factory that reads HOCON config to select and configure the appropriate backend.

Backends fail fast when not configured — no mock fallback in production code.

Configuration

All bootstrap/LLM settings live in HOCON (config/base.conf):

classify {
    llm {
        backend = "openai_compatible"  # or "anthropic", "bedrock_structured"
        model = "glm-4.7"
        base_url = null                # vLLM endpoint URL
        columns_per_call = 50
        discount = 0.10                # DST discount for LLM mass
    }
    bootstrap {
        max_iterations = 5
        k_threshold = 0.2              # diagnostic (not convergence-gating)
        coverage_target = 0.95
        max_total_llm_calls = 5000
        # Belief-gap convergence (primary criteria)
        gap_threshold = 0.15           # mean(Pl - Bel) target
        clarity_target = 0.10          # max fraction of unclear columns
        bel_floor = 0.50               # min belief for "settled"
    }
}

Environment variable overrides follow the standard pattern: ATELIER_LLM_MODEL, ATELIER_LLM_BASE_URL, ATELIER_BOOTSTRAP_K_THRESHOLD, etc.

SHAP Explanations

Per-item feature attribution explaining why each column was classified as it was. Complements the global SAGE importance (which ranks features across the entire dataset) with item-level explanations.

Two Methods

Method	Algorithm	Speed	Features	When Used
CatBoost TreeSHAP	Exact O(TLD) built-in	~0.1s for 50 items	Grouped: embedding, discrete	Auto when CatBoost model loaded
Embedding PermutationSHAP	`shap.PermutationExplainer`	~50s/item on CPU	12 named features	Tier-1, explicit request only

Auto mode (method="auto") only uses TreeSHAP — PermutationSHAP is too slow for default pipeline runs and must be explicitly requested.

Output

Each classification gains 6 extra columns:

shap_top1_name, shap_top1_value
shap_top2_name, shap_top2_value
shap_top3_name, shap_top3_value

These flow through to JSON, parquet, and evaluation output.

Configuration

classify.shap {
    enabled = true        # Enable SHAP in pipeline (auto-selects method)
    top_k = 3             # Number of top features to report per item
}

Configurable Discounts

All DST discount factors are configurable via HOCON. The DiscountConfig dataclass bundles all parameters with DiscountConfig.from_cfg(cfg) factory:

classify.discounts {
    cosine = 0.30                    # Cosine similarity → Theta mass
    svm = 0.20                       # SVM → Theta mass
    pattern_theta = 0.25             # Pattern detection → Theta mass (graduated by match fraction)
    name_match_exact = 0.70          # Exact label match singleton mass
    name_match_code = 0.50           # Formal code/abbrev match mass
    name_match_alias = 0.50          # Common name alias match mass
    name_match_overlap = 0.30        # Word overlap match mass
    catboost_base = 0.10             # Adaptive discount base
    catboost_variance_scale = 1.6    # Variance-to-discount scaling
    catboost_max = 0.50              # Cap on adaptive discount
    catboost_fallback = 0.15         # When no variance available
    confusable_ratio_threshold = 3.0 # Mass-split ratio that triggers compound focal element transfer
}

Environment variable overrides: ATELIER_DISCOUNT_COSINE, ATELIER_DISCOUNT_SVM, etc.

Milestones

Milestone	Scope	Status
M0	Cosine + pattern + name match, FSM, pipeline E2E	Done
M0.5	Schema fix, pignistic probability, HierarchicalClassification	Done
M1	LLM evidence source, bootstrap convergence loop, LLM↔ML validation	Done
M2	CatBoost + SVM + synthetic data, 6 evidence sources, Bedrock/Cerebras backends	Done
M3	Evaluation framework, E2E synth-train-eval, realistic mock LLM, SAGE importance	Done
M4	SHAP explanations, configurable discounts, thread-safe model loading	Done
M5	Data sources + versioning, OOTB onboarding (316-leaf ontology, 25 sample tables)	Done
M6	Agent-driven convergence loop (6 Claude tools), synth framework (316+ generators)	Done
M7	Monte Carlo stratified sampling, label propagation, background SHAP	Done
M8	GPU acceleration (NVIDIA driver symlink, batch encoding), meta-tagging overlay	Done
M8.5	SVM signals alignment (Pipeline+FeatureUnion adoption, evidence independence documentation)	Done
M9	Incremental SVM training on LLM-classified labels (cross-model distillation via MC sampling) — subsequently excised, see 2026-05-04 historical note above	Done
M10	Phase Gate #2 — belief-gap convergence pivot, Cautious-Code Review, TreeSHAP per-feature attribution, reasoning-trace citation analyzer (+9 pts iterative gain), 97.8% phase-gate validation on meta-tagging	Done
M11	MLflow experiment tracking, Hive data source integration	Proposed

Pipeline Phases (FSM Walk-Through)

A run of the classification pipeline advances through a finite state machine. Each state is a named phase with a single responsibility, and the legal transitions between phases — defined authoritatively in src/atelier/classify/fsm.py — form the workflow that operators see live in the Workflows page and that this document narrates end-to-end.

This page is the operator-facing companion to two deeper references:

Classification Pipeline — what the pipeline does mathematically (DST, evidence sources, fusion strategies).
DST Evidence Independence — why each evidence source qualifies for Dempster’s rule of combination.

Read this one first when you need to walk a reviewer through the run shape: which phase produces which artifact, where the iteration loop lives, what makes a run land in CONVERGED versus ERROR.

At a glance

                                     ┌──── revisit ────┐
                                     │                 │
                                     ▼                 │
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
                                                                         │
                                                                         ▼
                                                       CLASSIFYING → FUSING → EVALUATING → CONVERGED
                                                                                                │
                                                                                          (any phase)
                                                                                                ▼
                                                                                              ERROR

The arrow back from VALIDATING to LLM_SWEEP is the iteration loop; it’s the heart of the algorithm and is described in Iteration loop below.

Phases in execution order

#	State	What it does	Primary output
—	`IDLE`	No run in flight. Ready to dispatch the next classification.	—
1	`LOADING_VOCAB`	Load the user-supplied taxonomy (annotations CSV / Hive table / DB) and validate: label collisions, duplicate codes, orphaned aliases, parent-aware frame structure.	`HierarchicalCategorySet`, `FrameOfDiscernment`
2	`DISCOVERING`	Probe the data source via `cml.data_v1` (Hive), the meta-tagging mount (CSV), or the bundled fixtures to enumerate the tables in scope.	`list[str]` of table names
3	`SAMPLING`	For each discovered table, sample column metadata: bare names, types, ~5 representative values, true `COUNT(DISTINCT)` bounded by the sample limit, null ratio, sibling list. Reference-key columns (`attr_1_2_3_*` answer-key shape) are filtered out so they don’t trivially leak into evaluation.	`list[ColumnSample]` (canonical bare names — see `ColumnSample` invariant)
4	`LLM_SWEEP`	Claude classifies each directly-targeted column into the user vocabulary. Iteration 1 sweeps every column (or the Monte Carlo sampled subset — a stratified slice for large corpora; the remaining columns get label propagation later). Iterations 2…N revisit only the columns flagged for re-look in the previous `VALIDATING` pass.	`state.labels[qualified_name] → category_code`, plus per-column LLM confidence
5	`VALIDATING`	ML re-validation: CatBoost (fit-to-LLM during the loop) and the synth-trained SVM (translated through the LLM-mediated ICE→user-vocab alignment) score the same columns independently of the LLM. Per-column DST mass with conflict K is computed under the parent-aware frame. The disagreement set — driven primarily by belief-gap `Pl − Bel`, with K and coverage as secondary signals — feeds the next iteration’s revisit batch. The loop exits when convergence criteria are satisfied; otherwise it re-enters `LLM_SWEEP`.	`state.ml_prediction`, `state.ml_belief`, `state.ml_plausibility`, `state.ml_conflict`, the next iteration’s disagreement list
6	`CLASSIFYING`	Final per-column DST evidence fusion. Up to six evidence sources combine: `name_match`, `pattern`, `cosine`, `llm`, `catboost`, `svm`. Each produces a mass function over the parent-aware frame; per-column predicted code, belief, plausibility, and conflict are computed here.	`classifications: list[dict]` (each entry shaped as `classifications.json` rows)
7	`FUSING`	Combine per-column mass functions via the configured fusion strategy. `dempster` normalizes conflict by `(1 − K)`; `yager` redirects conflict mass to Θ (ignorance). Cautious-code review (when enabled) runs here — backing off over-specified leaf predictions whose belief sits below the commit threshold to a parent code where it does.	Headline classification per column; `cautious_review.json` (when enabled)
8	`EVALUATING`	Compute corpus-level metrics: accuracy vs reference (when present), per-category precision/recall, K distribution, gap distribution, non-PII residual count. Persist artifacts to disk and emit the parquet for the Embeddings page. Overwatch (when enabled) runs at the tail of this phase.	`evaluation_report.json`, `column_trajectories.json`, `taxonomy_findings.json`, `atelier_embeddings.parquet`, ML artifacts, `overwatch.md` (when enabled)
—	`CONVERGED`	Terminal success. Convergence criteria satisfied; results are on disk under `build/results/{run_id}/` and registered in the DB via the run-end registration path (or recovered later by `atelier.db.sync.sync_filesystem_to_db` on restart).	—
—	`ERROR`	Terminal failure. The FSM `error` field carries the diagnostic; pipeline logs and `register_error.json` (if any) carry the rest.	—

The FSM defines two states that the standard inference run does not visit — GENERATING_SYNTH and TRAINING. These belong to the offline synth-corpus generation + SVM-training flow that produces the bundled SVM artifact (legacy filename svm_frontier.pkl retained on disk for backward compatibility with older run directories), and are reachable from SAMPLING only on the explicit synth-generate code path.

Iteration loop: `LLM_SWEEP ⇄ VALIDATING` is the algorithm

The single most important thing to internalize when reviewing the pipeline: LLM_SWEEP and VALIDATING are not two separate one-shot phases — they form an iteration loop, and the loop is the convergence algorithm.

Each cycle:

LLM_SWEEP labels (or re-labels) the directly-targeted column set on iteration 1, the disagreement set on iterations 2…N.
VALIDATING runs ML re-validation, computes per-column belief, plausibility, and conflict under the parent-aware DST frame, and identifies the next disagreement set.
The loop exits when one of the convergence criteria is satisfied; otherwise it re-enters LLM_SWEEP.

The Workflows page draws this as a purple dashed back-edge from VALIDATING to LLM_SWEEP precisely because the geometry teaches the algorithm: this is bootstrapping with active-learning revisit, not a linear pipeline.

The driver of the loop is configurable:

Programmatic (default): pipeline._llm_revisit picks revisit candidates from _identify_disagreements and _identify_uncertain_columns.
Agent-driven (capability flag): the Agent Convergence skill replaces the programmatic driver — Claude chooses revisit candidates and decides when to declare convergence via tool calls.

Convergence criteria

A run reaches CONVERGED when any of the following holds at the end of an iteration in the LLM_SWEEP ⇄ VALIDATING loop:

Criterion	Config key	Default	Notes
Mean belief-gap below threshold	`classify.bootstrap.gap_threshold`	0.05	The primary signal — converging on `mean(Pl − Bel)`, not on K. Locked in by commit `bd7de2c` after the parent-aware DST frame audit.
Coverage met + K acceptable	`classify.bootstrap.coverage_floor`, `classify.bootstrap.k_threshold`	0.95, 0.40	Backstop — a corpus that LLM-labels everything cleanly on the first sweep doesn’t need additional iterations.
Iteration cap reached	`classify.bootstrap.max_iterations`	4	Fallback. The convergence reason is recorded as `max_iterations_reached` so the UI can show an honest “ran the full budget” rather than claiming gap convergence the run didn’t actually achieve.
Min iterations honored	`classify.bootstrap.min_iterations`	2	Forces at least N revisit cycles before any convergence path can fire. Defends against a single-pass LLM that happens to land cleanly without the ML cross-check having run.

Conflict K (Dempster’s rule’s normalization mass) is diagnostic, not the gating signal. Earlier iterations of the design framed K as the convergence headline; that framing was retired in commit bd7de2c and matters for any review of older docs or telemetry that still leads with K.

`ColumnSample` canonical form

The pipeline’s data-model invariant: ColumnSample.name is always the bare column identifier — table-relative, free of any f"{table_name}." prefix. Cross-table identity uses the qualified_name property (f"{table_name}.{name}") for dict keying.

This invariant is enforced in __post_init__ and validated at every source boundary:

Hive sampler (sampler._strip_table_qualifier) — strips the f"{table_name}." qualifier that Hive’s JDBC driver returns from SELECT * FROM db.table.
Meta-tagging CSV (meta_tagging_source) — strips the same prefix from CSV headers that encode the table name.
Synth, OOTB sample, fixtures — produce bare names by construction.

Any new source path that produces qualified names will trip the __post_init__ invariant at construction time with a clear diagnostic, rather than letting them silently propagate into the embedding text — where a repeated table-name prefix in the column name and the sibling list would drown the actual column signal in table-theme noise and produce table-wide misclassification.

Optional capability skills

Three skills attach to specific phases and are gated by the corresponding capability flag. Each renders only when its flag is enabled in /api/status config.

Skill	Attaches to	Behavior	Capability flag
Agent Convergence	`VALIDATING`	Claude drives the convergence loop directly via the `agent_loop` tool surface — picks revisit candidates from belief/conflict signals and declares convergence when satisfied — replacing the programmatic loop driver. Bounded by `max_turns`.	`classify_agent_enabled`
Cautious Review	`FUSING`	Per-column LLM review that backs off over-specified leaf predictions to a parent code where belief crosses the commit threshold. Defends against false-precision claims on opaque or ambiguous columns.	`cautious_review_enabled`
Overwatch	`EVALUATING`	Single-turn Opus analysis writes `overwatch.md` with pipeline-tuning recommendations after the run lands. Requires direct Anthropic API (not Bedrock) — Bedrock lags Opus releases.	`overwatch_enabled`

The three skills are visible as orange dashed nodes in the Workflows page when their flags are enabled, attached to their host phases. This is the registry-MVP shape — when we hit roughly six surfaces it graduates to a backend /api/skills endpoint reading from a real registry rather than a hand-coded list.

Phase ↔ artifact map

For an end-to-end review of any single run, here’s what’s recoverable from build/results/{run_id}/, indexed by the phase that produced it:

Phase	Artifact	What’s in it
Run start	`settings_snapshot.json`	The config that drove the run — `source_id`, all overlay values at start, default values, the resolved settings the pipeline actually used.
`LLM_SWEEP` ⇄ `VALIDATING`	`column_trajectories.json`	Per-column history across iterations: label changes, ML predictions, belief/plausibility/conflict trajectory, the `revisited` flag per iteration.
`LLM_SWEEP` ⇄ `VALIDATING`	`catboost_fit_to_llm.cbm` + `.classes.json`	CatBoost fit to the in-loop LLM labels. Persisted for Extend runs.
`LLM_SWEEP` ⇄ `VALIDATING`	`svm_frontier.pkl` + `.classes.json`	Synth-trained SVM with the in-run LLM-mediated alignment. Persisted for Extend runs. (Filename retained for backward compatibility; underlying model is the synth-trained SVM, not the excised M9 in-loop retrain.)
`CLASSIFYING` + `FUSING`	`classifications.json`	The per-column output: predicted code, belief, plausibility, conflict, full evidence-source mass distributions, belief path, llm/ML/cautious codes. The headline corpus result.
`FUSING`	`cautious_review.json`	Cautious Review skill audit (only when enabled).
`EVALUATING`	`evaluation_report.json`	Corpus-level metrics: accuracy, per-category precision/recall, K distribution, gap distribution, non-PII residual count.
`EVALUATING`	`taxonomy_findings.json`	Notes flagged during taxonomy traversal — orphaned codes, suspicious aliases, near-duplicate labels.
`EVALUATING`	`atelier_embeddings.parquet` + `umap.pkl`	Input for the Embeddings page (UMAP projection of the per-column embedding vectors with predicted-code colorings).
`EVALUATING`	`sage_importance.json` + `shap_summary.json`	GPU-accelerated global feature importance + per-column SHAP attributions (when enabled).
`EVALUATING` (post)	`overwatch.md`	Overwatch skill output (only when enabled).
Run end	`register_error.json` (rename to `.resolved` after sync)	If DB registration failed mid-run, the sync path on restart picks the run up from this sidecar.

Where to look in code

Concern	File
FSM state enum, transitions, `FSMRun` dataclass	`src/atelier/classify/fsm.py`
Phase advancement (every `fsm.advance(...)` call)	`src/atelier/classify/pipeline.py`
Iteration loop driver (programmatic)	`src/atelier/classify/bootstrap.py` (`_llm_sweep`, `_llm_revisit`, `_identify_disagreements`, `_run_ml_validation`)
Iteration loop driver (agent)	`src/atelier/classify/agent_loop.py`
Per-column DST fusion	`src/atelier/classify/pipeline.py` (`_classify_column`)
Convergence criteria evaluation	`src/atelier/classify/bootstrap.py` (`_mean_gap`, `_mean_k`, `_coverage`, `should_stop_early`)
Cautious Review skill	`src/atelier/classify/cautious_review.py`
Overwatch skill	`src/atelier/overwatch/agent.py`
Workflows page topology (UI)	`ui/src/lib/fsmPipelineLayout.ts`

State transition reference

Authoritative state-transition table (from fsm.py:_TRANSITIONS):

From	Legal next states
`IDLE`	`LOADING_VOCAB`
`LOADING_VOCAB`	`DISCOVERING`, `ERROR`
`DISCOVERING`	`SAMPLING`, `ERROR`
`SAMPLING`	`CLASSIFYING`, `GENERATING_SYNTH`, `LLM_SWEEP`, `ERROR`
`GENERATING_SYNTH`	`TRAINING`, `ERROR`
`TRAINING`	`CLASSIFYING`, `ERROR`
`LLM_SWEEP`	`VALIDATING`, `CLASSIFYING`, `ERROR`
`VALIDATING`	`LLM_SWEEP`, `CLASSIFYING`, `ERROR`
`CLASSIFYING`	`FUSING`, `ERROR`
`FUSING`	`EVALUATING`, `ERROR`
`EVALUATING`	`CONVERGED`, `IDLE`, `ERROR`
`CONVERGED`	`IDLE`
`ERROR`	`IDLE`

SAMPLING → CLASSIFYING (skipping LLM_SWEEP) is the path used by Extend runs, where ML-only inference is desired because the LLM has already classified an earlier corpus and the artifacts are being applied to a new dataset. LLM_SWEEP → CLASSIFYING (skipping VALIDATING) is the “first-sweep convergence” path on small corpora that don’t need iteration.

DST Evidence Independence

This note documents how Atelier’s classification pipeline handles non-distinct evidence sources under Dempster-Shafer fusion, and why the discount calibration and revisit gate are structured the way they are. It is intended to be cited by code reviewers and academic readers.

Atelier’s bootstrap loop is iterative refinement on a belief- assignment vector B over columns: B_{n+1} = T(B_n), where T composes the LLM sweep, ML validation (CatBoost + SVM), DST fusion, and targeted revisit on disagreement. Cast in the language of classical numerical analysis (Banach 1922; Saad 2003 §4.1, Iterative Methods for Sparse Linear Systems), every component of the pipeline maps onto a numerical-method primitive:

Component	Numerical-methods primitive
Bootstrap loop	Fixed-point iteration on B
LLM sweep	Stochastic operator T_LLM (Robbins-Monro 1951 framing)
ML validation	Deterministic linearization T_ML
DST fusion	Combiner ⊕ producing fused state
Targeted revisit on disagreement	Local smoothing in multigrid (Brandt 1977)
Pl − Bel gap	A posteriori error estimate per column
Conflict K	Nonlinear residual diagnostic
Ontology priors	Preconditioner — conditions first-pass output
Reliability discount on derivative sources	Damping / step-size control
Hierarchical cosine mass	Coarse-grid correction (multigrid)
`cautious_promoted_code`	Projection onto coarse grid at level where evidence unambiguous (Smets 1993)
`needs_clarification`	Residual-exceeds-tolerance flag

The diagnostic that ties the framework together is the residual norm ‖r(B)‖ — a unified scalar measuring distance from the fixed point — and the contraction factor ρ_n = ‖r_{n+1}‖ / ‖r_n‖, the headline iterative-method indicator (Saad §4.1):

ρ < 1: contractive — successive iterations reduce residual.
ρ → 1: stalled — iterations not making progress; warrants strategy change (different fusion rule, different preconditioner, agent escalation).
ρ > 1: diverging — iterations growing the residual.

bootstrap.residual_norm and bootstrap.contraction_rate implement the diagnostic. The unified residual is an L2 combination of four normalized components: mean(gap) / gap_threshold, frac_unclear / clarity_target, mean(K) / k_threshold, and frac(indep-tier disagreement at meaningful mass). A residual_norm of 1.0 means “at convergence threshold across the board”; values <1 are converged. Both are surfaced in IterationMetrics and the agent loop’s iteration_history.

This framing is what makes the rest of the design — non- distinctness handling, hierarchical aggregation, ontology priors, reliability shaping — operate as a cohesive accuracy-targeting engine rather than a collection of clever heuristics. Each mechanism is a numerical-method primitive in service of driving the residual to zero.

The non-distinctness problem

Dempster’s rule of combination assumes the bodies of evidence being combined are produced by distinct, conditionally independent sources (Shafer 1976, A Mathematical Theory of Evidence, Ch. 3 §3 and Ch. 4). Smets’ Transferable Belief Model (Smets 1990; Smets & Kennes 1994, The Transferable Belief Model) preserves this assumption at the credal level. Denoeux 2008 (Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence, Artificial Intelligence) characterizes the pathology that arises when the assumption is violated: combining two mass functions that derive from a shared evidential atom via Dempster’s rule effectively raises the contribution of that atom to a power. The conjunctive cautious rule, defined on commonality functions and idempotent on identical evidence (Denoeux 2008 §4), recovers soundness — but is non-normalising and not a drop-in replacement for Dempster.

The Atelier-specific violation

The classification pipeline in src/atelier/classify/ declares six evidence sources:

name_match — lexical column-name matching against the vocabulary.
pattern — regex/validator detection (email, IBAN, monetary, …).
cosine — semantic similarity between the curated embedding text and the user-vocabulary embedding.
llm — Claude Opus first-pass classification.
catboost — CatBoost classifier.
svm — synth-trained TF-IDF + LinearSVC classifier with an LLM-mediated ICE → user-taxonomy alignment applied at inference time.

The first three are genuinely independent of the LLM: their evidence arises from the column’s name, value patterns, and semantic embedding comparison against the vocabulary. The remaining sources have a mixed independence profile:

catboost is trained in fit_to_llm mode (default true) on (embedding_text, llm_code) pairs from the current run’s LLM sweep. See ml_train.fit_catboost_to_llm_labels and pipeline._install_fit_to_llm_catboost. The fitted model is, by construction, an explainability surface over the LLM’s labels — not a competing classifier. Strongly non-distinct with the LLM source under Denoeux 2008 (per-column shared label provenance).
svm is trained once on the synthetic corpus (scripts/generate_synth_source.py → ml_train.train_svm), with TF-IDF char-3-6gram + word-1-2gram features and labels keyed on the bundled-ontology ICE.* leaves from synth_generators.GENERATORS. At pipeline runtime, predictions are translated into the user taxonomy via subsumption-prediction alignment in classify.subsumption_alignment — sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from the Qdrant taxonomy collection (one alignment computation per (vocab, embedding_model) tuple, results cached on disk). Weakly non-distinct with the cosine source via shared enrichment-LLM upstream — the enriched annotations were generated offline by an LLM, but the alignment computation itself uses a structurally independent model (BERT embeddings), not the runtime autoregressive LLM. The prior LLM-mediated approach (one LLM classify_batch call per alignment, excised in the P7 subsumption-alignment intervention) was weakly non-distinct with the runtime LLM through shared model weights — the new approach eliminates that correlation. See the ontology_alignment.py module docstring for the full independence argument.

Treating LLM and CatBoost(LLM) as fully-independent sources and combining them via Dempster’s rule double-counts the LLM atom; the SVM evidence sits between fully independent and fully derivative. The pre-2026-04-30 discount schedule made the legacy three-way overlap worse: llm=0.10, catboost=0.10, svm=0.20, vs cosine=0.30. The genuinely independent semantic source was more discounted than the two derivative ones, mathematically suppressing it whenever the LLM was loud.

A failure case observed during pipeline validation illustrated the pathology in the abstract. A column whose values match the monetary_pattern regex was classified as a generic catch-all code rather than a financial-domain code. Cosine top-1 distributed mass across several financial-leaning codes in the active vocabulary, but at softmax-spread mass on the order of a few thousandths per code it could not overcome LLM mass (≈ 0.83) and CatBoost mass (≈ 0.81), both concentrated on the catch-all. The fused prediction matched the LLM; the disagreement gate at bootstrap._identify_disagreements required llm_code != fused_code and so never fired despite K ≈ 0.81 and a unanimous independent-source pull toward financial codes. needs_clarification=True was emitted, but no LLM revisit followed. Specific customer table names, column names, and codes are intentionally not reproduced in this document.

Treatment in this codebase

The pipeline uses two complementary, scope-bounded fixes:

1. Reliability discounting on derivative sources (Shafer §11.3)

The discount operator from Shafer 1976 §11.3 multiplies a source’s mass by reliability α = 1 - discount:

m’(A) = α · m(A); m’(Θ) = α · m(Θ) + (1 - α)

When evidence sources are non-distinct, the reliability of the derivative source with respect to the original is bounded above by 1 minus their information overlap. For sources trained directly on LLM output that overlap is near-total, so a substantial discount is the principled response under classical Dempster fusion.

The current defaults (config/base.conf:341+) place CatBoost and SVM above the cosine discount:

Source	Discount	Rationale
`cosine`	0.20	independent of LLM; semantic prior
`pattern`	0.25	independent; deterministic regex evidence
`name_match`	0.30–0.70	independent; lexical match against vocab
`llm`	0.15	original; first-pass label
`catboost`	0.55	strongly non-distinct (`fit_to_llm`, per-column LLM labels)
`svm`	0.22	weakly non-distinct (enrichment-mediated subsumption alignment; was 0.30 under LLM-mediated, 0.55 under M9)
`catboost_max`	0.75	variance ceiling; maintains headroom

Operators can dial these via the Settings page when retraining CatBoost on labels independent of the current LLM sweep (e.g. synth-only training); the metadata in config_overlay.SETTINGS_METADATA exposes the full range. The SVM discount at 0.22 (slightly above cosine’s 0.20) reflects the subsumption-prediction alignment: structurally independent of the runtime LLM (uses BERT embeddings, not autoregressive inference), with weak non-distinctness only via the shared enrichment-LLM upstream (same structural dependency the late-interaction cosine source carries). The 0.02 margin above cosine accounts for subsumption prediction being a single per-ICE-code decision (structurally more brittle than per-column cosine evidence).

2. Independent-tier consensus + revisit gate

For revisit decisions, the pipeline computes a parallel, isolated fusion over the LLM-independent subset only:

m_indep = m_cosine ⊕ m_pattern ⊕ m_name_match    (Dempster's rule)
indep_top1 = argmax_singleton m_indep

Implemented in pipeline._classify_column via the INDEPENDENT_TIER constant and combine_multiple(strategy="dempster"). The top-1 singleton and its mass are exposed in the result dict (independent_top1_code, independent_top1_mass, independent_top1_conflict) and stored on the BootstrapState.

The revisit gate at bootstrap._identify_disagreements then fires when:

indep_top1_code ≠ llm_code AND
indep_top1_mass ≥ classify.bootstrap.indep_revisit_mass_threshold (default 0.45)

This restores a real cross-source disagreement test that cannot be masked by LLM-derivative sources amplifying the LLM’s vote. The legacy high-K branch (llm_code != fused_code AND K > k_threshold) is retained as a safety net and runs second in priority.

The revisit prompt context at bootstrap._llm_revisit now includes the independent-tier consensus code/label/mass so the LLM has the counter-evidence in front of it during the second pass.

Ontology priors — substrate as semantic anchor

Patterns detect at extraction time. When a pattern fires we resolve its canonical ICE.* metadata from universal_vocabulary.json (label, description, common-name aliases, full ontological path root→leaf) and thread that metadata through three insertion points sourced from a single lookup (mass_functions.lookup_pattern_ontology):

Embedding text (features.ColumnFeatures.to_embedding_text — ontology_priors is a discrete FEATURE_NAMES entry, ablatable for SAGE). Cosine similarity then operates over publicly-grounded ontology terms an embedding model recognizes from training rather than the regex name alone. On the failure case that motivated this work, the column embedding gains the literal substring “Transaction Amount; The monetary value of a financial transaction.; aliases: amount, payment, price; ontology: Sensitive Data → Personally Identifiable Data → Financial Data → Payment Data → Transaction Amount” — orders of magnitude more semantic surface than patterns: monetary_pattern carried.
First-pass LLM user prompt (llm_backend.build_batch_user_prompt). Every batch — sweep AND revisit — the prompt now includes per-column “Pattern-detected ontology priors (from Atelier’s universal taxonomy — translate to the closest fit in the candidate vocabulary)” with each fired pattern’s label, description, alias list, and path. The LLM is explicitly instructed that the canonical ICE.* code is never a valid classification target; its job is ontology alignment from the publicly-grounded substrate to the user’s frame (He et al. 2023, Exploring Large Language Models for Ontology Alignment; Hertling & Paulheim 2023, OLaLa: Ontology Matching with LLMs; Ehrig & Sure 2004 for the classical foundation).
SAGE/SHAP attribution surface (features.FEATURE_NAMES). ontology_priors is now its own ablatable feature distinct from pattern_signals and sample_values. Operators can attribute classification mass to the publicly-grounded ontology prior independently of the raw embedding text — the explainability story ties each prediction back to the public substrate that motivated it.

Surfaced in the result dict as ontology_priors (list of dicts: pattern, code, label, description, common_names, path, match_fraction). The codes are universal-substrate IDs; they never appear in user-facing classifications. The user’s vocabulary remains the authoritative result space; ICE.* is the bridge.

Architectural significance: this is the substrate→tagging bridge the design has been pointing at. Pattern detection was always publicly-grounded; the resolver turns ICE.* into the user’s codes when it can; when it can’t, ontology priors carry the public semantic anchor straight through to cosine + LLM + SHAP without ever fabricating a code in the user’s frame. Compatible with — and strengthens — the indep-tier consensus + reliability-discount mechanisms above.

Cosine reliability shaping (Haenni-Hartmann 2006)

Static discount=0.30 allocated 0.70 of cosine mass uniformly via softmax across all candidate singletons. On large vocabularies (300+ leaves) this produced softmax compression — even a sharp top-1 hit landed at ~0.004 mass per code. Cosine could see the right answer but couldn’t carry it through fusion, and the indep-tier consensus sat permanently below the revisit threshold.

mass_functions.cosine_to_mass now applies dynamic source reliability per Haenni & Hartmann 2006, Modeling Partially Reliable Information Sources: A General Approach Based on Dempster-Shafer Theory (Information Fusion 7(4), 361–379, §3): the source-reliability factor α is an observable function of quality indicators, with (1 − α) allocated to ignorance.

Two quality indicators:

α_abs — sigmoid of top-1 absolute similarity around τ_abs = 0.40 with σ_abs = 0.10. Encodes “is cosine matching anything strongly, or just noise?”
α_marg — tanh((s₁ − s₂) / σ_marg) with σ_marg = 0.05. Encodes “is the top-1 a decisive winner, or ambiguous among similar candidates?”

Weighted blend (w_abs = 0.6, w_marg = 0.4), clamped to [reliability_floor, reliability_ceiling] = [0.10, 1 − classify_discount_maxsim]. The ceiling preserves the legacy maximum-mass behavior under sharp signal; the floor keeps cosine contributing some mass even under noise.

The α-bounded evidence mass is then split via margin-aware allocation:

m(top-1) = α · margin_weight + α · (1 − margin_weight) · softmax_top1
m(top-i, i>1) = α · (1 − margin_weight) · softmax_top_i
m(Θ) = 1 − α

where margin_weight = tanh((s₁ − s₂) / σ_marg). When the margin is wide, almost all evidence mass concentrates on top-1 directly rather than diluting through softmax. When the margin is narrow, the formula reduces to classical softmax allocation across the full candidate set.

Behavior across regimes (BDD-locked in features/agent/evidence_independence.feature, “Cosine reliability shaping concentrates mass on a clear top-1”):

Top-1 sim	Top-2 sim	α	margin_weight	Top-1 mass	Θ mass
0.70	0.50	0.700	1.000	0.700	0.300
0.45	0.20	0.700	1.000	0.700	0.300
0.45	0.44	0.452	0.197	0.091	0.548
0.23	0.23	0.100	0.002	0.0005	0.900

Sharp signal recovers the legacy ceiling allocation but concentrates it on top-1 (~170× the prior compressed mass). Ambiguous and noise regimes correctly route most mass to Θ rather than fabricating false confidence. The indep-tier revisit gate (threshold 0.45) is now reachable on cosine alone whenever cosine has clear semantic signal.

Composes cleanly with the indep-tier consensus computation: when cosine carries decisive mass on a code different from the LLM’s vote, that code becomes the indep-tier top-1 and the revisit gate fires — which is the soundness invariant the whole evidence- independence treatment is reaching for.

Hierarchical mass aggregation + cross-subtree visibility

A separate structural gap surfaced after reliability shaping landed: when cosine evidence localizes to a subtree (multiple financial-leaning leaves under a common parent) but the LLM picks a confident leaf in a different subtree, the predicted code falls to the LLM’s leaf and there is no surfaced signal that an honest-but-coarser parent would apply. Three cooperating fixes close that gap:

1. Cosine emits hierarchical mass

mass_functions.cosine_to_mass now walks up from the cosine top-1 leaf, finds the most-specific internal node whose descendants capture ≥ 50% of the softmax probability mass (_significant_subtree), and redirects the in-subtree residual mass to that internal-node focal element rather than diluting it across leaves. The frame already exposed every parent code as an internal-node FocalElement (descendant leaf set); we just weren’t emitting mass there. Hierarchical Dempster-Shafer treatment per Shafer 1976 §3 and Smets 1990 §6 (refinement / coarsening): an internal-node focal element represents a disjunction — “the answer is somewhere in this subtree” — without committing to a specific leaf.

Walking up from top-1 (rather than requiring every top-K to share an LCA) tolerates outliers cleanly: a small amount of probability leaking outside the subtree doesn’t void the aggregation as long as the bulk of mass remains inside.

Sharp-signal regimes are unaffected — when the margin is wide the residual mass α · (1 − margin_weight) is small, so the hierarchical aggregation simply scales proportionally. The top-1 leaf still wins when one is decisive.

2. `cautious_code` walks the full hierarchy

HierarchicalClassification.cautious_code previously walked only the predicted code’s ancestor chain via belief_path — structurally blind to belief mass in any other subtree. It now delegates to cross_subtree_belief, which iterates every singleton AND every internal-node focal element in the frame and returns those with Bel ≥ threshold. The most-specific code wins, regardless of subtree.

Concretely: when the LLM votes 0.1 Internal Non-Sensitive but cosine’s hierarchical aggregation puts Bel(Financial Data) = 0.55 on a different subtree’s parent, cautious_code(0.5) can now return Financial Data — not just 0 (the predicted code’s parent).

3. `cross_subtree_belief` surfaces the conflict

The result dict now carries a cross_subtree_belief field listing every code (leaf or internal node, any subtree) where Bel ≥ 0.5. Operators see both the LLM’s leaf vote AND the cosine-derived alternative subtree as legitimate signals, instead of the predicted-leaf-only belief_path. When evidence sources disagree on the subtree, both candidates appear and the operator can act on the disagreement directly.

This composes cleanly with the prior mechanisms: reliability shaping ensures cosine top-1 carries enough mass to trigger hierarchical aggregation when signal is clear; the indep-tier gate fires when cosine’s hierarchical mass disagrees with LLM at the leaf level; and cross_subtree_belief makes the cross- subtree disagreement explicit in the operator-facing result. The predicted_code field retains its leaf-argmax semantics for backward compatibility — operators consume the cautious / cross-subtree fields when needs_clarification = True or when the cross-subtree summary surfaces a competing internal node.

Operator-facing visibility

The fusion mechanisms above can produce mathematically correct belief structures that are nonetheless invisible to operators when the result-dict surface area is too narrow. Three small changes close that gap:

Evidence string carries per-source codes + competing summary

HierarchicalClassification.from_combined_evidence builds the evidence field. Previously: dst(cosine=0.65, llm=0.77, catboost=0.42, svm=0.22) → Internal Non-Sensitive [Bel=0.71, ...] — masses only, not the codes each source voted. Now: dst(cosine→1.4.1.1.1(0.65), llm→0.1(0.77), ...) → Internal Non-Sensitive [Bel=0.67, ...] [competing: Sensitive (1) Bel=0.26] — leaf-level disagreement is visible at a glance, and a “competing” trailer surfaces non-trivial belief in any non-predicted top-level subtree.

`cross_subtree_belief` is always informative

The 0.5 absolute threshold previously suppressed competing- subtree alternatives whenever Dempster fusion compressed their mass below the headline bar (the common case when one source dominates). The default is now lower (0.20) AND a always_include_top_per_subtree rule guarantees that the highest-belief leaf and highest-belief internal node from each top-level subtree appears in the result regardless of threshold (subject to a small min_bel floor so we don’t flood the result with noise). Operators always see the structured “what does each subtree look like?” view.

`cautious_promoted_code` (Smets least-commitment)

Per Smets 1993 (Belief Functions: The Disjunctive Rule of Combination and the Generalized Bayesian Theorem and related work on least-commitment), when a fine-grained decision is unsupported by evidence the principled response is to commit only at the level of granularity where evidence IS unambiguous. This is exactly the mechanism for “the predicted leaf is not the right answer; the parent code is more honest.”

HierarchicalClassification.cautious_promoted_code returns either the predicted leaf (no promotion) or the most-specific code anywhere in the hierarchy whose belief meets the commit_threshold (default 0.55). Promotion fires only when needs_clarification = True — operators get the leaf prediction by default; the cautious promotion is the epistemically-honest fallback when the system itself flags the prediction as uncertain.

The predicted_code field retains its leaf-argmax semantics for backward compatibility with Atlas governance sync and existing UI rendering. cautious_promoted_code lives alongside it as a separate field operators consult when needs_clarification is True.

Per-column residual trajectory

The corpus-wide residual norm + contraction factor establish the headline iterative-method diagnostic, but they obscure per-column behaviour. BootstrapState.column_history: dict[str, list[ColumnResidualSnapshot]] captures the column-major view: one snapshot per labeled column per iteration, populated in record_iteration_metrics after each iteration’s ML validation completes. Each snapshot records the column’s gap, belief, K, indep-tier top-1 code/mass, label, label source, and a revisited flag indicating whether _llm_revisit touched the column in that iteration.

bootstrap.column_contraction(state, name) mirrors the corpus-wide contraction_rate at the column level: ρ_col = current_gap / prev_gap (falling back to K when gap is zero), or None when the column has fewer than two snapshots. ρ_col < 1 means the column is converging; ρ_col → 1 stalled; ρ_col > 1 diverging. Per-column ρ exposes the empirical contraction distribution that corpus aggregates obscure — operators see which specific columns are converging vs stalling.

The full trajectory is written to build/results/{run_id}/column_trajectories.json at pipeline end alongside classifications.json, enabling offline analysis, operator post-mortem, and audit. The agent loop’s iteration_history carries a summarized view (per-column gap/bel/K sequences plus ρ_col) so the agent can reason about which columns are moving.

This trajectory infrastructure is the substrate for any future acceleration scheme. Three plausible Phase B / Phase C extensions all consume it:

Bandit-style revisit ordering (Phase B) — extend _identify_disagreements to mix expected_revisit_gain(name) derived from history into the sort key. Revisits ordered by predicted marginal residual reduction. Default-off knob; trajectory data backs it.
Aitken Δ² early-stop (Phase B) — for columns with ≥3 snapshots and a clean linear-convergence pattern, predict the limit and skip further revisits when the predicted gap is below cfg.gap_threshold. Saves LLM cost on the predictable tail. Default-off knob; trajectory data backs it.
Limited per-column belief-mass Anderson (Phase C, deferred) — only on columns that genuinely oscillate (per-column ρ near 1 with sign-changing residual differences). Phase A’s trajectories let us measure whether such a population exists before shipping any Anderson code.

The honest framing: classical Anderson acceleration on the full belief-vector iteration is poorly suited to LLM-driven dynamics (stochastic T, mostly-static state, discrete labels, targeted-not- uniform revisit). What’s value-add given the problem structure is the per-column trajectory data itself — operators see per-column convergence behaviour, future acceleration schemes have real per-column data to operate on, and we can decide between bandit / Aitken / Anderson empirically rather than rhetorically. Phase A ships that substrate; Phase B and Phase C are gated on what the substrate reveals.

Cost-sensitive classification at the LLM layer (Elkan 2001)

All the prior mechanisms operate at or below the fusion layer — they shape how per-source evidence is combined into a fused belief. But on the canonical failure case (loan_applications.requested_amount), the LLM at confidence 0.88 plus its derivative cluster (CatBoost, SVM) reinforces a vote on 0.1 Internal Non-Sensitive, and Dempster fusion’s normalization preserves that dominance. Algorithmic mitigations stalled at Bel ≈ 0.74 — an honest reduction from Bel = 0.955 baseline, but the headline classification still miscategorized financial PII.

The principled response, per cost-sensitive classification (Elkan 2001, The Foundations of Cost-Sensitive Learning) is to adjust the decision threshold under asymmetric cost. In data governance the asymmetry is severe: failing to flag truly sensitive data (false negative, Type II) creates regulatory liability (GDPR Art. 25 data protection by default; HIPAA Safe Harbor; PCI DSS scope creep guidance), while over-classifying (false positive, Type I) produces review overhead but is recoverable. Treating the costs as cost(FN) ≫ cost(FP) is the canonical privacy-regime convention.

Atelier applies this at the LLM layer — upstream of fusion — via a Sensitivity classification perspective section in the system prompt (llm_backend.build_system_prompt). The framing is deliberately collaborative rather than prescriptive: modern LLMs respond better to a colleague’s framing than to a compliance checklist. Three load-bearing moves:

Invoke what the LLM already knows. The preamble names BFO, CCO, and the privacy regimes (GDPR, HIPAA, PCI DSS) those ontologies overlap with — concepts the model has substantial training exposure to. The customer’s taxonomy is framed as “their refinement of those publicly-grounded concepts,” and the model is asked to pick whichever of their codes matches the canonical sensitivity concept it would otherwise assign (PII, Financial Information, Technical Identifier, Biometric, etc.). No re-teaching, no rule list — invocation.
State the asymmetry once, casually. Cost-sensitive classification appears as “a practical asymmetry: in governance, calling sensitive data non-sensitive is a larger error than the reverse.” The over-classification guard is embedded conversationally: “When signals are genuinely absent (operational metadata, surrogate keys, timestamps, status enums), non-sensitive is the correct call — don’t reach for sensitive just because of the asymmetry.” One sentence on confidence calibration: “Calibrate confidence to what you actually saw, not to this asymmetry.”
Vocabulary-aware sensitivity map, ICE conventions only. _sensitive_subtree_summary(category_set) activates on ICE.SENSITIVE.* / ICE.NONSENSITIVE.* paths and emits a Markdown block naming the sensitive root, catch-all, and a few publicly-grounded leaf abbreviations (per src/atelier/classify/fixtures/PROVENANCE.md). Returns "" for every other vocabulary shape so the prompt stays silent where the framework can’t verify the encoding is publicly grounded. For non-ICE vocabularies the LLM still has the full markdown category table, per-column ontology priors for pattern-bearing columns, and the perspective preamble — that is sufficient to navigate any taxonomy without the framework guessing at its sensitivity structure.

The prompt block is default-on for every classification run; no config knob. Built once per pipeline run at pipeline.py:577 so the helper computation is amortized and the new content lives inside the Anthropic prompt-cache prefix — one-time cache miss on the first batch, normal cache hits thereafter. Token cost is bounded (~250–300 fixed + ~80 for the per-vocab summary).

This composes cleanly with everything below it: reliability discounts on derivative sources still suppress double-counting, cosine reliability shaping still concentrates mass on clear top-1 hits, hierarchical aggregation still flows residual mass to internal-node focal elements, the indep-tier consensus gate still triggers revisits on cross-source disagreement, and cautious_promoted_code still applies Smets least-commitment on uncertain leaves. The Governance Cost Model changes what the LLM votes — biasing toward sensitive parents under uncertainty — leaving every downstream mechanism unchanged.

The hypothesis: with a governance prior at the source, the LLM will either (a) pick a defensible sensitive parent code on columns like requested_amount, or (b) lower its confidence on the non-sensitive choice — either of which is an improvement over the status quo. The exact behavior is non-deterministic and confirmed against real LLM runs; BDD scenarios assert the prompt structure (features/agent/governance_cost_model.feature), not the LLM’s vote.

Pattern-target alias resolver

A second, narrower bug surfaced during investigation: the static DEFAULT_PATTERN_MAP at mass_functions.py references canonical ICE.* mnemonic strings (monetary_pattern → ICE.SENSITIVE.PID.FINANCIAL.PAYMENT.TXNAMT) that are absent from non-ICE vocabularies. The pre-2026-04-30 behavior silently dropped any pattern whose target wasn’t in frame.singletons, disabling the entire pattern source on numeric or domain-specific vocabularies — including the run that motivated this work.

mass_functions.resolve_pattern_map now resolves each ICE.* target through three fallback layers against the active category_set:

Direct hit on all_by_code.
Match on by_abbrev using the leaf mnemonic (suffix after the final .).
Token-normalized match against common_names aliases.

Misses log a single WARNING enumerating the patterns that were dropped. The resolver is cached on the category_set instance and runs once per pipeline. The deeper BFO/Common-Core ontology mapping this shim approximates remains future work.

Deferred work

This treatment preserves Dempster’s rule end-to-end and handles non-distinctness through reliability discounting + per-source reliability shaping. One future refinement remains scoped out:

Tiered fusion with the cautious rule (Denoeux 2008). Combine the LLM-derivative cluster {llm, catboost, svm} via cautious conjunction (idempotent on identical evidence; commonality formulation q1 ∧̂ q2), the independent cluster {cosine, pattern, name_match} via Dempster, and combine the two cluster-level mass functions across-tier. This dissolves the non-distinctness problem at the math level rather than approximating it via discount. Trade-off: cautious is non-normalising, so derivative-tier-only columns will see narrower belief intervals (which is correct behaviour but a UI shift).

The combine_multiple infrastructure already supports adding a strategy="cautious" branch alongside the existing dempster / yager options, so the refinement is surgical when it lands.

References

Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press. Ch. 3 §3 (independence assumption); Ch. 4 §3 (Dempster’s rule); §11.3 (reliability discount).
Smets, P. (1990). The Combination of Evidence in the Transferable Belief Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 447–458.
Smets, P. & Kennes, R. (1994). The Transferable Belief Model. Artificial Intelligence 66(2), 191–234.
Denoeux, T. (2008). Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence. Artificial Intelligence 172(2-3), 234–264. §1, §3.1, §4.
Haenni, R. & Hartmann, S. (2006). Modeling Partially Reliable Information Sources: A General Approach Based on Dempster-Shafer Theory. Information Fusion 7(4), 361–379.

Operational impact

Operators upgrading to this calibration should expect:

More columns marked needs_clarification=True on the first run after upgrade. This is the intended outcome: derivative-source amplification no longer hides genuine cross-source conflict.
A modest increase in LLM revisit volume (the gate fires on a wider, principled condition). Mitigated by the indep_revisit_mass_threshold floor and the existing budget caps at classify.bootstrap.max_total_llm_calls / max_total_llm_attempts.
A pattern-source WARNING at startup enumerating any patterns whose ICE.* target failed to resolve to the active vocabulary. Acceptable as long as the leaf mnemonics that do exist in the vocab carry the relevant abbrev or common_names aliases — expected on first run with a domain-specific vocabulary.

MaxSim Channel — ColBERT Late-Interaction via Qdrant

Naming. This DST evidence channel is named maxsim — after the scoring operation Qdrant performs (a sum of per-query-token max cosines over the ColBERT multi-vector field), not the single-vector cosine it replaced. The per-token metric is cosine and the encoder is ColBERT, but the channel’s identity — the key in source_masses, INDEPENDENT_TIER, the classify.maxsim.* config namespace, and the classify.discounts.maxsim discount — is maxsim. The legacy single-vector cosine channel is retired (no fallback). Historical sprint notes may still say “cosine”.

This note specifies the maxsim evidence source: a multi-vector late-interaction (ColBERT-style) representation per annotation, stored in Qdrant, with enrichment supplied by an Agent-SDK curation loop and procedural deterministic verifiers. It composes with — does not replace — the reliability discounting, indep-tier consensus gate, hierarchical mass aggregation, and cost-sensitive LLM prompting documented in dst-evidence-independence.md.

Position in the architecture

The existing DST treatment shapes how per-source masses fuse. This work shapes the cosine source’s input representation. Both are necessary; neither is sufficient on its own.

The motivating gap is structural rather than algorithmic. Current cosine compresses each annotation into a single embedding from label + mnemonic + description and compares it to a single column-side embedding from column_name + concatenated_samples. On adversarial corpora — anonymized column names (comm_val, period_val, addr_ref), mixed sample distributions, vocab-token-as- data columns — the single-vector representation collapses discriminative signal before it reaches the fusion layer. Reliability shaping (Haenni-Hartmann 2006) can route mass to ignorance correctly in this regime, but it cannot recover the discriminative signal that was lost to the compression.

Late interaction via ColBERT restores the discriminative surface: instead of one dense-vector comparison per (column, tag) pair, the ColBERT encoder produces per-token contextual embeddings (128-d after the linear projection) for both entity and annotation texts. Qdrant’s native MaxSim comparator computes the token-level cross-alignment score directly — no Python-side scoring loop, no per-role weight tuning.

The entity side feeds ColumnFeatures.to_embedding_text() — the same text SAGE/SHAP ablate over — through the ColBERT encoder. The annotation side feeds a composed text from the enrichment payload (label, description, prototype values, name hints, value patterns, parent path, mnemonic) through the same encoder. Anti-examples are excluded from the annotation text (they add noise in the embedding space without improving MaxSim discrimination).

The motivating failure modes resolve through token-level alignment:

Anonymized columns — column-name tokens contribute little MaxSim, but sample-value tokens still align to annotation prototype- value tokens. Graceful degradation by token structure: weak tokens contribute near-zero MaxSim without polluting strong token matches.
Long-tail distinguishing values — a single distinctive sample value’s tokens claim their own MaxSim against annotation prototype tokens, no longer averaged out by a single dense vector.
Sibling discrimination — token-level alignment discriminates between semantically adjacent annotations (e.g., “credit card number” vs “bank account number”) through fine-grained token matching that dense single-vector cosine collapses.
Parent-pull — parent-path tokens in the annotation text provide hierarchical context. The hierarchical aggregation in _maxsim_positive_mass continues to flow residual mass to internal-node focal elements when subtree-level signal is what’s available.

This is morphologically close to what the upstream Ægir project provides through a learned hierarchical foundation model (RWKV-7 time-mixing + H-Net dynamic chunking, RLVR-trained against a deterministic four-component verifier on SOTAB / GitTables / WikiTables). The two are complementary, not redundant: Ægir’s representations are learned end- to-end against external corpora; late-interaction here is engineered from the user-selected taxonomy with LLM-augmented annotation profiles. Both can coexist as separate evidence sources, and the late-interaction infrastructure remains useful even after Ægir integration for taxonomies Ægir has not been adapted to.

Architecture overview

┌─ Source taxonomy (default.annotations or any user-selected) ────┐
│  label, mnemonic, description, parent path                       │
└────────────────────┬─────────────────────────────────────────────┘
                     │
                     ▼  scripts/enrich_annotations.py
              ┌──────────────────────────────┐
              │ Agent SDK enrichment loop    │
              │  + deterministic verifiers   │
              └──────────────┬───────────────┘
                             │
                             ▼  ColBERT token vectors + payload
         ┌────────────────────────────────────────────┐
         │ Qdrant collection: annotations_<tax>_<ver> │
         │   - single "colbert" multi-vector field     │
         │     (per-token 128-d, MaxSim comparator)    │
         │   - structured JSON payload                 │
         │   - operator_edits audit log                │
         └────────────┬───────────────────────────────┘
                      │
                      │  registered in PGlite taxonomy_registry
                      │  (administrative pointer, never primary storage)
                      │
                      ▼  build/exports/<tax>-enriched-<ver>-<utc>.parquet|tsv
                  on-demand snapshots for operator inspection

  At classify time:
       ColumnFeatures.to_embedding_text()
                 │
                 ▼  ColBERT encoder (colbert-ir/colbertv2.0)
          entity token vectors (N × 128)
                 │
                 ▼  Qdrant query_points (using="colbert", MaxSim)
          top-K annotations ranked by MaxSim score
                 │
                 ▼  maxsim_to_mass
          mass function (Haenni-Hartmann reliability shaping)
                 │
                 ▼  DST fusion (existing pipeline)
          belief, plausibility, conflict per tag

Qdrant payload schema

The collection per (taxonomy_id, augmentation_version) is the source of truth. No parallel relational mirror. One point per annotation.

Vector field

Each annotation point carries a single multi-vector field:

Name	Type	Source
`colbert`	multi-vector	ColBERT token-level embeddings of the composed annotation text

The composed annotation text is produced by qdrant_writer.compose_annotation_text() from the enrichment payload: label, description, prototype values (up to 10), name hints (up to 10), value pattern descriptions (up to 5), parent path (ontology chain), and mnemonic. Anti-examples are deliberately excluded — they add noise in the embedding space without improving MaxSim discrimination.

The ColBERT encoder (colbert-ir/colbertv2.0) produces per-token 128-dimensional vectors via BERT + a learned linear projection (768 → 128). Special tokens ([CLS], [SEP], [PAD]) are stripped; only content tokens contribute to MaxSim.

The collection is configured with MultiVectorConfig(comparator=MAX_SIM) so Qdrant computes token-level late-interaction scoring natively — no Python-side scoring loop.

Payload (JSON)

{
  // Source taxonomy fields, immutable passthrough
  "code":           "ICE.SENSITIVE.PID.CONTACT.EMAIL",  // or user-vocab equivalent
  "label":          "Email",
  "mnemonic":       "EMAIL",
  "description":    "RFC 5322 email addresses, including international forms.",
  "parent_code":    "ICE.SENSITIVE.PID.CONTACT",
  "parent_path":    ["Sensitive Data", "PII", "Contact", "Email"],

  // Enrichment fields, generated + verified
  "prototype_values":     ["jane.doe@example.com", "user@subdomain.example.org", ...],
  "value_patterns":       [
      {"kind": "regex",  "expr": "[^@\\s]+@[^@\\s]+\\.[^@\\s]+"},
      {"kind": "format", "expr": "local-part @ domain, RFC 5322"}
  ],
  "name_hints":           ["email", "e_mail", "email_addr", "contact_email", "msg_val"],
  "anti_examples":        [
      {"value": "+1-555-123-4567", "confusable_tag": "A_PHN", "reason": "phone-shaped"},
      {"value": "https://example.com/path", "confusable_tag": "SYSURL", "reason": "URL-shaped"}
  ],

  // Provenance + audit
  "augmentation_version":  "v1",                       // prompt template + verifier version
  "embedding_model":       "colbert-ir/colbertv2.0",
  "embedding_dim":         128,
  "generated_at":          "2026-05-16T20:00:00Z",
  "generated_by":          "agent-sdk:opus-4.7",       // model + harness identifier
  "verifier_results": {
      "prototype_values_match_patterns": true,
      "patterns_compile":                 true,
      "anti_example_targets_exist":       true,
      "parent_path_consistent":           true,
      "checks_passed":                    4,
      "checks_total":                     4
  },

  // Operator edits log — append-only, every edit recorded
  "operator_edits": [
      {
          "at":     "2026-05-17T09:14:00Z",
          "by":     "operator@example.com",
          "field":  "prototype_values",
          "op":     "remove",
          "value":  "test@test.test",
          "reason": "weak exemplar"
      }
  ],

  // Cross-reference
  "taxonomy_id":       "default",
  "taxonomy_version":  "2026-05-01"
}

Cache key (content-addressed)

Rebuilds are idempotent under stable inputs. The cache key for a single annotation point is:

key = sha256(
    taxonomy_id ||
    taxonomy_version_hash ||
    augmentation_version ||
    embedding_model ||
    source_row_hash       // hash of label+mnemonic+description+parent_code
)

Skip-on-cache-hit during rebuilds; force-rebuild via CLI flag. The cache layer is responsible for invalidation on any input change.

Collection naming

annotations_<taxonomy_id>_<augmentation_version>

Example: annotations_default_v1, annotations_hivepoc_synth_v1. The PGlite registry row tracks which collection is current for a given taxonomy_id; old collections remain queryable for A/B comparison and rollback.

Enrichment pipeline (high-level)

Detailed in scripts/enrich_annotations.py (P2) and the atelier.enrichment package. Vocabulary identity is dynamic: operators select a (connection, database, annotations_table) triple at runtime; the pipeline must not encode the count, names, or structure of the currently-loaded set as intrinsic. The single universal is that every node — leaf or internal — is a first-class tagging target, so both leaf and internal nodes receive enrichment.

The shape:

Read source taxonomy rows from the active annotations table selected by the operator at runtime. No vocabulary identity is hardcoded.
For each node (leaf or internal), run the enrichment loop:
- Build a generation prompt with parent-aware framing for internal nodes (children listed, “what does a column tagged at this generality look like without specializing to a child”) or leaf-aware framing for leaves (sibling-discriminative patterns, concrete prototype values).
- Call the provider-co-located generator (see below) to produce the six-field structured payload.
- Run the deterministic verifier suite (atelier.enrichment.verifiers). Failed checks become verifier feedback that is fed back into the next generation attempt up to enrichment.max_attempts.
- Compute parent_path deterministically from the taxonomy structure (no LLM needed) and confirm the LLM’s reasoning is consistent with it.
Compute embeddings for each named vector using the configured embedding model.
Write the multi-vector point + payload to Qdrant, keyed by the content-addressed cache key. Idempotent: same (vocabulary content hash, augmentation_version, embedding_model, source_row hash) quadruple → same point ID → no redundant work on partial rebuilds.
Update the PGlite taxonomy_registry row to record the build (taxonomy_id, augmentation_version, collection name, built_at, status). The registry is an administrative pointer — it records that a collection exists and where, never the primary content.

This pipeline satisfies the LLM-mediated reference artifact bar (audited via memory): every output is procedurally reproducible from its inputs and falsifiable by the verifier suite.

Provider co-location with classify

The enrichment generator does NOT introduce a separate provider knob. It reads cfg.classify_llm_backend and uses the same backend the classification path uses — operators manage one set of credentials, one cost regime, one billing surface. Within that backend, the generator selects the strongest reasoning model available, because per-node generation is single-shot and benefits from extended deliberation on structural taxonomy judgments (sibling discrimination, prototype induction, regex synthesis).

Selection rule (highest priority first), implemented in atelier.enrichment.model_resolver.resolve_enrichment_model:

cfg.enrichment_model_override (env: ATELIER_ENRICHMENT_MODEL) — explicit operator choice, used verbatim.
Per-backend apex constant when the platform owns the model identity (currently: anthropic → claude-opus-4-7).
Fall through to cfg.classify_llm_model for backends where the model identity is endpoint-owned (openai_compatible, cerebras) — the operator’s served endpoint is the apex available to that deployment.
Bedrock without model_override raises EnrichmentModelError with an operator-facing remediation hint. Bedrock model identities are AWS account + region + inference-profile specific; no portable default constant would be correct across deployments, and silently degrading to a weaker model would contradict the strongest-reasoning-model discipline. This is a deployment-readiness gate consistent with the no-silent-DST-degradation principle.

The generator records {backend}:{model} in the point’s generated_by provenance field, so verifier pass-rate per node is attributable to the exact provider+model combination — the unit of replayable experiment.

Parent-aware vs leaf-aware prompts

Both prompt variants produce the same six-field JSON schema, so downstream code treats their outputs identically. The framing difference shapes content quality:

Leaf prompt asks for values, patterns, and name hints describing what a column tagged exactly at this leaf would contain. Patterns are narrow enough to discriminate against sibling leaves under the same parent.
Parent prompt asks for what a column tagged at this generality level — without further specificity to a child looks like. Children are listed so the model knows what specializations would NOT route here. Anti-examples are hierarchically aware: the confusable_tag field (a vestigial name retained for schema stability — see anti_example_targets_exist verifier) may point to a sibling at the same level OR a sibling of an ancestor, because the late-interaction architecture’s anti-example evidence applies regardless of where in the tree the negative exemplar lives.

Late-interaction execution

Column-side multi-vector representation

For each column being classified, build the multi-vector query:

Query vector	Source
`col_name_view`	`embed(column_name + " in " + table_name)`
`col_sample_*`	`embed(sample_value)` per deduped sample (top-N by frequency or distinctness, configurable)
`col_context_view`	`embed("table columns: " + concat(other column names in same table))`
`col_pattern_view`	`embed(extracted format hints from samples)`

col_pattern_view is computed from sample values via the existing regex/validator detection in the pattern source — this is where the original “regex as embedding-text enrichment” intent (referenced in dst-evidence-independence.md and in the upstream Ægir documentation) re-enters cleanly: regex outputs contribute structured features into one of the multi-vector query slots, not as an independent mass function competing with cosine. The pattern source’s standalone mass-function status is preserved for narrow PII detection (email, IBAN, monetary, …) where its hits are crisp; the col_pattern_view augmentation is additional, not a replacement.

MaxSim aggregation

For each candidate tag and each query vector, find the best match in the annotation’s multi-vectors of the corresponding role:

positive_score(col, tag) =
    sim(col_name_view,    label_view of tag)         * w_label
  + sim(col_name_view,    name_hints of tag)         * w_name
  + max(sim(col_sample_i, prototype_values of tag))  * w_proto_per_sample
  + max(sim(col_sample_i, value_patterns of tag))    * w_pattern_per_sample
  + sim(col_context_view, parent_path_view of tag)   * w_context
  + ...

Execution happens in-engine via Qdrant’s multi-vector query API with MaxSim comparator. HNSW indexing brings the cost down to logarithmic in the annotation count, which is the dominant cost as vocabularies scale across deployments.

Mass function construction

mass_functions.maxsim_to_mass(scores, frame) produces a BeliefAssignment over the candidate frame from the Qdrant MaxSim scores.

The MaxSim score per tag is calibrated to evidence mass via the same reliability-shaping pattern documented in dst-evidence-independence.md: Haenni-Hartmann α-bounded reliability + margin-aware allocation.

α_abs — sigmoid of top-1 MaxSim score. “Is the best match strong enough to carry mass?”
α_marg — tanh((s₁ − s₂) / σ). “Is the top-1 decisive?”

Allocation:

m(top-1)         = α · margin_weight + α · (1 − margin_weight) · softmax_top1
m(top-i, i > 1)  = α · (1 − margin_weight) · softmax_top_i
m(Θ)             = 1 − α

Hierarchical subtree aggregation (_significant_subtree) routes residual mass to internal-node focal elements when subtree-level signal dominates leaf-level signal.

Storage philosophy

Single source of truth per layer, with administrative pointers in PGlite.

Layer	Primary storage	Role
Vectors + payload	Qdrant (`annotations_<tax>_<ver>`)	Truth for enriched annotations; supports late-interaction execution
Run artifacts	`build/` (existing pattern)	Parquet, classifications, evaluation, sweep manifests, exports
Administrative	PGlite (`taxonomy_registry`, run regs)	Where things live, at which version, in which status
Future (planned)	Iceberg in S3	Intermediates + `hx` history tables (taxonomy_history, enrichment_history, classification_runs_history, sweep_history); snapshot/time-travel for `hx` semantics native to Iceberg

PGlite never holds vectors, payloads, classifications, or intermediates. Its job is to answer “where is the current enriched annotation collection for taxonomy X?” and “which run produced this dataset?” Both registries are small, fast to query, and survive backend migrations untouched.

When Iceberg-HX-in-S3 lands, the migration is a backend swap at the registry layer — pipeline_run_registry.artifacts_backend flips from build_local to iceberg_s3, artifacts_path switches to an S3 URI, and pipeline logic remains unchanged. Current build/ artifacts are forward-compatible with this transition.

PGlite tables (P1.2 migration)

CREATE TABLE taxonomy_registry (
    taxonomy_id          TEXT PRIMARY KEY,
    source_table         TEXT NOT NULL,
    qdrant_collection    TEXT NOT NULL,
    qdrant_url           TEXT,
    augmentation_version TEXT NOT NULL,
    embedding_model      TEXT NOT NULL,
    embedding_dim        INTEGER NOT NULL,
    built_at             TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    status               TEXT NOT NULL DEFAULT 'building',
        -- 'building' | 'current' | 'stale' | 'archived'
    summary              TEXT
);

CREATE INDEX idx_taxonomy_registry_current
    ON taxonomy_registry(taxonomy_id, status);

-- Extends fsm_runs to record which enriched annotation collection
-- the run consumed.  NULL = legacy cosine; non-NULL = late-interaction.
ALTER TABLE fsm_runs ADD COLUMN IF NOT EXISTS
    taxonomy_collection TEXT REFERENCES taxonomy_registry(qdrant_collection);

Operator inspection and edit surface

The active enriched-annotations collection in Qdrant (whatever the operator’s runtime vocabulary selection happens to produce) is operator-facing through two surfaces:

On-demand export (scripts/export_enriched_annotations.py, P2.4): writes the Qdrant payload for a given (taxonomy_id, version) to build/exports/<tax>-enriched-<ver>-<utc>.parquet and a human-readable .tsv. Read-only snapshots, diffable across versions, dropable when no longer needed. Operators inspect via their existing tooling (parquet viewers, spreadsheet apps, mlr/q/duckdb for CLI).

Structured edit CLI (scripts/edit_enriched_annotation.py, deferred — part of P2 follow-on): operators issue targeted edits (add/remove prototype value, rewrite anti-example, etc.) which:

Write back to the Qdrant point’s payload + re-embed affected views
Append an entry to the operator_edits audit log
Bump a per-row revision counter (separate from augmentation_version, which is the system-level prompt/verifier version)

Edits are reversible — the audit log carries the prior value for every change. Per-customer overlays (deployment-specific augmentations beyond the base) follow the same shape on a separate edits stack.

SHAP / SAGE shift under late interaction

The structured per-segment inputs (column_name, each sample, context, pattern view) provide natural attribution surfaces that the prior single-vector representation flattened.

SHAP becomes per-decision interpretability infrastructure. For a column predicted EMAIL, SHAP attributes the score across the structured inputs: “sample_3 contributed 0.42 via match against EMAIL.prototype_values[7]; column_name contributed 0.08 via name_hints; everything else < 0.05.” Operator-legible explanation per prediction, computable in-pipeline at moderate cost (one late-interaction pass per perturbation). Wired into features.FEATURE_NAMES as new ablatable feature slots: late_interaction_positive, late_interaction_negative, late_interaction_view_<name>.

SAGE moves to offline-first. Late-interaction inputs are richer (more “features” — per-view contributions, per-vector contributions), and SAGE’s permutation-based global compute scales with that dimensionality. Per-pipeline-run SAGE becomes impractical and, more importantly, of low marginal value: SAGE’s value proposition is corpus-level stability rather than per-run signal. The shift:

SAGE runs as a separate offline pipeline, scheduled or on-demand, against the current enriched annotations + corpus characterization.
Artifact written to build/sage/<corpus_id>-<taxonomy_version>-<utc>.parquet.
Downstream consumers (UI, view-prioritization, operator dashboards) reference the cached artifact; the pipeline hot path never recomputes inline.
Optional integration: SAGE importance scores prioritize which annotation views the late-interaction engine computes first, with early-exit when high-importance views already discriminate confidently — a wall-time win on large taxonomies.

CLAUDE.md already notes SAGE is optional; this makes “optional” precise: optional in the hot path, scheduled-only otherwise.

Integration with existing fusion mechanisms

Every mechanism in dst-evidence-independence.md composes cleanly with this work. Specifically:

Existing mechanism	Composes by
Reliability discounting (Shafer §11.3)	Late-interaction cosine carries its own discount slot in `config/base.conf`; default starts at `cosine` value (0.20) and is sweep-tunable.
Indep-tier consensus + revisit gate	Late-interaction cosine remains in the independent tier (its only LLM dependence is the enrichment, which is offline + verified). Indep-tier fusion picks it up unchanged.
Cosine reliability shaping (Haenni-Hartmann 2006)	The α-bounded + margin-aware allocation pattern is reused for the positive channel; quality indicators extend to include verifier-pass-rate.
Hierarchical mass aggregation + cross-subtree visibility	The positive-channel mass function emits hierarchical mass identically: walk up from top-1 leaf to the most-specific subtree capturing ≥ 50% of softmax probability, redirect residual to internal-node focal element. `cautious_promoted_code` walks the full hierarchy as before.
Cost-sensitive classification at LLM layer (Elkan 2001)	Unchanged — operates upstream of fusion and is orthogonal to the cosine representation.
Pattern-target alias resolver	Unchanged for the standalone pattern source. The pattern source’s hits additionally enrich the `col_pattern_view` query vector.
Per-column residual trajectory	Unchanged — operates on the iteration history of fused belief, which still flows through `BootstrapState`. The late-interaction cosine’s per-view scores can be added to the snapshot for finer-grained trajectory analysis (deferred).

Configuration

New keys under classify.cosine.late_interaction in config/base.conf:

classify {
  cosine {
    # Late-interaction multi-vector cosine is the production cosine
    # source.  Default ON.  The legacy single-vector cosine path
    # remains in the code only as a transitional emergency fallback;
    # when the late-interaction flag is on and the path cannot run
    # (no enriched collection, Qdrant unreachable, qdrant-client
    # missing), the pipeline logs WARNING + marks the run degraded
    # via `maxsim_path` in the per-column result.
    late_interaction {
      enabled = true
      enabled = ${?ATELIER_CLASSIFY_COSINE_LATE_INTERACTION}

      model = "colbert-ir/colbertv2.0"
      model = ${?ATELIER_COLBERT_MODEL}

      qdrant_url = "http://127.0.0.1:6333"
      qdrant_url = ${?ATELIER_QDRANT_URL}
    }
  }
}

Existing classify.cosine.* keys are unchanged; the late-interaction path is the production cosine source under this design. The flag exists for emergency rollback only — leaving the pipeline in legacy single-vector cosine is a deployment-degraded state, not a normal operating mode, and runs in that state are tagged with maxsim_path: "legacy_degraded:<reason>" in the per-column result so the degradation is visible in operator-facing artifacts.

Deferred work

Synthia / copula-aware column-side patterns: when the SVM-on-synthetic work lands (separate track), the column-side multi-vector can include copula-derived inter-column dependency features as additional query vectors. The query-vector slot is already structurally available; only the feature extractor needs to land.
Aegir CTA + CPA outputs as additional query vectors: when Aegir integration lands, its predictions (and its CPA / cross- table grouping outputs) can enter the column-side multi-vector as supplementary query views. Same structural slot.
Per-deployment edit overlays with separate version stack from the base augmentation. Schema for the overlay is sketched above; implementation deferred until operator workflow is validated.
Iceberg-HX-in-S3 backend for the on-demand exports + run artifacts. Designed-for; not yet built.

References

Khattab, O. & Zaharia, M. (2020). ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT. SIGIR ’20, 39–48. Introduces the late-interaction MaxSim formulation.
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., & Zaharia, M. (2022). ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. NAACL 2022. Refines the MaxSim scoring + residual compression.
Qdrant multi-vector named-vectors API: https://qdrant.tech/course/multi-vector-search/module-1/late-interaction-basics/
Shafer, G. (1976). A Mathematical Theory of Evidence. §11.3 reliability discount. (Reused per the existing DST treatment.)
Smets, P. (1990). The Combination of Evidence in the Transferable Belief Model. IEEE TPAMI 12(5), 447–458. Negative-channel framing.
Haenni, R. & Hartmann, S. (2006). Modeling Partially Reliable Information Sources. Information Fusion 7(4), 361–379. α-bounded reliability shaping reused here.
Companion architecture note: dst-evidence-independence.md — reliability discounting, indep-tier consensus, hierarchical aggregation, cost-sensitive LLM prompting.
Upstream foundation-model work: https://zndx.github.io/aegir/ (hierarchical byte-level sequence model + RLVR-trained ontology policy for CTA/CPA/cross-table grouping; complementary independent evidence source on a longer timeline).

Nautilus — Mid-Run Pipeline Watcher

Nautilus is the in-process, mid-run watcher for a classification run. A daemon thread polls the FSM and BootstrapState.batch_audit, decides when a run is going sideways, and hands a structured InterventionRecord to a callback. The callback — not nautilus — decides what to do (record, cancel, escalate).

The thread itself is observation + decision framing. It owns no LLM-calling code, holds no agent context, and never kills a process. That separation keeps nautilus testable without tool-using agents in the loop and lets the same trigger logic serve both the gateway’s auto-cancel hook and the supervisor Overwatch post-mortem.

Why it exists

UAT surfaced a class of failure where the pipeline stopped making progress without erroring — typically a frozen LLM sweep on a problem batch with no heartbeat advance for 20+ minutes. The FSM still read LLM_SWEEP; nothing was wrong from the FSM’s point of view. The operator either waited or killed the process by hand.

Nautilus closes that gap. It pairs with two other layers of self-remediation in the pipeline:

Pillar	Where it lives	What it does
1 — Halving retry	`classify/bootstrap.py`	Per-batch: on LLM failure, halve `columns_per_call` and retry until single-column or success. Preserves 100% column coverage.
2 — Nautilus (this doc)	`overwatch/nautilus.py`	Per-run: observe FSM + batch_audit, fire an intervention when the run stalls, sweeps too long, or accumulates failures.
3 — Supervisor Overwatch	`overwatch/agent.py`, `apply_and_rerun.py`	Post-run: read the latest intervention record, propose a config overlay, optionally rerun. Multi-attempt session in `overwatch/session.py`.

Triggers

Each trigger fires at most once per FSM phase. Phase change resets the phase-scoped triggers (stall, slow_llm_sweep) so a long run can record one intervention per phase rather than storming every poll.

Trigger constant	Fires when	Default threshold
`TRIGGER_STALL` (`"stall"`)	No new `batch_audit` activity for `stall_threshold_s` while FSM is in a non-terminal state.	120 s
`TRIGGER_SLOW_SWEEP` (`"slow_llm_sweep"`)	`fsm.state == LLM_SWEEP` for more than `llm_sweep_threshold_s`, regardless of batch progress.	300 s
`TRIGGER_FAILED_BATCHES` (`"failed_batches"`)	Count of `batch_audit` entries with status `failed` or `fatal` exceeds `failed_batch_threshold`.	10
`TRIGGER_FSM_ERROR` (`"fsm_error"`)	Pipeline transitioned to `ERROR`. Unconditional; bypasses other evaluation.	—

evaluate_triggers() is a pure function of heartbeat + config — tests exercise it directly with a seeded _Heartbeat and synthetic clock, no threads required.

How it observes

State registry (module-level _state_registry): the pipeline calls register_state(run_id, state) early in run_classification_pipeline and unregister_state(run_id) in the finally block. The registry is lock-guarded so nautilus never observes a partially-destructed state during teardown.
FSM polling (tick): fsm.get_status(run_id) each poll (default every 10 s). Phase change resets phase-scoped triggers and the phase-entry clock.
batch_audit tail: nautilus counts entries and failed entries. The pipeline appends to state.batch_audit between LLM calls so the audit length acts as the heartbeat — its non-advance is the stall signal.

Dispatch and cooperative cancel

When a trigger fires, _dispatch builds an InterventionRecord, appends it to watcher.interventions, and invokes intervene_callback(rec) if one was supplied. The callback returns a dict with a decision field ("observed" | "intervened" | "cancelled") and an optional reason.

If decision == "cancelled" and cfg.can_cancel is true, nautilus flips state.cancelled = True on the registered BootstrapState. The pipeline checks this flag between LLM batches in bootstrap.py and exits cleanly. There is no SIGKILL path. An in-flight LLM call finishes before the run terminates. This is what “cooperative cancel” means.

can_cancel is gated by overwatch.autonomy:

Autonomy mode	`can_cancel`	What nautilus does on stall
`monitor`	false	Record only.
`propose`	false	Record only. The supervisor reads the record post-run.
`autonomous`	true	Flip `state.cancelled` so the pipeline exits.

The gateway’s default callback (gateway.py:2154) always returns {"decision": "cancelled"} — so the autonomy gate is the only thing keeping propose / monitor runs from auto-cancelling.

Deployment gates — Bedrock-only and direct-Anthropic

Nautilus runs without the direct Anthropic API. It makes no LLM calls of its own — it observes, decides, and hands a record to a callback. The Anthropic gate (cfg.has_overwatch) only applies to Pillar 3 — the post-run supervisor agent that consumes nautilus’s records and proposes config overlays. Pillar 2 (this watcher) is upstream of that gate.

Three independent gates drive what nautilus actually does:

Gate	Source	Affects
`overwatch.nautilus.enabled`	HOCON / env (default `true`)	Whether the watcher attaches at all
`overwatch.autonomy == "autonomous"`	HOCON / env (default `propose`)	Whether nautilus can flip `state.cancelled` itself; whether `kill_run` CLI is permitted
`cfg.has_overwatch` (= `overwatch.enabled AND has_anthropic`)	derived from `ANTHROPIC_API_KEY`	Whether Pillar 3 supervisor agent runs post-run

Capability matrix on a Bedrock-only deployment (no ANTHROPIC_API_KEY — typical for CAI on Bedrock or air-gapped environments):

Capability	Bedrock + `propose` (default)	Bedrock + `autonomous`
Watcher thread starts and polls	✅	✅
Trigger detection (stall / slow-sweep / failed-batches / fsm_error)	✅	✅
`InterventionRecord`s queryable via `/api/overwatch/nautilus*`	✅	✅
Operator UI Stop (`POST /api/fsm/cancel`)	✅ — never autonomy-gated	✅
Auto-cancel on stall (nautilus → `state.cancelled`)	❌ recorded only — `can_cancel=False`	✅
`kill_run` CLI	❌ rejected (autonomy gate)	✅
Post-run supervisor agent (proposes overlay, optional rerun)	❌ requires direct Anthropic API	❌ requires direct Anthropic API

To unlock auto-cancel without adding an Anthropic key, set ATELIER_OVERWATCH_AUTONOMY=autonomous. Autonomy is independent of has_overwatch. The trade-off: nautilus will cancel runs based on threshold rules alone, with no AI judgement layer behind the decision.

Config

overwatch {
  autonomy = "propose"  # monitor | propose | autonomous

  nautilus {
    enabled = true
    poll_interval_s = 10.0
    stall_threshold_s = 120.0
    llm_sweep_threshold_s = 300.0
    failed_batch_threshold = 10
  }
}

Environment overrides: ATELIER_OVERWATCH_NAUTILUS_ENABLED, ATELIER_OVERWATCH_NAUTILUS_POLL_INTERVAL_S, ATELIER_OVERWATCH_NAUTILUS_STALL_THRESHOLD_S, ATELIER_OVERWATCH_NAUTILUS_LLM_SWEEP_THRESHOLD_S, ATELIER_OVERWATCH_NAUTILUS_FAILED_BATCH_THRESHOLD.

nautilus_config_from_cfg(cfg) reads these and fills can_cancel from overwatch.autonomy == "autonomous".

HTTP surface

Both routes are read-only. Cancellation goes through the operator “Stop run” UI control or the kill_run CLI; nautilus does not expose a cancel endpoint of its own.

Method	Path	Purpose
`GET`	`/api/overwatch/nautilus/{run_id}`	Watcher snapshot for a specific run: heartbeat, intervention list, cancelled flag.
`GET`	`/api/overwatch/nautilus`	All active watchers (typically one — runs are single-flight).

The watcher object is held in a module-level _active_watchers map so the gateway can answer status queries without plumbing the reference through pipeline internals. Intervention history survives watcher stop and remains queryable until the gateway restarts.

Operator CLI: cooperative kill

uv run python -m atelier.overwatch.kill_run <run_id> \
    --reason "stuck on partner-data sweep" \
    [--session <supervisor-session-id>]

kill_run looks the run up in the nautilus registry, sets state.cancelled = True, and stops the watcher. Gated to autonomous mode — in propose / monitor an operator must use the UI Stop control, since the supervisor (which calls this CLI in autonomous mode) isn’t yet authorized to cancel on its own. With --session, the cancel is appended to the supervisor session’s intervention log via overwatch.session.record_intervention.

Lifecycle

Operator hits POST /api/fsm/start. Gateway spawns the pipeline thread.
Gateway polls fsm.get_status() for up to ~1 s waiting for the pipeline to claim a run_id, then constructs a NautilusWatcher, registers it in _active_watchers, and starts the daemon thread.
Pipeline calls register_state(run_id, state) early in run_classification_pipeline. From here nautilus can observe.
On each poll_interval_s tick: read FSM, refresh heartbeat, evaluate triggers, dispatch records, repeat.
On terminal state (IDLE / CONVERGED / ERROR) the watcher’s tick returns True and the thread exits. Pipeline’s finally calls unregister_state(run_id) and the gateway calls clear_active_watcher(run_id).

Testing

The watcher is split deliberately to keep tests synchronous:

Trigger logic — drive evaluate_triggers(state_name=..., now=..., failed_count=...) against a watcher with a hand-seeded _Heartbeat and a fake clock. No threads, no FSM, no BootstrapState.
Dispatch / cancel — instantiate a NautilusWatcher with a fake intervene_callback and a stub FSM; assert that the callback return value drives state.cancelled correctly under each can_cancel value.
End-to-end — BDD scenarios under features/agent/ cover the registry round-trip and the gateway routes; the slow-path watch is not exercised in CI (it would require a real long-running sweep).

Monte Carlo Sampling

At small corpus sizes (< 200 columns), every column receives direct LLM classification. As the corpus scales to thousands or millions of columns, this becomes prohibitively expensive. Monte Carlo stratified sampling selects a representative subset for direct LLM inference and propagates labels cheaply via embedding similarity to the remainder.

This is a zero-cost optimization: below the threshold, the pipeline behaves identically to before. The MC layer activates transparently at scale.

Three-Phase MC Layer

The MC layer operates between SAMPLING and LLM_SWEEP in the existing pipeline. No new FSM states — it runs as sub-phases.

SAMPLING
  ├─ [existing] Extract features for all columns
  ├─ Pre-classify: cheap M0 evidence (name, pattern, cosine) — no LLM
  ├─ Stratify: group by preliminary category + uncertainty
  └─ Select MC sample: importance-weighted within strata

LLM_SWEEP
  ├─ [existing] LLM classifies the MC sample (not all columns)
  └─ Propagate: extend labels to remaining corpus via embedding similarity

VALIDATING
  └─ [existing] Full 6-source DST on ALL columns
      (propagated labels enter as discounted LLM evidence)
      → High-gap / low-belief propagated columns escalate to revisit

Phase 1: Pre-Classification

Run M0 evidence sources only (no LLM, no ML models). For each column:

Name matching → best category + mass
Pattern detection → matched categories
Cosine similarity → top-K categories + scores

Returns a preliminary category code + confidence for every column. Uses the existing name_match_to_mass(), pattern_to_mass(), classify_cosine() functions from the pipeline.

Phase 2: Stratification

Partition columns by their preliminary category code:

Rare strata (< 2 x min_per_stratum members): fully sampled
UNRESOLVED stratum (M0 sources disagree or low confidence): fully sampled
Normal strata: proportional allocation with importance weighting

Phase 3: Sample Selection

Within each normal stratum, select columns via importance-weighted random sampling without replacement. Importance weight per column:

w = (1 - confidence) × (1 + uncertainty)

where confidence = max cosine similarity, uncertainty = ratio of 2nd-best to 1st-best similarity (ambiguity measure).

Total budget: min(max_sampled_columns, total × sample_fraction)

Label Propagation

After the LLM sweep on the sampled subset:

For each propagation column, find the nearest directly-classified column by cosine similarity (stratum-local to limit search space)
If similarity >= propagation_threshold: assign same label with discounted confidence
If similarity < threshold: column gets no LLM evidence in DST

Propagated labels enter DST fusion with a higher discount factor (0.30 vs 0.10 for direct LLM) — they carry less evidential mass. If M0 sources disagree with the propagated label, conflict K rises and the existing targeted-revisit loop automatically escalates the column for direct LLM classification.

Why This Works with DST

The evidence fusion framework makes MC sampling robust:

Propagated evidence carries less mass (more goes to Theta/ignorance)
M0 agreement with propagated label → high belief, narrow gap (good)
M0 disagreement with propagated label → wide gap → revisit-via-LLM
Escalation is automatic — no special MC-aware revisit logic needed

Scaling Projections

GitTables corpus: 1.7M tables today, 10M+ near-term. Average 8-12 columns per table = 15M-120M columns at full scale.

Corpus	MC Mode	Direct LLM Calls	Propagated	Cost Reduction
50	Passthrough	50 (all)	0	0%
500	Active	~75 (15%)	~425	85%
5,000	Active	~500 (cap)	~4,500	90%
50K	Active	~500 (cap)	~49.5K	99%
500K	Active	~500 (cap)	~499.5K	99.9%
15M	Active	~500 (cap)	~15M	>99.99%
120M	Active	~500 (cap)	~120M	>99.99%

At the max_sampled_columns=500 cap, stratified importance sampling ensures every category stratum gets at least min_per_stratum=3 exemplars. Uniform random sampling at 500/15M would miss rare categories entirely.

Scale-Critical Design Decisions

Embedding computation: batch GPU encoding at ~2,768 texts/s (RTX 4090); 15M columns takes ~90 minutes. One-time cost, GPU-parallelizable.
Stratum-local propagation: similarity search within each stratum (not across the full corpus) to limit memory and compute.
Memory: 15M columns × 200B = ~3GB for metadata; 15M × 1.5KB = ~22GB for embeddings. Requires streaming/chunked processing.
Escalation budget: ~50-100 additional direct-LLM calls from revisit. Total LLM call budget: ~600 calls for a 15M-column corpus.

Configuration

classify {
  monte_carlo {
    min_corpus_size = 200              # Below this, classify everything
    min_corpus_size = ${?ATELIER_MC_MIN_CORPUS_SIZE}
    sample_fraction = 0.15             # Fraction directly classified by LLM
    sample_fraction = ${?ATELIER_MC_SAMPLE_FRACTION}
    min_per_stratum = 3                # Minimum samples per category stratum
    max_sampled_columns = 500          # Hard cap on directly-classified columns
    max_sampled_columns = ${?ATELIER_MC_MAX_SAMPLED}
    propagation_threshold = 0.85       # Cosine sim for propagation
    propagation_threshold = ${?ATELIER_MC_PROPAGATION_THRESHOLD}
    propagation_discount = 0.30        # LLM mass discount for propagated labels
  }
}

Module Structure

src/atelier/classify/monte_carlo.py
├── MCConfig          — Frozen dataclass with from_cfg() factory
├── PreClassification — Per-column M0 result (code + confidence + uncertainty)
├── Stratum           — Column group by preliminary category
├── MCPlan            — Sampling plan (sampled + propagation sets)
├── pre_classify()    — Run M0 evidence for all columns
├── stratify()        — Group by preliminary category + uncertainty
├── select_sample()   — Importance-weighted selection within strata
└── propagate_labels() — Embedding-similarity label extension

GPU Acceleration

Atelier uses GPU acceleration for sentence-transformer embedding computation and CatBoost training/inference. GPU support is auto-detected at startup with graceful fallback to CPU.

Detection

gpu.preflight_gpu() runs once at config load time and caches the result for the process lifetime. Three-step detection:

nvidia-smi probe: subprocess call to detect device count, names, VRAM, and driver CUDA version
CUDA version extraction: parse nvidia-smi header for driver compatibility
PyTorch check: torch.cuda.is_available() confirms runtime support

The result is a GpuInfo dataclass with:

available — whether CUDA is usable
device_count — number of GPUs
devices — device names with VRAM (e.g., “NVIDIA RTX 4090 24GB”)
resolved_device — "cuda" or "cpu" for model initialization
warnings — non-blocking issues (version mismatches, library path hints)

NVIDIA Driver Symlink (nix + CUDA)

In devenv (nix-managed), CUDA libraries are isolated from the host system. The GPU module handles the nix+CUDA compatibility pattern by detecting the driver library path and ensuring PyTorch can find it. This avoids the common nix pitfall where torch.cuda.is_available() returns False despite GPUs being present.

Integration Points

Sentence-Transformer Embedding

embedding.py calls preflight_gpu() before initializing the SentenceTransformer model, passing device=gpu_info.resolved_device:

gpu_info = preflight_gpu()
model = SentenceTransformer("all-MiniLM-L6-v2", device=gpu_info.resolved_device)

GPU batch encoding achieves ~2,768 texts/second on RTX 4090 (vs ~400/s on CPU). This matters at scale: 15M columns takes ~90 minutes on GPU vs ~10 hours on CPU.

CatBoost Training

CatBoost automatically uses GPU when available via its task_type parameter. The virtual ensemble posterior sampling that drives uncertainty quantification benefits from GPU parallelism.

Preflight Reporting

GPU status appears in just preflight output and in the /api/status gateway endpoint, giving operators immediate visibility into whether GPU acceleration is active.

Configuration

GPU detection is automatic — no configuration needed. The system probes hardware and falls back gracefully:

GPU available: uses CUDA for all embedding and training operations
GPU detected but CUDA unavailable: warns about library path issues, falls back to CPU
No GPU: runs entirely on CPU with no warnings

CAI Considerations

CAI ML workloads can request GPU runtimes. When running on a GPU-enabled CAI session:

The NVIDIA drivers are provided by the container runtime
PyTorch CUDA support depends on the Python runtime image
GPU memory is shared with other processes in the session
Background SHAP computation can be memory-intensive; monitor with nvidia-smi if running alongside large models

Synthetic Data & Training

The classification pipeline includes two ML evidence sources — CatBoost and SVM — that require training data. Atelier generates synthetic training data from the controlled vocabulary, trains both classifiers, and uses them as independent evidence sources in DST fusion.

Synth Generators

synth_generators.py is the single source of truth for 316+ hand-coded value generators shared across the synth framework, sample source generation, and the registry.

Each generator is a callable (rng: random.Random) -> str that produces realistic values for a category. Examples:

EMAIL → "j.smith@example.com", "alice.chen@corp.net"
SSN → "123-45-6789" (formatted US Social Security Number)
LATITUDE → "41.8781" (valid geographic coordinate)
CURRENCY_CODE → "USD", "EUR", "JPY"

Three-Layer Generator Registry

synth_registry.py builds a complete generator set for any vocabulary through a priority-based registry:

Priority	Source	Description
1 (highest)	Hand-coded	From `GENERATORS` dict in synth_generators.py
2	Template	Real sample values with mild perturbation (±10% numeric jitter, character substitution)
3 (lowest)	Inferred	Regex pattern matching on category metadata (description, common_names)

registry = GeneratorRegistry.from_vocabulary(category_set)
# registry.coverage_summary() → {"hand-coded": 250, "template": 40, "inferred": 26}

The registry provides coverage_report() and coverage_summary() to identify categories without generators — important for vocabulary expansion.

Column Name Generation

Synthetic training data deliberately uses diverse column names to prevent classifiers from relying on name heuristics:

Semantic names: email_address, emailAddress, EMAIL_ADDR (snake_case, camelCase, uppercase variants, synonym-based)
Opaque names: field_42, col_abc, v_123 (~25% of columns)

This forces the ML models to learn from value patterns and context, not just column naming conventions.

ML Training Pipeline

ml_train.py orchestrates training for both classifiers:

synth_*.csv + reference_labels.json
        ↓
   _load_synth_data()
        ↓
   ┌────┴────┐
   ↓         ↓
  SVM     CatBoost
   ↓         ↓
 svm.pkl  catboost.cbm

SVM Path (Signals Architecture)

The SVM classifier uses the Pipeline + FeatureUnion composition adopted wholesale from the Signals project:

Build short text from column name + type + sample values via build_svm_text()
FeatureUnion extracts dual TF-IDF features:
- Character n-grams (3-6, char_wb analyzer) — captures subword patterns
- Word n-grams (1-2) — captures multi-word patterns
CalibratedClassifierCV(LinearSVC, method="sigmoid") — Platt scaling for calibrated probability estimates
_min_class_count() guard prevents calibration CV crash on small classes
Save to .pkl + .classes.json via joblib

The SVM operates on sparse lexical features — architecturally independent from the dense sentence-transformer embedding used by cosine and CatBoost. See Classification Pipeline for the full independence analysis.

CatBoost Path (GPU-accelerated)

Extract 12 features per column via features.extract_features()
Compute sentence-transformer embeddings (384-dim, GPU batch encoding)
Fit CatBoostColumnClassifier with:
- loss_function="MultiClass"
- posterior_sampling=True (virtual ensemble uncertainty)
- auto_class_weights="Balanced" (handle imbalanced categories)
Save to .cbm + .classes.json

Virtual Ensemble Uncertainty

CatBoost’s posterior_sampling=True enables Bayesian uncertainty quantification via virtual ensembles. The classifier produces not just class probabilities but per-class variance estimates. High variance translates to a higher DST discount factor — uncertain ML predictions carry less evidential weight in the fusion.

SVM Training (synth-only, with vocab alignment at inference)

The SVM is trained once on the synthetic corpus (scripts/generate_synth_source.py → ml_train.train_svm), with TF-IDF char-3-6gram + word-1-2gram features and labels keyed on the bundled-ontology ICE.* leaves from synth_generators.GENERATORS. At pipeline runtime, the ICE.* predictions are translated into the user’s taxonomy via the cached LLM-mediated alignment in atelier.classify.ontology_alignment (one LLM call per (vocabulary, model) tuple; result cached on disk under build/cache/alignment/).

data/synth/*.csv  +  ICE.* reference labels
        ↓
   train_svm()  (sklearn LinearSVC + TfidfVectorizer)
        ↓
   build/models/svm.pkl   (label space: ICE.* leaves)

────────  pipeline runtime  ──────────────────────

   svm.predict_proba(text)  →  {ICE.X: p, ICE.Y: q, ...}
        ↓
   translate_proba(proba, alignment)   ← from ontology_alignment
        ↓
   {user_code_A: p+q, user_code_B: r, ...}
        ↓
   svm_to_mass(...)  →  BeliefAssignment in user-taxonomy frame

Historical note — earlier revisions of this design ran a mid-loop train_svm_on_frontier_labels (historical function name) that retrained the SVM on live LLM labels and hot-swapped the result into the active model slot. That path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented in ontology_alignment.py. The current design preserves the SVM’s TF-IDF independence at the feature and label level; the only LLM dependency is the per-vocabulary alignment table, which is vocabulary-level rather than column-level shared error. See ontology_alignment.py module docstring for the full independence argument and the BM25-reranker future-work plan.

Train-Eval Cycle

train_eval_cycle.py orchestrates the full loop:

Generate synthetic data from vocabulary
Train CatBoost + SVM models
Classify using the trained models
Evaluate against the curated reference

This runs as part of the classification pipeline when models don’t exist yet, or can be triggered explicitly for experimentation.

SAGE Feature Importance

sage.py computes global feature importance via permutation-based SAGE values. Each of the 12 discrete features is ablated and the classification accuracy impact measured:

High SAGE value = feature is critical for classification
Low SAGE value = feature adds little discriminative power

SAGE runs on the directly-LLM-classified sampled subset when MC sampling is active (representative by stratification design), reducing computation at scale.

SHAP Per-Item Attribution

shap_explanations.py provides per-column explanations for why each column was classified as it was:

Method	Algorithm	Speed	When Used
CatBoost TreeSHAP	Exact O(TLD) built-in	~0.1s for 50 items	Auto when CatBoost loaded
PermutationSHAP	`shap.PermutationExplainer`	~50s/item	Explicit request only

Each classification gains 6 SHAP columns: shap_top1_name, shap_top1_value, shap_top2_name, shap_top2_value, shap_top3_name, shap_top3_value.

Background SHAP

For large corpora, SHAP can run in a background thread while the pipeline proceeds to EVALUATING. Controlled by the HOCON flag:

classify {
  background_analysis = true
  background_analysis = ${?ATELIER_BACKGROUND_ANALYSIS}
}

Set to false on CAI if background threads cause runtime issues.

Key Files

File	Role
`synth_generators.py`	316+ hand-coded value generators
`synth_registry.py`	Three-layer registry: hand-coded > template > inferred
`synth.py`	Synthetic data generation with diverse column names
`ml_train.py`	Training orchestrator: synth-only CatBoost + synth-only SVM (ICE.* labels)
`catboost_classifier.py`	CatBoost with virtual ensemble uncertainty
`svm_classifier.py`	Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
`train_eval_cycle.py`	Generate → train → classify → evaluate loop
`sage.py`	Global SAGE feature importance
`shap_explanations.py`	Per-item SHAP attribution

Embeddings

The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.

Architecture

The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.

Data Flow

Backend serves the parquet file via /api/datasets/{id}/data
React fetches the parquet and loads it into DuckDB WASM via a Mosaic coordinator
EmbeddingAtlas queries the DuckDB table for rendering: x/y coordinates, categories, text for tooltips
All filtering, search, and aggregation happens client-side — no round-trips to the server

Parquet Schema

The Embeddings page expects parquet files with these columns:

Column	Type	Required	Description
`id`	string	yes	Unique row identifier
`x`	float32	yes	2D projection x-coordinate (UMAP)
`y`	float32	yes	2D projection y-coordinate (UMAP)
`text`	string	recommended	Tooltip and search text
`category`	string	recommended	Color-coding category

Additional columns (e.g., source_table, belief, plausibility) are automatically available as cross-filter charts.

GitTables Dataset

The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.

To prepare the visualization parquet:

# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet

# Then seed the database
just seed

The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.

Naming: Embeddings vs Apache Atlas

The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.

Embeddings (Atelier) — Interactive scatter plot of classification embeddings
Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000

To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.

Data Sources & Versioning

Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.

Data Model

DataSource (1)                      Dataset versions (N)
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "ootb-sample"       │───1:N──│ v3 (active) — 2 min ago  │
│ type: "sample"          │        │ v2 — yesterday           │
│ display: "Sample"       │        │ v1 — built-in            │
│ vocab_mode: "universal" │        └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive"            │        └──────────────────────────┘
│ display: "hive:prod/…"  │
│ vocab_mode: "hive"      │
└─────────────────────────┘

Source Types

Type	Tables loaded from	Vocabulary	Created by
`sample`	`data/sample/tables/*.csv`	Expanded ICE ontology (316 leaves)	Auto-seeded on first boot
`hive`	CAI data connection	Domain annotations from `vocab_uri`	User creates via Status page
`synth`	`data/synth/tables/*.csv`	Domain annotations from `vocab_uri`	Generated by `scripts/generate_synth_source.py`

Vocabulary routing: For in-situ classification, the customer’s domain vocabulary IS the classification target — the LLM reads labels and descriptions and classifies into the domain’s hierarchical dot-codes. The annotations table location is configured per source via vocab_uri (e.g. meta.vocab, meta.annotations), decoupling data tables from the vocabulary. Multiple sources can share the same annotations table.

Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via VocabMapping. This requires dedicated training hardware and is not yet implemented.

Database Schema

CREATE TABLE data_sources (
    id TEXT PRIMARY KEY,
    source_type TEXT NOT NULL,          -- 'sample' | 'hive'
    source_uri TEXT NOT NULL DEFAULT '',
    display_name TEXT NOT NULL,
    vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
    vocab_uri TEXT NOT NULL DEFAULT '',  -- e.g. 'meta.vocab', 'meta.annotations'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT                       -- JSON: table_count, column_count
);

-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;

Vocabulary Routing

When a pipeline run starts, the source_id determines which vocabulary loads:

ootb-sample: load_sample_vocabulary() → data/sample/ontology.json (316 BFO-grounded leaves across the CCO ICE trichotomy)
hive/synth: Domain annotations loaded directly from the table specified by vocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.
No source: Falls back to universal vocabulary (16 PII leaves)

LLM Robustness

The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.

Adaptive batch sizing: _estimate_safe_batch_size() reduces columns_per_call for large vocabularies (e.g. 290 categories → 41)
Truncation retry: When LLMResponse.truncated is detected, the batch is halved and retried recursively until all columns are classified
Metrics: truncation_count and effective_batch_size tracked in BootstrapState and exposed via the agent’s check_convergence tool

Sample Source

The built-in “Sample” source (source_id ootb-sample) ships with Atelier so new deployments show meaningful data immediately. When the landing page loads and “Connected” turns green, the stats cards show 316 Terms and 316 Entities. The ootb- prefix in the id is an internal marker distinguishing shipped sources from user-registered connections — it is not shown in the UI.

Expanded Vocabulary (ICE.* Ontology)

The vocabulary follows the CCO ICE (Information Content Entity) trichotomy, grounded in BFO via atelier-vocab.ttl:

ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│   ├── ICE.NONSENSITIVE.DESIGNATIVE   ⊑ cco:DesignativeICE
│   │   ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│   │   ├── .CODE (.ID, .ABBREV, .POSTAL)
│   │   ├── .GEO  (.COUNTRY, .REGION, .CITY, .LOCATION)
│   │   ├── .REF  (.CITATION, .VERSION, .SOURCE)
│   │   └── .TITLE
│   ├── ICE.NONSENSITIVE.DESCRIPTIVE   ⊑ cco:DescriptiveICE
│   │   ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│   │   ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│   │   ├── .MEASUREMENT (~20 subtypes)
│   │   └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│   └── ICE.NONSENSITIVE.PRESCRIPTIVE  ⊑ cco:PrescriptiveICE
│       └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│   ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│   ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│   └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
    └── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …

351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.

Design principle: every category is our own BFO-grounded term. External sources (GitTables, meta-tagging) inform which conceptual space to cover; we never import their raw tags. The mapping goes outward from our vocabulary via atelier-vocab.ttl, not inward.

Sample Tables

25 mixed-domain tables with 316 columns (100 rows each). Tables are deliberately cross-domain — a customers table contains identity, contact, metadata, and categorical columns — so the classification pipeline cannot rely on table name alone.

~25% of columns use opaque names (field_42, var_abc, col_xyz) to exercise the pipeline’s ability to classify from values and context rather than column name heuristics.

Generated by scripts/generate_sample_source.py. The curated reference for the Sample source fixture is committed in data/sample/reference_labels.json (scope: fixture-only, for OOTB demo and unit tests).

For UAT / production evaluation, the curated reference lives at build/meta-tagging-clean/curated_reference.csv (gitignored) — built by scripts/parity/build_curated_reference.py from direct reference-column evidence plus name-index lookup with Ontology > Annotation > Common Names priority. UAT’s own classification outputs are provisional predictions and are scored against this curated reference at build/results/parity/delta_report.md.

Auto-Import on First Boot

The gateway seeds the Sample source (id ootb-sample) via a FastAPI lifespan context manager:

Check if ootb-sample source has any dataset versions
If none, read sample_source_stats() (table count, column count)
Create dataset version 1 with the stats as metadata
Update source metadata JSON

This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.

API

REST Endpoints

Endpoint	Method	Description
`/api/data-sources`	GET	List all data sources
`/api/data-sources`	POST	Create a new data source
`/api/datasets?source_id=X`	GET	List versions for a source
`/api/datasets/{id}/activate`	POST	Set a version as active
`/api/vocabulary/stats?source_id=X`	GET	Term count (source-aware)
`/api/fsm/start?source_id=X`	POST	Start pipeline for a source

gRPC RPCs

RPC	Description
`ListDataSources()`	List all sources
`StartClassification(source_id=…)`	Start pipeline for a source

UI Integration

The Status page has two new cards:

Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
Classification Pipeline card: “Start Classification” passes activeSourceId to /api/fsm/start?source_id=…

The Landing page stats cards reflect the active source:

Terms: vocabulary size for the active source (316 for the Sample source)
Entities: column count from the active dataset version
Sources badge: shows count when multiple sources exist

DatasetContext

interface DatasetContextValue {
  sources: DataSourceInfo[];
  activeSourceId: string | null;
  setActiveSourceId: (id: string) => void;
  datasets: DatasetInfo[];           // for activeSourceId
  activeDatasetId: string | null;
  setActiveDatasetId: (id: string) => void;
  refreshSources: () => Promise<void>;
  refreshDatasets: () => Promise<void>;
}

Key Files

File	Role
`db/migrations/20260414…_data_sources_and_versions.sql`	Schema migration
`src/atelier/db/model.py`	`DataSource` ORM model
`src/atelier/db/dao.py`	Source + version DAO methods
`src/atelier/classify/sampler.py`	`load_sample_source()`, `sample_source_stats()`
`src/atelier/classify/taxonomy.py`	`load_sample_vocabulary()`
`src/atelier/classify/pipeline.py`	Source-aware routing
`src/atelier/gateway.py`	REST endpoints + auto-import lifespan
`data/sample/ontology.json`	Expanded vocabulary (316 leaves)
`data/sample/tables/*.csv`	25 sample tables
`data/sample/reference_labels.json`	316-entry Sample-source fixture reference labels
`build/meta-tagging-clean/curated_reference.csv` (gitignored)	UAT-corpus curated reference
`scripts/expand_vocabulary.py`	Vocabulary expansion script
`scripts/generate_sample_source.py`	Sample table generation script
`ui/src/contexts/DatasetContext.tsx`	Source-aware React context
`ui/src/pages/Status.tsx`	Data source + version UI

ML Artifact Management + Extend Classification

Each Atelier classify run trains a CatBoost classifier, optionally an SVM classifier (synth-trained, with runtime LLM-mediated alignment to the user vocabulary — see ontology_alignment.py), and (when umap-learn handles the projection) a fitted UMAP reducer. The ML Artifact Set feature makes those trained models first-class entities — registered in PG, listed in the UI, and replayable on new data through a streamlined Extend Classification pipeline that skips the LLM sweep, DST iteration, and agent loop.

Why

The classify pipeline costs tens of minutes and (on Bedrock / Anthropic direct) tens of dollars per run. When the governance team adds new tables to an existing Hive database, or stands up a new Hive / Impala database with the same taxonomy, re-running the full pipeline is the wrong tool — there’s no new agent-mediated reference to learn from, and the LLM sweep adds nothing the trained CatBoost can’t reproduce at >100x speed. Extend Classification is the right shape: load the trained artifacts, predict on the new columns, write a parquet, register a new dataset. Done.

The data model deliberately tracks lineage in OpenLineage terminology (Run → Job → Dataset → Facet) so Marquez or a similar lineage backend can be wired in later without remodeling. The pathspec scheme (run id-keyed artifact directories) borrows from Metaflow’s DataStore addressing — every artifact resolves to build/results/{run_id}/{filename}.

Concepts

Term	What it is	Where it lives
Data Source	A configured source (Hive DB, Impala DB, OOTB Sample, filesystem mount).	`data_sources` table
Dataset	One classify or extend run’s output parquet, versioned per source.	`datasets` table
FSM Run	One pipeline invocation (classify or extend).	`fsm_runs` table
ML Artifact Set	The bundle a classify run produced: CatBoost (.cbm + .classes.json), optional SVM (.pkl + .classes.json), optional UMAP (.pkl), plus vocab signature and embedding-model identity.	`ml_artifact_sets` table
Active Artifact Set	The single ArtifactSet a future Extend run will use.	`ml_artifact_sets.is_active` (partial unique index enforces only-one-active)
Classify Run	The full LLM + DST + agent pipeline. Produces a Dataset AND an ArtifactSet.	`run_kind = 'classify'` on the dataset row
Extend Run	The streamlined ML-only pipeline. Consumes an ArtifactSet, produces a Dataset only.	`run_kind = 'extend'`

Database schema

The migration 20260427000000_ml_artifact_sets.sql adds ml_artifact_sets and three lineage columns on datasets:

ml_artifact_sets:
  id, source_id (→ data_sources.id), fsm_run_id (→ fsm_runs.id),
  parent_artifact_set_id (self-FK),
  catboost_path, catboost_classes_path,
  svm_path?, svm_classes_path?, umap_path?,
  classes (JSON), feature_groups (JSON),
  vocab_signature (sha256(sorted(classes))),
  embedding_model, embedding_dim,
  display_name, summary, is_active, is_archived,
  facets (JSON, OpenLineage projection),
  created_at

datasets (added):
  artifact_set_id (→ ml_artifact_sets.id),
  parent_dataset_id (→ datasets.id),  -- extend lineage
  run_kind ('classify' | 'extend')

The partial unique index idx_ml_artifact_sets_one_active ON (is_active) WHERE is_active = TRUE is the Postgres-side guarantee that only one row may be active globally at any time. The DAO’s set_active_artifact_set runs the demote + promote in a single transaction so the index constraint never sees two TRUE rows.

On-disk layout

Each classify run writes to build/results/{run_id}/:

build/results/{run_id}/
  catboost_fit_to_llm.cbm                  # required
  catboost_fit_to_llm.classes.json         # required (classes + feature_groups sidecar)
  svm_frontier.pkl                         # optional (skipped if fit-to-LLM didn't fire)
  svm_frontier.classes.json                # optional
  umap.pkl                                 # optional (only when CPU umap-learn was used)
  atelier_embeddings.parquet               # the dataset
  classifications.json                     # full per-column results
  evaluation_report.json                   # accuracy stats
  settings_snapshot.json                   # config-at-start
  taxonomy_findings.json                   # vocab QA
  ...

atelier.classify.artifact_set is the single point of knowledge about this layout — it builds the artifact-set record from a run dir and loads the bundle for an Extend run.

Pipeline writes (classify side)

At the end of EVALUATING, after the dataset row is upserted:

# pipeline.py (paraphrased)
parquet_path = _write_parquet(...)        # also persists umap.pkl
                                          # alongside via joblib
dao.upsert_dataset(
    ..., artifact_set_id=run_id, run_kind='classify',
    parent_dataset_id=None,
)

spec = build_artifact_set_record(
    run_id=run_id, results_dir=results_dir, cfg=cfg,
    n_columns=len(classifications),
    source_id=source_id, fsm_run_id=run_id,
)
if spec is not None:
    dao.register_artifact_set(**spec)

The first registered artifact set on a fresh deploy auto-activates (idempotent — subsequent registrations don’t steal active). Registration failures are non-fatal: the dataset still ships.

Extend pipeline

atelier.classify.extend_pipeline.run_extend_classification orchestrates the streamlined runner. Phase walk:

IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING
     → CLASSIFYING → FUSING → EVALUATING → CONVERGED

No new FSM states — SAMPLING → CLASSIFYING is already a legal transition (the full pipeline uses the same edge when synthesis is disabled). Production guards run BEFORE the FSM run is created:

Artifact-set existence — DAO lookup must return non-NULL, non-archived row.
File-existence preflight — every non-NULL path on the row must exist on disk (catboost + sidecar required; SVM / UMAP optional but when set must be present). Stale DB pointers fail fast.
Embedding-model identity — the artifact’s embedding_model field must equal the runtime cfg.embedding_model. Catches the BGE-large vs MiniLM swap that would silently produce nonsense predictions.
Vocab compatibility — surfaces in progress.vocab_compatibility as one of ok | superset | partial | disjoint. Warns but does NOT block (per the project decision); the artifact’s training classes drive the runtime taxonomy of the extend run.

Inference is intentionally simple — no DST iteration:

CatBoost predict_proba per column → top-1 = primary prediction.
(Optional) SVM predict_proba → second look; soft confidence haircut on disagreement.
belief = top1_p, plausibility = top1_p + (1 − sum_top3), conflict = 0.0 (clear “ML-only inference” marker for the UI).

UMAP transforms via bundle.umap_model.transform() when the bundle includes a fitted reducer (lands in the parent run’s coordinate space). Falls back to a fresh fit_transform when no UMAP was bundled — Extend coordinates differ from the parent’s; the divergence is recorded in settings_snapshot.json.

Gateway endpoints

GET    /api/artifact-sets[?source_id=&include_archived=]
GET    /api/artifact-sets/{id}
POST   /api/artifact-sets/{id}/activate
POST   /api/artifact-sets/{id}/archive
POST   /api/artifact-sets/{id}/unarchive
GET    /api/artifact-sets/{id}/compatibility?source_id=
POST   /api/fsm/extend                  body: {source_id,
                                                artifact_set_id,
                                                parent_dataset_id?}

The /api/fsm/extend endpoint mirrors /api/fsm/start’s background- thread plumbing so the existing /api/fsm/status polling carries the run through to the UI without any new client-side wiring. Returns 404 synchronously when artifact_set_id is missing from the DB; 409 when another FSM run is in flight.

UI

The Status page renders a new ML Artifacts panel between the Classification Pipeline panel and the Data Source panel. Composition mirrors DataSourceCard for visual continuity:

Header (extra slot): active source / dataset indicator (read-only), Refresh button, Extend Classification primary button.
Table columns: Active (radio) / Run ID (linked to overwatch when fsm_run_id is set) / Created / Summary / Models (CB / SVM / UMAP chips with informative tooltips) / Archive (trash icon).

The Data Source panel was reworked to match: it now has a leftmost Active column with Radio cells, and the Version column lost its inline [active] chip. Click row OR click radio → activate.

OpenLineage projection

atelier.classify.oplineage_emit.build_run_event projects an FSM run into an OpenLineage event dict. The Job is atelier.classify or atelier.extend_classify; the Run is fsm_runs.id. Outputs include the parquet plus one Dataset entry per artifact file (CatBoost / SVM / UMAP), each carrying a zndx_ml_artifact custom facet with framework, vocab_signature, embedding_model, classes_count.

Extend runs additionally emit a ParentRunFacet linking back to the classify run that produced the consumed ArtifactSet — the OpenLineage- canonical way to express “this run is a descendant of run X”.

Day one we don’t wire the HTTP transport — the projection is pure, and operators who configure Marquez later only need to add the POST plumbing. The custom zndx_ml_artifact and zndx_extend_lineage facets follow the OpenLineage custom-facet convention with _producer and _schemaURL attributes pointing at our schemas.

BDD coverage

features/agent/artifact_set.feature (tier-0): vocab signature determinism, signature stability under reordering, all four compatibility statuses (ok / superset / partial / disjoint).
features/agent/extend_pipeline.feature (tier-1, @gpu): Extend produces a Dataset with run_kind='extend', the dataset references the consumed artifact set, the run NEVER invokes an LLM (structural proof — run_extend_classification doesn’t accept an llm_backend parameter), vocab compatibility surfaces, atlas-compatible files appear in the run dir.
features/gateway/artifact_sets.feature (tier-1, @gpu): seven scenarios covering list / get / activate / compatibility / extend body validation / 404 paths.

Out of scope (deferred)

Auto-prune retention policy for artifact sets (manual archive only).
Cross-source vocab translation (mapping artifact’s classes onto a source with a different taxonomy).
Full per-table input dataset expansion in OpenLineage events (currently emits one aggregate input dataset per source).
HTTP transport for OpenLineage emission — pure projection only.

Deployment: Unseen Ontology, Known Schema

Operating principle: out here we iterate on public benchmarks; in CAI we execute to a customer-specified objective. The customer brings an unseen ontology shaped like a known annotations schema; the system has to produce classifications + calibrated belief intervals against that ontology without prior calibration.

This document captures the deployment-time invariants that the classification pipeline must honor, names the assumptions baked into today’s code that would fail against a sufficiently weird customer ontology, and proposes a roadmap milestone (M11 — Bring Your Own Vocabulary) that closes the remaining gaps and makes public-data iteration a test surface rather than a target for the same execution path.

Mode split — iteration vs. execution

Dimension	Iteration mode (local / public data)	Execution mode (CAI deployment)
Vocabulary source	`atelier-vocab.ttl` (300 ICE leaves) + curated mappings + benchmark-specific class lists (SOTAB 82 Schema.org types, GitTables 122 DBpedia types, SemTab DBpedia hierarchy)	Customer’s `default.annotations` Hive table — opaque to us until run-time
Hierarchy depth	Known (5 levels for ICE, ~3 for DBpedia subset, varying for Schema.org)	Unknown — could be 1 (flat) or 8+ (deep regulatory taxonomy)
Hierarchy shape	Tree, single root	Tree assumed; multi-root forest, cycles, unbalanced subtrees all plausible
Validation labels	Curated reference (synth, meta-tagging UAT, GitTables CTA gold)	Often absent. Sometimes a small spot-check set; sometimes none.
Accuracy bar	Track records over time on published benchmarks	Customer-stated objective; calibration + sample review when no agent-mediated reference exists
BFO / CCO grounding	Available — we mapped 360 terms ourselves	Opportunistic — only if the customer’s ontology happens to carry a `bfo_anchor` / `cco_anchor` / `schema_org_class` / `dbpedia_class` column
Iteration latency	Tight (re-run with overlay tweaks; soak on devenv)	Wide (CAI session lifecycle; nautilus + overwatch loops the only mid-run feedback)

The bridge between the two modes is structural: every iteration target gets transformed into annotations-schema shape before the pipeline sees it. SOTAB v2’s class list, GitTables’ 122 DBpedia types, the OOTB sample’s 316 ICE leaves, the customer’s Hive table — all four end up as a HierarchicalCategorySet built from a list[ReferenceCategory] with parent_code edges, fed into the same DST + agent + nautilus + overwatch stack. The execution path doesn’t know or care which mode it’s running under.

The annotations schema contract (what’s stable)

load_annotations_from_hive reads SELECT * FROM default.annotations and returns list[dict]. _normalize_annotations_row and _build_category_set_from_records (both in src/atelier/classify/taxonomy.py) translate that into a HierarchicalCategorySet. The fields we already accept, in order of preference per row:

Field	Required	Purpose	Fallback when missing
`code` (or `id` / `path` / dot-path)	yes	Identity for tree navigation, DST focal element, Atlas type name	row dropped (we cannot classify into an unnamed term)
`label` (or `display_name` / `name`)	strongly preferred	Human surface in UI + LLM prompt	falls back to last component of `code`
`parent_code`	no	Explicit parent edge	derived from dot-path (`A.B.C` → parent = `A.B`)
`description`	no	LLM context, embedding text	empty
`common_names` (synonyms / aliases)	no	LLM expansion + embedding text	empty
`notation`	no	SKOS-style dot code (numeric or otherwise)	empty
`abbrev`	no	Mnemonic shortcode for leaves	empty
`taxonomy`	no	Namespace discriminator	`"annotations"`
`sensitivity`	no	Domain-specific classification metadata	absent

The contract is structural, not semantic — we do not require any particular set of root codes, depth, or ICE-trichotomy alignment. A customer ontology rooted at LEGAL.PRIVILEGE.ATTORNEY_CLIENT is structurally indistinguishable from one rooted at ICE.SENSITIVE.PID.CONTACT.EMAIL from the algorithms’ point of view.

What’s already ontology-agnostic

Most of the algorithmic surface from v0.4.0-rc1 operates on the hierarchy as a graph, not on ICE-specific anchors:

Parent-aware DST frame (mass_functions.py) — votes at any node; fold-up uses HierarchicalCategorySet.descendants(code).
Hierarchical cosine mass (Shafer §3) — distributes embedding similarity through the graph regardless of root identity.
Cross-subtree cautious_code (Smets §6) — least-commitment promotion finds the deepest common ancestor on whatever tree is loaded.
Belief-path tracing (belief.py::belief_path) — walks parent_code chains; doesn’t care about labels.
Indep-tier revisit gate — fires on consensus disagreement, not on a code-pattern match.
Atlas type graph export (HierarchicalCategorySet.atlas_type_graph) — turns any tree into Atlas Classification typedefs with superTypes chains.
Validation (validate_taxonomy) — collision + duplicate detector catches structural problems before classify starts (cycles emerge as parent_code self-reference; multi-root surfaces as multiple parent-less entries).
Cautious-Code Review (cautious_review.py) — agent-mediated backoff is structural; the agent reasons about depth-vs-confidence on whatever tree it sees.

The DST math doesn’t know it’s classifying PII. That’s a feature — it means the work we did on v0.4.0-rc1 transfers to deployment with zero algorithm changes.

What anticipates badly today

Five gaps that an unseen customer ontology will surface on first encounter:

1. Schema flexibility — column-name variants

_normalize_annotations_row matches a fixed set of column-name candidates. Customers regularly bring annotation tables with names like Class_Name, parent_class, category_definition, sensitivity_tier, pii_category — none of which exactly match our preferred field names. Today this falls through to silent drops or empty fields.

Fix: extend the column-name normalization to be configurable and fuzzy. Add a vocab_schema_map overlay setting that lets the operator declare { "code": "Class_Name", "parent_code": "Parent" } at run-time. Default behavior stays automatic via fuzzy matching on common synonyms.

2. Hierarchy-shape resilience

Today’s calibration assumes our 5-level ICE depth. Discount defaults (cosine 0.20, llm 0.15, SVM 0.55), gap_threshold 0.15, and cautious_review bel_threshold 0.85 are all tuned against that. A customer’s flat 50-class taxonomy doesn’t need cautious-code review (no parents to back off to), and an 8-deep regulatory taxonomy demands tighter cautious thresholds (more depth × more places to be wrong).

Fix: depth-aware defaults. Compute hierarchy statistics at LOADING_VOCAB time (max depth, mean branching factor, leaf/internal ratio); apply scaled defaults if the operator hasn’t overridden them. Surface the stats in the Status page so the operator sees what they got.

3. Multi-root and cycle handling

Single-root tree is assumed in descendants / ancestors traversal. Customer dumps can have multiple top-level concepts (a forest), or — rarely but consequentially — a cycle introduced by data entry error. Today: cycles cause infinite recursion in descendants; multiple roots silently work because the traversal is parent-anchored, but pre-classification tooling (Atlas export, vocabulary stats UI) breaks.

Fix: explicit multi-root support in HierarchicalCategorySet. Cycle detection + clear error in validate_taxonomy with the offending edge identified. Both behind a feature flag so pathological customer data fails fast rather than hangs.

4. Opportunistic CCO/BFO grounding

Customer ontologies that overlap with Schema.org / DBpedia / BFO / CCO carry that overlap as metadata columns (e.g., schema_org_class, bfo_anchor, cco_class). Today we ignore these. Wiring them lets us:

Auto-validate the customer’s hierarchy against a known reference (warn on inconsistent BFO anchoring; e.g., a node mapped to cco:DesignativeICE whose children include a cco:Agent).
Reuse our 360-term mapping for embedding-text enrichment (a customer term mapped to schema:Person borrows the full description from the Schema.org corpus).
Bridge to Atlas BFO classifications when the customer’s governance team is ahead of theirs (Cloudera Atlas now ships BFO alignment as of mid-2025).

Fix: optional bfo_anchor / cco_class / schema_org_class / dbpedia_class columns in the annotations contract; when present, the loader populates them on ReferenceCategory and the embedding + LLM-prompt builders consume them. When absent, no behavior change.

5. Accuracy reporting without an agent-mediated reference

The customer often has no per-column gold-standard labels. Our v0.4.0-rc1 evaluation pipeline assumes a curated_reference table (or per-row reference_code field). When the customer doesn’t provide one:

What we have: belief-gap distribution, mean K, cautious-code depth distribution, cross-source agreement counts, reasoning-trace attribution analyzer. These are calibration metrics, not accuracy.

What we need: a deployment-mode evaluation report that’s honest about the absence of an agent-mediated reference. Three-tier report:

Internal consistency — DST K stats, belief-gap distribution, contraction rate. Always available. Tells the operator the pipeline converged.
Sample review workflow — eject N highest-uncertainty columns and N highest-confidence columns to the UI for human spot-check. The operator’s accept/reject decisions feed an ad-hoc curated reference that grows over time. This is essentially what UAT reviewers were doing manually; we can formalize it.
Public-benchmark proxy — when the customer’s ontology overlaps with SOTAB / GitTables / SemTab through opportunistic CCO grounding (gap #4), accuracy on the public benchmark serves as a conditional-confidence floor.

Public-data iteration as test surface

The principle: every public benchmark we adopt becomes a deployment-shape simulator, not a one-off integration. Concretely:

SOTAB v2 — wire as a classify source by transforming the 82 Schema.org type list into a HierarchicalCategorySet- shaped annotations table. The Schema.org type tree provides parent_code edges; our existing atelier-vocab.ttl mappings (schema:Person → cco:Agent, etc.) opportunistically populate the BFO/CCO anchors on the resulting ReferenceCategory rows. Pipeline runs against SOTAB tables exactly as it would against a customer Hive corpus.
GitTables — same treatment. The 122 DBpedia types become a flat (or DBpedia-hierarchy-enriched) annotations table. Our 15 already-mapped DBpedia → CCO bridges populate anchors where they exist; the other 107 stay un-anchored (correct behavior for opportunistic grounding).
SemTab annual — register the system, produce the annotations table from each year’s vocabulary release, evaluate against the cscore metric (which natively rewards our cautious_code).
Customer schema simulators — synthetic annotation tables that test specific deployment shapes: flat 50-class taxonomy (legal exemption codes), 8-deep regulatory tree (HIPAA subcategories), forest with 3 roots (multi-domain governance). These exercise the M11 shape-resilience work without needing real customer data.

Each iteration target ships as a data_sources row + a loader (one function each) + an annotations table built from the benchmark’s class list. None of them need pipeline-side knowledge.

M11 — Bring Your Own Vocabulary (proposed)

A milestone that delivers ontology-agnostic execution with the five gaps above closed:

Configurable vocab schema mapping — overlay setting + fuzzy default; surfaces in Status when applied.
Depth-aware default calibration — compute hierarchy stats at load; scale gap_threshold + cautious bel_threshold.
Multi-root + cycle support — explicit; behind feature flags that fail loudly when violated.
Opportunistic anchor columns — bfo_anchor / cco_class / schema_org_class / dbpedia_class consumed when present.
Three-tier deployment evaluation report — internal consistency / sample review workflow / public-benchmark proxy.
SOTAB v2 + GitTables wired as test sources — proof that the same execution path handles three published benchmarks plus the customer’s Hive table without code changes per target.

The work is concrete and bounded — roughly two focused sessions (taxonomy.py + pipeline.py extensions; one loader + one fixture test per benchmark). Stronger leverage than a feature-by-feature roadmap because every fix lands on the existing structural abstraction rather than introducing new mechanisms.

Out of scope (deferred)

Cross-customer ontology learning. Two customers with similar regulatory domains might benefit from shared inferences; we explicitly do not transfer learning across deployments. Each customer’s session is a closed world.
Customer-driven hierarchy editing. The annotations table is a contract the customer controls upstream of Atelier. We don’t ship UI for editing it.
Ontology auto-discovery. Inferring a hierarchy from unannotated data tables (clustering plus LLM proposes a tree) is a research direction in its own right; out of scope for M11.

Open questions

What if the customer brings two annotations tables? A primary domain vocabulary (hipaa.annotations) and a generic PII overlay (atelier.annotations). Today’s pipeline takes one. M11 should consider compose_vocabularies in the loader path, but it changes the meaning of “the customer’s ontology” — needs a design conversation.
Embedding-model robustness across languages. A German / Japanese / Mandarin annotations table will produce shorter embedding-text and weaker cosine signal at MiniLM-L6 scale. Bigger embedding models (BGE-large, E5-mistral) help but inflate per-run cost. Defer to a separate i18n milestone.
Atlas BFO sync. Cloudera’s Atlas team is shipping BFO alignment. Once stable, our atelier-vocab.ttl ↔ Atlas classification typedef mapping should round-trip without loss (we ship to Atlas; Atlas hands back BFO-anchored entities; we read them as opportunistic anchors per gap #4). Wait for Atlas BFO general availability before wiring.

Cross-references

Classification Pipeline — the execution path being made ontology-agnostic.
DST Evidence Independence — the numerical-methods framing that already operates on arbitrary hierarchies.
Pareto Capability Evolution — the longer-horizon search-space that builds on M11.
src/atelier/classify/ontology/README.md — the BFO/CCO substrate that opportunistic anchoring lifts into.
src/atelier/classify/taxonomy.py — _normalize_annotations_row, _build_category_set_from_records (the adapter layer).
src/atelier/classify/sampler.py — load_annotations_from_hive, load_annotations_from_json, load_annotations_from_filesystem (the source-shape variants).

SOTAB v2 Coverage Strategy

Ownership note (2026-05-09). Going forward, all ontology / vocabulary / synthetic-data work moves to Ægir. The label space conditions model pre-training directly, so it lives where the model lives. Atelier becomes the consumer of trained artifacts (H-Net/RWKV checkpoints + SVMs trained on Ægir-curated datasets). This document stays in Atelier’s docs as the specification of what we want covered; the actual TTL extensions, generators, and SOTAB integration implementation belong in ~/local/src/zndx/aegir/.

This document specifies how the BFO/CCO-grounded vocabulary should cover the SOTAB v2 Schema.org CTA label space (82 labels), so the hierarchical RWKV-7 model in Ægir can ladder predictions up from raw benchmark labels to BFO/CCO concepts.

Background

Atelier (current): ICE trichotomy (Designative / Descriptive / Prescriptive) grounded through Common Core Ontologies into BFO 2020. 20 Schema.org subjects mapped today (11 classes, 9 properties) in src/atelier/classify/ontology/atelier-vocab.ttl. This snapshot remains operational for the existing classification pipeline during the migration window.
Atelier (future): consumer of pre-trained model artifacts. Loads H-Net/RWKV checkpoints and SVMs trained in Ægir against ontology-grounded label spaces; uses them as evidence sources in DST fusion. No longer owns vocabulary.
Ægir (current → future ontology home): hierarchical RWKV-7 model targeting CTA + CPA on wide tables, trained against gt-signals-dbpedia and SOTAB v2. SOTAB infrastructure already wired: scripts/download_sotab.py fetches the four canonical bundles, src/aegir/data/table_dataset.py loads sotab_v2_cta_*_set.csv reference labels. Inheriting atelier-vocab.ttl + synth generators is part of the M2 roadmap.
Synthetic data pipeline: currently in atelier (synth_generators.py, 316+ generators). Migration target: Ægir, since the generators feed pre-training corpora directly. Atelier’s classification pipeline can consume generator output via a thin client during transition.

Authoritative SOTAB v2 label space

Verified against /raid/datasets/sotab/sotab_v2_cta_*_set.csv (union of training, validation, test, and the three robustness test sets: corner_cases, missing_values, format_heterogeneity):

82 distinct CTA labels covering 17 root entity types

Root entity types (from webdatacommons.org/structureddata/sotab/v2/, Table 2): Book, CreativeWork, Event, Hotel, JobPosting, LocalBusiness, Movie, Museum, MusicAlbum, MusicRecording, Person, Place, Product, Recipe, Restaurant, SportsEvent, TVEpisode.

The 82 labels are a mix of:

Class names — Country, MonetaryAmount, Organization, etc.
Entity-property pairs — Book/name, Hotel/description, JobPosting/description (the slash separates entity type from the property whose value the column carries).
Measurement units — Distance, Duration, Energy, Mass.
Enumeration types — BookFormatType, EventStatusType, GenderType, RestrictedDiet.
Coded attribute types — CoordinateAT, IdentifierAT, MusicArtistAT (the AT suffix denotes “atomic type”, a SOTAB convention, not Schema.org).

Aegir’s stale _LABEL_DIMS["sotab"] = 91 comment in src/aegir/data/table_dataset.py should be reduced to 82; the extra 9 appear to be carry-over from an earlier label set draft.

Coverage analysis

Direct hits (14 of 82)

Already grounded in atelier-vocab.ttl:

SOTAB label	Atelier mapping
`Country`	`schema:Country ⊑ BFO:Site`
`CreativeWork`	`schema:CreativeWork ⊑ cco:ICE`
`CreativeWork/name`	`schema:name` (rdf property; we have it)
`Event/description`, `Event/name`	`schema:Event` + `schema:description`/`schema:name`
`MonetaryAmount`	`schema:MonetaryAmount ⊑ cco:DescriptiveICE`
`Organization`	`schema:Organization ⊑ cco:Organization`
`Person/name`	`schema:Person ⊑ cco:Person` + `schema:name`
`Place/name`	`schema:Place ⊑ BFO:Site` + `schema:name`
`PostalAddress`	`schema:PostalAddress ⊑ cco:DesignativeICE`
`QuantitativeValue`	`schema:QuantitativeValue ⊑ cco:DescriptiveICE`
`email`	`schema:email ⊑ cco:DesignativeICE`
`telephone`	`schema:telephone ⊑ cco:DesignativeICE`
`URL`	`schema:url` (we use lowercase; SOTAB uses `URL`)

Subsumption-reachable (~20 of 82)

Subclasses of types already grounded — adding them is a single rdfs:subClassOf edge under the existing CCO branch:

Schema:CreativeWork descendants (8): Book/description, Book/name, BookFormatType, Movie/description, Movie/name, Recipe/description, Recipe/name, MusicAlbum/name, MusicRecording/name, TVEpisode/name, CreativeWorkSeries, Photograph, Review.

Schema:Organization descendants (5): Hotel/description, Hotel/name, LocalBusiness/name, Museum/name, Restaurant/name, SportsTeam.

Schema:Event descendants (1): SportsEvent/name.

Schema:PostalAddress sub-properties (4): addressLocality, addressRegion, postalCode, streetAddress.

Schema:QuantitativeValue measurement subtypes (4 + 1): Distance, Duration, Energy, Mass, weight (column-as-property).

Missing — requires new vocab work (~48 of 82)

Grouped by extension target:

Group	SOTAB labels	Target CCO/BFO grounding
Product family (new branch)	`Product/description`, `Product/name`, `ProductModel`, `Brand`	`cco:Artifact` + `cco:ArtifactModel` (Prescriptive territory)
Job posting family	`JobPosting/description`, `JobPosting/name`, `OccupationalExperienceRequirements`, `EducationalOccupationalCredential`, `workHours`, `paymentAccepted`	`cco:DescriptiveICE` (descriptive content about employment)
Product economics	`price`, `priceRange`, `currency`, `DeliveryMethod`, `ItemAvailability`, `OfferItemCondition`	Mix: `cco:DescriptiveICE` (price/range), enumerations under `cco:DesignativeICE` (DeliveryMethod, ItemAvailability)
Temporal granularities	`Date`, `DateTime`, `Time`, `DayOfWeek`	`cco:DescriptiveICE` temporal subtree (refinements of existing `TIMESTAMP`)
Generic data types	`Number`, `Boolean`, `Language`	Mix: `cco:DescriptiveICE` (Number, Boolean), `cco:DesignativeICE` (Language code)
Enumerations	`CategoryCode`, `EventStatusType`, `EventAttendanceModeEnumeration`, `GenderType`, `RestrictedDiet`, `BookFormatType`	`cco:DesignativeICE` (coded value identifiers)
Person/identity attributes	`GenderType`, `MusicArtistAT`	`cco:DescriptiveICE` (gender), `cco:Person` (artist)
Place attributes	`CoordinateAT`, `LocationFeatureSpecification`, `openingHours`	`cco:DescriptiveICE` (coordinates, hours), `cco:DescriptiveICE` (features)
Annotations / commentary	`category`, `label`, `Rating`, `Review`, `ItemList`	`cco:DescriptiveICE`
Measurement helpers	`unitCode`, `unitText`	Properties of `schema:QuantitativeValue` (we have); add as named properties
Communication channel	`faxNumber`	`cco:DesignativeICE` (sibling of `telephone`)
Attribute-typed identifiers	`IdentifierAT`	`cco:DesignativeICE` (sibling of `schema:identifier`)

Three-tier extension strategy

Tier-A — measurement zoo (lowest effort, highest leverage)

Add ~10 schema:QuantitativeValue subclasses under cco:DescriptiveICE:

schema:Distance       rdfs:subClassOf cco:ont00000853 .  # Descriptive ICE
schema:Duration       rdfs:subClassOf cco:ont00000853 .
schema:Energy         rdfs:subClassOf cco:ont00000853 .
schema:Mass           rdfs:subClassOf cco:ont00000853 .
schema:Speed          rdfs:subClassOf cco:ont00000853 .
schema:Temperature    rdfs:subClassOf cco:ont00000853 .

Plus property-level: unitCode, unitText, weight.

Implementation: ~10 lines in atelier-vocab.ttl, ~5 generators in synth_generators.py (already have NUMERIC.* generators that can be re-keyed to schema URIs).

SOTAB labels covered: 10 (Distance, Duration, Energy, Mass, weight, unitCode, unitText, Number, Boolean, plus one ancillary).

Tier-B — subclass plumbing (CreativeWork + Organization + Event subtrees)

Single-edge additions for entity types already grounded at parent level:

schema:Book           rdfs:subClassOf schema:CreativeWork .
schema:Movie          rdfs:subClassOf schema:CreativeWork .
schema:Recipe         rdfs:subClassOf schema:CreativeWork .
schema:MusicAlbum     rdfs:subClassOf schema:CreativeWork .
schema:MusicRecording rdfs:subClassOf schema:CreativeWork .
schema:TVEpisode      rdfs:subClassOf schema:CreativeWork .
schema:Photograph     rdfs:subClassOf schema:CreativeWork .
schema:Review         rdfs:subClassOf schema:CreativeWork .
schema:Hotel          rdfs:subClassOf schema:Organization .
schema:LocalBusiness  rdfs:subClassOf schema:Organization .
schema:Museum         rdfs:subClassOf schema:Organization .
schema:Restaurant     rdfs:subClassOf schema:Organization .
schema:SportsTeam     rdfs:subClassOf schema:Organization .
schema:SportsEvent    rdfs:subClassOf schema:Event .

Implementation: ~14 lines in atelier-vocab.ttl, ~14 SSSOM annotation blocks, ~14 generators in synth_generators.py.

SOTAB labels covered: ~20 (entity-property pairs cascade through parent’s name/description mappings).

Tier-C — Product branch + JobPosting + economics

Largest single addition; introduces cco:Artifact lineage:

schema:Product        rdfs:subClassOf cco:Artifact .       # NEW branch
schema:ProductModel   rdfs:subClassOf cco:ArtifactModel .  # NEW
schema:Brand          rdfs:subClassOf cco:DesignativeICE . # NEW
schema:JobPosting     rdfs:subClassOf cco:DescriptiveICE . # NEW

Plus property-level mappings: price, priceRange, currency, paymentAccepted, DeliveryMethod, ItemAvailability, OfferItemCondition, workHours, OccupationalExperienceRequirements, EducationalOccupationalCredential.

Plus temporal refinements: Date, DateTime, Time, DayOfWeek as subproperties of existing TIMESTAMP lineage.

Plus enumerations: CategoryCode, EventStatusType, EventAttendanceModeEnumeration, GenderType, RestrictedDiet, BookFormatType (~6 enumeration classes).

Implementation: ~30 lines in atelier-vocab.ttl, comparable SSSOM annotation overhead, ~25 new synth generators (Product family is its own generator pack: SKU, brand, model, GTIN, etc.).

SOTAB labels covered: remaining ~48.

Cumulative coverage after all three tiers

100% of the 82 SOTAB v2 Schema.org CTA labels mapped to BFO/CCO grounding, with provenance trails (SSSOM sssom:object_label axioms) for every mapping.

Ownership flow (post-migration)

Concern	Owner	Notes
Vocabulary IRIs + CCO/BFO grounding	Ægir (target)	`atelier-vocab.ttl` migrates to `aegir/src/aegir/ontology/`
Synth value generators	Ægir (target)	Generator output feeds pre-training corpora directly
SOTAB-label → vocab-IRI lookup	Ægir	`aegir-vocab` exposes label↔IRI map; consumed by training + inference
SOTAB v2 download + extraction	Ægir	`scripts/download_sotab.py` (already wired)
CTA/CPA dataset loaders	Ægir	`src/aegir/data/table_dataset.py` (already wired)
Model training + evaluation	Ægir	`train.py`, `src/aegir/models/heads.py::AegirForColumnAnnotation`
Per-class F1 + Pareto evaluation	Ægir	M2 roadmap entry (per-class F1 bars in leaderboard UI)
BFO-grounded prediction emission	Ægir	Leaderboard predicts SOTAB label AND emits its CCO/BFO ancestry
Trained checkpoint consumption	Atelier	New: load H-Net/RWKV + SVM artifacts as DST evidence sources
DST evidence fusion + classification pipeline	Atelier	Unchanged — trichotomy + belief/plausibility logic stays
Gateway + UI + governance integration	Atelier	Unchanged

During the migration window (until ontology fully relocates), atelier keeps its operational atelier-vocab.ttl snapshot. The concrete contract Ægir publishes to atelier becomes a vocab_label_map.json (IRI + BFO ancestry per SOTAB label) plus the trained model checkpoints themselves.

Aegir touchpoints (informative, not prescriptive)

The work in Ægir, in roadmap terms, lands inside its M2 milestone (“external-baseline harness, ontology editor with Postgres write paths, per-class F1 bars”):

src/aegir/data/table_dataset.py — fix the stale _LABEL_DIMS["sotab"] = 91 to 82; add a label_to_iri resolver that consumes the shared sotab_label_map.json.
scripts/sotab_diagnostic.py — extend representation-collapse diagnostics to surface per-tier (A/B/C) coverage of predictions, so we can see whether collapses correlate with vocab gaps.
Leaderboard gateway (src/aegir/gateway/app.py) — /api/ontology endpoint already exists; extend its response to include the BFO ancestry of each predicted label.
src/aegir/models/heads.py::AegirForColumnAnnotation — no model change needed for tier work; the head already operates on a (num_labels,) output, and 82 vs 91 is just a config delta.

The pretraining work documented in aegir/docs/notes/2026-04-19/234700_sotab_diagnostic_representation_collapse.md (model collapses to single embedding point on SOTAB-small) is orthogonal to this strategy — it’s a model issue, not a vocabulary issue. Vocab extension proceeds independently and should improve the post-collapse ceiling once representations are healthy.

Synthetic data pipeline implications

The synth framework (synth_generators.py, 316+ hand-coded generators plus the three-layer registry) migrates to Ægir with the rest of the ontology work. Ægir-resident synth gives pre-training direct access to generator output without crossing repo boundaries. After Tier-A/B/C extensions:

Tier-A adds measurement generators — DURATION (ISO-8601 strings), MASS (with unit suffix), DISTANCE, ENERGY — these are mostly numeric with unit annotations. Existing NUMERIC.* generators can be re-keyed.
Tier-B adds entity-name generators — BOOK_TITLE, MOVIE_TITLE, RECIPE_NAME, HOTEL_NAME, etc. Cascade through the registry’s template priority (priority 2): once Ægir has ~50 real Book/name samples from SOTAB itself, the registry generates plausible book titles via perturbation.
Tier-C adds product attribute generators — SKU, Brand, GTIN, ProductModel. Domain-specific; benefit from hand-coded generators (priority 1) seeded with realistic patterns.

Atelier’s classification pipeline, post-migration, can either (a) call into Ægir’s synth via a thin client during local dev, or (b) bundle a generator snapshot at release time. The decision depends on whether Atelier’s BDD/pytest scenarios remain self-contained or are content to require Ægir as a sibling repo.

Verification

Coverage is mechanically verifiable via SPARQL totality:

PREFIX cco: <https://www.commoncoreontologies.org/>
PREFIX schema: <https://schema.org/>

# Every SOTAB label must have a path to cco:InformationContentEntity (or descendant).
SELECT ?label WHERE {
  VALUES ?label { schema:Distance schema:Duration schema:Mass ... }  # all 82
  FILTER NOT EXISTS {
    ?label rdfs:subClassOf+ cco:ont00000958 .  # cco:InformationContentEntity
  }
}
# Empty result == 100% coverage.

This goes in src/atelier/classify/ontology/sparql/sotab_totality.rq once Tier-A lands.

Status

Strategy doc: this file (2026-05-09).
Ownership migration: Ægir takes over ontology / vocab / synth.
Tier-A implementation: Ægir M2 — vocab edits + SSSOM annotations + SPARQL totality query + measurement generators.
Tier-B implementation: Ægir M2.
Tier-C implementation: Ægir M3.

Atelier’s contribution post-migration is consumption-side: load Ægir’s trained checkpoints as DST evidence sources, surface BFO ancestry via the gateway/UI, integrate predictions into the existing belief/plausibility fusion machinery. The vocabulary itself, and the work to extend it, lives next to the model that uses it.

Pareto Capability Evolution (Roadmap)

Status: research-shaped capstone milestone. No incremental rollout — we ship it whole when the pieces converge.

This document proposes a long-horizon evolution of the Atelier classification pipeline from a single-config bootstrap loop into a multi-objective, population-based search over the policy space (LLM prompts, classifier hyperparameters, fusion strategy). The framing is rooted in three bodies of work — Active Learning, Automatic Prompt Optimization (APO), and GEPA — each of which already maps cleanly onto a piece of what we ship today.

Why this is a capstone, not a feature

The current bootstrap loop is already an active-learning system, just informally named. We sweep with an Opus oracle, fuse with Dempster- Shafer, revisit disagreements, retrain incrementally — all under a single configuration. Operators have started asking the next question: could we have run with a tighter belief gap, fewer LLM tokens, deeper cautious predictions? Each answer requires re-running with different settings. We need a search procedure that can carry this load without forcing operators to hand-tune one knob at a time.

We ship this when the prerequisite pieces converge:

The reasoning model in overwatch/agent.py stabilizes as a reliable proposer of structured configuration edits (prompt diffs and JSON patches over the config tree, not free-form advice).
We have enough corpus diversity in data_sources to evaluate candidates against generalization, not point estimates on one source.
A persistent population store (the config leaderboard) is in place so evolution survives gateway restarts and CAI session boundaries.

Until those land, individual ideas in this doc may be borrowed in isolation (e.g. an APO-only loop that evolves a single sweep prompt against accuracy). The capstone is the integrated whole — the borrowed pieces alone do not constitute “Pareto Capability Evolution”.

Foundations

Active Learning — the paradigm we already implement

Active learning minimizes label cost by querying an oracle on examples the model is most uncertain about (Settles 2009). Mapped onto the Atelier pipeline:

Active Learning concept	Atelier component
Oracle	Opus during sweep + revisit (`pipeline.py::_llm_sweep`, `_llm_revisit`)
Labeled pool (T_K)	Synth corpus + curated reference + accumulated LLM labels
Unlabeled pool (T_U)	Discovered source columns awaiting classification
Query strategy	Belief-gap-driven revisit selection (largest `Pl − Bel`)
Query-by-committee	Disagreement between CatBoost-fit-to-LLM and the synth-trained SVM (via the ICE→user alignment)
Pool vs. stream	Pool-based — Monte Carlo stratification picks each batch
Stopping criterion	`mean_gap < gap_threshold` OR `max_iterations` reached
Cold-start mitigation	Synth pre-training + pattern evidence on first sweep

The active-learning incorporation of new oracle labels is concentrated in the catboost source (fit_to_llm mode trains on the live LLM labels mid-run). The SVM was previously also part of this active-learning loop via the M9 frontier_svm retrain, but that path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for the independence reasons documented in ontology_alignment.py. The SVM now contributes a label-stable TF-IDF view that complements the live-LLM-aligned CatBoost view.

Automatic Prompt Optimization — APO and GEPA

Both APO (Microsoft Agent Lightning) and GEPA (Lakhotia et al., ICLR 2026) optimize LLM prompts via reflection-driven mutation: the LLM diagnoses its own failures in natural language and proposes prompt edits, evaluated against held-out tasks. They differ on search shape:

Dimension	APO	GEPA
Search structure	Beam (default width 4)	Pareto frontier (open-ended population)
Objective	Single scalar reward	Multi-objective, non-dominated sorting
Mutation	Textual gradient → LLM-edit	LLM reflection + cross-candidate recombination
Targets	One prompt template at a time	One or more prompts; full system policy
Scope	“Pick the best system prompt”	“Discover diverse strategies and combine them”
Sample efficiency	Not benchmarked vs. RL	35× fewer rollouts than GRPO; +6–20% over MIPROv2

For Atelier, APO is the right shape for narrow optimizations (tune one sweep prompt against accuracy on a known corpus). GEPA is the right shape for the capstone: we have multiple operator-relevant objectives (accuracy, calibration, cost, coverage, latency), and we benefit from preserving complementary policies rather than collapsing to a single configuration.

We treat APO and GEPA as peer techniques. APO is invoked when one objective clearly dominates and beam search is sufficient; GEPA is invoked when objectives trade off and the frontier’s diversity is itself the asset. Both share the same reflection-engine plumbing.

The synthesis — Pareto Capability Evolution

The capstone integrates AL, APO/GEPA, and population-based search into one loop:

Active learning drives label acquisition within each candidate run (the existing bootstrap loop, unchanged).
Reflection-driven mutation drives proposal of new pipeline configurations: prompt edits, classifier knobs, fusion swaps.
Pareto sorting decides which configurations survive into the next generation.

The reflection model is the same Opus instance already wired for overwatch — it reads the convergence report of a finished run and proposes targeted edits to the configuration that produced it.

Pipeline policy space

Mutation targets the configuration tuple, not just the prompt:

LLM prompts: sweep template, revisit template, classification subagent system prompt.
Classifier hyperparameters: CatBoost depth, learning rate, class weights; SVM C and kernel; SVM-vs-LLM blend ratio in DST mass construction.
Fusion strategy: Dempster vs. Yager; gap threshold; bel-floor; pignistic vs. cautious decision rule; cautious depth threshold.
Search budget: sweep batch size, max bootstrap iterations, Monte Carlo stratification fraction, revisit triggers.
Pattern evidence weights: per-pattern mass discount, evidence layering order.
Embedding choice: MiniLM-L6 (today’s default) vs. BGE-large vs. E5-mistral — bounded by the embedding-model identity check we already enforce on Extend runs.

Hard invariants encoded elsewhere (e.g. classify.bootstrap.max_iterations >= 2, classify.catboost.fit_to_llm = true) remain non-negotiable — mutations that violate them are rejected before evaluation, never committed to the population.

Objectives (Pareto axes)

Objective	Source	Direction	Why operators care
Mean Bel of correct prediction	curated reference	maximize	core accuracy
Mean Pl − Bel	EVALUATING report	minimize	calibration tightness
LLM tokens / converged column	sweep accounting	minimize	governance budget
Cautious accuracy @ depth-N	`epistemic_evaluation`	maximize	hierarchy faithfulness
Vocab coverage @ τ	`classifications.json`	maximize	“did we touch every leaf?”
Pipeline duration	`fsm_runs.{started_at, updated_at}`	minimize	iteration speed

A configuration enters the frontier if no other configuration beats it on every axis (non-dominated sorting). The frontier is open-ended in size; crowding-distance pruning bounds it under operator-defined caps.

Population store (“config leaderboard”)

A persistent backing store records:

Each evaluated configuration as a row, keyed by hash of the config tuple (bit-stable across host reboots).
Every objective score per evaluation, with provenance back to the fsm_runs.id that produced it.
Lineage edges: which configuration mutated to which, via what proposer (APO-style critic vs. GEPA-style recombiner) and what diff.
Frontier membership over time, so an operator can see which configurations entered, dominated others, or were pruned.

This is conceptually a leaderboard — operators sort and filter by any axis or weighted combination — and structurally a write-once registry that supports re-evaluation as new corpora arrive. A frontier that holds against corpus A may not hold against corpus B; the registry preserves both views without conflating them.

The store interfaces with existing tables: it points at ml_artifact_sets rows (the bundle a winning config produced still ships through Extend Classification) and fsm_runs rows (each evaluation is one FSM run). It does not duplicate them — there is one source of truth for artifacts, and the leaderboard layers search-state on top.

Reflection loop — concrete shape

Per generation:

Sample a parent from the current frontier, weighted by either crowding distance (favor diversity) or recency (favor live operator priorities). A small fraction of generations sample a dominated ancestor instead, to escape local frontier traps.
Diagnose by feeding the parent’s run report to the reflection model. The report includes the final classifications, per-axis objective scores, the convergence trace, and any cautious-review findings.
Propose edits as a structured patch (JSON) against the configuration tuple — e.g. {"classify.bootstrap.gap_threshold": 0.05, "classify.svm.blend_ratio": 0.6} — or a textual prompt diff when the target is a prompt template.
Evaluate by instantiating the patched config, running it as an FSM run, and recording scores into the leaderboard.
Update the frontier via non-dominated sort; admit the new configuration if it is non-dominated; prune incumbents whose crowding distance falls below a threshold.

Mutation diversity is encouraged via dual proposers: one focused on accuracy/calibration (the reflection model with a “be conservative” system prompt), one focused on cost/latency (the same model with an “aggressively shrink the budget” system prompt). The frontier preserves both styles rather than collapsing to whichever proposer happened to find an early local optimum.

What this retires

“Frontier SVM” terminology and the M9 retrain it described. The mid-loop train_svm_on_frontier_labels retrain that gave the “frontier SVM” its name was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for the source-independence reasons documented in ontology_alignment.py. The SVM is now trained once on synth with ICE.* labels and translated into the user vocabulary at inference time via the LLM-mediated alignment. “Frontier” the word is freed for the Pareto sense used elsewhere in this doc.
Single-config tuning by hand. Today operators tweak base.conf or the runtime overlay and re-run. The capstone replaces that loop with population-based search; the overlay UI surfaces frontier picks and lets operators promote one to active rather than asking them to choose individual values.
The “single best” mental model. Operators learn to think in trade-offs: “the accuracy-leader spends 4× tokens; the budget-leader loses 6 points of cautious accuracy at depth-3” — and the system surfaces both rather than averaging them away.

Non-goals (explicitly deferred)

Multi-tenant scheduling under CAI quotas. The search loop assumes single-tenant compute on the host’s GPU. Quota-aware scheduling is a separate concern.
Cross-corpus warm-start. A frontier from corpus A is not automatically transplanted to corpus B. The leaderboard preserves both, but transfer learning across taxonomies is research in its own right.
Re-training the embedding model in-loop. Embedding-model identity is locked per artifact set (already enforced for Extend runs); evolution can swap the embedding model only by spinning up a fresh population, not via mutation within an existing one.
Online / streaming evaluation. Pool-based AL is the operating mode. Streaming evaluation as columns arrive continuously is a candidate for v2 — the leaderboard would persist while the pool grows.

Open research questions

Cold start for the proposer. The reflection model needs at least one finished run before it can propose edits. Bootstrap with N random perturbations of the default config? Use APO-style beam search for the first generation, then expand into Pareto?
Noisy oracle problem. AL assumes the oracle is roughly correct. Opus is excellent but not infallible. The cautious-review pass catches some errors, but whether the leaderboard should down-weight configurations whose convergence relied on later-overturned LLM labels is open.
Convergence detection for the meta-loop. When does evolution stop? Frontier-stability heuristics (no admissions in K generations) versus operator-driven termination versus budget-exhausted.
Reflection-model agreement. APO’s textual-gradient critic and GEPA’s recombination critic are both LLM-driven. Do they propose meaningfully different edits, or do they collapse to the same suggestion? Worth empirical study before committing the architecture.
Reproducibility under stochastic LLM outputs. Two evaluations of the same config can disagree on objective scores. How much smoothing (multi-seed averaging) is required before non-dominated sorting becomes stable?

Cross-references

Classification Pipeline — the AL loop being generalized.
Synthetic Data & Training — synth provides the labeled-pool floor.
ML Artifacts & Extend Classification — winning configurations produce artifact sets that flow through the existing Extend pipeline.
GPU Acceleration — population-based search amplifies the payoff of fast per-evaluation rollouts.
Proposed Integrations — neighboring roadmap items that may interact with the leaderboard surface.

References

Settles, B. (2009). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
Lakhotia, K. et al. (2025). GEPA: Genetic-Evolutionary Pareto- frontier Adaptation. arXiv:2507.19457. ICLR 2026 (Oral).
Pryzant, R. et al. (2023). Automatic Prompt Optimization with “Gradient Descent” and Beam Search. arXiv:2305.03495.
Microsoft Agent Lightning, APO Algorithm Documentation, https://microsoft.github.io/agent-lightning/latest/algorithm-zoo/apo/

Proposed Integrations

This page documents two planned integration points that extend the data source model: MLflow experiment tracking (Phase 5) and Hive data connections (Phase 6). Both are designed but not yet implemented.

MLflow Integration (Phase 5)

Motivation

On CAI deployments, MLflow is available as a managed service. Logging pipeline runs to MLflow provides:

Experiment history: compare accuracy, conflict, and coverage across pipeline versions without the Atelier UI
Model registry: when CatBoost/SVM models are trained, register them as versioned artifacts
Artifact persistence: classifications.json, evaluation reports, and parquet files survive pod restarts
Cross-project visibility: other CAI workloads can discover Atelier’s registered models

Architecture: Write-Then-Reconcile

The MLflow bridge follows the RAG Studio reconciler pattern — the pipeline never blocks on MLflow I/O.

Pipeline thread                    Reconciler (background)
──────────────                     ───────────────────────
write JSON to queue dir ──────►   poll queue dir
  (non-blocking)                   parse JSON envelope
                                   log to MLflow (retries)
                                   move to archive/

This design is resilient to:

MLflow downtime (queue accumulates, reconciler catches up)
Pipeline latency (no synchronous API calls in the hot path)
Pod restarts (queue dir is on persistent storage)

Queue Format

Each pipeline state transition writes a JSON envelope to build/mlflow_queue/:

{
  "event": "run_complete",
  "run_id": "abc123",
  "source_id": "ootb-sample",
  "timestamp": "2026-04-14T12:00:00Z",
  "payload": {
    "params": {
      "source_id": "ootb-sample",
      "vocabulary_mode": "universal",
      "sample_size": 50,
      "llm_model": "glm-4.7",
      "discount_cosine": 0.30
    },
    "metrics": {
      "accuracy": 0.847,
      "micro_f1": 0.832,
      "macro_f1": 0.791,
      "mean_belief": 0.724,
      "mean_conflict": 0.089,
      "coverage": 0.973,
      "llm_calls": 42,
      "bootstrap_iterations": 3
    },
    "artifacts": [
      "build/results/abc123/classifications.json",
      "build/results/abc123/evaluation_report.json",
      "build/results/abc123/atelier_embeddings.parquet"
    ]
  }
}

MLflow Experiment Structure

Each data source maps to an MLflow experiment:

Experiment: atelier/ootb-sample
├── Run: v1 (params, metrics, artifacts)
├── Run: v2 (params, metrics, artifacts)
└── Run: v3 (params, metrics, artifacts)

Experiment: atelier/hive-prod-default
└── Run: v1 (params, metrics, artifacts)

What Gets Logged

Category	Items	Notes
Params	source_id, vocabulary_mode, sample_size, llm_model, discount factors	Static per run
Metrics	accuracy, micro_f1, macro_f1, mean_belief, mean_conflict, coverage	Numeric scalars
Artifacts	classifications.json, evaluation_report.json, parquet	Full result set
Models	CatBoost (.cbm), SVM (.pkl)	Registered when newly trained

Module Design

# src/atelier/classify/mlflow_bridge.py

class MLflowBridge:
    """Async write-then-reconcile bridge to MLflow."""

    def __init__(self, queue_dir: Path, experiment_prefix: str = "atelier"):
        self.queue_dir = queue_dir
        self.experiment_prefix = experiment_prefix

    def enqueue(self, event: str, run_id: str, source_id: str, payload: dict):
        """Write an event envelope to the queue (non-blocking)."""
        ...

    def reconcile(self):
        """Process all pending queue items. Called by background thread."""
        ...

Pipeline integration points:

# In pipeline.py — at key state transitions:
bridge.enqueue("run_started", run_id, source_id, {"params": {...}})
# ... pipeline work ...
bridge.enqueue("run_complete", run_id, source_id, {"metrics": {...}, "artifacts": [...]})

Gating

MLflow is only active on CAI (cfg.is_cml). In devenv, the bridge is a no-op. The mlflow package is an optional dependency — import failure is handled gracefully.

Configuration

# config/base.conf (proposed)
mlflow {
    enabled = false
    enabled = ${?ATELIER_MLFLOW_ENABLED}
    tracking_uri = null
    tracking_uri = ${?MLFLOW_TRACKING_URI}
    queue_dir = "build/mlflow_queue"
}

Implementation Notes

The reconciler runs as a daemon thread started in the gateway lifespan, similar to the sample source seeding
Queue items are atomic files (write to .tmp, rename to .json) to prevent partial reads
Failed reconciliation retries with exponential backoff (max 5 min)
Archive dir (build/mlflow_queue/archive/) retains processed items for debugging

Files (Proposed)

File	Action
`src/atelier/classify/mlflow_bridge.py`	New: bridge + reconciler
`src/atelier/classify/pipeline.py`	Extend: bridge calls at transitions
`config/base.conf`	Extend: mlflow config block
`src/atelier/config.py`	Extend: mlflow fields
`src/atelier/gateway.py`	Extend: reconciler daemon thread

Hive Data Source (Phase 6)

Motivation

The OOTB sample source demonstrates the pipeline with synthetic data. In production on CAI, the real value comes from classifying columns in the customer’s actual Hive tables via CAI data connections.

How It Works

Hive sources are auto-discovered at gateway startup. The gateway lifespan hook calls discover_hive_sources(cfg) which:

Iterates all connections listed in ATELIER_DATA_CONNECTIONS
For each connection, runs SHOW DATABASES and checks each database for an annotations table
Validates the schema: fetches 1 row and checks for legacy (id, ontology, annotation) or universal (code, label) format
Auto-registers valid sources via get_or_create_data_source() (idempotent — safe to re-run on restart)

Once registered, the pipeline route works automatically:

Pipeline resolves data from the connection: when source_id refers to a hive source, the pipeline calls discover_tables() and sample_table_metadata() using that connection
Vocabulary routing: hive sources use load_annotations_from_hive() which reads default.annotations (domain categories) and composes them on top of the universal base
Results register as versions: each pipeline run creates a new version under the hive source, with the same activation/versioning semantics as the sample source

Data Flow

CAI Data Connection (Hive/Impala)
        │
        ▼
discover_tables(cfg, connection_name, database)
        │                    ┌─────────────────────────┐
        ▼                    │ load_annotations_from_   │
sample_table_metadata()      │ hive(cfg, connection)    │
        │                    │ → default.annotations    │
        ▼                    └──────────┬──────────────┘
                                        │
    ┌───────────────────────────────────┘
    ▼
compose_vocabularies(universal, hive_domain)
    │
    ▼
run_classification_pipeline(cfg, fsm, source_id="hive-prod-default")
    │
    ▼
Dataset version N+1 registered under hive source

Vocabulary Composition

Hive sources use two-layer vocabulary composition:

Layer 1 (always):   Universal vocabulary (16 BFO-grounded PII categories)
                              ╱╲
Layer 2 (hive only): Domain annotations from default.annotations table
                     (290+ customer-specific categories with hierarchical codes)
                              ╱╲
                     Composed CategorySet (300+ terms)

Domain categories attach to the universal tree via parent_code references. Categories without a valid parent are logged as warnings and placed under a catch-all internal node.

Source Creation

When a user selects a data connection from the Status page dropdown and clicks “Create Source”, the gateway:

Validates the connection by running SHOW DATABASES

Creates a data_sources record:

{
  "id": "hive-{connection}-{database}",
  "source_type": "hive",
  "source_uri": "{connection}/{database}",
  "display_name": "hive:{connection}/{database}",
  "vocabulary_mode": "hive"
}

The source appears in the dropdown immediately

Pipeline Routing

# In pipeline.py — source-based auto-resolution
if source.source_type == "hive":
    connection_name = source.source_uri.split("/")[0]
    database = source.source_uri.split("/")[1]
    # discover_tables() and sample_table_metadata() use the connection
    # load_annotations_from_hive() uses the connection for vocabulary

Configuration

No new configuration needed. Existing settings control Hive behavior:

classify {
    connection_name = ""                # Default CAI data connection
    connection_name = ${?ATELIER_CLASSIFY_CONNECTION}
    database = "default"
    database = ${?ATELIER_CLASSIFY_DATABASE}
}

cml {
    data_connections = ""               # Comma-separated connection names
    data_connections = ${?ATELIER_DATA_CONNECTIONS}
}

Files (Proposed Changes)

File	Change
`src/atelier/gateway.py`	Add `POST /api/data-sources` endpoint with connection validation
`src/atelier/classify/pipeline.py`	Extend source routing to resolve hive connections
`ui/src/pages/Status.tsx`	Add “Create Source” button in data connection card

Existing Modules Used (No Changes)

Module	Function	Role
`sampler.py`	`discover_tables()`	List tables via `cml.data_v1`
`sampler.py`	`sample_table_metadata()`	Sample column values
`taxonomy.py`	`load_annotations_from_hive()`	Load domain vocabulary
`taxonomy.py`	`compose_vocabularies()`	Merge universal + domain

Implementation Priority

Phase	Integration	Depends On	Testable Without Services
5	MLflow bridge	Phase 2 (data model)	Partially — queue/reconcile logic is pure Python
6	Hive source	Phase 2 (data model)	No — requires CAI data connection

Phase 5 can be developed and unit-tested independently (the queue and reconcile logic is pure Python). The MLflow API calls can be mocked in tier-0 BDD scenarios.

Phase 6 is primarily wiring — the heavy lifting (table discovery, vocabulary loading, pipeline execution) already exists. The main new code is the gateway endpoint for source creation and the UI for triggering it.

Encrypted Deployment Defaults (SOPS + age)

Atelier ships with encrypted deployment defaults so a CAI operator can stand up a working instance by entering only four environment variables — their two AWS Bedrock credentials, a direct Anthropic API key (for overwatch), plus a single age private key that unlocks everything else.

Why

Every CAI deployment needs a dozen-ish environment variables: Bedrock model ARNs, Atlas / Ranger URLs, feature toggles, governance flags, subagent model IDs, and — for UAT runs — a curated-reference CSV for accuracy measurement. Most of those values are identical across every deployment of the same Atelier release; only the AWS credentials and the Anthropic key are operator-specific. Rather than documenting a long checklist for every customer, we encrypt the defaults and the curated-reference fixture into the repository with SOPS and ship one key alongside the deployment.

The operator paste-sets the key; everything else is already wired up.

Operator workflow (what to tell your CAI users)

Set four environment variables on the CAI Application, then start it:

Name	Value	Source
`AWS_ACCESS_KEY_ID`	Bedrock access key	your AWS / IAM team
`AWS_SECRET_ACCESS_KEY`	Bedrock secret	your AWS / IAM team
`ANTHROPIC_API_KEY`	direct Anthropic API key	Anthropic Console
`SOPS_AGE_KEY`	full `AGE-SECRET-KEY-1…` string	provided out-of-band by the Atelier maintainer

On startup, bin/start-app.sh runs the shared bin/bootstrap-secrets.sh utility, which decrypts both .env.cai.enc (dotenv defaults) and features/fixtures/curated_reference.csv.enc (meta-tagging answer key) with the age key you provided. The dotenv values source into the environment where HOCON’s ${?VAR} substitution picks them up; the decrypted CSV materializes at build/data/curated_reference.csv and ATELIER_CLASSIFY_REFERENCE_URI points at it so evaluation_report.json carries real accuracy numbers. No per-customer checklist to maintain.

Overrides still work. Any explicit ATELIER_* env var on the CAI Application wins over the encrypted default — so an operator who wants a different Bedrock ARN just sets ATELIER_AGENT_MODEL directly and that value takes precedence.

Alternative: pointing at a key file

If the operator already has the age key on disk (e.g. mounted from a secret store), they can set SOPS_AGE_KEY_FILE=/path/to/key.txt instead of pasting the key content. bin/start-app.sh supports both.

Maintainer workflow

The age public key is committed in .sops.yaml; the private key is held by the Atelier maintainer and distributed out-of-band to each CAI operator.

First-time setup

Place your age private key at ~/.config/sops/age/keys.txt — the public key must match the age: age1… line in .sops.yaml. The devenv shell provides both sops and age binaries.

Editing defaults

just decrypt-secrets          # .env.cai.enc → .env.cai (plaintext, gitignored)
$EDITOR .env.cai              # add / change values
just encrypt-secrets          # .env.cai → .env.cai.enc
git add .env.cai.enc
git commit -m "chore: update CAI deployment defaults"

The plaintext .env.cai is excluded by .gitignore; only the encrypted .env.cai.enc is tracked. SOPS encrypts each value independently, so diffs show which keys changed even though their values are opaque.

Editing the curated-reference fixture

The meta-tagging answer key (what evaluation_report.json compares predictions against) ships encrypted under the BDD fixtures tree so committed secrets live with the corpus they validate.

# From the maintainer's reviewer xlsx
uv run python -m atelier.overwatch.ingest_reference \
    ~/path/to/Atelier_Results_Default_DB_4-16.xlsx \
    --out build/data/curated_reference.csv

# Encrypt into features/fixtures/ and commit the ciphertext only
just encrypt-reference
git add features/fixtures/curated_reference.csv.enc
git commit -m "chore: update curated-reference answer key"

To inspect the current key without re-running the xlsx ingest:

just decrypt-reference        # decrypts into build/data/curated_reference.csv
$PAGER build/data/curated_reference.csv

Both the plaintext CSV (in build/) and .env.cai are ignored by git; only the .enc ciphertexts are tracked.

Rotating the key

age-keygen -o new-key.txt                                    # generate replacement pair
# update .sops.yaml: replace the age: age1... line with the new public key
sops updatekeys .env.cai.enc                                 # re-encrypt deployment defaults
sops updatekeys features/fixtures/curated_reference.csv.enc  # AND the curated-reference fixture
git commit -am "chore: rotate CAI deployment key"
# distribute the new private key to operators via the same out-of-band channel

sops updatekeys rewrites the encrypted file’s recipient list in place — nothing about the plaintext values changes, so this is a zero-content-drift rotation. Run it against every encrypted artifact so the new key unlocks the whole set.

Adding a second recipient (e.g. ops team shared key)

Add a second age: entry under the matching creation_rules block in .sops.yaml, then run sops updatekeys .env.cai.enc. Either private key will decrypt.

How this fits with HOCON

SOPS only populates environment variables. HOCON (config/base.conf) already treats all configuration as environment-overridable via the ${?VAR} pattern:

agents {
  model = "claude-opus-4-7"
  model = ${?ATELIER_AGENT_MODEL}     # env wins when set
}

SOPS decryption runs before the gRPC server loads HOCON, so from HOCON’s perspective the encrypted values are just ordinary environment variables.

What belongs in `.env.cai.enc` vs `config/base.conf`

.env.cai.enc — deployment-specific defaults that differ between environments but aren’t operator secrets per se (model ARNs, Knox endpoints, feature toggles, subagent IDs). Values that are derivable from context and you don’t want every operator to rediscover.
config/base.conf — true defaults that hold for every deployment; structural knobs that belong in source control in plaintext (pipeline thresholds, port numbers, fusion strategy).
Operator-entered env vars — genuine per-deployment secrets (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, the SOPS_AGE_KEY itself). These never live in the repository.

Security notes

SOPS_AGE_KEY decrypts only this project’s .env.cai.enc. Losing it costs you these defaults; gaining it grants no AWS, Cloudera, or third-party privilege on its own.
Each customer should get the same age private key (defaults are identical across deployments) — per-customer secrets, if any, stay in the CAI Application’s own environment variables.
Rotate the key whenever a recipient leaves the operator pool.
The age public key in .sops.yaml is intentionally committed; public keys are meant to be public.

Reference

.sops.yaml — recipient rules (covers .env.cai.enc + features/fixtures/*.csv.enc)
.env.cai.enc — encrypted deployment defaults (committed)
features/fixtures/curated_reference.csv.enc — encrypted curated-reference CSV (committed)
bin/bootstrap-secrets.sh — shared decrypt utility; runs from bin/start-app.sh, devenv enterShell, and just bootstrap-secrets
bin/start-app.sh — CAI startup; invokes bootstrap-secrets then sources .env.cai
justfile helpers:
- bootstrap-secrets — run the shared decrypt utility
- decrypt-secrets / encrypt-secrets — dotenv editing workflow
- decrypt-reference / encrypt-reference — curated-reference CSV editing workflow
devenv.nix — provides sops + age in the dev shell; runs bootstrap in enterShell
SOPS docs · age docs

Reviewer’s Guide to the Embeddings Canvas

This guide is for operators auditing classification runs and proposing algorithm-tuning remediations. It explains the Dempster–Shafer (DST) measures the canvas exposes, the rationale behind the curated SQL Predicate panel, and a concrete walk-through of using the canvas to diagnose the four root causes called out in audit_2026-05-06_a.md (runs 40f07630, 8d67b1ed, e5b0ac26).

The guide assumes you have an Embeddings page open for one of those runs and a copy of the audit alongside.

1. The DST measures, in plain English

Atelier fuses up to six independent evidence sources (name match, pattern, cosine, LLM, CatBoost, SVM) via Dempster’s rule of combination. The fused result is a mass function over the taxonomy’s frame of discernment. From that mass function we report five scalars per column:

Field	Formula	Meaning
`belief`	`Bel(A) = Σ m(B), B ⊆ A`	Lower bound on the probability the prediction is correct. Mass committed only to A or its subsets.
`plausibility`	`Pl(A) = Σ m(B), B ∩ A ≠ ∅`	Upper bound. Mass consistent with A — what hasn’t been ruled out.
`uncertainty`	`Pl(A) − Bel(A)`	The width of the [Bel, Pl] interval. Epistemic uncertainty, smaller is better.
`confidence`	`BetP(x) = Σ_{x∈A} m(A) /	A
`conflict`	K, the pre-normalization mass on ∅	Source-disagreement diagnostic. Under Dempster’s rule, K is normalized out of [Bel, Pl] but logged separately; under Yager, K is redirected to ignorance (Θ).

The invariant: Bel(A) ≤ BetP(A) ≤ Pl(A) for every column. Pl + Bel of A’s complement always equals 1 (duality).

Which one is the “rigor” signal?

For a single positive scalar, prefer belief. It is the honest floor — mass that cannot be redirected by additional evidence even in principle. The cautious-review gate (bel_threshold = 0.80) operates on Bel; bootstrap convergence is on the gap (Pl − Bel); needs_clarification fires when Bel < 0.80 OR gap > 0.20. The project’s algorithms already treat Bel as the truth proxy; the reviewer should too.

confidence is not redundant — it serves a different purpose. BetP redistributes ignorance optimistically, so a vacuous mass function over a 16-singleton frame still produces BetP ≈ 0.06 per singleton. Comparing belief to confidence on the same row is how reviewers build intuition for “how much of this column’s prediction is committed evidence vs. evenly-spread ignorance.” Big gap between Bel and BetP = the prediction looks confident only because the rest of the frame is empty.

A worked example from 8d67b1ed’s row 1 (fitness_members.row_id):

Bel = 0.834   BetP = 0.834   Pl = 0.933   K = 0.358

BetP and Bel align tightly because the mass is concentrated on singletons (no large compound focal elements to spread). A healthy, committed prediction.

Compare with a hypothetical weak prediction:

Bel = 0.30    BetP = 0.55    Pl = 0.85    K = 0.10

The same headline confidence = 0.55 masks a Bel of 0.30 — meaning 70% of the mass is sitting on compound focal elements that BetP is spraying across singletons. Reviewer’s read: this is not an 0.55 prediction; it’s a 0.30 prediction wearing a 0.55 hat.

Why `conflict` is no longer the canvas color default

Under Dempster’s rule (the default fusion strategy), K is normalized out of [Bel, Pl] — every fused mass function is renormalized by (1 − K). K still gets reported as a diagnostic, but it does not correlate with prediction quality. Run 8d67b1ed averaged K = 0.27 across all 287 columns; rows with very different beliefs (Bel = 0.30 vs Bel = 0.85) commonly share the same K. Coloring by K painted the canvas a nearly-uniform fog.

belief paints the canvas with information. Low-Bel rows (the cautious-review candidate pool) cluster in warm colors; committed predictions cool out. The 0.80 cliff that drives the cautious review and needs_clarification is visible — it’s the threshold between a calm canvas and a hot-spot region that demands human attention.

If you switch the run to Yager fusion, K is no longer normalized out — it shows up as ignorance mass, which depresses Bel and widens the gap. Reviewers comparing fusion strategies side-by-side should look at Bel + gap on both, not at K — K means different things under the two rules.

2. The curated SQL Predicate panel

Embedding-Atlas’s default behavior is to auto-generate one chart per data column. With 35 fields in the parquet, that’s noise: tooltips overlap with the canvas, projection coordinates render as histograms, JSON blobs render as illegible text fields.

The curated panel exposes only fields that map to an algo-tuning decision. Order is intentional — top to bottom, the panel walks the reviewer from “is this run healthy” → “where is the pain concentrated” → “which feature is driving it.”

#	Field	Chart shape	Why it’s there
1	`belief`	Histogram	Primary quality signal. The 0.80 cliff is the cautious-review threshold; rows below are the candidate pool. Brushing this filters the canvas to “weak” predictions.
2	`confidence`	Histogram	BetP. Side-by-side with `belief` builds intuition for the Bel-vs-BetP gap. Wide gap on a row = mass concentrated on compound focal elements.
3	`review_decision`	Count plot	Categorical: `keep` / `backoff` / `reroute` / `""` (untouched). This is the audit’s central concern — Finding 1 names reroute as the instability amplifier.
4	`predicted_annotation`	Count plot	Compact mnemonic (e.g. `NAMEFULL`, `EMAIL`, `PHONE`) — the dot-codes are unreadable in a small chart, but the annotation tells the same story. The full label appears in the embedding tooltip.
5	`needs_clarification`	2-bar count	Boolean union of `Bel < 0.80 OR gap > 0.20`. The “demands attention” set, expressed as a single flag.
6	`llm_confidence`	Histogram	LLM’s self-reported confidence. Low-tail rows are the population at risk for reroute amplification — a weakly-asserted LLM code that DST then has to defend.
7	`uncertainty`	Histogram	Pl − Bel — gap-driven revisit set. Bootstrap convergence is on `mean(uncertainty)`; canvas histogram lets reviewers see whether the run actually converged or just hit max-iterations.
8	`conflict`	Histogram	K, demoted from default but kept as a source-disagreement diagnostic. Useful when comparing Dempster vs Yager runs (K means different things under each).
9–14	`shap_top1/2/3_name`, `shap_top1/2/3_value`	Count + histogram pairs	Surfaces which feature is driving each prediction. Top-1 is usually `sample_values`; top-2/3 reveal sibling-context vs column-name dominance. The intentional inclusion of all three reflects the steep dropoff in SHAP utility between top-1 and top-3 — the dropoff is itself the situational signal. When top-1 dominates by 5×, single-feature explanations work; when top-1/2/3 are flat, the prediction is broadly diffuse and remediation needs to address feature-engineering, not source weights.
15	`table_name`	Count plot	Hotspot navigation. Audit calls out `legal_cases` and `loan_applications` as hallucination concentration zones; this chart lets reviewers brush-filter to one.
16	`column_type`	Count plot	Numeric vs object. Pattern-signal source is type-conditioned; reviewing remediations to pattern detectors benefits from typed slicing.

What’s not in the default panel. Reference fields (reference_code, reference_label, matches_reference) are usually empty for production Hive data. When a run does have a curated reference set (UAT meta-tagging mounts), reviewers can add reference_code and matches_reference via the SQL Predicates control on the panel header — type into the predicate input directly, or click “Add” and pick the column. The panel re-renders instantly. Same mechanism applies to any field the reviewer wants ad-hoc — e.g., predicted_label for a long-form taxonomy view, or predicted_code when a numeric dot-code is needed for filtering.

3. Walk-through against `audit_2026-05-06_a.md`

The audit identifies four root causes. Below: how to reach each one on the canvas, what the right brushing pattern is, and what the algo-tuning lens reveals.

Finding 1 — Three-way reroute as instability amplifier

“20.6% of columns flip between runs with identical configuration. The reroute mechanism turns a minor LLM fluctuation into a major classification change.”

Brush: review_decision = "reroute" on chart 3.

What you see: the canvas highlights all rerouted rows. Look at their distribution — are they clustered in one taxonomy region (a single subtree’s entropy bleeding into neighbors), or are they scattered (the LLM is fluctuating uniformly)?

Cross-brush with belief chart 1, brushing the 0.40–0.70 band: this is the cohort that fails the 0.80 threshold but isn’t trivially weak. Reroute decisions are most consequential here — an LLM fluke on a 0.85-Bel row gets rejected by the threshold; on a 0.65-Bel row it gets handed to the reviewer. The audit’s recommended P1 guard (“reject reroutes where pre-review code matches LLM code with conf > 0.80”) would visibly clip the right edge of this brush.

Algo-tuning read: if rerouted rows cluster around belief ≈ 0.5 and have llm_confidence > 0.80, the audit’s P1 guard is the right remediation. If they cluster at belief < 0.4, the upstream issue is fusion strength, not the reviewer.

Finding 2 — LLM annotation-code hallucination (27 columns)

“The LLM returned annotation mnemonics (SSN, DOB, FNAME) instead of numeric taxonomy codes for 27 columns in 40f07630. When llm_code is an annotation string, the code-resolution layer discards it — evidence_sources.llm = {}.”

Brush: llm_confidence chart 6, isolate the low-confidence tail below 0.20. These are columns whose LLM evidence was discarded (the evidence layer assigns 0 confidence when the code fails to resolve).

Cross-brush with belief (chart 1) — affected columns will pile up at low Bel because they fall back to cosine alone.

SHAP signal: chart 9 (shap_top1_name) on the same brush should show column_name or sibling_context dominating instead of sample_values. When SHAP’s top-1 is not sample_values, the classifier wasn’t given enough evidence from the values themselves — the LLM-evidence loss is showing up as an upstream feature-importance shift.

Algo-tuning read: the audit’s P0 (“map annotation mnemonics to numeric codes in _resolve_llm_code()”) would eliminate this brush entirely — its impact is visible as the disappearance of a low-tail cluster on llm_confidence. Reviewer can size the impact: “≈ 27 columns × mean(belief gain) = X total mass committed.”

Finding 3 — `col_04` and sibling-context poisoning

“When sibling opaque columns (col_02, col_32) are all misclassified as Shipping Address (because the table name biases the embedding), the reviewer uses those wrong sibling labels as evidence to perpetuate the error.”

Brush: shap_top1_name = "sibling_context" on chart 9, then cross-brush column_name LIKE 'col\\_%' via the SQL Predicate input. This is the at-risk population.

What you see: the rerouted opaque columns cluster on the canvas near their (incorrectly inferred) neighbors. When the reviewer-bias poisoning is at work, these clusters will show consistent predicted_annotation across the cluster — the error has propagated.

Cross-brush with review_decision = "reroute": rerouted opaque columns where SHAP shows sibling-context dominance are the precise target of the audit’s P2 remediation (“exclude sibling columns with opaque names from reviewer context”).

Algo-tuning read: the size of this brush is the population the P2 remediation removes. If SHAP top-2 and top-3 (charts 11, 13) are also dominated by sibling-context for these rows, the value-side evidence is systematically under-represented and the remediation needs to extend beyond “exclude opaque siblings” to “rebalance feature weights when sample-values entropy is high.”

Finding 4 — Baseline 20% non-determinism

“Between e5b0ac26 and 8d67b1ed — identical configuration, same dataset, 5 hours apart — 59 of 287 columns (20.6%) changed their final predicted_code. This establishes the non-determinism floor.”

This finding is cross-run; one canvas can’t render it directly. But it manifests on the canvas as confidence vs belief gap dispersion. A run with high non-determinism has many columns where confidence diverges from belief — these are the rows whose mass is spread across compound focal elements rather than committed to singletons, making them sensitive to small evidence perturbations.

Brush: in the SQL Predicate input, type confidence - belief > 0.15. The canvas highlights the diffuse-mass cohort. These are the rows most likely to flip on the next run.

Algo-tuning read: the audit’s P2 (raise bel_threshold from 0.80 to 0.85–0.90) tightens the cautious-review entry criterion — fewer borderline rows enter review, fewer reroutes amplify. Brushing belief ∈ (0.80, 0.85) shows the population the threshold raise removes from review, which is also the high-flip-rate population. A 5-point threshold raise on the 8d67b1ed canvas removes ≈ 30 columns from the review pool — pre-computable from the histogram.

4. Algo-tuning playbook

When you arrive at a fresh canvas, walk top to bottom:

Healthy run check: belief chart, look at the mass below 0.80. If < 10% of the corpus is below the cliff, the run converged comfortably. If > 25%, something upstream (LLM, alignment, vocab) is weak.
Reroute pressure check: review_decision, count the reroute bar. Reroutes > 20% of total = the reviewer is doing too much work; consider raising bel_threshold (audit P2) or constraining the shortlist.
LLM-evidence integrity check: llm_confidence, look for the < 0.20 tail. Population of that tail = approximately the annotation-hallucination cohort (audit P0).
Feature-driver check: shap_top1_name, see whether sample_values dominates. When it doesn’t, the prediction is leaning on schema/sibling context — fragile.
Hotspot triage: table_name, see whether failures concentrate in a small number of tables. Per-table failure patterns often point to vocabulary alignment gaps that affect only certain domains (legal, financial, medical).

The remediations the audit recommends should each have a visible signature on the canvas. When you propose a fix, predict where on the canvas the fix will land — and verify against the next run.

5. Configuration reference

The curated panel is configured in ui/src/pages/Embeddings.tsx via the defaultChartsConfig prop on the EmbeddingAtlas component. The category field on the embedding spec sets the canvas color; the include array sets the predicate panel contents and order.

Reviewers needing different fields for a one-off audit can use the SQL Predicate control at the top of the predicate panel — type a SQL expression directly (e.g. predicted_code = '1.1.1.9.1' AND review_decision = 'reroute') and brush the result. The expression composes with all other brushes on the canvas.

For permanent additions to the default panel, edit the include array; the order in the array is the order in the panel. Avoid adding high-cardinality fields (column_name, evidence, embedding_text) — they render as illegible count plots.

Addendum — Remediation Paper-Trade Observations (2026-05-06)

This addendum captures observations from the static validation of the algo-tuning playbook against 8d67b1ed’s parquet, and the paper-trade of each audit_2026-05-06_a.md remediation against the same run. It is intended both as honest documentation of what worked vs. what needed adjustment, and as the calibration baseline against which the post-remediation validation run will be evaluated.

A.1 Playbook validation findings

Walking the playbook brushes against 8d67b1ed/atelier_embeddings.parquet surfaced three corrections to the original guide:

Correction 1 — `BetP − Bel` brush is empty in practice

The playbook section on Finding 4 (baseline non-determinism) prescribes brushing confidence - belief > 0.15 to find the diffuse-mass cohort. On 8d67b1ed:

mean(BetP - Bel) = 0.0007
rows with gap > 0.15 = 0
rows with gap > 0.05 = 0

Why: mass concentrates on singleton focal elements in this corpus, so the pignistic transform has nothing to redistribute. BetP ≈ Bel everywhere. The theoretical intuition (BetP optimistically spreads ignorance) is sound but only manifests when significant mass lives on compound focal elements — rare in production runs.

Replacement brush for the non-determinism cohort: uncertainty > 0.20 (Pl − Bel above the cautious-review gap threshold). This does populate; in 8d67b1ed, 199/287 rows (69%) carry uncertainty > 0.20, so for narrowing purposes pair it with belief < 0.6 to focus on the genuinely weak predictions.

Correction 2 — `bel_threshold` direction is the opposite of the audit’s claim

The audit’s P2 recommendation says:

Raise bel_threshold from 0.80 to 0.85-0.90 → reduces candidate pool

This is mechanically false. The threshold gates entry to cautious review (cautious_review.py:454: if bel < bel_threshold); raising it strictly enlarges the candidate pool. Measured on 8d67b1ed:

`bel_threshold`	Candidates
0.80	199 / 287 (69.3%)
0.85	239 / 287 (83.3%)
0.90	255 / 287 (88.9%)

Decision: R4 is deferred. R1 (annotation-mnemonic recovery) materially lifts Bel for the 33%-of-corpus cohort with previously-empty LLM evidence; the candidate-pool size after R1 may make a threshold adjustment unnecessary. Re-evaluate after the validation run.

Correction 3 — Audit’s “16 hallucination cases” undercount

Audit Finding 2 cites “~16 cols hallucinate annotation in 8d67b1ed.” The actual count of rows whose LLM evidence is absent from the fused mass is 95 / 287 (33%) — six times the audit’s number. The discrepancy is partly because the audit conflated two distinct cases: true mnemonic emission (which R1 recovers) and LLM voting at a parent focal element (which _mass_summary filters out of the singleton-only evidence_sources.llm field even though the mass is fully present in the fused result). The latter is not a hallucination — it’s an observability artifact in _mass_summary.

Implication for the canvas: rows with evidence_sources.llm = {} should not be read as “LLM contributed nothing.” When the row’s llm_code is non-empty and falls inside the runtime taxonomy, the LLM voted at an internal node and contributed mass through that parent FE. The brush is more honest as a resolution-failure indicator: filter to rows where llm_code is non-numeric AND evidence_sources.llm is empty to find the genuine mnemonic cohort.

2026-05-07 update — R10: _mass_summary now surfaces internal- node FEs alongside singletons. Internal-node entries carry a trailing * (e.g. "1.1.1.9.4*": 0.65) to distinguish “parent FE, mass spread across descendants” from a singleton-leaf vote. The singleton-only filter is gone — evidence_sources.llm = {} now means the LLM produced no code we could map at all, which is the intended semantics.

A.2 Per-remediation paper-trade results

Each remediation was paper-traded against 8d67b1ed after implementation. Predicted impact in the leftmost column comes from the audit’s recommendation; observed impact is what the paper-trade measured.

ID	Remediation	Audit’s predicted impact	Paper-traded impact	Note
R1	Annotation-mnemonic fallback in `_resolve_to_focal_element`	“Recovers LLM evidence for 27 cols, ~20% reroute candidate reduction”	38 / 287 columns (13.2% of corpus) recover full LLM evidence; mean `llm_confidence` on recovered cohort = 0.89; 5 hr_compensation columns concentrate on `EMPDET → 1.1.1.2.5.3 (Employment Related)` from scattered low-Bel predictions	Significantly higher than audit’s 27. Recovery is concentrated in tables with rich user-vocab mnemonics (`EMPDET`, `PANEXP`, `SHIPADDR`, `BIN`).
R6	Skip Hive/Hue temp tables (`__tmp_*`) at discovery	(not in audit)	1 table dropped (`hue__tmp_ecommerce_orders`); 16 cols removed from classification, 9 of which were R1-recovery candidates → net R1 impact after R6: 29 cols	R6 supersedes R1 for those 9 cols (correct: temp tables shouldn’t classify at all). Net cohort R1 actually recovers in next run = 29.
R2b	Markdown-fence + extra-data extraction in `_parse_decision`	“Eliminates 3-5 hard errors per run”	3 / 11 errored decisions in `8d67b1ed` had the markdown-fence-with-trailing-prose shape; new `_extract_json_object` parses them cleanly (verified against captured response)	Audit estimate accurate.
R2c	Shortlist-permissive parsing (NEW — audit conflated with R2b)	(not separately specified)	8 / 11 errored decisions rejected codes that were valid in the runtime taxonomy but outside the 5-entry shortlist. R2c accepts these as `shortlist_extended` reroutes	Audit’s “11 errors” summary should split into two classes; R2b alone would only catch 3/11.
R3	Exclude opaque siblings (`col_NN`, `var_NN`, `dim_NN`, …) from reviewer context	“Prevents sibling-context poisoning”	Cohort visible in `8d67b1ed` is small (1 rerouted, 2 candidates) because filter was ON; in 40f07630 (filter off) the cohort is 13+	Paper-trade limited by which run is on hand. Validation run will need filter OFF or include opaque-name tables to size the impact.
R2a	Stability guard on cross-subtree reroutes	“Prevents the gaming_profiles.handle failure class”	Three iterations: v1 (naive: pre==llm ∧ conf>0.80) blocked 20 / 64 reroutes, including legitimate depth corrections. v2 (top-level-root differs) blocked 5 / 64 but missed sideways moves within the `1.x` namespace. v3 (neither-is-ancestor — current implementation) blocks 12 / 64 — all visibly cross-subtree, with depth corrections preserved	Audit framing assumed all “LLM+fusion agreed” reroutes are noise; in practice 15 such reroutes were within-subtree backoffs (e.g., `1.1.1.8.2 → 1.1.1.8`). The neither-is-ancestor rule cleanly separates these.
R5	Split `llm_agreement` into pre/post-review metrics	“Makes overwatch signal useful”	Purely additive; new `llm_agreement_pre_review` field reports DST-vs-LLM alignment without review reassignment confounding	Diagnostic only; no impact on classification outcomes.
R4	Raise `bel_threshold` 0.80 → 0.85-0.90	“Reduces candidate pool”	Deferred — see Correction 2. Audit direction is mechanically wrong. Re-evaluate after R1 lifts Bel	Expected outcome: post-R1, the threshold may not need adjustment; if it does, the right direction is down (0.65-0.70).

A.3 Predicted canvas signatures for the validation run

What to look for on the post-remediation canvas to verify each remediation landed:

Remediation	Predicted canvas signature
R1	`belief` histogram shifts right — the mode of the < 0.5 cluster moves toward 0.7-0.8 (LLM evidence now contributing). The hr_compensation table (5 cols, all currently scattered) collapses onto a single `predicted_annotation` value (`EMPDET`).
R6	Total column count drops by ~16 (the `hue__tmp_ecommerce_orders` columns). `table_name` count plot loses one bar.
R2b	`cautious_review.json`’s `errored` count drops by ~3. Bedrock-deployed runs benefit most.
R2c	`cautious_review.json`’s `errored` count drops by ~8 (combined with R2b: total errored drops to 0-1). New `shortlist_extended` counter in summary > 0.
R3	`cautious_review.json` row records show `siblings_after_filter < siblings_unfiltered` for tables containing `col_NN` columns. Reroutes whose rationale referenced sibling labels (e.g., the `col_04 → Shipping Address` case) lose that justification.
R2a	`cautious_review.json` summary shows `stability_guard_fired > 0`; the guard’s blocked reroutes show up as `decision = "keep"` with rationales prefixed `[R2a stability guard fired: ...]`. Brush by `review_decision = "keep"` AND `review_rationale LIKE '[R2a%'` in the SQL Predicate panel to count.
R5	Overwatch report’s Health Signals table gains a row; `llm_agreement_pre_review > llm_agreement` when reviewer reassigned LLM-aligned predictions.

A.4 What the paper-trade cannot validate

Cumulative interaction effects. R1 raises Bel for 38 cols, which changes which cols enter cautious review, which changes the shortlist composition for those cols, which changes whether R2c’s permissive path fires. Static paper-trade can’t model this cascade.
Real LLM behavior in cautious review. R2c assumes the LLM occasionally picks valid-but-out-of-shortlist codes; the true rate may differ once the run uses the post-R1 frame (more LLM evidence → fewer cautious-review entries → smaller cohort exposed to R2c).
Bedrock vs Anthropic-direct response shapes. R2b was smoke-tested against one captured Bedrock fence-with-prose case; other Bedrock formatting variants (mid-stream JSON, Latin-1 whitespace, multi-block responses) are unobserved in the dataset.
R3 sibling-context poisoning size on this corpus. With classify_exclude_reference_columns = true (8d67b1ed’s setting), the col_04-class cohort is suppressed at discovery; the validation run should toggle this off (or include opaque-name tables) if the goal is to measure R3’s true impact.

A.5 Expected delta on overwatch’s Health Signals table

Pre-remediation (8d67b1ed):

Signal	Configured	Actual	In Contract?
`llm_agreement`	≥ 0.9895	0.6794	❌ No
`state.failed_columns`	≤ 2	11	❌ No

Post-remediation (validation run prediction):

Signal	Configured	Predicted	In Contract?
`llm_agreement` (post-review)	≥ 0.9895	~0.85 (R1+R2c+R3+R2a all push it up)	❌ Still under, but materially closer
`llm_agreement_pre_review` (R5, NEW)	(no contract)	~0.92	—
`state.failed_columns`	≤ 2	0-1 (R2b + R2c eliminate parser/shortlist failures)	✅ Yes
`total_columns`	—	271 (was 287; R6 drops 16)	—
Cohort with empty `evidence_sources.llm`	—	~57 (was 95; R1 recovers 38)	—
`stability_guard_fired` (R2a, NEW)	—	~12	—
`shortlist_extended` (R2c, NEW)	—	~8	—

If the post-validation overwatch report shows llm_agreement still sub-0.80, the residual gap is in the 55-column “numeric-unresolved” cohort that R1 doesn’t touch. That points to a follow-up remediation — likely a frame-coverage gap where the LLM emits codes the runtime taxonomy doesn’t carry.

A.6 Configuration

Each remediation is gated by an independent flag, so a follow-up A/B run (if any signature is missing or wrong) can isolate per-remediation contribution by toggling one flag at a time.

Flag	Default	Disable for ablation
`classify.resolve_llm_annotation_mnemonic`	`true`	R1 off
`classify.exclude_temp_tables`	`true`	R6 off
`classify.cautious_review.shortlist_permissive`	`true`	R2c off
`classify.cautious_review.exclude_opaque_siblings`	`true`	R3 off
`classify.cautious_review.stability_guard_enabled`	`true`	R2a off
`classify.cautious_review.stability_guard_llm_conf`	`0.80`	R2a threshold

The R2b parser improvement is not flag-gated — it’s strictly more correct than the prior greedy regex on every input.

A.7 Test surface

Unit tests in tests/classify/test_audit_remediations.py cover R1, R2b, R2c, and R6. R2a and R3 are paper-traded against build/results/8d67b1ed/cautious_review.json rather than unit-tested because their value lives in cohort behavior (cross-subtree distribution, sibling filtering effects), not single-decision transforms. R5 is a metric addition with no decision logic to test.

PYTHONPATH=src python3 -m pytest tests/classify/test_audit_remediations.py -v
# 19 tests, all passing as of 2026-05-06.

Extend Classification Workflow

End-to-end procedure for classifying a Hive corpus that grows over time: train CatBoost on the stable subset, then extend the trained model to newly-added tables without re-running the full LLM-driven classification pipeline.

This report documents the procedure and the empirical results from a session on 2026-05-13 against the hive-poc/reference_corpus source (reference data-governance POC, 40 tables, ~620 columns), running with the Phase-3 DST frame and the LLM-emission validation + retry mechanism enabled.

Why two-phase classification

A full classify run uses LLM sweeps, multi-source DST fusion, and cautious-review on top of CatBoost training — minutes to tens of minutes per 300-column batch with non-trivial LLM cost. An extend run reuses a previous run’s CatBoost (and optionally UMAP / SVM) and applies them directly to new columns — seconds to a couple of minutes regardless of corpus size, no LLM cost.

The pattern lets data-governance teams:

Establish a stable baseline classification on the tables they already know
Onboard new tables incrementally without re-running expensive LLM sweeps
Compare new-table predictions against a known model artifact for audit and consistency

Empirically, on the corpus we measured, the extend output actually scored higher on the operator-flagged ground-truth proxy than the parent classify run (71.9% strict vs 68.1%) — the cautious-review backoff in the full pipeline turned out to be over-conservative on this corpus. The workflow below establishes both runs so you can compare them and pick the artifact that best matches your governance team’s expectations.

Prerequisites


Atelier deployment	CAI Application or local devenv with `cml.data_v1` access
Hive source	A `data_sources` row registered for the corpus (e.g. `hive-poc/reference_corpus`)
Annotations table	Deployed at `<connection>.<cfg.classify_database>.annotations` (typically `<connection>.default.annotations`) — not colocated with the data tables
Config	`config/base.conf` editable, or env-var overrides for the toggles below
LLM backend	Configured via `ANTHROPIC_API_KEY` / Bedrock credentials so the classify-phase sweep can run

The classify and extend runs are triggered from the UI’s pipeline panel or via POST /api/fsm/start and POST /api/fsm/extend respectively.

The config knobs that drive the workflow

Two HOCON settings under classify { … } in config/base.conf:

`classify.table_exclude_patterns`

Comma-separated regex patterns matched against Hive table names (re.search semantics, case-sensitive). Tables whose name matches any pattern are dropped at discover_tables time and never sampled — same mechanism applies uniformly to classify and extend pipelines.

Empty (default) = no filtering. Operator edits this between runs.

`classify.svm.enabled`

When false (current default), the per-vocabulary SVM evidence source is skipped — the alignment LLM call doesn’t fire, no SVM is trained, and the pipeline runs with 5 evidence sources instead of 6. Toggle back to true after the recipe-driven synth training described in docs/src/architecture/... (separate workstream) replaces the LLM-mediated alignment.

Both also have env-var overrides (ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS, ATELIER_CLASSIFY_SVM_ENABLED) that take precedence over the HOCON defaults at load time.

Procedure

Step 1 — Identify the “stable” subset of the corpus

Decide which tables you want CatBoost to train on. The pattern is typically: “tables that have been in production long enough to have operator-validated classifications.” Newly-added tables go in the excluded set.

For the documented session, the stable subset was the 20 tables that existed in the previous classify baseline (5450b626), with 20 new tables added to Hive after that.

Identify the newly-added tables by diffing the current Hive table list against a previous run’s classifications:

python3 << 'EOF'
import json
from pathlib import Path

parent_run = '5450b626'  # or whichever prior run defines your baseline
new_run = 'f931f469'     # a fresh run that classified the post-addition full source

parent_tables = sorted({c['table_name'] for c in
    json.loads(Path(f'/home/cdsw/build/results/{parent_run}/classifications.json').read_text())
    if c.get('table_name')})
new_tables = sorted({c['table_name'] for c in
    json.loads(Path(f'/home/cdsw/build/results/{new_run}/classifications.json').read_text())
    if c.get('table_name')})

added = sorted(set(new_tables) - set(parent_tables))
print(f'Added tables: {len(added)}')
for t in added:
    print(f'  + {t}')
EOF

For each new table, build a fully-anchored regex pattern (^name$) so a future table named e.g. member_registry_v2 doesn’t accidentally get caught by a pattern targeting member_registry.

Step 2 — Filter the new tables before the classify run

Edit config/base.conf to populate classify.table_exclude_patterns with the comma-separated regex list:

classify {
  …
  table_exclude_patterns = "^app_developer_records$, ^compliance_documents$, ^component_catalog$, ^contact_supplemental$, ^content_profiles$, ^credential_vault$, ^device_identity_log$, ^engagement_signals$, ^headcount_ledger$, ^health_location_profiles$, ^member_registry$, ^order_shipments$, ^payment_events$, ^program_index$, ^return_billing$, ^screening_records$, ^security_research_assets$, ^staff_registry$, ^system_audit_records$, ^workforce_data$"
  table_exclude_patterns = ${?ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS}
  …
}

Or as a single-line env override in .env.cai.enc:

ATELIER_CLASSIFY_TABLE_EXCLUDE_PATTERNS="^app_developer_records$, …, ^workforce_data$"

Verify the config loads correctly:

python3 -c "
import sys; sys.path.insert(0, 'src')
from atelier.config import load_config
cfg = load_config()
print(f'{len(cfg.classify_table_exclude_pattern_list)} patterns:')
for p in cfg.classify_table_exclude_pattern_list:
    print(f'  {p}')
"

Step 3 — Restart the Application to pick up the new config

In the CAI Workspace UI, Application → Restart. The pipeline loads HOCON values fresh on each load_config() call, but the in-memory Python module cache for _HOCON_MAP is initialized once; a restart guarantees both layers see the new config.

Step 4 — Run the parent classify against the stable subset

Trigger from the UI’s pipeline panel, or:

curl -s -X POST "$ATELIER_BASE_URL/api/fsm/start" \
  -H 'content-type: application/json' \
  -d '{"source_id": "hive-poc/reference_corpus"}'

Expected:

discover_tables enumerates all tables in Hive, drops the excluded set, returns the stable subset
The pipeline runs end-to-end on the filtered set: LLM sweep, DST fusion, fit-to-LLM CatBoost training, cautious review, SHAP/SAGE if enabled
Run dir lands at build/results/<run_id>/ with the full artifact set (CatBoost CBM, classes JSON, UMAP, parquet, classifications, evaluation_report, etc.)
Run kind: classify. Artifact set: same id as run_id.

Note the run_id of this baseline — it becomes the artifact_set_id for the extend run.

What you should see in `validation_retries.json`

{
  "total_retries": 1-5,  // small number is healthy
  "events": [
    {
      "column_names": ["..."],
      "invalid_codes": ["A_FD", "1.2.1.3.3", ...],
      "retry_idx": 0
    },
    …
  ]
}

Each entry is a column where the LLM emitted a code that’s not in the deployed default.annotations taxonomy. The retry mechanism re-prompted the LLM with the specific invalid code named, and the LLM (almost always at retry_idx: 0) emitted a valid code on the second attempt. After-exhaustion blanking (residual invalid emissions getting category_code = None) is rare; if it happens, those columns are simply dropped from CatBoost training data.

Empty events: [] means the LLM emitted only in-taxonomy codes throughout the sweep — the goal state.

Step 5 — Clear the filter before the extend run

Edit config/base.conf:

classify {
  …
  table_exclude_patterns = ""
  …
}

Or unset the env var. Restart the Application again.

Step 6 — Run extend against the artifact from Step 4

curl -s -X POST "$ATELIER_BASE_URL/api/fsm/extend" \
  -H 'content-type: application/json' \
  -d '{
        "source_id": "hive-poc/reference_corpus",
        "artifact_set_id": "<parent_run_id>",
        "parent_dataset_id": "<parent_run_id>"
      }'

Or trigger from the UI’s Extend panel against the artifact set matching the parent’s run_id.

Expected:

discover_tables enumerates all 40 tables (no filtering)
sample_table_metadata samples each
The parent run’s CatBoost predicts predict_proba on every column
No LLM sweep, no DST fusion, no cautious review — straight CatBoost top-1
Run dir at build/results/<extend_run_id>/ with parquet, classifications, evaluation_report
Run kind: extend. References the parent via artifact_set_id and parent_dataset_id

A real cost in elapsed time

For a 40-table / ~620-column corpus, the extend run completes in roughly 2–3 minutes (dominated by Hive metadata sampling). Compare to the parent classify which takes 10–30 minutes depending on LLM batch latency.

Caveats observed during the session

The annotations database is NOT colocated with the data tables

The deployment has data tables at hive-poc.reference_corpus but the canonical taxonomy at hive-poc.default.annotations. The full classify pipeline handles this via cfg.classify_database (defaults to "default") and an optional vocab_uri on the data_sources row. The extend pipeline must do the same — early in the session a regression was found where extend was querying <data_db>.annotations (which doesn’t exist), silently catching the exception, and producing output with predicted_annotation empty and predicted_label echoing predicted_code. The fix at src/atelier/classify/extend_pipeline.py reads from cfg.classify_database for annotations, independent of the data-tables database resolved from source_id.

`validation_retries.json` is the audit trail

Any LLM emission outside the deployed taxonomy is captured in build/results/<run_id>/validation_retries.json with the column name and the invalid code. Empty events list = clean sweep. The audit lives alongside the run artifacts so post-mortem doesn’t require pod-log access.

Cautious-review backoff can be over-conservative

On the documented corpus, the parent classify’s cautious-review mechanism backed off 15 columns from terminal predictions to parent codes that the extend run subsequently recovered as correct terminals. The threshold knob (classify.cautious_review.bel_threshold, default 0.80) is the lever; tightening it to 0.85 or 0.90 will reduce the rate of backoffs.

Re-running classify with the filter restored is cheap regression-protection

If the extend output looks worse than expected on the OLD tables, the parent’s artifacts are unchanged and re-deploying is one config edit + restart. Both runs land in build/results/ and are independently auditable.

Results from the 2026-05-13 session

Five classify+extend runs were measured against the same operator-curated review spreadsheet (Atelier-Results-vs-Prompt-solution-522d89ae.xlsx), which encodes one operator’s expected classifications for the 20 OLD tables. Three metrics matter:

Strict (canonical-validated) — predicted_annotation matches the spreadsheet’s expected tag, validated against default.annotations so spreadsheet hallucinations don’t count as Atelier misses
Stem-collapsed — same as strict but ignoring A_/C_/S_ prefix differences within a code’s annotation family
Binary sensitive-vs-public — predicted sensitive vs non-sensitive matches spreadsheet’s Data Sensitivity field
Operator-curated recall — 15 columns the operator explicitly flagged as “Atelier got this wrong”; recall counts how many now resolve correctly

Run	Notes	Strict	Stem	Binary	Op-curated
522d89ae	Original baseline (pre-Phase-3, pre-validation)	69.1%	44.6%	84.2%	0/15
5450b626	Pre-Phase-3 retrain (filtered to 20 OLD tables)	66.7%	42.8%	83.2%	3/15
1d6e3fae	Phase 3 only (full DST frame, no validation+retry)	67.4%	42.1%	83.9%	3/15
2ac4d0a6	Phase 3 + validation+retry classify	68.1%	43.2%	84.6%	4/15
0146134f	Phase 3 + validation+retry extend (from 2ac4d0a6)	71.9%	47.0%	84.6%	7/15

Three distinct improvements

Validation+retry catches the parent classify up. 2ac4d0a6 over 1d6e3fae: +0.7pp strict, +1 op-curated. Driven by the 3 LLM hallucinations the new mechanism caught and corrected in real-time (A_FD on monetary columns, 1.2.1.3.3 on case_ref).
Extend’s CatBoost-only path materially outperforms the parent’s full pipeline. 0146134f over 2ac4d0a6: +3.8pp strict, +3 op-curated. Surprise: extend lacks DST fusion and cautious review, yet scores higher — the parent’s cautious-review backoff was over-conservative on this corpus.
Op-curated recall climbs across the whole arc. 0/15 → 7/15 over the session’s work, without ground-truth supervision or model changes — just architectural correctness improvements (Phase 3, validation+retry, correct annotations database in extend).

Column-level diff (0146134f vs 2ac4d0a6 on the OLD 20 tables)

Of 300 shared OLD-table predictions:
  unchanged:                  263 (88%)
  leaf → parent (regression):   3 (1%)
  parent → leaf (refinement):  15 (5%)
  sibling-within-subtree:      14 (5%)
  cross-subtree:                5 (2%)

Net specificity move: +12 columns more specific in extend than parent
Confidence delta on unchanged: median +0.177, mean +0.196

The 15 parent-to-leaf flips include exactly the failure modes documented in earlier xlsx reviews:

shipping_manifests/tracking_id: A_TRID parent → TRANSID leaf
legal_cases/party_ref: C_PID parent → NAMEFULL leaf
gaming_profiles/linked_account: ACCOUNT_ID → SOCIAL_ID
insurance_claims/alt_contact: A_PHN → OTHPHNUM
hr_compensation/comp_value: INCOME → SALARY
shipping_manifests/col_32: COUNTRY → SHIPCNTY

Three column-classes that still miss

Of the 8 operator-curated columns 0146134f still misses, all fall into pre-documented failure modes:

TRANSID over-application on permit columns — permit_ref, rec_33 wanting TRAVPERM/WORKPERM, still getting TRANSID
System-vs-Person URL — page_ref, media_ref wanting PRSNURL/INPPHOTO, still getting SYSURL
Network identifier domain-adaptation gap — network_addr wanting DEVMACADDR, still getting IPADDR — the SVM has not been trained on synthetic examples that separate MAC-shape from IPv4-shape

These are the targets for the recipe-driven dense-synth SVM retraining workstream (parked pending implementation) — the generators need to teach the SVM patterns the pretrained models cannot read.

Reproducibility checklist

For others to reproduce this work end-to-end:

Clone the Atelier repo at the commit landed during the 2026-05-13 session (Phase 3 + validation+retry merged).
Configure a Hive connection pointing at a corpus that matches the shape (data tables in one database, annotations table in default.annotations, ~10-50 tables).
Identify a stable subset and an “added” subset of the corpus.
Follow Steps 1–6 above.
Compare:
- build/results/<parent_run>/evaluation_report.json vs build/results/<extend_run>/evaluation_report.json for headline metrics
- build/results/<parent_run>/classifications.json vs build/results/<extend_run>/classifications.json for column-level diffs on the overlap
- build/results/<parent_run>/validation_retries.json for the LLM-hallucination audit trail
If you have an operator-curated review spreadsheet (per docs/src/operations/embeddings-reviewer-guide.md), apply the scoring methodology in this report.

The session’s artifacts live at:

build/results/5450b626/   # pre-Phase-3 baseline
build/results/1d6e3fae/   # Phase 3 only
build/results/2ac4d0a6/   # Phase 3 + validation+retry classify
build/results/0146134f/   # Phase 3 + validation+retry extend

Spreadsheet: Atelier-Results-vs-Prompt-solution-522d89ae.xlsx

Backfill script (used to populate predicted_annotation on extend runs produced before the colocation fix landed): scripts/backfill_extend_annotations.py

What’s not in scope for this report

Recipe-driven SVM retraining to address the 8 remaining operator-curated misses (parked; needs synth-generator densification around the documented domain-adaptation gaps)
Cautious-review threshold tuning to align parent classify predictions more closely with extend (A/B candidate)
Multi-reviewer ground truth to replace the single-operator spreadsheet as the evaluation substrate (Tier 0 of the broader accuracy-improvement roadmap)
Subjective Logic / conformal prediction for the no-ground-truth deployment scenario (architectural discussion captured in separate design notes)

Each is tracked separately; the workflow documented here is the current operationally-ready path.

Scenario Overview

Atelier uses behave (BDD) to capture platform decisions as executable specifications. Every scenario answers a concrete question: Does the config load? Can the runtime start? Does the classification pipeline converge?

These aren’t just tests. They’re the design context that connects architectural choices to the deployment realities of Cloudera AI.

Active Domains

155 scenarios across 35 features, 4 domains.

Infrastructure (infra)

Health checks and configuration lifecycle for the services Atelier depends on.

Feature	Tag	Tier	Scenarios	What it validates
Config lifecycle	`@config`	0	3	HOCON load, CLI override precedence, materialize + validate
PostgreSQL health	`@postgres`	1	2	Connection with pgvector extension, migration state
Qdrant health	`@qdrant`	1	1	Vector store HTTP health endpoint
PGlite process	`@pglite`	0	2	Node.js script existence, npm dependency declarations
Preflight	`@preflight`	0	3	Structured deny/warn checks, GPU detection

Deployment

CAI deployment modalities and the runtime profile that catches failures before pushing.

Feature	Tag	Tier	Scenarios	What it validates
Runtime profile	`@runtime-profile`	0	6	Import chain, script executability, config resolution, migration parsing
AMP lifecycle	`@amp`	0 + cai	5	`.project-metadata.yaml` structure, task patterns, install + start
Application modality	`@application`	0 + 1	3	HOST binding logic, full local stack startup
Studio modality	`@studio`	0	2	`IS_COMPOSABLE` root directory routing
Embeddings integration	`@embeddings`	0	4	npm dependency, page component, React Router, preparation script
Naming conventions	`@naming`	0	2	User-facing surfaces say “Embeddings”, no Apache Atlas confusion

Gateway

HTTP gateway endpoints, gRPC bridge, and live service integration.

Feature	Tag	Tier	Scenarios	What it validates
API endpoints	`@api`	0 + 1	8	REST endpoint contracts, response shapes
API testclient	`@testclient`	0	7	FastAPI TestClient integration (no running server)
Status endpoint	`@status`	0 + 1	4	Aggregated health report, config state
Pipeline integration	`@pipeline`	1	2	Classification pipeline via gateway
SPA routes	`@spa`	0	1	Client-side routing fallback

Agent

Classification pipeline, DST evidence fusion, ML classifiers, and agent orchestration.

Feature	Tag	Scenarios	What it validates
Classification pipeline	`@gpu`	28	DST belief, Dempster combination, features, patterns (+ Luhn/IPv4/date/currency validation), name matching, pipeline E2E, Monte Carlo sampling
Bootstrap convergence	`@bootstrap`	11	LLM sweep, ML validation, targeted revisit, convergence criteria, ontology-aligned SVM
Agent convergence loop	`@gpu`	6	6-tool agent loop, conflict reports, convergence, mock client
Agent smoke test	`@agent`	6	Agent metadata, tool definitions, state formatting
LLM backends	`@backend`	8	Backend factory, Anthropic/Bedrock/Cerebras/OpenAI clients
ML classifiers	`@ml`	4	CatBoost + SVM training, inference, virtual ensemble UQ
ML E2E	`@ml-e2e`	2	Full synth → train → classify → evaluate cycle
Belief path	`@belief-path`	3	Hierarchical navigation, cautious classification
SAGE importance	`@sage`	1	Permutation-based feature importance
SHAP explanations	`@shap`	2	TreeSHAP + PermutationSHAP attribution
Synth generation	`@synth`	2	Synthetic data + reference-label generation
Synth framework	`@synth-framework`	2	Generator registry, coverage reporting
Meta-tagging	`@meta-tagging`	2	META_TO_ICE mappings, coverage
Experimentation	`@experimentation`	3	Discount tuning, comparative evaluation
Real data	`@real-data`	3	Production annotation validation (requires build/data/)

By Tier

Tier	Requires	Scenarios	Pass locally
0	Python only	~120	Yes
1	devenv stack	~15	Yes (with `devenv up`)
cai	Live CAI session	~5	Skipped (documentation-only)

Additional tags: @slow (~17 scenarios requiring extended runtime), @gpu (GPU detection/acceleration scenarios — run on CPU too, just slower).

Why BDD for a Deployment Platform?

CAI deployment has four modalities — Project, Application, AMP, and Studio — each with different constraints on networking, filesystem layout, and process lifecycle. Traditional unit tests verify module behavior in isolation. BDD scenarios verify that the system hangs together across these modalities.

Consider the Application modality: when CDSW_APP_PORT is set, the startup script must bind to 127.0.0.1 because CAI’s reverse proxy handles external traffic. Bind to 0.0.0.0 instead and you bypass the proxy’s auth layer. This isn’t a bug in any single module — it’s a deployment contract that only a scenario can express clearly:

Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
  Given CDSW_APP_PORT is set to "8090"
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "127.0.0.1"

The scenario is the spec. A colleague reading this knows exactly what the constraint is, why it matters, and can verify it passes with just behave.

Test Infrastructure

Framework

Atelier uses behave for BDD and pytest for unit tests. The BDD scenarios live in features/ and are organized by domain.

Tier System

Scenarios are tagged by the infrastructure they require. The ATELIER_BDD_TIER environment variable controls which tiers run.

Tier	Tag	Requires	Purpose
0	`@tier-0`	Python only	Config, imports, classification pipeline, agent loop, ML classifiers
1	`@tier-1`	devenv stack	PostgreSQL, Qdrant, gRPC, full gateway startup
cai	`@tier-cai`	CAI session	Live deployment validation — always skipped locally

Additional tags:

@slow — scenarios requiring extended runtime (pipeline E2E, ML training)
@gpu — GPU acceleration scenarios (run on CPU too, just slower)

Tier 0 runs everywhere: laptops, CI, CAI sessions. No services, no network calls. This is where the runtime profile lives — the scenarios that catch deployment failures before you push.

Tier 1 requires devenv up to be running (PostgreSQL on :5533, Qdrant on :6334). These verify that services are healthy and that the application can actually connect to its data stores.

Tier CAI exists as executable documentation. The step definitions are stubs — they express what should happen in a live CAI session without automating it. When debugging a deployment failure, these scenarios are a checklist.

Running Tests

# Full BDD suite including gateway checks (preferred)
just behave

# Tier-0 only (no services needed)
just bdd

# Tier-0 + tier-1 (requires devenv up)
just bdd-full

# Runtime profile specifically
just bdd-runtime

# Single domain
ATELIER_BDD_TIER=0 uv run behave features/agent/

# Single feature file
uv run behave features/agent/classification.feature

# By tag
ATELIER_BDD_TIER=0 uv run behave features/ -t @bootstrap

# Verbose (show all steps, not just failures)
just behave --no-capture

Feature Organization

features/
├── environment.py                          # Tier filtering, stack health, cleanup hooks
├── steps/__init__.py                       # Central re-exports (behave's discovery point)
├── infra/                                  # Domain: infrastructure & services
│   ├── step_defs/
│   │   ├── helpers.py
│   │   ├── config_steps.py
│   │   ├── health_steps.py
│   │   └── preflight_steps.py
│   ├── config_lifecycle.feature            # 3 scenarios
│   ├── health_postgres.feature             # 2 scenarios
│   ├── health_qdrant.feature               # 1 scenario
│   ├── health_pglite.feature               # 2 scenarios
│   └── preflight.feature                   # 3 scenarios
├── deployment/                             # Domain: CAI deployment workflows
│   ├── step_defs/
│   │   ├── helpers.py
│   │   ├── runtime_steps.py
│   │   ├── amp_steps.py
│   │   └── naming_steps.py
│   ├── runtime_profile.feature             # 6 scenarios
│   ├── amp_lifecycle.feature               # 5 scenarios
│   ├── application.feature                 # 3 scenarios
│   ├── studio.feature                      # 2 scenarios
│   ├── embeddings.feature                  # 4 scenarios
│   └── naming_audit.feature                # 2 scenarios
├── gateway/                                # Domain: HTTP/gRPC gateway
│   ├── step_defs/
│   │   ├── status_steps.py
│   │   ├── http_steps.py
│   │   ├── endpoint_steps.py
│   │   ├── pipeline_steps.py
│   │   └── testclient_steps.py
│   ├── api_endpoints.feature               # 8 scenarios
│   ├── api_testclient.feature              # 7 scenarios
│   ├── status_endpoint.feature             # 4 scenarios
│   ├── pipeline_integration.feature        # 2 scenarios
│   └── spa_routes.feature                  # placeholder
└── agent/                                  # Domain: classification & agents
    ├── step_defs/
    │   ├── agent_steps.py
    │   ├── classification_steps.py
    │   ├── bootstrap_steps.py
    │   ├── backend_steps.py
    │   ├── synth_steps.py
    │   ├── ml_steps.py
    │   ├── ml_e2e_steps.py
    │   ├── sage_steps.py
    │   ├── shap_steps.py
    │   ├── real_data_steps.py
    │   ├── belief_path_steps.py
    │   ├── synth_framework_steps.py
    │   ├── meta_tagging_steps.py
    │   ├── experimentation_steps.py
    │   ├── agent_loop_steps.py
    │   └── monte_carlo_steps.py
    ├── classification.feature              # 19 scenarios (DST, pipeline, MC sampling)
    ├── bootstrap.feature                   # 10 scenarios
    ├── agent_loop.feature                  # 6 scenarios
    ├── agent_smoke.feature                 # 6 scenarios
    ├── backend.feature                     # 8 scenarios
    ├── ml_classifiers.feature              # 4 scenarios
    ├── ml_e2e.feature                      # 2 scenarios
    ├── synth.feature                       # 2 scenarios
    ├── synth_framework.feature             # 2 scenarios
    ├── sage.feature                        # 1 scenario
    ├── shap.feature                        # 2 scenarios
    ├── belief_path.feature                 # 3 scenarios
    ├── meta_tagging.feature                # 2 scenarios
    ├── experimentation.feature             # 3 scenarios
    └── real_data.feature                   # 3 scenarios

Step Discovery

Behave only discovers step definitions from features/steps/. Domain step definitions live in <domain>/step_defs/ directories and are re-exported through features/steps/__init__.py:

from features.infra.step_defs.config_steps import *
from features.infra.step_defs.health_steps import *
from features.infra.step_defs.preflight_steps import *
from features.deployment.step_defs.runtime_steps import *
from features.deployment.step_defs.amp_steps import *
from features.deployment.step_defs.naming_steps import *
from features.agent.step_defs.agent_steps import *
from features.agent.step_defs.classification_steps import *
from features.agent.step_defs.bootstrap_steps import *
from features.agent.step_defs.backend_steps import *
from features.agent.step_defs.synth_steps import *
from features.agent.step_defs.ml_steps import *
from features.agent.step_defs.ml_e2e_steps import *
from features.agent.step_defs.sage_steps import *
from features.agent.step_defs.shap_steps import *
from features.agent.step_defs.real_data_steps import *
from features.agent.step_defs.belief_path_steps import *
from features.agent.step_defs.synth_framework_steps import *
from features.agent.step_defs.meta_tagging_steps import *
from features.agent.step_defs.experimentation_steps import *
from features.gateway.step_defs.status_steps import *
from features.gateway.step_defs.http_steps import *
from features.gateway.step_defs.endpoint_steps import *
from features.gateway.step_defs.pipeline_steps import *
from features.agent.step_defs.agent_loop_steps import *
from features.agent.step_defs.monte_carlo_steps import *
from features.gateway.step_defs.testclient_steps import *

Two conventions protect against behave’s automatic discovery behavior:

Use step_defs/, not steps/ — Behave walks the feature tree and exec’s any .py file it finds in a directory named steps/. This bypasses Python’s import system, breaking relative imports and module context. Using step_defs/ avoids this entirely.
Never name a features/ subdirectory after a stdlib module — When behave imports features.platform, Python also registers it as platform in sys.modules, shadowing the stdlib. This breaks anything that lazily imports platform (including pydantic). The infra/ domain was originally named platform/ until this caused a cascade of subtle failures.

Config-Driven BDD

Infrastructure steps load configuration from HOCON via atelier.config.load_config() rather than hardcoding values. This means BDD scenarios validate the same config path used in production:

from atelier.config import load_config
cfg = load_config()
_wait_for("PostgreSQL", lambda: _check_pg(cfg.db_url))

Stack Health Gate

Tier-1 scenarios share a one-time stack health check in environment.py. Before the first tier-1 scenario runs, the framework verifies PostgreSQL and Qdrant are reachable (with a 60-second retry window). If either service is down, all tier-1 scenarios fail fast with a clear message rather than producing confusing connection errors.

Cleanup

after_scenario in environment.py removes temporary files registered via context._temp_files. This handles config materialization artifacts and other test-created files.

Unit Tests

Alongside BDD, tests/ contains pytest unit tests for isolated module behavior:

just test                    # Run all pytest tests
uv run pytest tests/ -x     # Stop on first failure

BDD and pytest serve complementary roles: pytest validates that individual functions behave correctly; BDD validates that the system’s deployment contracts hold.

Deployment Modalities

Cloudera AI offers four ways to run code. Each has different constraints on networking, filesystem layout, process lifecycle, and dependency management. Atelier’s BDD scenarios encode these constraints as executable specifications.

Project

Every CAI deployment starts as a Project — a Git-backed workspace cloned into /home/cdsw. The Project modality is implicit: it provides the filesystem layout, environment variables, and Python runtime that all other modalities build on.

No dedicated feature file. Project constraints are tested indirectly through every other deployment scenario.

AMP (Automated Machine Learning Prototype)

An AMP is a one-click provisioning workflow defined in .project-metadata.yaml. It runs a sequence of tasks — typically create_job to install dependencies, then start_application to launch the service.

Why BDD captures this well: AMP metadata is YAML that CAI interprets at deploy time. A malformed task definition doesn’t fail until someone clicks “Deploy” in the CAI UI. Our tier-0 scenarios catch structural problems immediately.

What the scenarios validate

AMP metadata structure (amp_lifecycle.feature):

Scenario: AMP metadata file is valid
  Given the file ".project-metadata.yaml" exists
  When I parse the AMP metadata
  Then it has a "name" field
  And it has a "runtimes" section
  And it has a "tasks" section

Task ordering pattern — CAI requires create_job before run_job for the same entity label. Getting this wrong means the install job never runs:

Scenario: AMP tasks follow create_job/run_job pattern
  Given the AMP metadata is loaded
  Then a "create_job" task with entity_label "install_deps" exists
  And a "run_job" task with entity_label "install_deps" exists
  And a "start_application" task exists

Install script validity — scripts/install_deps.py runs in a bare Python environment without uv or devenv. A syntax error here means the entire deployment fails:

Scenario: Install script is valid Python
  When I compile "scripts/install_deps.py" with py_compile
  Then no SyntaxError is raised

Tier-CAI scenarios document what a successful AMP deploy looks like. These are skipped locally but serve as a regression checklist when debugging deployment failures:

@tier-cai
Scenario: AMP install job completes successfully
  Given I am in a CAI project session
  When I run the install dependencies job
  Then the job exits with code 0
  And "atelier" is importable in system Python
  And "node --version" succeeds
  And the directory "ui/dist" exists

Application

An Application is a long-running web service. CAI assigns a port via CDSW_APP_PORT and routes subdomain traffic through a reverse proxy that handles authentication.

The key constraint: When CDSW_APP_PORT is set, the service must bind to 127.0.0.1, not 0.0.0.0. The reverse proxy connects over localhost; binding to all interfaces bypasses CAI’s auth layer.

For local development (no CDSW_APP_PORT), binding to 0.0.0.0 is correct — it lets you access the service from a browser.

Scenario: start-app.sh binds to 127.0.0.1 when CDSW_APP_PORT is set
  Given CDSW_APP_PORT is set to "8090"
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "127.0.0.1"

Scenario: start-app.sh binds to 0.0.0.0 for local dev
  Given CDSW_APP_PORT is not set
  When I parse bin/start-app.sh for the HOST variable
  Then HOST is "0.0.0.0"

The tier-1 scenario verifies the full stack actually starts and serves traffic:

@tier-1
Scenario: Full application stack starts locally
  When I run bin/start-app.sh in the background
  Then the HTTP gateway responds on port 8090 within 30 seconds
  And the gRPC server responds on port 50051

Studio (future)

A Studio is a pre-built Docker image where IS_COMPOSABLE=true. Instead of being the root application, Atelier runs as an embedded service within a larger container.

The key constraint: When IS_COMPOSABLE is set, the install script must use /home/cdsw/atelier as the root directory (the project subdirectory) instead of /home/cdsw (the container root). Getting this wrong means dependencies install into the wrong location and imports fail at startup.

Scenario: install_deps.py handles IS_COMPOSABLE root path
  When I set IS_COMPOSABLE to "true"
  And I parse scripts/install_deps.py for root_dir
  Then root_dir is "/home/cdsw/atelier"

Scenario: install_deps.py uses default root without IS_COMPOSABLE
  When IS_COMPOSABLE is not set
  And I parse scripts/install_deps.py for root_dir
  Then root_dir is "/home/cdsw"

Studio support is currently speculative — these scenarios document the expected behavior so the contract is established before implementation begins.

Runtime Profile

The CAI Runtime Profile is a set of tier-0 scenarios that validate deployment readiness without requiring a live CAI session. Run it before every push to catch the class of errors that only manifest when CAI tries to start the application.

just bdd-runtime

Why This Exists

CAI deployment failures are expensive to debug. The install job runs in a container with a 30-minute timeout. If it fails, the only feedback is a log dump. If it succeeds but the application crashes at startup, the only feedback is a “Application failed to start” banner with a link to logs that may or may not contain the root cause.

The runtime profile catches failures that would otherwise require a deploy-debug-redeploy cycle:

Check	Failure mode it prevents
Core package importable	Missing `__init__.py`, circular imports, broken package structure
Entry points importable	New dependency not declared in `pyproject.toml`
Proto stubs importable	Forgot to run `just proto` after editing `.proto`
Scripts exist and are executable	Missing `chmod +x`, file not committed
HOCON config resolves	Undefined substitution variable, syntax error in `.conf`
Migrations parseable	Malformed `-- migrate:up` block, missing SQL terminator

The Scenarios

Import chain validation

The most common CAI deployment failure is an import error. A module works in devenv because all dev dependencies are installed, but fails in CAI because the install script only installs production dependencies.

Scenario: Core package is importable
  When I import "atelier"
  Then no ImportError is raised
  And atelier.__version__ is defined

Scenario: All entry points are importable
  When I import "atelier.server"
  And I import "atelier.gateway"
  And I import "atelier.config"
  And I import "atelier.db.bootstrap"
  Then no ImportError is raised

Scenario: Proto stubs are generated and importable
  When I import "atelier.proto.atelier_pb2"
  And I import "atelier.proto.atelier_pb2_grpc"
  Then no ImportError is raised

These scenarios exercise the full import graph. If atelier.gateway imports fastapi which imports pydantic which imports annotated_types, and annotated_types isn’t in the dependency chain — this catches it.

Script executability

CAI runs scripts via #!/usr/bin/env python3 or #!/usr/bin/env bash. If the shebang is wrong or the execute bit isn’t set, the deploy fails with a cryptic “Permission denied” error.

Scenario: Required scripts exist and are executable
  Then the file "scripts/install_deps.py" exists
  And the file "scripts/startup_app.py" exists
  And the file "scripts/install_node.sh" is executable
  And the file "scripts/install_qdrant.sh" is executable
  And the file "bin/start-app.sh" is executable

Configuration resolution

HOCON configs use ${?VAR} substitution for environment variables. A typo in a variable name or an unresolvable reference won’t fail until load_config() is called at startup. The runtime profile forces resolution at test time:

Scenario: HOCON config resolves without errors
  When I load the config with no overrides
  Then no exception is raised
  And the config has grpc_port > 0
  And the config has gateway_port > 0

Migration parsing

Atelier uses a dbmate-compatible migration runner (atelier.db.bootstrap) that parses -- migrate:up / -- migrate:down blocks from SQL files. If a migration is missing its UP block, the bootstrap silently skips it — which means the schema diverges from what the code expects.

Scenario: Database migrations are parseable
  Given migration files exist in "db/migrations/"
  When I parse each migration for UP/DOWN blocks
  Then every migration has a valid UP block

When to Extend the Profile

Add a new runtime profile scenario whenever you:

Add a new Python entry point or importable module
Add a new script that CAI executes directly
Add a new HOCON config key that downstream code depends on
Add a new migration file

The rule of thumb: if it can break a CAI deploy and you can verify it without services running, it belongs in the runtime profile.

Sprint Summary: 2026-05-06 to 2026-05-20

This appendix records the engineering work completed during the two-week sprint ending 2026-05-20. The sprint covered 27 commits on feat/dst-late-interaction-cosine across three major work streams: (1) training-time Normalized Hierarchical SVM with the Structured Shared Frobenius Norm, (2) ColBERT late-interaction cosine integration with Qdrant, and (3) CatBoost/SVM calibration under the Dempster-Shafer evidence-independence framework. A DST numeric sensitivity study and BDD scenario expansion provided the empirical grounding.

1. Training-Time NHSVM (Choi et al. 2015)

Motivation

The prior NHSVM implementation was a post-hoc approximation: a flat SVM trained with standard Frobenius norm regularization (no hierarchy awareness), then nhsvm_reweight() nudged the probability distribution at inference time using tree-distance penalties. This cannot recover what was never learned. The SVM’s decision boundaries are flat; the reweighting is a band-aid. On an asymmetric taxonomy (deep sensitive subtree vs. shallow operational subtrees), the flat SVM systematically under-penalizes cross-subtree probability flow, allowing shallow catch-all nodes to absorb classifications that belong in the deep subtree.

The Structured Shared Frobenius Norm

Choi et al. (2015, arXiv:1508.02479) shows that for single-label hierarchical classification, proper NHSVM reduces to a standard multi-class SVM with a modified feature map. The key insight is the Structured Shared Frobenius Norm:

||W||^2_G = sum_n ||u_n||^2 / alpha_n

where u_n is the per-node weight component and alpha_n is the path-normalized budget for node n. This regularizer explicitly incorporates the label structure G: it promotes models to utilize shared information along tree paths, penalizing complexity proportionally to each node’s position in the hierarchy.

The Kronecker product feature expansion (Eq. 5) implements this norm without a custom solver. For sample x with label y, the expanded feature map is:

phi(x, y) = Lambda(y) tensor-product x

where Lambda(y)_n = sqrt(alpha_n) for nodes n on the root-to-y path, and zero elsewhere. Standard L2 regularization on the expanded space equals the Structured Shared Frobenius Norm on the original space. The geometry is exact, not approximate.

Directional Constraint (Eq. 7)

The alpha budget is computed via a linear program with the directional constraint: alpha_child >= alpha_parent for every parent-child pair. This forces more of the information budget toward leaves, preventing degenerate solutions on unbalanced trees where shallow internal nodes would otherwise absorb the entire alpha budget.

The LP formulation:

maximize   min_n alpha_n
subject to sum(alpha_n for n in path(root, l)) = 1   for every leaf l
           alpha_child >= alpha_parent                for every parent-child
           alpha_n >= 0

Solver: scipy.optimize.linprog(method='highs'). On the project taxonomy: 296 variables, 220 equalities, 582 directional inequalities. Solves in under one second with zero violations and path sums exact to machine precision (deviation < 1e-15). The unconstrained closed-form (Lemma 2: alpha_n = 1/D_n - 1/D_parent) is preserved as a private fallback.

Implementation

The training pipeline proceeds:

TF-IDF (char 3-6 + word 1-2 n-grams, 50K max features)
TruncatedSVD to 200 components (configurable via classify.svm.nhsvm_svd_components). Necessary because full TF-IDF times Kronecker expansion would produce 14.75M features and a 34.8 GB coefficient matrix. At 200 dimensions the expanded space is 59K features and the model fits in approximately 250 MB.
Kronecker expansion via HierarchicalFeatureExpander: training-time expansion populates only the label’s path blocks (sparse, ~path_len x d non-zeros per row); inference-time expansion populates all blocks (dense across nodes, sparse across features).
LinearSVC with CalibratedClassifierCV(ensemble=False) for Platt-scaled probabilities.

The model serializes as a dict bundle ({feature_union, svd, expander, classifier, classes}) with automatic detection on load. Legacy flat .pkl files load unchanged, preserving backward compatibility. A _nhsvm suffix on the per-vocabulary cache key prevents serving a flat model as hierarchical or vice versa.

When the pipeline detects a training-time NHSVM model (via the _hierarchical attribute), it skips all post-hoc reweighting infrastructure (distance matrix precomputation, nhsvm_to_mass routing) and sends SVM probabilities directly through svm_to_mass. The hierarchy is already in the probabilities.

SVM Training Consolidation

In the same sprint, SVM training was consolidated from two paths (Path A: ICE alignment-based, Path B: enrichment-based) into a single enrichment-required path. Qdrant is the source of truth for enrichment payloads; a JSON export under build/ serves as the offline/CI fallback.

The synthetic corpus generator (synth_registry.py) now covers all taxonomy nodes (leaves and internal) via a three-layer generator architecture:

ICE-matched hand-coded (highest priority): enrichment metadata is matched against 31 inference patterns to select the best ICE generator. A mnemonic-to-dot-code bridge maps category abbreviations (e.g., EMAIL) to enrichment payload keys (e.g., 1.1.1.9.3.1).
Template generators (medium): prototype values from enrichment payloads with mild perturbation (numeric jitter, character substitution).
Inferred generators (lowest): fallback via pattern matching on category description and common names.

Coverage: 100% of all taxonomy nodes receive a generator. The leaf-only assumption was corrected at six sites across three files; every node is a first-class tagging target.

2. ColBERT Late-Interaction Cosine via Qdrant

Architecture

The sprint delivered the full P1-P3 stack for multi-vector cosine evidence:

P1 (storage foundation): Qdrant collection schema with named multi-vector fields. Each annotation point stores ColBERT token-level embeddings (128-d after the linear projection) alongside the structured enrichment payload (prototype values, value patterns, name hints, anti-examples, parent path).

P2 (enrichment pipeline): LLM-mediated annotation enrichment generates a six-field structured payload per taxonomy node. Each payload is verified by a deterministic suite of six checks before being written to Qdrant:

patterns_compile – every regex pattern must be valid Python re syntax.
prototype_values_match_patterns – at least 50% of prototypes must match a declared regex (relaxed from 100% this sprint to handle diverse free-text categories like marketplace names).
anti_example_targets_exist – every value in the confusable_tag field (the anti-example pointer) must exist in the taxonomy.
parent_path_consistent – the generated parent path must match the taxonomy hierarchy exactly.
name_hints_non_empty – at least one usable name hint.
no_contradiction_with_anti_examples – no prototype value may appear in anti-examples (self-contradiction rejection).

Prompts come in two variants (leaf and parent framing) because the principle that drives the architecture – every node is a first-class tagging target – means parent and leaf nodes describe different kinds of column. A leaf prompt asks for maximum-specificity signals; a parent prompt asks for family-level signals with the children listed so the model knows what specializations would not route to the parent.

P3 (late-interaction integration): The bridge (maxsim_bridge.py) encodes entity text through the same ColBERT encoder, queries Qdrant with native MaxSim, normalizes scores by query token count to recover mean per-token similarity, and converts scores to DST mass functions via maxsim_to_mass.

Token-Level Discrimination

The motivating failure modes of single-vector cosine resolve through token-level alignment:

Anonymized columns (comm_val, period_val, addr_ref) – column-name tokens contribute little MaxSim, but sample-value tokens still align to annotation prototype-value tokens. Weak tokens contribute near-zero MaxSim without polluting strong matches.
Long-tail distinguishing values – a single distinctive sample value’s tokens claim their own MaxSim against annotation prototypes, no longer averaged out by a single dense vector.
Sibling discrimination – token-level alignment discriminates between semantically adjacent annotations (e.g., “credit card number” vs. “bank account number”) through fine-grained matching that single-vector cosine collapses.
Parent-pull – parent-path tokens in the annotation text provide hierarchical context for the mass aggregation layer.

Channel-Decomposed Dempster Combination (P3.6)

The mass function produced by late-interaction cosine separates into two channels:

Positive channel: MaxSim scores on annotation points allocate mass to focal elements (leaf singletons and internal-node descendant sets). Haenni-Hartmann reliability shaping (alpha-bounded allocation) ensures the source never over-commits. Margin-aware allocation places top-1 mass proportional to the gap between first and second candidates; residual mass splits softmax across remaining candidates.
Negative channel: Anti-example evidence on a code c allocates mass to Theta \ D(c), where D(c) is the descendant leaf set. This is structurally correct for hierarchical exclusion: negating an internal node removes its entire subtree, not just the node itself.

The two channels combine via channel-decomposed Dempster’s rule. When channels conflict on the same node (high positive and high negative simultaneously), conflict K materializes as a diagnostic signal rather than being silently cancelled. The hierarchical aggregation layer walks from the top-1 leaf up to the most-specific ancestor with at least 50% descendant-mass concentration, promoting mass to internal-node focal elements when subtree-level signal is what the evidence supports.

3. CatBoost, SVM, and Dempster-Shafer Calibration

The Non-Distinctness Problem

Dempster’s rule assumes the evidence sources being combined are distinct and conditionally independent (Shafer 1976, Ch. 3-4; Smets & Kennes 1994). When sources share provenance – one source’s labels are deterministically derived from another’s – Dempster’s rule double-counts their agreement, inflating confidence beyond what the evidence warrants.

Atelier’s pipeline has six evidence sources with varying degrees of independence:

Source	Discount	Independence status
MaxSim (ColBERT late-interaction)	0.20	Weakly non-distinct (ColBERT encoder is deterministic; per-user-code reference vectors share enrichment-LLM upstream)
NHSVM	0.22	Weakly non-distinct (sentence-transformer subsumption alignment shares enrichment-LLM upstream — same provenance as ColBERT plus an additional alignment step, hence the slightly higher discount)
Pattern	0.25	Independent (deterministic regex matching)
Name match	0.30–0.70	Independent (deterministic string matching)
CatBoost	0.55	Strongly non-distinct (fit-to-LLM: per-column shared label provenance with LLM)
LLM	0.15	Primary source

The discount schedule follows Shafer’s reliability discount (alpha = 1 - discount applied to source mass) with adjustments per Denoeux (2008): when a source rides on labels deterministically derived from another source, an undiscounted derivative source mathematically swallows the only genuinely independent signal.

Pending work — manually curated annotation specifications in Ægir are the path to fully eliminating the shared LLM-upstream provenance on ColBERT and NHSVM. When per-user-code annotation payloads are author-curated rather than LLM-generated during enrichment, ColBERT’s reference vectors and the subsumption alignment both become structurally independent of any runtime LLM, and their discounts can drop toward the calibrations a truly distinct source carries. Until that curation is in place, the 0.20 / 0.22 calibration above is the right under-confidence price to pay.

CatBoost: Fit-to-LLM and Adaptive Discount

CatBoost operates in fit_to_llm mode (default): it trains on (embedding_text, llm_code) pairs from the current run’s LLM sweep. The model is the explainability surface over LLM labels – SHAP and SAGE attribute to a model that actually agrees with the LLM, which is the transparent “why this code” story presented to the operator. But this makes CatBoost strongly non-distinct with the LLM under Denoeux’s framework: per-column shared label provenance.

The adaptive discount addresses this through virtual ensemble variance. CatBoost’s virtual ensemble provides uncertainty quantification per code; the discount formula is:

discount = min(max_discount, base_discount + avg_var x variance_scale)

Defaults: base 0.55, variance_scale 1.6, max 0.75, fallback 0.55. High variance (uncertain predictions) produces a larger discount, routing more mass to Theta (ignorance) rather than inflating a weakly-supported prediction. This is the step-size control in the iterative-methods framing: the derivative source’s contribution is damped proportionally to its own uncertainty.

SVM: Subsumption Alignment and Weak Non-Distinctness

The SVM’s discount (0.22) reflects a qualitatively different non- distinctness regime. The SVM trains on a synthetic corpus generated from the bundled ICE ontology, then translates predictions into the user taxonomy via sentence-transformer cosine subsumption alignment. The alignment is a per-vocabulary mapping table computed via BERT cosine similarity between ICE concept signatures and enriched annotation payloads – structurally independent of the runtime LLM.

The weak non-distinctness comes from the shared enrichment-LLM upstream: the enrichment payloads that anchor the subsumption alignment were themselves generated by an LLM (though a different call, different prompt, different temperature than the runtime classification LLM). This is the same structural dependency shared by the late-interaction cosine source, which justifies SVM’s discount (0.22) sitting near cosine’s (0.20) rather than near CatBoost’s (0.55).

With training-time NHSVM, the SVM’s probabilities already incorporate hierarchy, so the pipeline routes them through svm_to_mass directly. The post-hoc nhsvm_to_mass reweighting is preserved as a legacy fallback for flat-trained SVMs loaded in hierarchical mode.

The DST evidence-independence architecture frames the bootstrap loop as fixed-point iteration on a belief-assignment vector B over columns: B_{n+1} = T(B_n). Each component maps onto a numerical-method primitive (Banach 1922; Saad 2003):

Component	Primitive
Bootstrap loop	Fixed-point iteration on B
LLM sweep	Stochastic operator (Robbins-Monro framing)
ML validation (CatBoost + SVM)	Deterministic linearization
DST fusion	Combiner producing fused state
Targeted revisit	Local smoothing (multigrid)
Pl - Bel gap	A posteriori error estimate per column
Conflict K	Nonlinear residual diagnostic
Reliability discount	Damping / step-size control
Hierarchical cosine mass	Coarse-grid correction (multigrid)

The unified residual norm combines four components: mean(gap) / gap_threshold, frac_unclear / clarity_target, mean(K) / k_threshold, and independent-tier disagreement fraction. Residual below 1.0 means converged. The contraction factor rho = ||r_{n+1}|| / ||r_n|| is the headline diagnostic: rho < 1 is contractive, rho -> 1 is stalled, rho > 1 is diverging.

4. DST Sensitivity Study

A numeric sensitivity study (P3.12-P3.13) swept 2,549 synthetic cells across 10 invariants on an 11-node taxonomy (7 leaves, 4 internal). Zero mathematical violations were found. Key findings:

Channel conflict K is bounded. At the production negative-channel weight beta = 0.30, conflict K caps at approximately 0.24. The K threshold logs never fire under normal operating conditions; the Yager fallback path is effectively dead code under Dempster fusion.

The _significant_subtree concentration threshold is a structural cliff. The hard 0.50 threshold for promoting mass to an internal-node focal element produces a discontinuity of Delta = 0.203 in parent mass when sibling probability crosses 0.65 to 0.70. This is a plausible driver of the parent-instead-of-leaf error cluster (22-25% of error budget in evaluation).

Internal-node top-1 switch is the largest discontinuity. The transition from leaf-dominant to internal-node-dominant top-1 prediction produces Delta_mass = 0.57 at the crossover point. This is a high-volatility regime where late-interaction positive-weight calibration is critical.

Anti-example negative channel is a tie-breaker, not a primary driver. At beta = 0.30, the negative channel’s effect on parent mass is approximately Delta = 0.0015 under full negative evidence. The channel requires positive-channel support to produce meaningful rank changes.

Leaf dominance is preserved. Across all swept parameter ranges, the top-1 leaf’s mass never falls below the parent’s mass in realistic operating regimes. Parent focal-element mass is a disjunctive signal (contributing to plausibility, not belief) rather than a competing prediction.

5. BDD Scenario Expansion

The sprint added hierarchical anti-subtree BDD scenarios (P3.9-P3.11) testing the channel-decomposed Dempster combination on an abstract taxonomy fixture:

Anti-example on internal node allocates to descendant complement Theta \ D(n), correctly removing the entire subtree rather than just the node.
Anti-example on leaf preserves singleton complement semantics (regression guard).
Channel conflict K surfaces contradiction when both channels fire strongly on the same node, materializing K as a diagnostic rather than silently cancelling.
Internal-node tag is a first-class prediction target with mass landing directly on the node’s descendant-set focal element.

Additional DST boundary-condition scenarios (P3.10) test operator-observable failure modes: uniform evidence, vacuous sources, and single-source dominance. The generic-vs-specific-same-depth scenario (P3.11) validates that sibling discrimination at equal depth is structurally sound.

6. Operational Improvements

Cautious review disabled (empirically validated as harmful). Run ce4f3777 against 920 reference columns demonstrated: reroute miss rate 76.1%, backoff miss rate 78.8%, net accuracy destruction -13.6 percentage points vs. LLM-only. The cascade: degraded evidence from a second LLM call on high-conflict evidence produces high K, low belief, mass review, mass damage. Disabled by default with bel_threshold = 0.0 (unreachable) as a belt-and-suspenders guard.

Enrichment verifier relaxation. The prototype_values_match_ patterns check was relaxed from 100% to 50% match threshold. Categories with diverse free-text values (marketplace names, descriptive labels) legitimately produce prototypes that do not fit a single regex family. The prior strict threshold caused false rejections and forced manual bypass.

Enrichment prompt feedback key fix. The retry prompt read verifier_feedback.get("failed_checks") but the verifier report writes "details". Retry prompts had empty diagnostic information; the LLM was asked to fix failures it could not see.

Bootstrap-environment and curate-agent-mediated skills. Two Claude Agent SDK skills were added to .claude/commands/: a unified enrichment + curation + SVM skill (6-phase back-pressure rubric, resume-safe persistence) and a targeted per-table curation skill.

Late-interaction bridge self-supplies embedder. The ColBERT encoder is now initialized by the bridge itself rather than requiring the caller to pass one, fixing a CAI venv import ordering issue.

Commit Log

Hash	Summary
`a505953`	R7-R10 audit remediations + bundled R1-R6 + UI / config
`6010e94`	Cite canonical CCO IRIs alongside shorthand labels
`baafa5f`	SOTAB v2 coverage strategy + Aegir handoff
`70ec5b5`	P1 storage foundation for late-interaction cosine
`6716935`	P2 LLM-mediated annotation enrichment pipeline
`b5e97ea`	P3 late-interaction cosine integration (default off)
`8a9e771`	P3.5 default-on + loud-fallback for late-interaction cosine
`142b91e`	P3.6 channel-decomposed positive/negative Dempster combination
`8faf242`	Academic-grade DST Reborn brief
`c324fbe`	P3.7 SHAP per-decision attribution surface for late-interaction cosine
`ed57fd1`	P3.8 hierarchical integrity – internal-node tags as first-class
`28e7273`	P3.9 hierarchical anti-subtree carve-out – abstract taxonomy fixture
`519a1c9`	P3.10 DST boundary-condition scenarios
`a5652db`	P3.11 generic-vs-specific-same-depth scenario
`77b41d8`	P3.12 DST numeric sensitivity study + findings
`2fb7377`	P3.13 hierarchical-aggregation interaction battery
`f155e89`	P7 subsumption alignment + P5 frontier cleanup + P4 enrichment infra
`3d6696f`	Stage A DST sensitivity visibility instrumentation + Stage B script
`1fee0be`	top1_margin disjoint-FE traversal – Stage A regression
`929e29e`	Late-interaction bridge self-supplies embedder + CAI venv fix
`7a1e4e7`	Bootstrap-environment + curate-agent-mediated skills
`1df1383`	Training-time NHSVM via Structured Shared Frobenius Norm

References

Choi, Chung, and Hewitt. 2015. “Normalized Hierarchical Multi-label SVM.” arXiv:1508.02479.
Denoeux, Thierry. 2008. “Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence.” Artificial Intelligence 172(2-3): 234-264.
Haenni, Rolf and Stephan Hartmann. 2006. “Modeling partially reliable information sources.” Studia Logica 82(1): 103-133.
Khoo, Omar, and Steedman. 2006. “An Information Retrieval approach to short text classification.” EMNLP 2006.
Saad, Yousef. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM.
Shafer, Glenn. 1976. A Mathematical Theory of Evidence. Princeton University Press.
Smets, Philippe and Robert Kennes. 1994. “The Transferable Belief Model.” Artificial Intelligence 66(2): 191-234.

Keyboard shortcuts

Atelier