Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Atelier is an agentic classification workbench for Cloudera AI. It classifies column metadata using six independent evidence sources fused via Dempster-Shafer Theory (DST), producing belief intervals instead of point estimates. An LLM-in-the-loop convergence agent identifies disagreements between sources and orchestrates targeted reclassification until the corpus stabilizes.

Why Belief Intervals?

Traditional classifiers output a single probability \( P(A) = 0.85 \) — “85% email address.” This conflates two fundamentally different situations: high confidence with abundant evidence vs. moderate confidence with sparse evidence. A Bayesian posterior and a coin flip can both yield 0.5, but they represent very different epistemic states.

Dempster-Shafer theory separates these via the belief function \( \text{Bel}(A) \) and plausibility function \( \text{Pl}(A) \), where:

$$ \text{Bel}(A) = \sum_{B \subseteq A} m(B), \qquad \text{Pl}(A) = 1 - \text{Bel}(\bar{A}) $$

The interval \( [\text{Bel}(A),; \text{Pl}(A)] \) bounds the true probability. Its width \( \text{Pl}(A) - \text{Bel}(A) \) quantifies epistemic uncertainty — how much we don’t know:

IntervalInterpretation
\( [0.82,; 0.87] \)Strong evidence, low ambiguity — classify with confidence
\( [0.30,; 0.90] \)Some support for \(A\), but high ignorance — gather more evidence
\( [0.45,; 0.55] \)Two sources disagree — wide gap, needs revisit

This distinction drives the entire pipeline: columns with wide belief gaps (where \( \text{Pl}(A) - \text{Bel}(A) \) is large) are automatically escalated for LLM re-examination with enriched context. Conflict \( K \) is tracked as a diagnostic but the gap width determines which columns need attention.

Architecture

React FrontendFastAPI GatewayREST → gRPC bridge
Serves React buildgRPC CoreProto-first API
7 RPCsClaude Agent SDK6-tool convergence loop
Conflict-driven revisitClassification Pipeline6 evidence sources
DST fusion
MC sampling at scalePostgreSQLdevenv: PG 16 + pgvector
CAI: PGliteQdrantVector store
Embedding searchXYFlow Agent CanvasEmbeddings Visualization  REST /api/*gRPC :50051REST → gRPC bridge
Serves React build












Proto-first API
7 RPCs












6-tool convergence loop
Conflict-driven revisit












6 evidence sources
DST fusion
MC sampling at scale












devenv: PG 16 + pgvector
CAI: PGlite












Vector store
Embedding search

















Six Evidence Sources

Each source independently produces a mass function \( m_i : 2^\Theta \to [0, 1] \) over the frame of discernment \( \Theta \) (the set of all category codes). Sources are grouped by computational cost:

SourceFeature SpaceCost Tier
Cosine similarityDense 384-dim sentence-transformer embedding (all-MiniLM-L6-v2)M0 (local)
Pattern detection16 regex detectors + post-regex validators (email, phone, SSN, IP, UUID, date, datetime, URL, credit card + Luhn, MAC, IBAN, postal code, monetary, hash, semver, currency + ISO 4217); graduated mass scaling by match fractionM0
Name matchingColumn name vs vocabulary labels, codes, and aliases (4-tier: exact > code > alias > overlap)M0
LLM classificationFrontier model reasoning (Anthropic / Bedrock / Cerebras / OpenAI-compatible)M1 (API)
CatBoost12 discrete features + 384-dim embedding; virtual ensemble uncertainty via posterior_samplingM2 (trained)
SVMSparse TF-IDF: character n-grams (3–6) ∪ word bigrams; Platt-scaled LinearSVCM2 (trained)

The SVM and CatBoost classifiers occupy deliberately orthogonal feature spaces: the SVM operates on sparse lexical features (TF-IDF) while CatBoost uses dense semantic embeddings. This architectural separation ensures genuine evidence independence for Dempster’s rule.

Fusion

Sources are combined via the conjunctive rule of combination:

$$ m_{1 \oplus 2}(C) = \frac{1}{1-K} \sum_{\substack{A \cap B = C \ A,B \subseteq \Theta}} m_1(A) \cdot m_2(B) $$

where the conflict \( K = \sum_{A \cap B = \varnothing} m_1(A) \cdot m_2(B) \) measures the degree to which sources contradict each other. High \( K \) is the diagnostic signal that drives the convergence loop: columns where independent evidence sources disagree are escalated for targeted LLM revisit with enriched context (ML prediction, belief interval, confusable pair).

Hierarchical Classification

The vocabulary forms a rooted code tree (e.g., ICE.SENSITIVE.PID.CONTACT.EMAIL). Belief and plausibility are queryable at any depth — \( \text{Bel}(\texttt{ICE.SENSITIVE}) \) aggregates all descendants. The cautious_code(τ) operator returns the deepest code where \( \text{Bel} > \tau \), enabling principled depth-accuracy tradeoffs: high \( \tau \) yields coarse but reliable labels; low \( \tau \) yields specific but less certain ones.

Convergence

The bootstrap pipeline iterates three phases until the belief gap (\( \text{Pl}(A) - \text{Bel}(A) \)) stabilizes:

  1. LLM sweep — classify all frontier columns via batch LLM calls
  2. ML validation — run the full 6-source DST pipeline; compute per-column belief, plausibility, and gap
  3. Targeted revisit — re-classify only uncertain columns (high gap or low belief) with enriched context (ML prediction + belief interval + detected patterns + confusable pairs)

The primary convergence measure is mean belief gap — the average width of the \( [\text{Bel}, \text{Pl}] \) interval across all columns. A narrow gap means the evidence sources agree on a confident prediction. Conflict \( K \) is tracked as a diagnostic signal (it indicates source disagreement) but does not gate convergence — a column can have \( K = 0.9 \) but \( \text{Bel} = 0.95 \): the sources fought, but the winner is clear.

An agent-driven variant (via Claude Agent SDK with 6 tools) delegates the revisit strategy to an LLM that reasons about uncertainty patterns, calls retrain_svm to progressively improve the SVM on accumulated frontier labels, and declares convergence when diminishing returns are reached. The programmatic variant uses gap + coverage thresholds for environments where tool-use isn’t available.

Frontier-Label SVM Training

After the first LLM sweep, the SVM is retrained on blended synthetic + frontier labels — high-quality classifications from the frontier model on the stratified importance sample. Synthetic data provides vocabulary breadth (all categories); frontier labels provide corpus-specific depth. The SVM is hot-swapped progressively across convergence iterations, carrying corpus-specific signal into each validation pass. DST independence is preserved: the SVM trains on frontier-model (Opus) labels while the LLM mass function in fusion uses the subagent model (Sonnet/Haiku).

Scale

The pipeline handles corpora from 50 columns (OOTB sample) to 120M+ columns (full GitTables at 10M+ tables). Monte Carlo stratified sampling selects a representative frontier subset for LLM classification and propagates labels to the remaining corpus via embedding similarity.

With max_frontier_columns = 500, classifying a 120M-column corpus requires LLM inference on only 0.0004% of columns — a >99.99% cost reduction while preserving classification quality through DST conflict-driven escalation of uncertain propagations.

Out-of-the-Box Experience

A fresh deployment auto-seeds on first boot:

  1. 316-leaf BFO-grounded vocabulary (351 categories total) covering the CCO Information Content Entity trichotomy: Designative (names, IDs, codes), Descriptive (measurements, dates, amounts), Prescriptive (software, specs)
  2. 25 sample tables with 316 columns and a committed curated reference
  3. One-click classification via the Status page
  4. Interactive Embeddings visualization (UMAP/t-SNE via embedding-atlas)

Quick Start

Local development (devenv):

devenv shell          # Enter dev environment
just install          # Install Python + Node dependencies
just up               # Start gRPC + gateway + Vite dev server

CAI deployment: Deploy as an AMP from https://github.com/zndx/atelier.

Documentation Map