Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Classification Pipeline

Atelier’s core objective: agent-mediated metadata classification using Dempster-Shafer Theory (DST) to produce belief intervals instead of flat confidence scores, exposing epistemic uncertainty and source disagreement.

Terminology — reference-label provenance

Four distinct sources of per-column labels show up in our writeups. Conflating them is load-bearing error, so we name each explicitly:

TermSourceAuthority levelWhere it appears
Published benchmarkExternal, human-curated labels (SOTAB, GitTables)Gold standard — memorization-safe checkSOTAB pilot artifacts; docs/notes/2026-04-19/…phase_gate_2.md
Curated referenceGenerator-derived (synth pairs an answer-key “reference column” per target) + spot-checked by handDefinitive for the synthetic corpus; not equivalent to a published benchmarkbuild/meta-tagging-clean/curated_reference.csv
LLM commitmentA single LLM’s pass-1 or pass-2 outputClassifier opinion; not a truthparquet llm_code, predicted_code
CatBoost priorCatBoost fit to LLM labels, used for revisit enrichmentNot independent evidence — it is a compressed self-consensus of the LLM; valuable specifically for rescuing abstentionsparquet predicted_code via DST fusion

An ablation (as used in our writeups) is a controlled experiment that holds most of the pipeline fixed and varies exactly one component at a time, so changes in accuracy can be attributed to that component rather than to the combination.

Methodology

Why Dempster-Shafer?

Traditional classifiers output a single confidence score (e.g., “85% email address”). This hides two distinct types of uncertainty:

  • Aleatoric uncertainty: inherent randomness in the data
  • Epistemic uncertainty: ignorance due to insufficient evidence

DST separates these via belief intervals [Bel(A), Pl(A)]:

  • Bel(A) = committed evidence supporting A (lower bound)
  • Pl(A) = evidence that cannot rule out A (upper bound)
  • Pl(A) - Bel(A) = unresolved ambiguity

When Bel(A) = 0.8 and Pl(A) = 0.85, we have high confidence with low ambiguity. When Bel(A) = 0.3 and Pl(A) = 0.9, we know something supports A but much remains uncertain — a signal to gather more evidence.

Evidence Sources

Each source independently produces a mass function (Basic Probability Assignment) that distributes belief across the frame of discernment:

SourceTypeDiscountConfigurableStatus
Cosine similaritySentence-transformer (all-MiniLM-L6-v2)0.30classify.discounts.cosineM0
Pattern detection16 regex detectors + post-regex validators0.25classify.discounts.pattern_thetaM0
Name matchingColumn name ↔ label/abbrev/common_namesvariesclassify.discounts.name_match_*M0
LLMOpenAI-compatible / Anthropic / Bedrock / Cerebras0.10classify.llm.discountM1
CatBoostGradient boosted trees (virtual ensembles)adaptiveclassify.discounts.catboost_*M2
SVMDual TF-IDF (char+word n-grams) + LinearSVC (Platt scaling)0.20classify.discounts.svmM2

The discount controls how much mass goes to Θ (total ignorance). Higher discount = more conservative = wider belief intervals.

Pattern mass is graduated: detect_patterns() returns a match fraction (0.0-1.0) per pattern, and pattern_to_mass() scales evidence mass by the average match fraction. A 95% match produces ~3x more mass than a 35% match, eliminating the binary cliff at the 1/3 detection threshold.

Pattern theta (0.25) is deliberately higher than LLM theta (0.10), so the LLM cleanly dominates when pattern and LLM evidence conflict — the LLM considers full context (name, type, values, siblings), while patterns operate on value structure alone.

Evidence Independence

Dempster’s rule of combination requires cognitively independent evidence sources (Shafer 1976) — each mass function must reflect information not derived from the other sources being combined. Atelier achieves this through architectural separation of feature spaces and training signals:

SourceFeature SpaceTraining SignalIndependence Basis
Name matchString/lexicalNone (deterministic)Symbolic matching only
PatternRegexNone (deterministic)Hand-crafted rules only
CosineDense embedding (384-dim)Pre-trained sentence-transformerLearned semantic similarity
LLMSemantic (frontier or subagent model)Pre-trained weightsIn-context classification
CatBoostDense embedding + 12 featuresSynthetic data generatorsGradient-boosted ensemble
SVMSparse TF-IDF (char 3-6 + word 1-2 n-grams)Synthetic data generatorsLexical surface patterns

The SVM is Atelier’s domain-adaptation channel. Cosine and the frontier LLM both rely on pretrained models that read the columns whose names and values carry meaning a web-text-trained model can grip on (email_address, transaction_amount, ISO dates). Many columns in deployed enterprise data are not like that: opaque names (val_09, col_73, ref_addr), opaque values (hex digests, internal serial codes, prefix-stripped tokens), or both. Pretrained models have nothing to grip on for those — the signal lives only in domain-specific shape (format, length, character-class distribution, prefix vocabulary) that must be learned from data shaped like the deployed distribution. The SVM is trained on synthetic corpora produced by procedural generators in src/atelier/classify/synth_generators.py, so it learns precisely those patterns. The SVM and cosine therefore operate on disjoint signal populations — semantic-bearing columns versus inscrutable ones — which makes their evidence sources structurally, not merely statistically, independent under DST.

A subtler point worth naming: the historical “confusable pair” framing attributed to the data what often lived in the featurizer. Char-n-gram TF-IDF treating Brazilian CPF identifiers as date-shaped, or sub-word tokenization splitting similar-looking strings into overlapping tokens, are tokenization artifacts — properties of the model, not the data. Domain-adapted training on synthetic-corpus examples that match the deployed distribution sees past those artifacts; the SVM is not “resolving confusables” but reading columns that pretrained models fundamentally cannot.

Architecturally this also provides the most important independence guarantee in the DST stack. While cosine similarity and CatBoost both operate on the same dense sentence-transformer embedding (384 dimensions from all-MiniLM-L6-v2), the SVM operates on a fully orthogonal feature representation: sparse TF-IDF character and word n-grams extracted by sklearn.pipeline.Pipeline + FeatureUnion. The SVM captures lexical surface patterns (abbreviations, digit sequences, camelCase fragments) that the dense embedding collapses — providing genuine corrective signal in DST fusion.

SVM Architecture (adopted from Signals)

The SVM classifier follows the Pipeline + FeatureUnion composition pattern from the Signals project — the version of record presented as an independent fifth DST evidence source:

Column metadata text ("email_addr | user@example.com")
        │
        ▼
    FeatureUnion
    ├── TfidfVectorizer(analyzer="char_wb", ngram_range=(3,6))
    │   → captures subword patterns, abbreviations, digit sequences
    └── TfidfVectorizer(analyzer="word", ngram_range=(1,2))
        → captures multi-word patterns ("email address", "zip code")
        │
        ▼
    Sparse feature matrix (up to 100K dimensions)
        │
        ▼
    CalibratedClassifierCV(LinearSVC, method="sigmoid")
        │
        ▼
    Calibrated probability distribution {code: probability}

Key implementation details:

  • Singleton class filteringfit() drops categories with < 2 training examples before CalibratedClassifierCV, since StratifiedKFold requires every class to have >= 2 samples. With 316 categories and few tables, some categories inevitably have only one example. Dropped categories are logged and still receive predictions from the other 5 DST evidence sources.
  • _min_class_count() — returns the actual minimum (no longer clamped to 2)
  • feature_importances(top_n) — navigates CalibratedClassifierCVLinearSVC to extract coef_, averages absolute coefficients across classes, cross-references with FeatureUnion.get_feature_names_out() for named feature importance
  • is_fitted property for safe state checking before prediction

SVM Training (synth-only) and Vocabulary Alignment

The SVM is trained once on the synthetic corpus (see synth.md) using TF-IDF char-3-6gram + word-1-2gram features and labels keyed on bundled-ontology ICE.* leaves from synth_generators.GENERATORS. At pipeline runtime, the ICE.* predictions are translated into the user’s taxonomy via the cached subsumption-prediction alignment in atelier.classify.subsumption_alignment — sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from the Qdrant taxonomy collection. The legacy LLM-mediated alignment was retired in the P7 intervention (see DST Evidence Independence).

The alignment targets every user node — leaves AND internal nodes (per the dynamic-annotations principle that every node is a first-class tagging target). An ICE leaf may legitimately align to a user internal node when the user’s vocabulary covers a concept family without a leaf-specific equivalent. Restricting alignment to user leaves only would silently reject the parent-family fallback that is the architecturally-correct behavior.

The translation step is what restored the SVM as useful evidence for non-OOTB user vocabularies — pre-alignment, the SVM emitted ICE codes that didn’t appear in the user-taxonomy frame and silently contributed nothing. See subsumption_alignment.py module docstring for the full independence argument.

Historical note (2026-05-04 refactor). Earlier revisions of this design ran a mid-loop train_svm_on_frontier_labels (historical function name) that retrained the SVM on live LLM labels and hot-swapped the result into the active model slot — labelled “M9 incremental SVM retraining” in commit history. That path was excised on 2026-05-04 (commits 8627c2c, 5199379, cc59d01) for source- independence reasons: the per-column LLM label copying made the SVM strongly non-distinct with the LLM source under Denoeux 2008. The subsequent LLM-mediated alignment introduced a vocabulary-level shared error mode (the alignment-time LLM and the runtime LLM share weights), which the P7 subsumption-prediction intervention eliminates — runtime alignment now uses sentence-transformer embeddings rather than the runtime LLM. The SVM’s TF-IDF independence at the feature and label level is preserved; the remaining weak non-distinctness is the shared enrichment-LLM upstream (offline-generated annotations), structurally identical to the late-interaction cosine source’s coupling.

Implementation
  • train_svm() in ml_train.py — synth-only training, persists to build/models/svm.pkl (label space: ICE.* leaves)
  • ontology_alignment.build_alignment() — once-per-(vocab, embedding_model) ICE → user-code mapping via subsumption prediction (sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from Qdrant); cached at build/cache/alignment/<sha256>.json
  • Discount: classify.discounts.svm = 0.22 (was 0.30 under LLM-mediated alignment, 0.55 in M9 era) reflects the enrichment-mediated subsumption-prediction regime — weakly non-distinct via shared enrichment-LLM upstream only.

Dempster’s Rule of Combination

Sources are fused via the conjunctive combination rule:

m₁₂(C) = Σ{m₁(A)·m₂(B) : A∩B=C} / (1 - K)

where K = Σ{m₁(A)·m₂(B) : A∩B=∅} is the conflict between sources.

High K means the sources disagree — a valuable diagnostic signal. Note that K is not the convergence criterion — see Belief-Gap Convergence below.

Compound Focal Elements (Uncertainty Representation)

When DST evidence splits closely between two singleton categories, collapsing to a single top-1 prediction misrepresents what the evidence actually says. DST’s native vocabulary for this is the compound focal element: a portion of the runner-up’s mass transfers to a focal element representing the union of the two singletons, honestly reflecting that the evidence supports the disjunction but does not discriminate between members. This is the same DST math that supports queries at any node in the hierarchy via belief_at() — the compound mass propagates up to the common ancestor, so belief at any level reflects the combined evidence.

The mechanism is unconditional DST: any two singletons whose masses split closely qualify in principle. In practice the implementation maintains a short registry of category pairs where the transfer is routinely activated — examples below, filtered to vocabulary at runtime. These are illustrations of cases where the mechanism activates, not a definitional list of categories the classifier is expected to “confuse”.

Example pairWhy mass-splitting is common
Record Identifier ↔ Device IdentifierBoth are opaque identifiers; context determines which
Timestamp ↔ Date of BirthBoth are temporal; DOB is a specific semantic subtype
Transaction Amount ↔ Bank Account NumberBoth are financial numbers
IP Address ↔ Device IdentifierIP addresses can identify devices

Mechanics: when the top-2 singleton masses match a registered pair and their ratio is below confusable_ratio_threshold (default 3.0), half of the runner-up’s mass transfers to the compound focal element. Belief at the common ancestor then reflects the combined evidence via belief_at() propagation. (The config knob retains its historical name for backward compatibility; the mechanism itself is honest uncertainty representation, not pair-discrimination.)

Pattern Validation

Pattern detection uses a two-stage architecture: 16 regex patterns for recall, plus a _VALIDATORS registry for precision. A value must pass both the regex AND the validator (if one exists) to count.

ValidatorPatternChecks
_luhn_checkcredit_card_patternLuhn checksum (ISO/IEC 7812)
_is_valid_ipv4ipv4_patternAll 4 octets in 0-255 range
_is_plausible_datedate_iso_pattern, datetime_iso_patternMonth 01-12, day 01-31
_is_iso_currencyiso_currency_patternISO 4217 whitelist (~40 codes)

The phone_pattern uses a suppression mechanism: when a more specific digit-heavy pattern also fires (SSN, date, credit card, IP, postal code, monetary, IBAN), the phone match is suppressed. This prevents the phone regex from injecting false evidence on columns whose values happen to contain formatted digits.

12 Discrete Features

Each column produces 12 SAGE-ablatable features:

  1. column_name — humanized column name
  2. column_type — SQL type (suppresses uninformative STRING/VARCHAR)
  3. sample_values — first 5 non-null values as text
  4. cardinality — distinct value count
  5. null_ratio — fraction of NULL values
  6. value_entropy — Shannon entropy of value lengths
  7. pattern_signals — matched regex patterns
  8. avg_value_length — mean string length
  9. numeric_ratio — fraction parseable as numbers
  10. sibling_context — other column names in the same table
  11. source_table — table name
  12. value_description — auto-generated natural language description

Architecture

AgentFSM

The classification pipeline runs as a background Finite State Machine:

ML-only path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Bootstrap path (programmatic):
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
                                                    ▲                     │
                                                    └─── (disagreements) ─┘
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

Agent-driven path:
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING
                                                    ▲           │
                                                    └── Agent convergence loop (5 tools)
                                                          Claude reasons about which columns to revisit
                                                          (converged) ────► CLASSIFYING → FUSING → EVALUATING → CONVERGED → IDLE

MC sampling (when corpus > 200 columns):
SAMPLING includes pre-classify → stratify → select MC sample
LLM_SWEEP classifies the sampled subset only → propagate labels to remainder

State transitions are persisted to PostgreSQL. The Status page polls /api/fsm/status for live progress updates.

Module Structure

src/atelier/classify/
├── __init__.py          # Public API: run_pipeline(), run_bootstrap(), get_fsm_status()
├── belief.py            # DST core: BeliefAssignment, FocalElement, dempster_combine()
├── mass_functions.py    # Evidence→mass converters (6 active)
├── features.py          # 12 features + 16 pattern detectors + 5 post-regex validators
├── taxonomy.py          # ReferenceCategory, HierarchicalCategorySet
├── embedding.py         # Sentence-transformer cosine classifier
├── llm_backend.py       # LLM backend factory (Anthropic, OpenAI-compat, Bedrock tool-use, Cerebras)
├── bootstrap.py         # Bootstrap convergence loop (LLM sweep + ML validation)
├── agent_loop.py        # Agent-driven convergence (6 Claude tools)
├── monte_carlo.py       # MC stratified sampling for scale (pre-classify, stratify, select, propagate)
├── gpu.py               # GPU detection + NVIDIA driver symlink (nix+CUDA)
├── sampler.py           # Hive metadata sampling + fixture data loading
├── synth.py             # Synthetic data generation
├── synth_generators.py  # 316+ hand-coded value generators (shared module)
├── synth_registry.py    # Three-layer generator registry (hand-coded > template > inferred)
├── meta_tagging_overlay.py # 130+ META_TO_ICE mappings for meta-tagging alignment
├── svm_classifier.py    # Pipeline+FeatureUnion: dual TF-IDF + LinearSVC + Platt scaling (signals)
├── catboost_classifier.py # CatBoost with virtual ensemble uncertainty
├── ml_train.py          # Training orchestrator (synth → models)
├── ml_inference.py      # Lazy-loading inference wrappers
├── evaluation.py        # Structured evaluation (per-category P/R/F1, confusion matrix)
├── train_eval_cycle.py  # Synth → train → classify → evaluate orchestrator
├── mock_llm.py          # Realistic mock LLM (seeded uncertainty + mass-splitting between close categories)
├── sage.py              # SAGE feature importance (permutation-based, GPU-aware)
├── shap_explanations.py # Per-item SHAP feature attribution (TreeSHAP + PermutationSHAP)
├── pipeline.py          # Full pipeline orchestration (6 sources + MC + background SHAP)
├── fsm.py               # AgentFSM state machine
├── fixtures/
│   ├── universal_vocabulary.json  # BFO-grounded universal vocabulary (16 leaves)
│   └── fixture_tables.json        # 8 tables, 50 cols — fixture reference for unit tests
│                                    (NOT the UAT-corpus curated reference; see
│                                    build/meta-tagging-clean/curated_reference.csv)
data/sample/
└── ontology.json                  # Expanded vocabulary (300 leaves, 25 internal)
└── ontology/
    ├── atelier-vocab.ttl          # CCO-mediated BFO alignment (59 mapped terms)
    ├── sparql/unmapped-terms.rq   # Totality validation query
    └── README.md                  # Mapping methodology and usage

Build Directory

Artifacts are written to build/ (gitignored) to separate reproducible code from potentially sensitive intermediate data:

build/
├── data/annotations/    # Cached vocabulary from hive
├── data/samples/        # Sampled metadata
├── data/synth/          # Synthetic training data
├── models/              # Trained CatBoost + SVM models, embedding caches
└── results/{run_id}/
    ├── classifications.json           # Per-column DST results (+ SHAP columns when enabled)
    ├── evaluation_report.json         # Per-category P/R/F1, confusion matrix
    └── atelier_embeddings.parquet     # For embedding-atlas (+ shap_top{1,2,3}_{name,value})

Controlled Vocabulary

Loaded from hive default.annotations (11 columns):

ColumnMaps toPurpose
idcodeHierarchical dot-notation identifier
ontologylabelHuman-readable category name
annotationabbrevFormal code / mnemonic
definitiondescriptionHuman-readable definition text
common_namescommon_namesPipe/comma-separated aliases
specifics(embedding text)Examples and context
non_corp, emp_contractor, individual, corpsensitivityPer-role ratings (0-4)
deprecated(filter)“yes” = exclude

API

REST Endpoints

  • GET /api/fsm/status — Current pipeline state + progress
  • POST /api/fsm/start — Start a single-pass ML classification run
  • POST /api/fsm/start-bootstrap — Start bootstrap convergence loop (LLM + ML)
  • GET /api/fsm/runs — List past runs

gRPC RPCs

  • GetFSMStatus() → FSMStatusResponse
  • StartClassification() → StartClassificationResponse

HierarchicalClassification

The pipeline wraps each column result in a HierarchicalClassification object (ported from signals) that enables post-hoc hierarchy navigation:

  • belief_at(code) — query Bel at any hierarchy level (leaf or internal)
  • plausibility_at(code) — query Pl at any level
  • interval_at(code)(Bel, Pl) tuple
  • uncertainty_gapPl - Bel for the predicted category
  • needs_clarification — True when uncertainty_gap > 0.3 or conflict > 0.2
  • from_combined_evidence() — factory method: filters vacuous sources, combines via the configured fusion strategy, ranks by pignistic probability

Confidence is pignistic probability BetP(singleton), the decision-theoretic transform that distributes multi-element focal set mass equally among members.

Fusion Strategies

Two DST combination rules are implemented, selectable via classify.fusion_strategy:

  • dempster (default) — Classical Dempster’s rule with (1-K) normalization. Under high conflict, surviving singletons are amplified.
  • yager — Yager’s modified rule. Conflict mass is redirected to Θ (ignorance) instead of being normalized away. Preserves epistemic honesty at the cost of higher ignorance mass and typically lower peak belief values. When K=0, produces identical results to Dempster.

Yager is available as an opt-in alternative for empirical validation. The default (Dempster) remains in place pending A/B comparison on real pipeline runs — Yager’s increased conservatism may or may not improve overall classification quality, and compensatory adjustments to per-source discounting or decision thresholds may be needed.

Bootstrap Convergence Loop

The bootstrap pipeline wraps the single-pass ML pipeline in an iterative LLM↔ML convergence loop. It adds LLM evidence and repeats until predictions are settled — measured by belief-gap convergence, not raw conflict K.

Three Phases

  1. LLM Sweep (LLM_SWEEP): Batch-classify all columns via the configured LLM backend (Claude via Bedrock/Anthropic, or any OpenAI-compatible endpoint). Columns are sent in table-aware batches with sibling context. If every batch fails, the sweep raises RuntimeError (fail-fast) instead of silently proceeding with zero labels.

  2. ML Validation (VALIDATING): Run the full 6-source DST pipeline for each column. Compute per-column belief interval [Bel, Pl], conflict K, and uncertainty gap Pl - Bel. Identify uncertain columns where predictions need revisiting.

  3. Targeted Revisit (back to LLM_SWEEP): Re-classify uncertain columns with enriched context — the ML prediction, belief interval, pattern signals, and value descriptions are included in the prompt. This gives the LLM evidence it didn’t have in the first pass.

Belief-Gap Convergence

The primary convergence measure is the uncertainty gap Pl - Bel for each column’s predicted category. This directly answers “how settled is this prediction?” — unlike K, which only measures source disagreement.

A column can have K=0.9 but Bel=0.95 — the sources fought hard during combination, but the normalizing denominator (1-K) concentrated surviving mass on the agreed-upon singleton. That column’s prediction is settled despite high conflict; it doesn’t need revisiting.

Convergence criteria (all must hold):

CriterionMetricDefaultMeaning
Primarymean_gap < gap_threshold0.15Predictions are tight
Secondaryfrac_unclear < clarity_target0.10At most 10% of columns need clarification
Coveragecoverage >= coverage_target0.9595% of columns have labels

Revisit targeting: _identify_uncertain_columns() selects columns where gap > 0.3 OR Bel < bel_floor (default 0.50), sorted by gap descending (most uncertain first).

Early stopping: The proof-of-progress paradigm monitors the gap trend. When mean gap plateaus for 2 consecutive iterations (no verifiable progress), the loop stops even if the threshold hasn’t been reached.

K as Diagnostic

Conflict K remains in logs, iteration metrics, and agent tools as a diagnostic for source disagreement. It is useful for identifying calibration issues (e.g., a pattern detector producing false positives) but does not gate convergence. The cumulative K formula K = 1 - Π(1 - Kᵢ) tends to be high (~0.5-0.8) with 6 partially correlated sources; this is expected and does not indicate poor quality.

Agent-Driven Convergence

As an alternative to the programmatic loop, the agent convergence loop (agent_loop.py) delegates revisit strategy to Claude. The agent uses 6 tools — get_conflict_report, revisit_columns, check_convergence, get_column_detail, retrain_svm, declare_converged — to reason about which columns need re-examination. The agent sees both gap-based and K-based metrics and can make nuanced decisions. See Keystone Agents.

LLM Backend

llm_backend.py provides a factory-pattern abstraction:

  • OpenAICompatibleBackend: For vLLM, GLM-4.7, and any endpoint implementing the OpenAI chat completions API. Default backend.
  • AnthropicBackend: For Claude via the Anthropic SDK.
  • BedrockBackend: For AWS Bedrock via the Converse API.
  • BedrockStructuredBackend: Production default on CAI. Uses invoke_model with tool-use for structured output (output_config is not supported on Bedrock). When extended thinking is enabled, tool_choice must be "auto" (Anthropic constraint); a text-block fallback parser handles this case. Both backends use region_from_arn() to extract the target region from cross-region inference profile ARNs.
  • CerebrasBackend: OpenAI-compatible with Cerebras-specific defaults (base_url=https://api.cerebras.ai/v1, model=zai-glm-4.7).
  • create_backend_from_cfg(cfg): Factory that reads HOCON config to select and configure the appropriate backend.

Backends fail fast when not configured — no mock fallback in production code.

Configuration

All bootstrap/LLM settings live in HOCON (config/base.conf):

classify {
    llm {
        backend = "openai_compatible"  # or "anthropic", "bedrock_structured"
        model = "glm-4.7"
        base_url = null                # vLLM endpoint URL
        columns_per_call = 50
        discount = 0.10                # DST discount for LLM mass
    }
    bootstrap {
        max_iterations = 5
        k_threshold = 0.2              # diagnostic (not convergence-gating)
        coverage_target = 0.95
        max_total_llm_calls = 5000
        # Belief-gap convergence (primary criteria)
        gap_threshold = 0.15           # mean(Pl - Bel) target
        clarity_target = 0.10          # max fraction of unclear columns
        bel_floor = 0.50               # min belief for "settled"
    }
}

Environment variable overrides follow the standard pattern: ATELIER_LLM_MODEL, ATELIER_LLM_BASE_URL, ATELIER_BOOTSTRAP_K_THRESHOLD, etc.

SHAP Explanations

Per-item feature attribution explaining why each column was classified as it was. Complements the global SAGE importance (which ranks features across the entire dataset) with item-level explanations.

Two Methods

MethodAlgorithmSpeedFeaturesWhen Used
CatBoost TreeSHAPExact O(TLD) built-in~0.1s for 50 itemsGrouped: embedding, discreteAuto when CatBoost model loaded
Embedding PermutationSHAPshap.PermutationExplainer~50s/item on CPU12 named featuresTier-1, explicit request only

Auto mode (method="auto") only uses TreeSHAP — PermutationSHAP is too slow for default pipeline runs and must be explicitly requested.

Output

Each classification gains 6 extra columns:

  • shap_top1_name, shap_top1_value
  • shap_top2_name, shap_top2_value
  • shap_top3_name, shap_top3_value

These flow through to JSON, parquet, and evaluation output.

Configuration

classify.shap {
    enabled = true        # Enable SHAP in pipeline (auto-selects method)
    top_k = 3             # Number of top features to report per item
}

Configurable Discounts

All DST discount factors are configurable via HOCON. The DiscountConfig dataclass bundles all parameters with DiscountConfig.from_cfg(cfg) factory:

classify.discounts {
    cosine = 0.30                    # Cosine similarity → Theta mass
    svm = 0.20                       # SVM → Theta mass
    pattern_theta = 0.25             # Pattern detection → Theta mass (graduated by match fraction)
    name_match_exact = 0.70          # Exact label match singleton mass
    name_match_code = 0.50           # Formal code/abbrev match mass
    name_match_alias = 0.50          # Common name alias match mass
    name_match_overlap = 0.30        # Word overlap match mass
    catboost_base = 0.10             # Adaptive discount base
    catboost_variance_scale = 1.6    # Variance-to-discount scaling
    catboost_max = 0.50              # Cap on adaptive discount
    catboost_fallback = 0.15         # When no variance available
    confusable_ratio_threshold = 3.0 # Mass-split ratio that triggers compound focal element transfer
}

Environment variable overrides: ATELIER_DISCOUNT_COSINE, ATELIER_DISCOUNT_SVM, etc.

Milestones

MilestoneScopeStatus
M0Cosine + pattern + name match, FSM, pipeline E2EDone
M0.5Schema fix, pignistic probability, HierarchicalClassificationDone
M1LLM evidence source, bootstrap convergence loop, LLM↔ML validationDone
M2CatBoost + SVM + synthetic data, 6 evidence sources, Bedrock/Cerebras backendsDone
M3Evaluation framework, E2E synth-train-eval, realistic mock LLM, SAGE importanceDone
M4SHAP explanations, configurable discounts, thread-safe model loadingDone
M5Data sources + versioning, OOTB onboarding (316-leaf ontology, 25 sample tables)Done
M6Agent-driven convergence loop (6 Claude tools), synth framework (316+ generators)Done
M7Monte Carlo stratified sampling, label propagation, background SHAPDone
M8GPU acceleration (NVIDIA driver symlink, batch encoding), meta-tagging overlayDone
M8.5SVM signals alignment (Pipeline+FeatureUnion adoption, evidence independence documentation)Done
M9Incremental SVM training on LLM-classified labels (cross-model distillation via MC sampling) — subsequently excised, see 2026-05-04 historical note aboveDone
M10Phase Gate #2 — belief-gap convergence pivot, Cautious-Code Review, TreeSHAP per-feature attribution, reasoning-trace citation analyzer (+9 pts iterative gain), 97.8% phase-gate validation on meta-taggingDone
M11MLflow experiment tracking, Hive data source integrationProposed