Pipeline Phases (FSM Walk-Through)
A run of the classification pipeline advances through a finite state
machine. Each state is a named phase with a single responsibility,
and the legal transitions between phases — defined authoritatively in
src/atelier/classify/fsm.py
— form the workflow that operators see live in the Workflows page and
that this document narrates end-to-end.
This page is the operator-facing companion to two deeper references:
- Classification Pipeline — what the pipeline does mathematically (DST, evidence sources, fusion strategies).
- DST Evidence Independence — why each evidence source qualifies for Dempster’s rule of combination.
Read this one first when you need to walk a reviewer through the run
shape: which phase produces which artifact, where the iteration loop
lives, what makes a run land in CONVERGED versus ERROR.
At a glance
┌──── revisit ────┐
│ │
▼ │
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
│
▼
CLASSIFYING → FUSING → EVALUATING → CONVERGED
│
(any phase)
▼
ERROR
The arrow back from VALIDATING to LLM_SWEEP is the iteration loop; it’s the heart of the algorithm and is described in Iteration loop below.
Phases in execution order
| # | State | What it does | Primary output |
|---|---|---|---|
| — | IDLE | No run in flight. Ready to dispatch the next classification. | — |
| 1 | LOADING_VOCAB | Load the user-supplied taxonomy (annotations CSV / Hive table / DB) and validate: label collisions, duplicate codes, orphaned aliases, parent-aware frame structure. | HierarchicalCategorySet, FrameOfDiscernment |
| 2 | DISCOVERING | Probe the data source via cml.data_v1 (Hive), the meta-tagging mount (CSV), or the bundled fixtures to enumerate the tables in scope. | list[str] of table names |
| 3 | SAMPLING | For each discovered table, sample column metadata: bare names, types, ~5 representative values, true COUNT(DISTINCT) bounded by the sample limit, null ratio, sibling list. Reference-key columns (attr_1_2_3_* answer-key shape) are filtered out so they don’t trivially leak into evaluation. | list[ColumnSample] (canonical bare names — see ColumnSample invariant) |
| 4 | LLM_SWEEP | Claude classifies each directly-targeted column into the user vocabulary. Iteration 1 sweeps every column (or the Monte Carlo sampled subset — a stratified slice for large corpora; the remaining columns get label propagation later). Iterations 2…N revisit only the columns flagged for re-look in the previous VALIDATING pass. | state.labels[qualified_name] → category_code, plus per-column LLM confidence |
| 5 | VALIDATING | ML re-validation: CatBoost (fit-to-LLM during the loop) and the synth-trained SVM (translated through the LLM-mediated ICE→user-vocab alignment) score the same columns independently of the LLM. Per-column DST mass with conflict K is computed under the parent-aware frame. The disagreement set — driven primarily by belief-gap Pl − Bel, with K and coverage as secondary signals — feeds the next iteration’s revisit batch. The loop exits when convergence criteria are satisfied; otherwise it re-enters LLM_SWEEP. | state.ml_prediction, state.ml_belief, state.ml_plausibility, state.ml_conflict, the next iteration’s disagreement list |
| 6 | CLASSIFYING | Final per-column DST evidence fusion. Up to six evidence sources combine: name_match, pattern, cosine, llm, catboost, svm. Each produces a mass function over the parent-aware frame; per-column predicted code, belief, plausibility, and conflict are computed here. | classifications: list[dict] (each entry shaped as classifications.json rows) |
| 7 | FUSING | Combine per-column mass functions via the configured fusion strategy. dempster normalizes conflict by (1 − K); yager redirects conflict mass to Θ (ignorance). Cautious-code review (when enabled) runs here — backing off over-specified leaf predictions whose belief sits below the commit threshold to a parent code where it does. | Headline classification per column; cautious_review.json (when enabled) |
| 8 | EVALUATING | Compute corpus-level metrics: accuracy vs reference (when present), per-category precision/recall, K distribution, gap distribution, non-PII residual count. Persist artifacts to disk and emit the parquet for the Embeddings page. Overwatch (when enabled) runs at the tail of this phase. | evaluation_report.json, column_trajectories.json, taxonomy_findings.json, atelier_embeddings.parquet, ML artifacts, overwatch.md (when enabled) |
| — | CONVERGED | Terminal success. Convergence criteria satisfied; results are on disk under build/results/{run_id}/ and registered in the DB via the run-end registration path (or recovered later by atelier.db.sync.sync_filesystem_to_db on restart). | — |
| — | ERROR | Terminal failure. The FSM error field carries the diagnostic; pipeline logs and register_error.json (if any) carry the rest. | — |
The FSM defines two states that the standard inference run does not
visit — GENERATING_SYNTH and TRAINING. These belong to the
offline synth-corpus generation + SVM-training flow that
produces the bundled SVM artifact (legacy filename
svm_frontier.pkl retained on disk for backward compatibility with
older run directories), and are reachable from SAMPLING only on the
explicit synth-generate code path.
Iteration loop: LLM_SWEEP ⇄ VALIDATING is the algorithm
The single most important thing to internalize when reviewing the
pipeline: LLM_SWEEP and VALIDATING are not two separate one-shot
phases — they form an iteration loop, and the loop is the convergence
algorithm.
Each cycle:
LLM_SWEEPlabels (or re-labels) the directly-targeted column set on iteration 1, the disagreement set on iterations 2…N.VALIDATINGruns ML re-validation, computes per-column belief, plausibility, and conflict under the parent-aware DST frame, and identifies the next disagreement set.- The loop exits when one of the convergence criteria
is satisfied; otherwise it re-enters
LLM_SWEEP.
The Workflows page draws this as a purple dashed back-edge from
VALIDATING to LLM_SWEEP precisely because the geometry teaches the
algorithm: this is bootstrapping with active-learning revisit, not a
linear pipeline.
The driver of the loop is configurable:
- Programmatic (default):
pipeline._llm_revisitpicks revisit candidates from_identify_disagreementsand_identify_uncertain_columns. - Agent-driven (capability flag): the Agent Convergence skill replaces the programmatic driver — Claude chooses revisit candidates and decides when to declare convergence via tool calls.
Convergence criteria
A run reaches CONVERGED when any of the following holds at the end
of an iteration in the LLM_SWEEP ⇄ VALIDATING loop:
| Criterion | Config key | Default | Notes |
|---|---|---|---|
| Mean belief-gap below threshold | classify.bootstrap.gap_threshold | 0.05 | The primary signal — converging on mean(Pl − Bel), not on K. Locked in by commit bd7de2c after the parent-aware DST frame audit. |
| Coverage met + K acceptable | classify.bootstrap.coverage_floor, classify.bootstrap.k_threshold | 0.95, 0.40 | Backstop — a corpus that LLM-labels everything cleanly on the first sweep doesn’t need additional iterations. |
| Iteration cap reached | classify.bootstrap.max_iterations | 4 | Fallback. The convergence reason is recorded as max_iterations_reached so the UI can show an honest “ran the full budget” rather than claiming gap convergence the run didn’t actually achieve. |
| Min iterations honored | classify.bootstrap.min_iterations | 2 | Forces at least N revisit cycles before any convergence path can fire. Defends against a single-pass LLM that happens to land cleanly without the ML cross-check having run. |
Conflict K (Dempster’s rule’s normalization mass) is diagnostic, not
the gating signal. Earlier iterations of the design framed K as the
convergence headline; that framing was retired in commit bd7de2c and
matters for any review of older docs or telemetry that still leads
with K.
ColumnSample canonical form
The pipeline’s data-model invariant:
ColumnSample.name
is always the bare column identifier — table-relative, free of any
f"{table_name}." prefix. Cross-table identity uses the
qualified_name property (f"{table_name}.{name}") for dict keying.
This invariant is enforced in __post_init__ and validated at every
source boundary:
- Hive sampler (
sampler._strip_table_qualifier) — strips thef"{table_name}."qualifier that Hive’s JDBC driver returns fromSELECT * FROM db.table. - Meta-tagging CSV (
meta_tagging_source) — strips the same prefix from CSV headers that encode the table name. - Synth, OOTB sample, fixtures — produce bare names by construction.
Any new source path that produces qualified names will trip the
__post_init__ invariant at construction time with a clear
diagnostic, rather than letting them silently propagate into the
embedding text — where a repeated table-name prefix in the column
name and the sibling list would drown the actual column signal in
table-theme noise and produce table-wide misclassification.
Optional capability skills
Three skills attach to specific phases and are gated by the
corresponding capability flag. Each renders only when its flag is
enabled in /api/status config.
| Skill | Attaches to | Behavior | Capability flag |
|---|---|---|---|
| Agent Convergence | VALIDATING | Claude drives the convergence loop directly via the agent_loop tool surface — picks revisit candidates from belief/conflict signals and declares convergence when satisfied — replacing the programmatic loop driver. Bounded by max_turns. | classify_agent_enabled |
| Cautious Review | FUSING | Per-column LLM review that backs off over-specified leaf predictions to a parent code where belief crosses the commit threshold. Defends against false-precision claims on opaque or ambiguous columns. | cautious_review_enabled |
| Overwatch | EVALUATING | Single-turn Opus analysis writes overwatch.md with pipeline-tuning recommendations after the run lands. Requires direct Anthropic API (not Bedrock) — Bedrock lags Opus releases. | overwatch_enabled |
The three skills are visible as orange dashed nodes in the Workflows
page when their flags are enabled, attached to their host phases.
This is the registry-MVP shape — when we hit roughly six surfaces it
graduates to a backend /api/skills endpoint reading from a real
registry rather than a hand-coded list.
Phase ↔ artifact map
For an end-to-end review of any single run, here’s what’s recoverable
from build/results/{run_id}/, indexed by the phase that produced it:
| Phase | Artifact | What’s in it |
|---|---|---|
| Run start | settings_snapshot.json | The config that drove the run — source_id, all overlay values at start, default values, the resolved settings the pipeline actually used. |
LLM_SWEEP ⇄ VALIDATING | column_trajectories.json | Per-column history across iterations: label changes, ML predictions, belief/plausibility/conflict trajectory, the revisited flag per iteration. |
LLM_SWEEP ⇄ VALIDATING | catboost_fit_to_llm.cbm + .classes.json | CatBoost fit to the in-loop LLM labels. Persisted for Extend runs. |
LLM_SWEEP ⇄ VALIDATING | svm_frontier.pkl + .classes.json | Synth-trained SVM with the in-run LLM-mediated alignment. Persisted for Extend runs. (Filename retained for backward compatibility; underlying model is the synth-trained SVM, not the excised M9 in-loop retrain.) |
CLASSIFYING + FUSING | classifications.json | The per-column output: predicted code, belief, plausibility, conflict, full evidence-source mass distributions, belief path, llm/ML/cautious codes. The headline corpus result. |
FUSING | cautious_review.json | Cautious Review skill audit (only when enabled). |
EVALUATING | evaluation_report.json | Corpus-level metrics: accuracy, per-category precision/recall, K distribution, gap distribution, non-PII residual count. |
EVALUATING | taxonomy_findings.json | Notes flagged during taxonomy traversal — orphaned codes, suspicious aliases, near-duplicate labels. |
EVALUATING | atelier_embeddings.parquet + umap.pkl | Input for the Embeddings page (UMAP projection of the per-column embedding vectors with predicted-code colorings). |
EVALUATING | sage_importance.json + shap_summary.json | GPU-accelerated global feature importance + per-column SHAP attributions (when enabled). |
EVALUATING (post) | overwatch.md | Overwatch skill output (only when enabled). |
| Run end | register_error.json (rename to .resolved after sync) | If DB registration failed mid-run, the sync path on restart picks the run up from this sidecar. |
Where to look in code
| Concern | File |
|---|---|
FSM state enum, transitions, FSMRun dataclass | src/atelier/classify/fsm.py |
Phase advancement (every fsm.advance(...) call) | src/atelier/classify/pipeline.py |
| Iteration loop driver (programmatic) | src/atelier/classify/bootstrap.py (_llm_sweep, _llm_revisit, _identify_disagreements, _run_ml_validation) |
| Iteration loop driver (agent) | src/atelier/classify/agent_loop.py |
| Per-column DST fusion | src/atelier/classify/pipeline.py (_classify_column) |
| Convergence criteria evaluation | src/atelier/classify/bootstrap.py (_mean_gap, _mean_k, _coverage, should_stop_early) |
| Cautious Review skill | src/atelier/classify/cautious_review.py |
| Overwatch skill | src/atelier/overwatch/agent.py |
| Workflows page topology (UI) | ui/src/lib/fsmPipelineLayout.ts |
State transition reference
Authoritative state-transition table (from
fsm.py:_TRANSITIONS):
| From | Legal next states |
|---|---|
IDLE | LOADING_VOCAB |
LOADING_VOCAB | DISCOVERING, ERROR |
DISCOVERING | SAMPLING, ERROR |
SAMPLING | CLASSIFYING, GENERATING_SYNTH, LLM_SWEEP, ERROR |
GENERATING_SYNTH | TRAINING, ERROR |
TRAINING | CLASSIFYING, ERROR |
LLM_SWEEP | VALIDATING, CLASSIFYING, ERROR |
VALIDATING | LLM_SWEEP, CLASSIFYING, ERROR |
CLASSIFYING | FUSING, ERROR |
FUSING | EVALUATING, ERROR |
EVALUATING | CONVERGED, IDLE, ERROR |
CONVERGED | IDLE |
ERROR | IDLE |
SAMPLING → CLASSIFYING (skipping LLM_SWEEP) is the path used by
Extend runs, where ML-only inference is desired
because the LLM has already classified an earlier corpus and the
artifacts are being applied to a new dataset. LLM_SWEEP → CLASSIFYING (skipping VALIDATING) is the “first-sweep convergence”
path on small corpora that don’t need iteration.