Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pipeline Phases (FSM Walk-Through)

A run of the classification pipeline advances through a finite state machine. Each state is a named phase with a single responsibility, and the legal transitions between phases — defined authoritatively in src/atelier/classify/fsm.py — form the workflow that operators see live in the Workflows page and that this document narrates end-to-end.

This page is the operator-facing companion to two deeper references:

Read this one first when you need to walk a reviewer through the run shape: which phase produces which artifact, where the iteration loop lives, what makes a run land in CONVERGED versus ERROR.

At a glance

                                     ┌──── revisit ────┐
                                     │                 │
                                     ▼                 │
IDLE → LOADING_VOCAB → DISCOVERING → SAMPLING → LLM_SWEEP → VALIDATING ──┐
                                                                         │
                                                                         ▼
                                                       CLASSIFYING → FUSING → EVALUATING → CONVERGED
                                                                                                │
                                                                                          (any phase)
                                                                                                ▼
                                                                                              ERROR

The arrow back from VALIDATING to LLM_SWEEP is the iteration loop; it’s the heart of the algorithm and is described in Iteration loop below.

Phases in execution order

#StateWhat it doesPrimary output
IDLENo run in flight. Ready to dispatch the next classification.
1LOADING_VOCABLoad the user-supplied taxonomy (annotations CSV / Hive table / DB) and validate: label collisions, duplicate codes, orphaned aliases, parent-aware frame structure.HierarchicalCategorySet, FrameOfDiscernment
2DISCOVERINGProbe the data source via cml.data_v1 (Hive), the meta-tagging mount (CSV), or the bundled fixtures to enumerate the tables in scope.list[str] of table names
3SAMPLINGFor each discovered table, sample column metadata: bare names, types, ~5 representative values, true COUNT(DISTINCT) bounded by the sample limit, null ratio, sibling list. Reference-key columns (attr_1_2_3_* answer-key shape) are filtered out so they don’t trivially leak into evaluation.list[ColumnSample] (canonical bare names — see ColumnSample invariant)
4LLM_SWEEPClaude classifies each directly-targeted column into the user vocabulary. Iteration 1 sweeps every column (or the Monte Carlo sampled subset — a stratified slice for large corpora; the remaining columns get label propagation later). Iterations 2…N revisit only the columns flagged for re-look in the previous VALIDATING pass.state.labels[qualified_name] → category_code, plus per-column LLM confidence
5VALIDATINGML re-validation: CatBoost (fit-to-LLM during the loop) and the synth-trained SVM (translated through the LLM-mediated ICE→user-vocab alignment) score the same columns independently of the LLM. Per-column DST mass with conflict K is computed under the parent-aware frame. The disagreement set — driven primarily by belief-gap Pl − Bel, with K and coverage as secondary signals — feeds the next iteration’s revisit batch. The loop exits when convergence criteria are satisfied; otherwise it re-enters LLM_SWEEP.state.ml_prediction, state.ml_belief, state.ml_plausibility, state.ml_conflict, the next iteration’s disagreement list
6CLASSIFYINGFinal per-column DST evidence fusion. Up to six evidence sources combine: name_match, pattern, cosine, llm, catboost, svm. Each produces a mass function over the parent-aware frame; per-column predicted code, belief, plausibility, and conflict are computed here.classifications: list[dict] (each entry shaped as classifications.json rows)
7FUSINGCombine per-column mass functions via the configured fusion strategy. dempster normalizes conflict by (1 − K); yager redirects conflict mass to Θ (ignorance). Cautious-code review (when enabled) runs here — backing off over-specified leaf predictions whose belief sits below the commit threshold to a parent code where it does.Headline classification per column; cautious_review.json (when enabled)
8EVALUATINGCompute corpus-level metrics: accuracy vs reference (when present), per-category precision/recall, K distribution, gap distribution, non-PII residual count. Persist artifacts to disk and emit the parquet for the Embeddings page. Overwatch (when enabled) runs at the tail of this phase.evaluation_report.json, column_trajectories.json, taxonomy_findings.json, atelier_embeddings.parquet, ML artifacts, overwatch.md (when enabled)
CONVERGEDTerminal success. Convergence criteria satisfied; results are on disk under build/results/{run_id}/ and registered in the DB via the run-end registration path (or recovered later by atelier.db.sync.sync_filesystem_to_db on restart).
ERRORTerminal failure. The FSM error field carries the diagnostic; pipeline logs and register_error.json (if any) carry the rest.

The FSM defines two states that the standard inference run does not visit — GENERATING_SYNTH and TRAINING. These belong to the offline synth-corpus generation + SVM-training flow that produces the bundled SVM artifact (legacy filename svm_frontier.pkl retained on disk for backward compatibility with older run directories), and are reachable from SAMPLING only on the explicit synth-generate code path.

Iteration loop: LLM_SWEEP ⇄ VALIDATING is the algorithm

The single most important thing to internalize when reviewing the pipeline: LLM_SWEEP and VALIDATING are not two separate one-shot phases — they form an iteration loop, and the loop is the convergence algorithm.

Each cycle:

  1. LLM_SWEEP labels (or re-labels) the directly-targeted column set on iteration 1, the disagreement set on iterations 2…N.
  2. VALIDATING runs ML re-validation, computes per-column belief, plausibility, and conflict under the parent-aware DST frame, and identifies the next disagreement set.
  3. The loop exits when one of the convergence criteria is satisfied; otherwise it re-enters LLM_SWEEP.

The Workflows page draws this as a purple dashed back-edge from VALIDATING to LLM_SWEEP precisely because the geometry teaches the algorithm: this is bootstrapping with active-learning revisit, not a linear pipeline.

The driver of the loop is configurable:

  • Programmatic (default): pipeline._llm_revisit picks revisit candidates from _identify_disagreements and _identify_uncertain_columns.
  • Agent-driven (capability flag): the Agent Convergence skill replaces the programmatic driver — Claude chooses revisit candidates and decides when to declare convergence via tool calls.

Convergence criteria

A run reaches CONVERGED when any of the following holds at the end of an iteration in the LLM_SWEEP ⇄ VALIDATING loop:

CriterionConfig keyDefaultNotes
Mean belief-gap below thresholdclassify.bootstrap.gap_threshold0.05The primary signal — converging on mean(Pl − Bel), not on K. Locked in by commit bd7de2c after the parent-aware DST frame audit.
Coverage met + K acceptableclassify.bootstrap.coverage_floor, classify.bootstrap.k_threshold0.95, 0.40Backstop — a corpus that LLM-labels everything cleanly on the first sweep doesn’t need additional iterations.
Iteration cap reachedclassify.bootstrap.max_iterations4Fallback. The convergence reason is recorded as max_iterations_reached so the UI can show an honest “ran the full budget” rather than claiming gap convergence the run didn’t actually achieve.
Min iterations honoredclassify.bootstrap.min_iterations2Forces at least N revisit cycles before any convergence path can fire. Defends against a single-pass LLM that happens to land cleanly without the ML cross-check having run.

Conflict K (Dempster’s rule’s normalization mass) is diagnostic, not the gating signal. Earlier iterations of the design framed K as the convergence headline; that framing was retired in commit bd7de2c and matters for any review of older docs or telemetry that still leads with K.

ColumnSample canonical form

The pipeline’s data-model invariant: ColumnSample.name is always the bare column identifier — table-relative, free of any f"{table_name}." prefix. Cross-table identity uses the qualified_name property (f"{table_name}.{name}") for dict keying.

This invariant is enforced in __post_init__ and validated at every source boundary:

  • Hive sampler (sampler._strip_table_qualifier) — strips the f"{table_name}." qualifier that Hive’s JDBC driver returns from SELECT * FROM db.table.
  • Meta-tagging CSV (meta_tagging_source) — strips the same prefix from CSV headers that encode the table name.
  • Synth, OOTB sample, fixtures — produce bare names by construction.

Any new source path that produces qualified names will trip the __post_init__ invariant at construction time with a clear diagnostic, rather than letting them silently propagate into the embedding text — where a repeated table-name prefix in the column name and the sibling list would drown the actual column signal in table-theme noise and produce table-wide misclassification.

Optional capability skills

Three skills attach to specific phases and are gated by the corresponding capability flag. Each renders only when its flag is enabled in /api/status config.

SkillAttaches toBehaviorCapability flag
Agent ConvergenceVALIDATINGClaude drives the convergence loop directly via the agent_loop tool surface — picks revisit candidates from belief/conflict signals and declares convergence when satisfied — replacing the programmatic loop driver. Bounded by max_turns.classify_agent_enabled
Cautious ReviewFUSINGPer-column LLM review that backs off over-specified leaf predictions to a parent code where belief crosses the commit threshold. Defends against false-precision claims on opaque or ambiguous columns.cautious_review_enabled
OverwatchEVALUATINGSingle-turn Opus analysis writes overwatch.md with pipeline-tuning recommendations after the run lands. Requires direct Anthropic API (not Bedrock) — Bedrock lags Opus releases.overwatch_enabled

The three skills are visible as orange dashed nodes in the Workflows page when their flags are enabled, attached to their host phases. This is the registry-MVP shape — when we hit roughly six surfaces it graduates to a backend /api/skills endpoint reading from a real registry rather than a hand-coded list.

Phase ↔ artifact map

For an end-to-end review of any single run, here’s what’s recoverable from build/results/{run_id}/, indexed by the phase that produced it:

PhaseArtifactWhat’s in it
Run startsettings_snapshot.jsonThe config that drove the run — source_id, all overlay values at start, default values, the resolved settings the pipeline actually used.
LLM_SWEEPVALIDATINGcolumn_trajectories.jsonPer-column history across iterations: label changes, ML predictions, belief/plausibility/conflict trajectory, the revisited flag per iteration.
LLM_SWEEPVALIDATINGcatboost_fit_to_llm.cbm + .classes.jsonCatBoost fit to the in-loop LLM labels. Persisted for Extend runs.
LLM_SWEEPVALIDATINGsvm_frontier.pkl + .classes.jsonSynth-trained SVM with the in-run LLM-mediated alignment. Persisted for Extend runs. (Filename retained for backward compatibility; underlying model is the synth-trained SVM, not the excised M9 in-loop retrain.)
CLASSIFYING + FUSINGclassifications.jsonThe per-column output: predicted code, belief, plausibility, conflict, full evidence-source mass distributions, belief path, llm/ML/cautious codes. The headline corpus result.
FUSINGcautious_review.jsonCautious Review skill audit (only when enabled).
EVALUATINGevaluation_report.jsonCorpus-level metrics: accuracy, per-category precision/recall, K distribution, gap distribution, non-PII residual count.
EVALUATINGtaxonomy_findings.jsonNotes flagged during taxonomy traversal — orphaned codes, suspicious aliases, near-duplicate labels.
EVALUATINGatelier_embeddings.parquet + umap.pklInput for the Embeddings page (UMAP projection of the per-column embedding vectors with predicted-code colorings).
EVALUATINGsage_importance.json + shap_summary.jsonGPU-accelerated global feature importance + per-column SHAP attributions (when enabled).
EVALUATING (post)overwatch.mdOverwatch skill output (only when enabled).
Run endregister_error.json (rename to .resolved after sync)If DB registration failed mid-run, the sync path on restart picks the run up from this sidecar.

Where to look in code

ConcernFile
FSM state enum, transitions, FSMRun dataclasssrc/atelier/classify/fsm.py
Phase advancement (every fsm.advance(...) call)src/atelier/classify/pipeline.py
Iteration loop driver (programmatic)src/atelier/classify/bootstrap.py (_llm_sweep, _llm_revisit, _identify_disagreements, _run_ml_validation)
Iteration loop driver (agent)src/atelier/classify/agent_loop.py
Per-column DST fusionsrc/atelier/classify/pipeline.py (_classify_column)
Convergence criteria evaluationsrc/atelier/classify/bootstrap.py (_mean_gap, _mean_k, _coverage, should_stop_early)
Cautious Review skillsrc/atelier/classify/cautious_review.py
Overwatch skillsrc/atelier/overwatch/agent.py
Workflows page topology (UI)ui/src/lib/fsmPipelineLayout.ts

State transition reference

Authoritative state-transition table (from fsm.py:_TRANSITIONS):

FromLegal next states
IDLELOADING_VOCAB
LOADING_VOCABDISCOVERING, ERROR
DISCOVERINGSAMPLING, ERROR
SAMPLINGCLASSIFYING, GENERATING_SYNTH, LLM_SWEEP, ERROR
GENERATING_SYNTHTRAINING, ERROR
TRAININGCLASSIFYING, ERROR
LLM_SWEEPVALIDATING, CLASSIFYING, ERROR
VALIDATINGLLM_SWEEP, CLASSIFYING, ERROR
CLASSIFYINGFUSING, ERROR
FUSINGEVALUATING, ERROR
EVALUATINGCONVERGED, IDLE, ERROR
CONVERGEDIDLE
ERRORIDLE

SAMPLING → CLASSIFYING (skipping LLM_SWEEP) is the path used by Extend runs, where ML-only inference is desired because the LLM has already classified an earlier corpus and the artifacts are being applied to a new dataset. LLM_SWEEP → CLASSIFYING (skipping VALIDATING) is the “first-sweep convergence” path on small corpora that don’t need iteration.