DST Evidence Independence
This note documents how Atelier’s classification pipeline handles non-distinct evidence sources under Dempster-Shafer fusion, and why the discount calibration and revisit gate are structured the way they are. It is intended to be cited by code reviewers and academic readers.
The pipeline as iterative refinement
Atelier’s bootstrap loop is iterative refinement on a belief-
assignment vector B over columns: B_{n+1} = T(B_n), where T
composes the LLM sweep, ML validation (CatBoost + SVM), DST fusion,
and targeted revisit on disagreement. Cast in the language of
classical numerical analysis (Banach 1922; Saad 2003 §4.1,
Iterative Methods for Sparse Linear Systems), every component of
the pipeline maps onto a numerical-method primitive:
| Component | Numerical-methods primitive |
|---|---|
| Bootstrap loop | Fixed-point iteration on B |
| LLM sweep | Stochastic operator T_LLM (Robbins-Monro 1951 framing) |
| ML validation | Deterministic linearization T_ML |
| DST fusion | Combiner ⊕ producing fused state |
| Targeted revisit on disagreement | Local smoothing in multigrid (Brandt 1977) |
| Pl − Bel gap | A posteriori error estimate per column |
| Conflict K | Nonlinear residual diagnostic |
| Ontology priors | Preconditioner — conditions first-pass output |
| Reliability discount on derivative sources | Damping / step-size control |
| Hierarchical cosine mass | Coarse-grid correction (multigrid) |
cautious_promoted_code | Projection onto coarse grid at level where evidence unambiguous (Smets 1993) |
needs_clarification | Residual-exceeds-tolerance flag |
The diagnostic that ties the framework together is the
residual norm ‖r(B)‖ — a unified scalar measuring distance
from the fixed point — and the contraction factor ρ_n = ‖r_{n+1}‖ / ‖r_n‖, the headline iterative-method indicator
(Saad §4.1):
ρ < 1: contractive — successive iterations reduce residual.ρ → 1: stalled — iterations not making progress; warrants strategy change (different fusion rule, different preconditioner, agent escalation).ρ > 1: diverging — iterations growing the residual.
bootstrap.residual_norm and bootstrap.contraction_rate
implement the diagnostic. The unified residual is an L2
combination of four normalized components: mean(gap) /
gap_threshold, frac_unclear / clarity_target, mean(K) /
k_threshold, and frac(indep-tier disagreement at meaningful mass).
A residual_norm of 1.0 means “at convergence threshold across the
board”; values <1 are converged. Both are surfaced in
IterationMetrics and the agent loop’s iteration_history.
This framing is what makes the rest of the design — non- distinctness handling, hierarchical aggregation, ontology priors, reliability shaping — operate as a cohesive accuracy-targeting engine rather than a collection of clever heuristics. Each mechanism is a numerical-method primitive in service of driving the residual to zero.
The non-distinctness problem
Dempster’s rule of combination assumes the bodies of evidence being combined are produced by distinct, conditionally independent sources (Shafer 1976, A Mathematical Theory of Evidence, Ch. 3 §3 and Ch. 4). Smets’ Transferable Belief Model (Smets 1990; Smets & Kennes 1994, The Transferable Belief Model) preserves this assumption at the credal level. Denoeux 2008 (Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence, Artificial Intelligence) characterizes the pathology that arises when the assumption is violated: combining two mass functions that derive from a shared evidential atom via Dempster’s rule effectively raises the contribution of that atom to a power. The conjunctive cautious rule, defined on commonality functions and idempotent on identical evidence (Denoeux 2008 §4), recovers soundness — but is non-normalising and not a drop-in replacement for Dempster.
The Atelier-specific violation
The classification pipeline in src/atelier/classify/ declares six
evidence sources:
name_match— lexical column-name matching against the vocabulary.pattern— regex/validator detection (email, IBAN, monetary, …).cosine— semantic similarity between the curated embedding text and the user-vocabulary embedding.llm— Claude Opus first-pass classification.catboost— CatBoost classifier.svm— synth-trained TF-IDF + LinearSVC classifier with an LLM-mediated ICE → user-taxonomy alignment applied at inference time.
The first three are genuinely independent of the LLM: their evidence arises from the column’s name, value patterns, and semantic embedding comparison against the vocabulary. The remaining sources have a mixed independence profile:
catboostis trained infit_to_llmmode (default true) on(embedding_text, llm_code)pairs from the current run’s LLM sweep. Seeml_train.fit_catboost_to_llm_labelsandpipeline._install_fit_to_llm_catboost. The fitted model is, by construction, an explainability surface over the LLM’s labels — not a competing classifier. Strongly non-distinct with the LLM source under Denoeux 2008 (per-column shared label provenance).svmis trained once on the synthetic corpus (scripts/generate_synth_source.py→ml_train.train_svm), with TF-IDF char-3-6gram + word-1-2gram features and labels keyed on the bundled-ontology ICE.* leaves fromsynth_generators.GENERATORS. At pipeline runtime, predictions are translated into the user taxonomy via subsumption-prediction alignment inclassify.subsumption_alignment— sentence-transformer cosine similarity between ICE concept signatures and enriched annotation payloads from the Qdrant taxonomy collection (one alignment computation per (vocab, embedding_model) tuple, results cached on disk). Weakly non-distinct with the cosine source via shared enrichment-LLM upstream — the enriched annotations were generated offline by an LLM, but the alignment computation itself uses a structurally independent model (BERT embeddings), not the runtime autoregressive LLM. The prior LLM-mediated approach (one LLMclassify_batchcall per alignment, excised in the P7 subsumption-alignment intervention) was weakly non-distinct with the runtime LLM through shared model weights — the new approach eliminates that correlation. See theontology_alignment.pymodule docstring for the full independence argument.
Treating LLM and CatBoost(LLM) as fully-independent sources and
combining them via Dempster’s rule double-counts the LLM atom; the
SVM evidence sits between fully independent and fully derivative.
The pre-2026-04-30 discount schedule made the legacy three-way
overlap worse: llm=0.10, catboost=0.10, svm=0.20, vs
cosine=0.30. The genuinely independent semantic source was
more discounted than the two derivative ones, mathematically
suppressing it whenever the LLM was loud.
A failure case observed during pipeline validation illustrated the
pathology in the abstract. A column whose values match the
monetary_pattern regex was classified as a generic catch-all
code rather than a financial-domain code. Cosine top-1 distributed
mass across several financial-leaning codes in the active
vocabulary, but at softmax-spread mass on the order of a few
thousandths per code it could not overcome LLM mass (≈ 0.83) and
CatBoost mass (≈ 0.81), both concentrated on the catch-all. The
fused prediction matched the LLM; the disagreement gate at
bootstrap._identify_disagreements required
llm_code != fused_code and so never fired despite K ≈ 0.81 and
a unanimous independent-source pull toward financial codes.
needs_clarification=True was emitted, but no LLM revisit
followed. Specific customer table names, column names, and codes
are intentionally not reproduced in this document.
Treatment in this codebase
The pipeline uses two complementary, scope-bounded fixes:
1. Reliability discounting on derivative sources (Shafer §11.3)
The discount operator from Shafer 1976 §11.3 multiplies a source’s
mass by reliability α = 1 - discount:
m’(A) = α · m(A); m’(Θ) = α · m(Θ) + (1 - α)
When evidence sources are non-distinct, the reliability of the derivative source with respect to the original is bounded above by 1 minus their information overlap. For sources trained directly on LLM output that overlap is near-total, so a substantial discount is the principled response under classical Dempster fusion.
The current defaults (config/base.conf:341+) place CatBoost and
SVM above the cosine discount:
| Source | Discount | Rationale |
|---|---|---|
cosine | 0.20 | independent of LLM; semantic prior |
pattern | 0.25 | independent; deterministic regex evidence |
name_match | 0.30–0.70 | independent; lexical match against vocab |
llm | 0.15 | original; first-pass label |
catboost | 0.55 | strongly non-distinct (fit_to_llm, per-column LLM labels) |
svm | 0.22 | weakly non-distinct (enrichment-mediated subsumption alignment; was 0.30 under LLM-mediated, 0.55 under M9) |
catboost_max | 0.75 | variance ceiling; maintains headroom |
Operators can dial these via the Settings page when retraining
CatBoost on labels independent of the current LLM sweep (e.g.
synth-only training); the metadata in config_overlay.SETTINGS_METADATA
exposes the full range. The SVM discount at 0.22 (slightly above
cosine’s 0.20) reflects the subsumption-prediction alignment:
structurally independent of the runtime LLM (uses BERT embeddings,
not autoregressive inference), with weak non-distinctness only via
the shared enrichment-LLM upstream (same structural dependency the
late-interaction cosine source carries). The 0.02 margin above
cosine accounts for subsumption prediction being a single per-ICE-code
decision (structurally more brittle than per-column cosine evidence).
2. Independent-tier consensus + revisit gate
For revisit decisions, the pipeline computes a parallel, isolated fusion over the LLM-independent subset only:
m_indep = m_cosine ⊕ m_pattern ⊕ m_name_match (Dempster's rule)
indep_top1 = argmax_singleton m_indep
Implemented in pipeline._classify_column via the INDEPENDENT_TIER
constant and combine_multiple(strategy="dempster"). The top-1
singleton and its mass are exposed in the result dict
(independent_top1_code, independent_top1_mass,
independent_top1_conflict) and stored on the BootstrapState.
The revisit gate at bootstrap._identify_disagreements then fires
when:
indep_top1_code ≠ llm_codeANDindep_top1_mass ≥ classify.bootstrap.indep_revisit_mass_threshold(default 0.45)
This restores a real cross-source disagreement test that cannot be
masked by LLM-derivative sources amplifying the LLM’s vote. The
legacy high-K branch (llm_code != fused_code AND K > k_threshold)
is retained as a safety net and runs second in priority.
The revisit prompt context at bootstrap._llm_revisit now includes
the independent-tier consensus code/label/mass so the LLM has the
counter-evidence in front of it during the second pass.
Ontology priors — substrate as semantic anchor
Patterns detect at extraction time. When a pattern fires we resolve
its canonical ICE.* metadata from universal_vocabulary.json (label,
description, common-name aliases, full ontological path root→leaf)
and thread that metadata through three insertion points sourced from
a single lookup (mass_functions.lookup_pattern_ontology):
-
Embedding text (
features.ColumnFeatures.to_embedding_text—ontology_priorsis a discreteFEATURE_NAMESentry, ablatable for SAGE). Cosine similarity then operates over publicly-grounded ontology terms an embedding model recognizes from training rather than the regex name alone. On the failure case that motivated this work, the column embedding gains the literal substring “Transaction Amount; The monetary value of a financial transaction.; aliases: amount, payment, price; ontology: Sensitive Data → Personally Identifiable Data → Financial Data → Payment Data → Transaction Amount” — orders of magnitude more semantic surface thanpatterns: monetary_patterncarried. -
First-pass LLM user prompt (
llm_backend.build_batch_user_prompt). Every batch — sweep AND revisit — the prompt now includes per-column “Pattern-detected ontology priors (from Atelier’s universal taxonomy — translate to the closest fit in the candidate vocabulary)” with each fired pattern’s label, description, alias list, and path. The LLM is explicitly instructed that the canonical ICE.* code is never a valid classification target; its job is ontology alignment from the publicly-grounded substrate to the user’s frame (He et al. 2023, Exploring Large Language Models for Ontology Alignment; Hertling & Paulheim 2023, OLaLa: Ontology Matching with LLMs; Ehrig & Sure 2004 for the classical foundation). -
SAGE/SHAP attribution surface (
features.FEATURE_NAMES).ontology_priorsis now its own ablatable feature distinct frompattern_signalsandsample_values. Operators can attribute classification mass to the publicly-grounded ontology prior independently of the raw embedding text — the explainability story ties each prediction back to the public substrate that motivated it.
Surfaced in the result dict as ontology_priors (list of dicts:
pattern, code, label, description, common_names, path, match_fraction). The codes are universal-substrate IDs; they never
appear in user-facing classifications. The user’s vocabulary
remains the authoritative result space; ICE.* is the bridge.
Architectural significance: this is the substrate→tagging bridge the design has been pointing at. Pattern detection was always publicly-grounded; the resolver turns ICE.* into the user’s codes when it can; when it can’t, ontology priors carry the public semantic anchor straight through to cosine + LLM + SHAP without ever fabricating a code in the user’s frame. Compatible with — and strengthens — the indep-tier consensus + reliability-discount mechanisms above.
Cosine reliability shaping (Haenni-Hartmann 2006)
Static discount=0.30 allocated 0.70 of cosine mass uniformly via
softmax across all candidate singletons. On large vocabularies (300+
leaves) this produced softmax compression — even a sharp top-1 hit
landed at ~0.004 mass per code. Cosine could see the right answer
but couldn’t carry it through fusion, and the indep-tier consensus
sat permanently below the revisit threshold.
mass_functions.cosine_to_mass now applies dynamic source
reliability per Haenni & Hartmann 2006, Modeling Partially
Reliable Information Sources: A General Approach Based on
Dempster-Shafer Theory (Information Fusion 7(4), 361–379, §3):
the source-reliability factor α is an observable function of
quality indicators, with (1 − α) allocated to ignorance.
Two quality indicators:
- α_abs — sigmoid of top-1 absolute similarity around
τ_abs = 0.40withσ_abs = 0.10. Encodes “is cosine matching anything strongly, or just noise?” - α_marg —
tanh((s₁ − s₂) / σ_marg)withσ_marg = 0.05. Encodes “is the top-1 a decisive winner, or ambiguous among similar candidates?”
Weighted blend (w_abs = 0.6, w_marg = 0.4), clamped to
[reliability_floor, reliability_ceiling] = [0.10, 1 − classify_discount_maxsim]. The ceiling preserves the legacy
maximum-mass behavior under sharp signal; the floor keeps cosine
contributing some mass even under noise.
The α-bounded evidence mass is then split via margin-aware allocation:
m(top-1) = α · margin_weight + α · (1 − margin_weight) · softmax_top1
m(top-i, i>1) = α · (1 − margin_weight) · softmax_top_i
m(Θ) = 1 − α
where margin_weight = tanh((s₁ − s₂) / σ_marg). When the margin
is wide, almost all evidence mass concentrates on top-1 directly
rather than diluting through softmax. When the margin is narrow,
the formula reduces to classical softmax allocation across the
full candidate set.
Behavior across regimes (BDD-locked in
features/agent/evidence_independence.feature, “Cosine reliability
shaping concentrates mass on a clear top-1”):
| Top-1 sim | Top-2 sim | α | margin_weight | Top-1 mass | Θ mass |
|---|---|---|---|---|---|
| 0.70 | 0.50 | 0.700 | 1.000 | 0.700 | 0.300 |
| 0.45 | 0.20 | 0.700 | 1.000 | 0.700 | 0.300 |
| 0.45 | 0.44 | 0.452 | 0.197 | 0.091 | 0.548 |
| 0.23 | 0.23 | 0.100 | 0.002 | 0.0005 | 0.900 |
Sharp signal recovers the legacy ceiling allocation but concentrates it on top-1 (~170× the prior compressed mass). Ambiguous and noise regimes correctly route most mass to Θ rather than fabricating false confidence. The indep-tier revisit gate (threshold 0.45) is now reachable on cosine alone whenever cosine has clear semantic signal.
Composes cleanly with the indep-tier consensus computation: when cosine carries decisive mass on a code different from the LLM’s vote, that code becomes the indep-tier top-1 and the revisit gate fires — which is the soundness invariant the whole evidence- independence treatment is reaching for.
Hierarchical mass aggregation + cross-subtree visibility
A separate structural gap surfaced after reliability shaping landed: when cosine evidence localizes to a subtree (multiple financial-leaning leaves under a common parent) but the LLM picks a confident leaf in a different subtree, the predicted code falls to the LLM’s leaf and there is no surfaced signal that an honest-but-coarser parent would apply. Three cooperating fixes close that gap:
1. Cosine emits hierarchical mass
mass_functions.cosine_to_mass now walks up from the cosine top-1
leaf, finds the most-specific internal node whose descendants
capture ≥ 50% of the softmax probability mass
(_significant_subtree), and redirects the in-subtree residual
mass to that internal-node focal element rather than diluting it
across leaves. The frame already exposed every parent code as an
internal-node FocalElement (descendant leaf set); we just
weren’t emitting mass there. Hierarchical Dempster-Shafer
treatment per Shafer 1976 §3 and Smets 1990 §6 (refinement /
coarsening): an internal-node focal element represents a
disjunction — “the answer is somewhere in this subtree” — without
committing to a specific leaf.
Walking up from top-1 (rather than requiring every top-K to share an LCA) tolerates outliers cleanly: a small amount of probability leaking outside the subtree doesn’t void the aggregation as long as the bulk of mass remains inside.
Sharp-signal regimes are unaffected — when the margin is wide
the residual mass α · (1 − margin_weight) is small, so the
hierarchical aggregation simply scales proportionally. The
top-1 leaf still wins when one is decisive.
2. cautious_code walks the full hierarchy
HierarchicalClassification.cautious_code previously walked only
the predicted code’s ancestor chain via belief_path —
structurally blind to belief mass in any other subtree. It now
delegates to cross_subtree_belief, which iterates every
singleton AND every internal-node focal element in the frame and
returns those with Bel ≥ threshold. The most-specific code
wins, regardless of subtree.
Concretely: when the LLM votes 0.1 Internal Non-Sensitive but
cosine’s hierarchical aggregation puts Bel(Financial Data) = 0.55 on a different subtree’s parent, cautious_code(0.5) can
now return Financial Data — not just 0 (the predicted
code’s parent).
3. cross_subtree_belief surfaces the conflict
The result dict now carries a cross_subtree_belief field
listing every code (leaf or internal node, any subtree) where
Bel ≥ 0.5. Operators see both the LLM’s leaf vote AND the
cosine-derived alternative subtree as legitimate signals,
instead of the predicted-leaf-only belief_path. When evidence
sources disagree on the subtree, both candidates appear and the
operator can act on the disagreement directly.
This composes cleanly with the prior mechanisms: reliability
shaping ensures cosine top-1 carries enough mass to trigger
hierarchical aggregation when signal is clear; the indep-tier
gate fires when cosine’s hierarchical mass disagrees with LLM at
the leaf level; and cross_subtree_belief makes the cross-
subtree disagreement explicit in the operator-facing result. The
predicted_code field retains its leaf-argmax semantics for
backward compatibility — operators consume the cautious /
cross-subtree fields when needs_clarification = True or when
the cross-subtree summary surfaces a competing internal node.
Operator-facing visibility
The fusion mechanisms above can produce mathematically correct belief structures that are nonetheless invisible to operators when the result-dict surface area is too narrow. Three small changes close that gap:
Evidence string carries per-source codes + competing summary
HierarchicalClassification.from_combined_evidence builds the
evidence field. Previously: dst(cosine=0.65, llm=0.77, catboost=0.42, svm=0.22) → Internal Non-Sensitive [Bel=0.71, ...] — masses only, not the codes each source voted. Now:
dst(cosine→1.4.1.1.1(0.65), llm→0.1(0.77), ...) → Internal Non-Sensitive [Bel=0.67, ...] [competing: Sensitive (1) Bel=0.26] — leaf-level disagreement is visible at a glance,
and a “competing” trailer surfaces non-trivial belief in any
non-predicted top-level subtree.
cross_subtree_belief is always informative
The 0.5 absolute threshold previously suppressed competing-
subtree alternatives whenever Dempster fusion compressed their
mass below the headline bar (the common case when one source
dominates). The default is now lower (0.20) AND a
always_include_top_per_subtree rule guarantees that the
highest-belief leaf and highest-belief internal node from each
top-level subtree appears in the result regardless of
threshold (subject to a small min_bel floor so we don’t
flood the result with noise). Operators always see the
structured “what does each subtree look like?” view.
cautious_promoted_code (Smets least-commitment)
Per Smets 1993 (Belief Functions: The Disjunctive Rule of Combination and the Generalized Bayesian Theorem and related work on least-commitment), when a fine-grained decision is unsupported by evidence the principled response is to commit only at the level of granularity where evidence IS unambiguous. This is exactly the mechanism for “the predicted leaf is not the right answer; the parent code is more honest.”
HierarchicalClassification.cautious_promoted_code returns
either the predicted leaf (no promotion) or the most-specific
code anywhere in the hierarchy whose belief meets the
commit_threshold (default 0.55). Promotion fires only when
needs_clarification = True — operators get the leaf
prediction by default; the cautious promotion is the
epistemically-honest fallback when the system itself flags the
prediction as uncertain.
The predicted_code field retains its leaf-argmax semantics
for backward compatibility with Atlas governance sync and
existing UI rendering. cautious_promoted_code lives
alongside it as a separate field operators consult when
needs_clarification is True.
Per-column residual trajectory
The corpus-wide residual norm + contraction factor establish the
headline iterative-method diagnostic, but they obscure per-column
behaviour. BootstrapState.column_history: dict[str, list[ColumnResidualSnapshot]]
captures the column-major view: one snapshot per labeled column per
iteration, populated in record_iteration_metrics after each
iteration’s ML validation completes. Each snapshot records the
column’s gap, belief, K, indep-tier top-1 code/mass, label, label
source, and a revisited flag indicating whether
_llm_revisit touched the column in that iteration.
bootstrap.column_contraction(state, name) mirrors the corpus-wide
contraction_rate at the column level: ρ_col = current_gap /
prev_gap (falling back to K when gap is zero), or None when the
column has fewer than two snapshots. ρ_col < 1 means the column is
converging; ρ_col → 1 stalled; ρ_col > 1 diverging. Per-column ρ
exposes the empirical contraction distribution that corpus
aggregates obscure — operators see which specific columns are
converging vs stalling.
The full trajectory is written to
build/results/{run_id}/column_trajectories.json at pipeline end
alongside classifications.json, enabling offline analysis,
operator post-mortem, and audit. The agent loop’s
iteration_history carries a summarized view (per-column
gap/bel/K sequences plus ρ_col) so the agent can reason about
which columns are moving.
This trajectory infrastructure is the substrate for any future acceleration scheme. Three plausible Phase B / Phase C extensions all consume it:
-
Bandit-style revisit ordering (Phase B) — extend
_identify_disagreementsto mixexpected_revisit_gain(name)derived from history into the sort key. Revisits ordered by predicted marginal residual reduction. Default-off knob; trajectory data backs it. -
Aitken Δ² early-stop (Phase B) — for columns with ≥3 snapshots and a clean linear-convergence pattern, predict the limit and skip further revisits when the predicted gap is below
cfg.gap_threshold. Saves LLM cost on the predictable tail. Default-off knob; trajectory data backs it. -
Limited per-column belief-mass Anderson (Phase C, deferred) — only on columns that genuinely oscillate (per-column ρ near 1 with sign-changing residual differences). Phase A’s trajectories let us measure whether such a population exists before shipping any Anderson code.
The honest framing: classical Anderson acceleration on the full belief-vector iteration is poorly suited to LLM-driven dynamics (stochastic T, mostly-static state, discrete labels, targeted-not- uniform revisit). What’s value-add given the problem structure is the per-column trajectory data itself — operators see per-column convergence behaviour, future acceleration schemes have real per-column data to operate on, and we can decide between bandit / Aitken / Anderson empirically rather than rhetorically. Phase A ships that substrate; Phase B and Phase C are gated on what the substrate reveals.
Cost-sensitive classification at the LLM layer (Elkan 2001)
All the prior mechanisms operate at or below the fusion layer —
they shape how per-source evidence is combined into a fused
belief. But on the canonical failure case
(loan_applications.requested_amount), the LLM at confidence 0.88
plus its derivative cluster (CatBoost, SVM) reinforces a vote on
0.1 Internal Non-Sensitive, and Dempster fusion’s normalization
preserves that dominance. Algorithmic mitigations stalled at
Bel ≈ 0.74 — an honest reduction from Bel = 0.955 baseline, but
the headline classification still miscategorized financial PII.
The principled response, per cost-sensitive classification
(Elkan 2001, The Foundations of Cost-Sensitive Learning) is to
adjust the decision threshold under asymmetric cost. In data
governance the asymmetry is severe: failing to flag truly
sensitive data (false negative, Type II) creates regulatory
liability (GDPR Art. 25 data protection by default; HIPAA Safe
Harbor; PCI DSS scope creep guidance), while over-classifying
(false positive, Type I) produces review overhead but is
recoverable. Treating the costs as cost(FN) ≫ cost(FP) is the
canonical privacy-regime convention.
Atelier applies this at the LLM layer — upstream of fusion —
via a Sensitivity classification perspective section in the
system prompt (llm_backend.build_system_prompt). The framing is
deliberately collaborative rather than prescriptive: modern LLMs
respond better to a colleague’s framing than to a compliance
checklist. Three load-bearing moves:
-
Invoke what the LLM already knows. The preamble names BFO, CCO, and the privacy regimes (GDPR, HIPAA, PCI DSS) those ontologies overlap with — concepts the model has substantial training exposure to. The customer’s taxonomy is framed as “their refinement of those publicly-grounded concepts,” and the model is asked to pick whichever of their codes matches the canonical sensitivity concept it would otherwise assign (PII, Financial Information, Technical Identifier, Biometric, etc.). No re-teaching, no rule list — invocation.
-
State the asymmetry once, casually. Cost-sensitive classification appears as “a practical asymmetry: in governance, calling sensitive data non-sensitive is a larger error than the reverse.” The over-classification guard is embedded conversationally: “When signals are genuinely absent (operational metadata, surrogate keys, timestamps, status enums), non-sensitive is the correct call — don’t reach for sensitive just because of the asymmetry.” One sentence on confidence calibration: “Calibrate confidence to what you actually saw, not to this asymmetry.”
-
Vocabulary-aware sensitivity map, ICE conventions only.
_sensitive_subtree_summary(category_set)activates onICE.SENSITIVE.*/ICE.NONSENSITIVE.*paths and emits a Markdown block naming the sensitive root, catch-all, and a few publicly-grounded leaf abbreviations (persrc/atelier/classify/fixtures/PROVENANCE.md). Returns""for every other vocabulary shape so the prompt stays silent where the framework can’t verify the encoding is publicly grounded. For non-ICE vocabularies the LLM still has the full markdown category table, per-column ontology priors for pattern-bearing columns, and the perspective preamble — that is sufficient to navigate any taxonomy without the framework guessing at its sensitivity structure.
The prompt block is default-on for every classification run;
no config knob. Built once per pipeline run at
pipeline.py:577 so the helper computation is amortized and the
new content lives inside the Anthropic prompt-cache prefix —
one-time cache miss on the first batch, normal cache hits
thereafter. Token cost is bounded (~250–300 fixed + ~80 for
the per-vocab summary).
This composes cleanly with everything below it: reliability
discounts on derivative sources still suppress double-counting,
cosine reliability shaping still concentrates mass on clear
top-1 hits, hierarchical aggregation still flows residual mass
to internal-node focal elements, the indep-tier consensus gate
still triggers revisits on cross-source disagreement, and
cautious_promoted_code still applies Smets least-commitment
on uncertain leaves. The Governance Cost Model changes what
the LLM votes — biasing toward sensitive parents under
uncertainty — leaving every downstream mechanism unchanged.
The hypothesis: with a governance prior at the source, the LLM
will either (a) pick a defensible sensitive parent code on
columns like requested_amount, or (b) lower its confidence on
the non-sensitive choice — either of which is an improvement
over the status quo. The exact behavior is non-deterministic
and confirmed against real LLM runs; BDD scenarios assert the
prompt structure (features/agent/governance_cost_model.feature),
not the LLM’s vote.
Pattern-target alias resolver
A second, narrower bug surfaced during investigation: the static
DEFAULT_PATTERN_MAP at mass_functions.py references canonical
ICE.* mnemonic strings (monetary_pattern → ICE.SENSITIVE.PID.FINANCIAL.PAYMENT.TXNAMT)
that are absent from non-ICE vocabularies. The pre-2026-04-30
behavior silently dropped any pattern whose target wasn’t in
frame.singletons, disabling the entire pattern source on numeric
or domain-specific vocabularies — including the run that motivated
this work.
mass_functions.resolve_pattern_map now resolves each ICE.* target
through three fallback layers against the active category_set:
- Direct hit on
all_by_code. - Match on
by_abbrevusing the leaf mnemonic (suffix after the final.). - Token-normalized match against
common_namesaliases.
Misses log a single WARNING enumerating the patterns that were
dropped. The resolver is cached on the category_set instance and
runs once per pipeline. The deeper BFO/Common-Core ontology mapping
this shim approximates remains future work.
Deferred work
This treatment preserves Dempster’s rule end-to-end and handles non-distinctness through reliability discounting + per-source reliability shaping. One future refinement remains scoped out:
- Tiered fusion with the cautious rule (Denoeux 2008). Combine
the LLM-derivative cluster
{llm, catboost, svm}via cautious conjunction (idempotent on identical evidence; commonality formulationq1 ∧̂ q2), the independent cluster{cosine, pattern, name_match}via Dempster, and combine the two cluster-level mass functions across-tier. This dissolves the non-distinctness problem at the math level rather than approximating it via discount. Trade-off: cautious is non-normalising, so derivative-tier-only columns will see narrower belief intervals (which is correct behaviour but a UI shift).
The combine_multiple infrastructure already supports adding a
strategy="cautious" branch alongside the existing dempster /
yager options, so the refinement is surgical when it lands.
References
- Shafer, G. (1976). A Mathematical Theory of Evidence. Princeton University Press. Ch. 3 §3 (independence assumption); Ch. 4 §3 (Dempster’s rule); §11.3 (reliability discount).
- Smets, P. (1990). The Combination of Evidence in the Transferable Belief Model. IEEE Transactions on Pattern Analysis and Machine Intelligence 12(5), 447–458.
- Smets, P. & Kennes, R. (1994). The Transferable Belief Model. Artificial Intelligence 66(2), 191–234.
- Denoeux, T. (2008). Conjunctive and Disjunctive Combination of Belief Functions Induced by Non-Distinct Bodies of Evidence. Artificial Intelligence 172(2-3), 234–264. §1, §3.1, §4.
- Haenni, R. & Hartmann, S. (2006). Modeling Partially Reliable Information Sources: A General Approach Based on Dempster-Shafer Theory. Information Fusion 7(4), 361–379.
Operational impact
Operators upgrading to this calibration should expect:
- More columns marked
needs_clarification=Trueon the first run after upgrade. This is the intended outcome: derivative-source amplification no longer hides genuine cross-source conflict. - A modest increase in LLM revisit volume (the gate fires on a
wider, principled condition). Mitigated by the
indep_revisit_mass_thresholdfloor and the existing budget caps atclassify.bootstrap.max_total_llm_calls/max_total_llm_attempts. - A pattern-source
WARNINGat startup enumerating any patterns whose ICE.* target failed to resolve to the active vocabulary. Acceptable as long as the leaf mnemonics that do exist in the vocab carry the relevantabbrevorcommon_namesaliases — expected on first run with a domain-specific vocabulary.