Sprint Summary: 2026-05-06 to 2026-05-20
This appendix records the engineering work completed during the
two-week sprint ending 2026-05-20. The sprint covered 27 commits
on feat/dst-late-interaction-cosine across three major work
streams: (1) training-time Normalized Hierarchical SVM with the
Structured Shared Frobenius Norm, (2) ColBERT late-interaction cosine
integration with Qdrant, and (3) CatBoost/SVM calibration under the
Dempster-Shafer evidence-independence framework. A DST numeric
sensitivity study and BDD scenario expansion provided the empirical
grounding.
1. Training-Time NHSVM (Choi et al. 2015)
Motivation
The prior NHSVM implementation was a post-hoc approximation: a flat
SVM trained with standard Frobenius norm regularization (no hierarchy
awareness), then nhsvm_reweight() nudged the probability
distribution at inference time using tree-distance penalties. This
cannot recover what was never learned. The SVM’s decision boundaries
are flat; the reweighting is a band-aid. On an asymmetric taxonomy
(deep sensitive subtree vs. shallow operational subtrees), the flat
SVM systematically under-penalizes cross-subtree probability flow,
allowing shallow catch-all nodes to absorb classifications that
belong in the deep subtree.
The Structured Shared Frobenius Norm
Choi et al. (2015, arXiv:1508.02479) shows that for single-label hierarchical classification, proper NHSVM reduces to a standard multi-class SVM with a modified feature map. The key insight is the Structured Shared Frobenius Norm:
||W||^2_G = sum_n ||u_n||^2 / alpha_n
where u_n is the per-node weight component and alpha_n is the path-normalized budget for node n. This regularizer explicitly incorporates the label structure G: it promotes models to utilize shared information along tree paths, penalizing complexity proportionally to each node’s position in the hierarchy.
The Kronecker product feature expansion (Eq. 5) implements this norm without a custom solver. For sample x with label y, the expanded feature map is:
phi(x, y) = Lambda(y) tensor-product x
where Lambda(y)_n = sqrt(alpha_n) for nodes n on the root-to-y path, and zero elsewhere. Standard L2 regularization on the expanded space equals the Structured Shared Frobenius Norm on the original space. The geometry is exact, not approximate.
Directional Constraint (Eq. 7)
The alpha budget is computed via a linear program with the directional constraint: alpha_child >= alpha_parent for every parent-child pair. This forces more of the information budget toward leaves, preventing degenerate solutions on unbalanced trees where shallow internal nodes would otherwise absorb the entire alpha budget.
The LP formulation:
maximize min_n alpha_n
subject to sum(alpha_n for n in path(root, l)) = 1 for every leaf l
alpha_child >= alpha_parent for every parent-child
alpha_n >= 0
Solver: scipy.optimize.linprog(method='highs'). On the project
taxonomy: 296 variables, 220 equalities, 582 directional
inequalities. Solves in under one second with zero violations and
path sums exact to machine precision (deviation < 1e-15). The
unconstrained closed-form (Lemma 2: alpha_n = 1/D_n - 1/D_parent)
is preserved as a private fallback.
Implementation
The training pipeline proceeds:
- TF-IDF (char 3-6 + word 1-2 n-grams, 50K max features)
- TruncatedSVD to 200 components (configurable via
classify.svm.nhsvm_svd_components). Necessary because full TF-IDF times Kronecker expansion would produce 14.75M features and a 34.8 GB coefficient matrix. At 200 dimensions the expanded space is 59K features and the model fits in approximately 250 MB. - Kronecker expansion via
HierarchicalFeatureExpander: training-time expansion populates only the label’s path blocks (sparse, ~path_len x d non-zeros per row); inference-time expansion populates all blocks (dense across nodes, sparse across features). - LinearSVC with
CalibratedClassifierCV(ensemble=False)for Platt-scaled probabilities.
The model serializes as a dict bundle ({feature_union, svd, expander, classifier, classes}) with automatic detection on load.
Legacy flat .pkl files load unchanged, preserving backward
compatibility. A _nhsvm suffix on the per-vocabulary cache key
prevents serving a flat model as hierarchical or vice versa.
When the pipeline detects a training-time NHSVM model (via the
_hierarchical attribute), it skips all post-hoc reweighting
infrastructure (distance matrix precomputation, nhsvm_to_mass
routing) and sends SVM probabilities directly through svm_to_mass.
The hierarchy is already in the probabilities.
SVM Training Consolidation
In the same sprint, SVM training was consolidated from two paths
(Path A: ICE alignment-based, Path B: enrichment-based) into a
single enrichment-required path. Qdrant is the source of truth for
enrichment payloads; a JSON export under build/ serves as the
offline/CI fallback.
The synthetic corpus generator (synth_registry.py) now covers all
taxonomy nodes (leaves and internal) via a three-layer generator
architecture:
- ICE-matched hand-coded (highest priority): enrichment metadata
is matched against 31 inference patterns to select the best ICE
generator. A mnemonic-to-dot-code bridge maps category abbreviations
(e.g.,
EMAIL) to enrichment payload keys (e.g.,1.1.1.9.3.1). - Template generators (medium): prototype values from enrichment payloads with mild perturbation (numeric jitter, character substitution).
- Inferred generators (lowest): fallback via pattern matching on category description and common names.
Coverage: 100% of all taxonomy nodes receive a generator. The leaf-only assumption was corrected at six sites across three files; every node is a first-class tagging target.
2. ColBERT Late-Interaction Cosine via Qdrant
Architecture
The sprint delivered the full P1-P3 stack for multi-vector cosine evidence:
P1 (storage foundation): Qdrant collection schema with named multi-vector fields. Each annotation point stores ColBERT token-level embeddings (128-d after the linear projection) alongside the structured enrichment payload (prototype values, value patterns, name hints, anti-examples, parent path).
P2 (enrichment pipeline): LLM-mediated annotation enrichment generates a six-field structured payload per taxonomy node. Each payload is verified by a deterministic suite of six checks before being written to Qdrant:
patterns_compile– every regex pattern must be valid Pythonresyntax.prototype_values_match_patterns– at least 50% of prototypes must match a declared regex (relaxed from 100% this sprint to handle diverse free-text categories like marketplace names).anti_example_targets_exist– every value in theconfusable_tagfield (the anti-example pointer) must exist in the taxonomy.parent_path_consistent– the generated parent path must match the taxonomy hierarchy exactly.name_hints_non_empty– at least one usable name hint.no_contradiction_with_anti_examples– no prototype value may appear in anti-examples (self-contradiction rejection).
Prompts come in two variants (leaf and parent framing) because the principle that drives the architecture – every node is a first-class tagging target – means parent and leaf nodes describe different kinds of column. A leaf prompt asks for maximum-specificity signals; a parent prompt asks for family-level signals with the children listed so the model knows what specializations would not route to the parent.
P3 (late-interaction integration): The bridge
(maxsim_bridge.py) encodes entity text through the same
ColBERT encoder, queries Qdrant with native MaxSim, normalizes
scores by query token count to recover mean per-token similarity,
and converts scores to DST mass functions via
maxsim_to_mass.
Token-Level Discrimination
The motivating failure modes of single-vector cosine resolve through token-level alignment:
- Anonymized columns (
comm_val,period_val,addr_ref) – column-name tokens contribute little MaxSim, but sample-value tokens still align to annotation prototype-value tokens. Weak tokens contribute near-zero MaxSim without polluting strong matches. - Long-tail distinguishing values – a single distinctive sample value’s tokens claim their own MaxSim against annotation prototypes, no longer averaged out by a single dense vector.
- Sibling discrimination – token-level alignment discriminates between semantically adjacent annotations (e.g., “credit card number” vs. “bank account number”) through fine-grained matching that single-vector cosine collapses.
- Parent-pull – parent-path tokens in the annotation text provide hierarchical context for the mass aggregation layer.
Channel-Decomposed Dempster Combination (P3.6)
The mass function produced by late-interaction cosine separates into two channels:
- Positive channel: MaxSim scores on annotation points allocate mass to focal elements (leaf singletons and internal-node descendant sets). Haenni-Hartmann reliability shaping (alpha-bounded allocation) ensures the source never over-commits. Margin-aware allocation places top-1 mass proportional to the gap between first and second candidates; residual mass splits softmax across remaining candidates.
- Negative channel: Anti-example evidence on a code c allocates mass to Theta \ D(c), where D(c) is the descendant leaf set. This is structurally correct for hierarchical exclusion: negating an internal node removes its entire subtree, not just the node itself.
The two channels combine via channel-decomposed Dempster’s rule. When channels conflict on the same node (high positive and high negative simultaneously), conflict K materializes as a diagnostic signal rather than being silently cancelled. The hierarchical aggregation layer walks from the top-1 leaf up to the most-specific ancestor with at least 50% descendant-mass concentration, promoting mass to internal-node focal elements when subtree-level signal is what the evidence supports.
3. CatBoost, SVM, and Dempster-Shafer Calibration
The Non-Distinctness Problem
Dempster’s rule assumes the evidence sources being combined are distinct and conditionally independent (Shafer 1976, Ch. 3-4; Smets & Kennes 1994). When sources share provenance – one source’s labels are deterministically derived from another’s – Dempster’s rule double-counts their agreement, inflating confidence beyond what the evidence warrants.
Atelier’s pipeline has six evidence sources with varying degrees of independence:
| Source | Discount | Independence status |
|---|---|---|
| MaxSim (ColBERT late-interaction) | 0.20 | Weakly non-distinct (ColBERT encoder is deterministic; per-user-code reference vectors share enrichment-LLM upstream) |
| NHSVM | 0.22 | Weakly non-distinct (sentence-transformer subsumption alignment shares enrichment-LLM upstream — same provenance as ColBERT plus an additional alignment step, hence the slightly higher discount) |
| Pattern | 0.25 | Independent (deterministic regex matching) |
| Name match | 0.30–0.70 | Independent (deterministic string matching) |
| CatBoost | 0.55 | Strongly non-distinct (fit-to-LLM: per-column shared label provenance with LLM) |
| LLM | 0.15 | Primary source |
The discount schedule follows Shafer’s reliability discount (alpha = 1 - discount applied to source mass) with adjustments per Denoeux (2008): when a source rides on labels deterministically derived from another source, an undiscounted derivative source mathematically swallows the only genuinely independent signal.
Pending work — manually curated annotation specifications in Ægir are the path to fully eliminating the shared LLM-upstream provenance on ColBERT and NHSVM. When per-user-code annotation payloads are author-curated rather than LLM-generated during enrichment, ColBERT’s reference vectors and the subsumption alignment both become structurally independent of any runtime LLM, and their discounts can drop toward the calibrations a truly distinct source carries. Until that curation is in place, the 0.20 / 0.22 calibration above is the right under-confidence price to pay.
CatBoost: Fit-to-LLM and Adaptive Discount
CatBoost operates in fit_to_llm mode (default): it trains on
(embedding_text, llm_code) pairs from the current run’s LLM sweep.
The model is the explainability surface over LLM labels – SHAP and
SAGE attribute to a model that actually agrees with the LLM, which is
the transparent “why this code” story presented to the operator. But
this makes CatBoost strongly non-distinct with the LLM under
Denoeux’s framework: per-column shared label provenance.
The adaptive discount addresses this through virtual ensemble variance. CatBoost’s virtual ensemble provides uncertainty quantification per code; the discount formula is:
discount = min(max_discount, base_discount + avg_var x variance_scale)
Defaults: base 0.55, variance_scale 1.6, max 0.75, fallback 0.55. High variance (uncertain predictions) produces a larger discount, routing more mass to Theta (ignorance) rather than inflating a weakly-supported prediction. This is the step-size control in the iterative-methods framing: the derivative source’s contribution is damped proportionally to its own uncertainty.
SVM: Subsumption Alignment and Weak Non-Distinctness
The SVM’s discount (0.22) reflects a qualitatively different non- distinctness regime. The SVM trains on a synthetic corpus generated from the bundled ICE ontology, then translates predictions into the user taxonomy via sentence-transformer cosine subsumption alignment. The alignment is a per-vocabulary mapping table computed via BERT cosine similarity between ICE concept signatures and enriched annotation payloads – structurally independent of the runtime LLM.
The weak non-distinctness comes from the shared enrichment-LLM upstream: the enrichment payloads that anchor the subsumption alignment were themselves generated by an LLM (though a different call, different prompt, different temperature than the runtime classification LLM). This is the same structural dependency shared by the late-interaction cosine source, which justifies SVM’s discount (0.22) sitting near cosine’s (0.20) rather than near CatBoost’s (0.55).
With training-time NHSVM, the SVM’s probabilities already incorporate
hierarchy, so the pipeline routes them through svm_to_mass directly.
The post-hoc nhsvm_to_mass reweighting is preserved as a legacy
fallback for flat-trained SVMs loaded in hierarchical mode.
The Pipeline as Iterative Refinement
The DST evidence-independence architecture frames the bootstrap loop as fixed-point iteration on a belief-assignment vector B over columns: B_{n+1} = T(B_n). Each component maps onto a numerical-method primitive (Banach 1922; Saad 2003):
| Component | Primitive |
|---|---|
| Bootstrap loop | Fixed-point iteration on B |
| LLM sweep | Stochastic operator (Robbins-Monro framing) |
| ML validation (CatBoost + SVM) | Deterministic linearization |
| DST fusion | Combiner producing fused state |
| Targeted revisit | Local smoothing (multigrid) |
| Pl - Bel gap | A posteriori error estimate per column |
| Conflict K | Nonlinear residual diagnostic |
| Reliability discount | Damping / step-size control |
| Hierarchical cosine mass | Coarse-grid correction (multigrid) |
The unified residual norm combines four components: mean(gap) / gap_threshold, frac_unclear / clarity_target, mean(K) / k_threshold, and independent-tier disagreement fraction. Residual below 1.0 means converged. The contraction factor rho = ||r_{n+1}|| / ||r_n|| is the headline diagnostic: rho < 1 is contractive, rho -> 1 is stalled, rho > 1 is diverging.
4. DST Sensitivity Study
A numeric sensitivity study (P3.12-P3.13) swept 2,549 synthetic cells across 10 invariants on an 11-node taxonomy (7 leaves, 4 internal). Zero mathematical violations were found. Key findings:
Channel conflict K is bounded. At the production negative-channel weight beta = 0.30, conflict K caps at approximately 0.24. The K threshold logs never fire under normal operating conditions; the Yager fallback path is effectively dead code under Dempster fusion.
The _significant_subtree concentration threshold is a structural
cliff. The hard 0.50 threshold for promoting mass to an
internal-node focal element produces a discontinuity of Delta = 0.203
in parent mass when sibling probability crosses 0.65 to 0.70.
This is a plausible driver of the parent-instead-of-leaf error
cluster (22-25% of error budget in evaluation).
Internal-node top-1 switch is the largest discontinuity. The transition from leaf-dominant to internal-node-dominant top-1 prediction produces Delta_mass = 0.57 at the crossover point. This is a high-volatility regime where late-interaction positive-weight calibration is critical.
Anti-example negative channel is a tie-breaker, not a primary driver. At beta = 0.30, the negative channel’s effect on parent mass is approximately Delta = 0.0015 under full negative evidence. The channel requires positive-channel support to produce meaningful rank changes.
Leaf dominance is preserved. Across all swept parameter ranges, the top-1 leaf’s mass never falls below the parent’s mass in realistic operating regimes. Parent focal-element mass is a disjunctive signal (contributing to plausibility, not belief) rather than a competing prediction.
5. BDD Scenario Expansion
The sprint added hierarchical anti-subtree BDD scenarios (P3.9-P3.11) testing the channel-decomposed Dempster combination on an abstract taxonomy fixture:
- Anti-example on internal node allocates to descendant complement Theta \ D(n), correctly removing the entire subtree rather than just the node.
- Anti-example on leaf preserves singleton complement semantics (regression guard).
- Channel conflict K surfaces contradiction when both channels fire strongly on the same node, materializing K as a diagnostic rather than silently cancelling.
- Internal-node tag is a first-class prediction target with mass landing directly on the node’s descendant-set focal element.
Additional DST boundary-condition scenarios (P3.10) test operator-observable failure modes: uniform evidence, vacuous sources, and single-source dominance. The generic-vs-specific-same-depth scenario (P3.11) validates that sibling discrimination at equal depth is structurally sound.
6. Operational Improvements
Cautious review disabled (empirically validated as harmful).
Run ce4f3777 against 920 reference columns demonstrated:
reroute miss rate 76.1%, backoff miss rate 78.8%, net accuracy
destruction -13.6 percentage points vs. LLM-only. The cascade:
degraded evidence from a second LLM call on high-conflict evidence
produces high K, low belief, mass review, mass damage. Disabled
by default with bel_threshold = 0.0 (unreachable) as a
belt-and-suspenders guard.
Enrichment verifier relaxation. The prototype_values_match_ patterns check was relaxed from 100% to 50% match threshold.
Categories with diverse free-text values (marketplace names,
descriptive labels) legitimately produce prototypes that do not fit
a single regex family. The prior strict threshold caused false
rejections and forced manual bypass.
Enrichment prompt feedback key fix. The retry prompt read
verifier_feedback.get("failed_checks") but the verifier report
writes "details". Retry prompts had empty diagnostic information;
the LLM was asked to fix failures it could not see.
Bootstrap-environment and curate-agent-mediated skills. Two
Claude Agent SDK skills were added to .claude/commands/: a unified
enrichment + curation + SVM skill (6-phase back-pressure rubric,
resume-safe persistence) and a targeted per-table curation skill.
Late-interaction bridge self-supplies embedder. The ColBERT encoder is now initialized by the bridge itself rather than requiring the caller to pass one, fixing a CAI venv import ordering issue.
Commit Log
| Hash | Summary |
|---|---|
a505953 | R7-R10 audit remediations + bundled R1-R6 + UI / config |
6010e94 | Cite canonical CCO IRIs alongside shorthand labels |
baafa5f | SOTAB v2 coverage strategy + Aegir handoff |
70ec5b5 | P1 storage foundation for late-interaction cosine |
6716935 | P2 LLM-mediated annotation enrichment pipeline |
b5e97ea | P3 late-interaction cosine integration (default off) |
8a9e771 | P3.5 default-on + loud-fallback for late-interaction cosine |
142b91e | P3.6 channel-decomposed positive/negative Dempster combination |
8faf242 | Academic-grade DST Reborn brief |
c324fbe | P3.7 SHAP per-decision attribution surface for late-interaction cosine |
ed57fd1 | P3.8 hierarchical integrity – internal-node tags as first-class |
28e7273 | P3.9 hierarchical anti-subtree carve-out – abstract taxonomy fixture |
519a1c9 | P3.10 DST boundary-condition scenarios |
a5652db | P3.11 generic-vs-specific-same-depth scenario |
77b41d8 | P3.12 DST numeric sensitivity study + findings |
2fb7377 | P3.13 hierarchical-aggregation interaction battery |
f155e89 | P7 subsumption alignment + P5 frontier cleanup + P4 enrichment infra |
3d6696f | Stage A DST sensitivity visibility instrumentation + Stage B script |
1fee0be | top1_margin disjoint-FE traversal – Stage A regression |
929e29e | Late-interaction bridge self-supplies embedder + CAI venv fix |
7a1e4e7 | Bootstrap-environment + curate-agent-mediated skills |
1df1383 | Training-time NHSVM via Structured Shared Frobenius Norm |
References
- Choi, Chung, and Hewitt. 2015. “Normalized Hierarchical Multi-label SVM.” arXiv:1508.02479.
- Denoeux, Thierry. 2008. “Conjunctive and disjunctive combination of belief functions induced by nondistinct bodies of evidence.” Artificial Intelligence 172(2-3): 234-264.
- Haenni, Rolf and Stephan Hartmann. 2006. “Modeling partially reliable information sources.” Studia Logica 82(1): 103-133.
- Khoo, Omar, and Steedman. 2006. “An Information Retrieval approach to short text classification.” EMNLP 2006.
- Saad, Yousef. 2003. Iterative Methods for Sparse Linear Systems. 2nd ed. SIAM.
- Shafer, Glenn. 1976. A Mathematical Theory of Evidence. Princeton University Press.
- Smets, Philippe and Robert Kennes. 1994. “The Transferable Belief Model.” Artificial Intelligence 66(2): 191-234.