Deployment: Unseen Ontology, Known Schema
Operating principle: out here we iterate on public benchmarks; in CAI we execute to a customer-specified objective. The customer brings an unseen ontology shaped like a known annotations schema; the system has to produce classifications + calibrated belief intervals against that ontology without prior calibration.
This document captures the deployment-time invariants that the classification pipeline must honor, names the assumptions baked into today’s code that would fail against a sufficiently weird customer ontology, and proposes a roadmap milestone (M11 — Bring Your Own Vocabulary) that closes the remaining gaps and makes public-data iteration a test surface rather than a target for the same execution path.
Mode split — iteration vs. execution
| Dimension | Iteration mode (local / public data) | Execution mode (CAI deployment) |
|---|---|---|
| Vocabulary source | atelier-vocab.ttl (300 ICE leaves) + curated mappings + benchmark-specific class lists (SOTAB 82 Schema.org types, GitTables 122 DBpedia types, SemTab DBpedia hierarchy) | Customer’s default.annotations Hive table — opaque to us until run-time |
| Hierarchy depth | Known (5 levels for ICE, ~3 for DBpedia subset, varying for Schema.org) | Unknown — could be 1 (flat) or 8+ (deep regulatory taxonomy) |
| Hierarchy shape | Tree, single root | Tree assumed; multi-root forest, cycles, unbalanced subtrees all plausible |
| Validation labels | Curated reference (synth, meta-tagging UAT, GitTables CTA gold) | Often absent. Sometimes a small spot-check set; sometimes none. |
| Accuracy bar | Track records over time on published benchmarks | Customer-stated objective; calibration + sample review when no agent-mediated reference exists |
| BFO / CCO grounding | Available — we mapped 360 terms ourselves | Opportunistic — only if the customer’s ontology happens to carry a bfo_anchor / cco_anchor / schema_org_class / dbpedia_class column |
| Iteration latency | Tight (re-run with overlay tweaks; soak on devenv) | Wide (CAI session lifecycle; nautilus + overwatch loops the only mid-run feedback) |
The bridge between the two modes is structural: every iteration
target gets transformed into annotations-schema shape before the
pipeline sees it. SOTAB v2’s class list, GitTables’ 122 DBpedia
types, the OOTB sample’s 316 ICE leaves, the customer’s Hive table —
all four end up as a HierarchicalCategorySet built from a
list[ReferenceCategory] with parent_code edges, fed into the same
DST + agent + nautilus + overwatch stack. The execution path doesn’t
know or care which mode it’s running under.
The annotations schema contract (what’s stable)
load_annotations_from_hive reads SELECT * FROM default.annotations
and returns list[dict]. _normalize_annotations_row and
_build_category_set_from_records (both in
src/atelier/classify/taxonomy.py) translate that into a
HierarchicalCategorySet. The fields we already accept, in order of
preference per row:
| Field | Required | Purpose | Fallback when missing |
|---|---|---|---|
code (or id / path / dot-path) | yes | Identity for tree navigation, DST focal element, Atlas type name | row dropped (we cannot classify into an unnamed term) |
label (or display_name / name) | strongly preferred | Human surface in UI + LLM prompt | falls back to last component of code |
parent_code | no | Explicit parent edge | derived from dot-path (A.B.C → parent = A.B) |
description | no | LLM context, embedding text | empty |
common_names (synonyms / aliases) | no | LLM expansion + embedding text | empty |
notation | no | SKOS-style dot code (numeric or otherwise) | empty |
abbrev | no | Mnemonic shortcode for leaves | empty |
taxonomy | no | Namespace discriminator | "annotations" |
sensitivity | no | Domain-specific classification metadata | absent |
The contract is structural, not semantic — we do not require any
particular set of root codes, depth, or ICE-trichotomy alignment. A
customer ontology rooted at LEGAL.PRIVILEGE.ATTORNEY_CLIENT is
structurally indistinguishable from one rooted at
ICE.SENSITIVE.PID.CONTACT.EMAIL from the algorithms’ point of view.
What’s already ontology-agnostic
Most of the algorithmic surface from v0.4.0-rc1 operates on the hierarchy as a graph, not on ICE-specific anchors:
- Parent-aware DST frame (
mass_functions.py) — votes at any node; fold-up usesHierarchicalCategorySet.descendants(code). - Hierarchical cosine mass (Shafer §3) — distributes embedding similarity through the graph regardless of root identity.
- Cross-subtree cautious_code (Smets §6) — least-commitment promotion finds the deepest common ancestor on whatever tree is loaded.
- Belief-path tracing (
belief.py::belief_path) — walksparent_codechains; doesn’t care about labels. - Indep-tier revisit gate — fires on consensus disagreement, not on a code-pattern match.
- Atlas type graph export (
HierarchicalCategorySet.atlas_type_graph) — turns any tree into Atlas Classification typedefs withsuperTypeschains. - Validation (
validate_taxonomy) — collision + duplicate detector catches structural problems before classify starts (cycles emerge as parent_code self-reference; multi-root surfaces as multiple parent-less entries). - Cautious-Code Review (
cautious_review.py) — agent-mediated backoff is structural; the agent reasons about depth-vs-confidence on whatever tree it sees.
The DST math doesn’t know it’s classifying PII. That’s a feature — it means the work we did on v0.4.0-rc1 transfers to deployment with zero algorithm changes.
What anticipates badly today
Five gaps that an unseen customer ontology will surface on first encounter:
1. Schema flexibility — column-name variants
_normalize_annotations_row matches a fixed set of column-name
candidates. Customers regularly bring annotation tables with names
like Class_Name, parent_class, category_definition,
sensitivity_tier, pii_category — none of which exactly match our
preferred field names. Today this falls through to silent drops or
empty fields.
Fix: extend the column-name normalization to be configurable and
fuzzy. Add a vocab_schema_map overlay setting that lets the
operator declare { "code": "Class_Name", "parent_code": "Parent" }
at run-time. Default behavior stays automatic via fuzzy matching on
common synonyms.
2. Hierarchy-shape resilience
Today’s calibration assumes our 5-level ICE depth. Discount defaults
(cosine 0.20, llm 0.15, SVM 0.55), gap_threshold 0.15, and
cautious_review bel_threshold 0.85 are all tuned against that. A
customer’s flat 50-class taxonomy doesn’t need cautious-code review
(no parents to back off to), and an 8-deep regulatory taxonomy
demands tighter cautious thresholds (more depth × more places to be
wrong).
Fix: depth-aware defaults. Compute hierarchy statistics at LOADING_VOCAB time (max depth, mean branching factor, leaf/internal ratio); apply scaled defaults if the operator hasn’t overridden them. Surface the stats in the Status page so the operator sees what they got.
3. Multi-root and cycle handling
Single-root tree is assumed in descendants / ancestors traversal.
Customer dumps can have multiple top-level concepts (a forest), or —
rarely but consequentially — a cycle introduced by data entry error.
Today: cycles cause infinite recursion in descendants; multiple
roots silently work because the traversal is parent-anchored, but
pre-classification tooling (Atlas export, vocabulary stats UI) breaks.
Fix: explicit multi-root support in
HierarchicalCategorySet. Cycle detection + clear error in
validate_taxonomy with the offending edge identified. Both behind a
feature flag so pathological customer data fails fast rather than
hangs.
4. Opportunistic CCO/BFO grounding
Customer ontologies that overlap with Schema.org / DBpedia / BFO /
CCO carry that overlap as metadata columns (e.g.,
schema_org_class, bfo_anchor, cco_class). Today we ignore
these. Wiring them lets us:
- Auto-validate the customer’s hierarchy against a known reference
(warn on inconsistent BFO anchoring; e.g., a node mapped to
cco:DesignativeICEwhose children include acco:Agent). - Reuse our 360-term mapping for embedding-text enrichment (a
customer term mapped to
schema:Personborrows the full description from the Schema.org corpus). - Bridge to Atlas BFO classifications when the customer’s governance team is ahead of theirs (Cloudera Atlas now ships BFO alignment as of mid-2025).
Fix: optional bfo_anchor / cco_class / schema_org_class /
dbpedia_class columns in the annotations contract; when present,
the loader populates them on ReferenceCategory and the embedding +
LLM-prompt builders consume them. When absent, no behavior change.
5. Accuracy reporting without an agent-mediated reference
The customer often has no per-column gold-standard labels. Our
v0.4.0-rc1 evaluation pipeline assumes a curated_reference table
(or per-row reference_code field). When the customer doesn’t
provide one:
What we have: belief-gap distribution, mean K, cautious-code depth distribution, cross-source agreement counts, reasoning-trace attribution analyzer. These are calibration metrics, not accuracy.
What we need: a deployment-mode evaluation report that’s honest about the absence of an agent-mediated reference. Three-tier report:
- Internal consistency — DST K stats, belief-gap distribution, contraction rate. Always available. Tells the operator the pipeline converged.
- Sample review workflow — eject N highest-uncertainty columns and N highest-confidence columns to the UI for human spot-check. The operator’s accept/reject decisions feed an ad-hoc curated reference that grows over time. This is essentially what UAT reviewers were doing manually; we can formalize it.
- Public-benchmark proxy — when the customer’s ontology overlaps with SOTAB / GitTables / SemTab through opportunistic CCO grounding (gap #4), accuracy on the public benchmark serves as a conditional-confidence floor.
Public-data iteration as test surface
The principle: every public benchmark we adopt becomes a deployment-shape simulator, not a one-off integration. Concretely:
-
SOTAB v2 — wire as a classify source by transforming the 82 Schema.org type list into a
HierarchicalCategorySet- shaped annotations table. The Schema.org type tree provides parent_code edges; our existingatelier-vocab.ttlmappings (schema:Person → cco:Agent, etc.) opportunistically populate the BFO/CCO anchors on the resultingReferenceCategoryrows. Pipeline runs against SOTAB tables exactly as it would against a customer Hive corpus. -
GitTables — same treatment. The 122 DBpedia types become a flat (or DBpedia-hierarchy-enriched) annotations table. Our 15 already-mapped DBpedia → CCO bridges populate anchors where they exist; the other 107 stay un-anchored (correct behavior for opportunistic grounding).
-
SemTab annual — register the system, produce the annotations table from each year’s vocabulary release, evaluate against the
cscoremetric (which natively rewards our cautious_code). -
Customer schema simulators — synthetic annotation tables that test specific deployment shapes: flat 50-class taxonomy (legal exemption codes), 8-deep regulatory tree (HIPAA subcategories), forest with 3 roots (multi-domain governance). These exercise the M11 shape-resilience work without needing real customer data.
Each iteration target ships as a data_sources row + a loader (one
function each) + an annotations table built from the benchmark’s
class list. None of them need pipeline-side knowledge.
M11 — Bring Your Own Vocabulary (proposed)
A milestone that delivers ontology-agnostic execution with the five gaps above closed:
- Configurable vocab schema mapping — overlay setting + fuzzy default; surfaces in Status when applied.
- Depth-aware default calibration — compute hierarchy stats at load; scale gap_threshold + cautious bel_threshold.
- Multi-root + cycle support — explicit; behind feature flags that fail loudly when violated.
- Opportunistic anchor columns —
bfo_anchor/cco_class/schema_org_class/dbpedia_classconsumed when present. - Three-tier deployment evaluation report — internal consistency / sample review workflow / public-benchmark proxy.
- SOTAB v2 + GitTables wired as test sources — proof that the same execution path handles three published benchmarks plus the customer’s Hive table without code changes per target.
The work is concrete and bounded — roughly two focused sessions (taxonomy.py + pipeline.py extensions; one loader + one fixture test per benchmark). Stronger leverage than a feature-by-feature roadmap because every fix lands on the existing structural abstraction rather than introducing new mechanisms.
Out of scope (deferred)
- Cross-customer ontology learning. Two customers with similar regulatory domains might benefit from shared inferences; we explicitly do not transfer learning across deployments. Each customer’s session is a closed world.
- Customer-driven hierarchy editing. The annotations table is a contract the customer controls upstream of Atelier. We don’t ship UI for editing it.
- Ontology auto-discovery. Inferring a hierarchy from unannotated data tables (clustering plus LLM proposes a tree) is a research direction in its own right; out of scope for M11.
Open questions
- What if the customer brings two annotations tables? A
primary domain vocabulary (
hipaa.annotations) and a generic PII overlay (atelier.annotations). Today’s pipeline takes one. M11 should considercompose_vocabulariesin the loader path, but it changes the meaning of “the customer’s ontology” — needs a design conversation. - Embedding-model robustness across languages. A German / Japanese / Mandarin annotations table will produce shorter embedding-text and weaker cosine signal at MiniLM-L6 scale. Bigger embedding models (BGE-large, E5-mistral) help but inflate per-run cost. Defer to a separate i18n milestone.
- Atlas BFO sync. Cloudera’s Atlas team is shipping BFO
alignment. Once stable, our
atelier-vocab.ttl↔ Atlas classification typedef mapping should round-trip without loss (we ship to Atlas; Atlas hands back BFO-anchored entities; we read them as opportunistic anchors per gap #4). Wait for Atlas BFO general availability before wiring.
Cross-references
- Classification Pipeline — the execution path being made ontology-agnostic.
- DST Evidence Independence — the numerical-methods framing that already operates on arbitrary hierarchies.
- Pareto Capability Evolution — the longer-horizon search-space that builds on M11.
src/atelier/classify/ontology/README.md— the BFO/CCO substrate that opportunistic anchoring lifts into.src/atelier/classify/taxonomy.py—_normalize_annotations_row,_build_category_set_from_records(the adapter layer).src/atelier/classify/sampler.py—load_annotations_from_hive,load_annotations_from_json,load_annotations_from_filesystem(the source-shape variants).