Ontology

Aegir is the canonical owner of the bespoke BFO 2020 / CCO-grounded ontology used by the metadata-tagging stack — the Signals Data Governance (SDG) ontology — and of the ontology-grounded synthetic-data pipeline that produces in-distribution pretraining bytes against it. The ontology, its rigor program, and the realized OWL artifact published outward all live in this chapter. The chapter covers what the ontology is now, how its classes function as the annotation vocabulary, the quantitative rigor metrics and the formal publish gate every extension must clear, and the disposal membranes that enforce rigor rather than assert it.

The ontology conditions everything downstream — it is the annotation vocabulary for Column Type / Column Property Annotation (CTA/CPA) over wide relational tables. Its classes are not leaf terms but intermediate-depth subsumers: the property-bearing classes a heterogeneous-but-coherent column belongs to. Defining those classes well is building the annotation vocabulary, and the gates exist to keep every term a coherent, grounded annotation target. Putting the ontology next to the model is the only arrangement where these decisions stay coherent.

What the ontology is now

sdg-ontology is content-derived from FinePDFs (qdrant/ColBERT MaxSim domain filtering over a SKOS index) and realized to a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl}, with a consistency certificate at corpora/ontology/HERMIT_CERTIFICATE.md. The seven family catalogs (src/aegir/ontology/catalog/01…07) are a seed and regression baseline; FinePDFs-derived intermediate classes accrete in 08_derived.json, and the live driver is the content-first derivation pipeline (scripts/derive_ontology.py, scripts/define_intermediate_classes.py), not a fixed template count.

The architecture is an agent-mediated propose / dispose feedback loop: an engine proposes axioms; a stack of deterministic membranes (parse → HermiT with CCO imported as a reasoning authority → OntoClean) disposes and returns the reason; the agent responds and refines. Rigor is enforced, not asserted. The Authors Guide is the canonical reference for every metric, band, gate, and membrane.

Scope summary

Concern	Owner	Notes
SDG ontology IRIs + BFO/CCO grounding	Ægir	`src/aegir/ontology/catalog/*.json` → realized `corpora/ontology/sdg-ontology.{omn,owl}`
Content-first derivation (FinePDFs → classes)	Ægir	`scripts/derive_ontology.py`, `scripts/define_intermediate_classes.py`
Grounding-anchor retrieval (CCO + FHIR + accretive)	Ægir	`scripts/grounding_anchors.py`
Rigor metrology + OQuaRE publish gate	Ægir	`scripts/ontology_metrology.py`, `scripts/ontology_oquare.py`
Disposal membranes (parse / HermiT / OntoClean)	Ægir	`scripts/build_realized_ontology.py`, `src/aegir/ontology/ontoclean.py`
Ontology-grounded synthetic corpus + DDL spine	Ægir	`scripts/generate_chapter.py`, `scripts/verify_chapters.py`, `src/aegir/ontology/ddl.py`, `realize.py`
CTA / CPA dataset loaders	Ægir	`src/aegir/data/table_dataset.py`
Model training + evaluation	Ægir	`train.py`, `train_pretrain.py`, `AegirForColumnAnnotation`
Consumer-side use of the above	downstream projects	Outside Ægir’s design constraints

A separate sibling project (Atelier) consumes Ægir-produced artifacts as an independent pretraining-efficacy gate. Atelier’s own docs describe what it needs from this contract, but those docs are advisory input here, not specification.

Sub-pages

Authors Guide — metrics & quality gates — canonical: the full quantitative metric suite (IOF rigor dimensions, OntoQA/OQuaRE structural metrics, OntoClean proxies), the OQuaRE publish gate with its [1,5] bands and floors, the disposal membranes, and the pre-registered OQ-Rigor / OQ-Structure objectives, with the exact formulas the tooling enforces
Charter — Ægir’s internal direction-setter for the ontology scope: provenance discipline, the committed BFO/CCO branch structure, and external-standard anchors
Migration — authoring history for the initial bespoke vocabulary
Concept brief — RLVR for ontology generation — the design of the long-horizon Signals M4 apparatus: a four-component verifiable reward R(O, I) over OWL artifacts and a GRPO-trained, SAE-instrumented local policy targeting it. That reward is now realized as the deterministic membrane stack (HermiT/CCO, OntoClean, OQuaRE) that the agent-mediated propose/dispose loop — documented in the Authors Guide — is building and proving today
Semantic engine — authoritative reference — the operational-state description of the SDG ontology, the rigor program, and the closed-loop synthetic-data pipeline
RLVR for ontology generation — the externally-readable methodological chapter for the long-horizon M4 apparatus: the verifier R(O, I), now realized as the membrane stack, and the SAE-instrumented-Qwen policy that GRPO trains against it to autonomously generate ontology extensions

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking

Ontology

What the ontology is now

Scope summary

Sub-pages