Stage 1: Ontology Extraction

The first stage transforms curated educational text into formal ontological structure. A large language model reads natural language passages and produces BFO-grounded ontology fragments – typed entity hierarchies with properties, relationships, and axioms that can be mechanically projected into database schemas.

Input: FineWeb PDFs Edu

The source corpus is FineWeb-Edu, a curated subset of Common Crawl filtered for educational content using LLaMA-3-70B-Instruct quality scoring. Key properties:

1.3 trillion tokens of curated, high-quality educational text
Spans every domain: medicine, law, finance, engineering, social sciences, natural sciences
Already deduplicated and quality-filtered – no need for additional curation
PDF-extracted passages preserve document structure (headings, tables, lists)

Each passage is a self-contained description of some real-world domain – exactly the kind of text that contains implicit ontological structure waiting to be made explicit.

Extraction Process

The extraction uses structured prompting with a three-phase approach:

Domain identification: Classify the passage into one or more information domains (healthcare, finance, logistics, etc.) to select domain-appropriate extraction templates.
Entity extraction: Identify entity types, their properties, and inter-entity relationships. The prompt constrains outputs to BFO-compatible categories.
BFO alignment: Map each extracted entity to the appropriate BFO upper-level category, ensuring the fragment inherits BFO’s formal axioms.

Validation Gate

Not every LLM output is usable. A validation gate checks three properties:

Syntactic: Does the output parse as valid OWL/RDF?
BFO alignment: Is every class properly subsumed by a BFO category?
Coherence: Are there contradictory axioms or dangling references?

Fragments that fail validation are discarded or re-prompted. In practice, structured output modes (JSON schema enforcement) in GLM-4.7/GLM-5 achieve >90% first-pass validation rates.

Why BFO

Basic Formal Ontology (ISO/IEC 21838-2:2021) is the most widely adopted upper ontology in applied information science:

700+ ontology projects built on BFO across government, defense, healthcare, and industry
Common Core Ontologies (CCO) used by the U.S. Department of Defense and Intelligence Community
OBO Foundry biomedical ontologies (Gene Ontology, ChEBI, etc.) all align to BFO
Formal first-order logic axiomatization ensures machine-verifiable consistency

BFO provides the upper-level categories that give our extracted ontologies a shared formal backbone. Without this grounding, extracted ontologies would be ad-hoc entity lists with no guaranteed interoperability or logical structure.

BFO Categories for Information Systems

The BFO categories most relevant to relational data modeling:

BFO Category	IRI	Maps To	Example
Generically Dependent Continuant	`BFO:0000031`	InformationEntity	A patient record, a diagnosis code
Object	`BFO:0000030`	Concrete entity	A patient, a medical device
Quality	`BFO:0000019`	Data attribute	Acuity level, sensitivity classification
Role	`BFO:0000023`	Functional role	Data subject, provider, auditor
Process	`BFO:0000015`	Temporal event	An encounter, a transaction, a review
Specifically Dependent Continuant	`BFO:0000020`	Inherent property	A patient’s blood type, a device’s serial number

These categories constrain what kinds of entities can participate in what kinds of relationships – a Patient (Object) can bear a DataSubjectRole (Role), an Encounter (Process) has participant a Patient (Object), and so on. These constraints propagate through the pipeline: they determine which foreign key relationships are valid in the generated schemas.

Formal Definition

An ontology fragment is a tuple:

\[ O = (C, R, A, \iota) \]

where:

\(C = \{c_1, \ldots, c_n\}\) is a set of classes (entity types), each with a set of properties \(P(c_i) = \{p_1, \ldots, p_k\}\)
\(R = \{r_1, \ldots, r_m\}\) is a set of relations between classes, each \(r_j: c_a \to c_b\) with cardinality constraints
\(A\) is a set of axioms – subsumption (\(c_i \sqsubseteq c_j\)), disjointness (\(c_i \sqcap c_j = \bot\)), and property constraints (domain, range, cardinality)
\(\iota: C \to \text{BFO}\) is the BFO alignment mapping that assigns each class to a BFO upper-level category

The alignment mapping \(\iota\) must satisfy BFO’s axioms: if \(\iota(c_i) = \text{BFO:Process}\), then \(c_i\) inherits Process axioms (has temporal extent, can have participants, etc.). This is not merely a label – it constrains the valid relationships and properties that \(c_i\) can participate in.

Output

Each successfully validated ontology fragment becomes input to Stage 2: Schema Projection. A single text passage typically yields 1–5 fragments, depending on the complexity and domain diversity of the passage content.

The ontology fragments are serialized as OWL/RDF for archival and as structured JSON for downstream processing. Both representations preserve the full BFO alignment mapping, enabling validation at every subsequent stage.

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking