Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Stage 1: Ontology Extraction

The first stage transforms curated educational text into formal ontological structure. A large language model reads natural language passages and produces BFO-grounded ontology fragments – typed entity hierarchies with properties, relationships, and axioms that can be mechanically projected into database schemas.

Input: FineWeb PDFs Edu

The source corpus is FineWeb-Edu, a curated subset of Common Crawl filtered for educational content using LLaMA-3-70B-Instruct quality scoring. Key properties:

  • 1.3 trillion tokens of curated, high-quality educational text
  • Spans every domain: medicine, law, finance, engineering, social sciences, natural sciences
  • Already deduplicated and quality-filtered – no need for additional curation
  • PDF-extracted passages preserve document structure (headings, tables, lists)

Each passage is a self-contained description of some real-world domain – exactly the kind of text that contains implicit ontological structure waiting to be made explicit.

Extraction Process

The extraction uses structured prompting with a three-phase approach:

  1. Domain identification: Classify the passage into one or more information domains (healthcare, finance, logistics, etc.) to select domain-appropriate extraction templates.

  2. Entity extraction: Identify entity types, their properties, and inter-entity relationships. The prompt constrains outputs to BFO-compatible categories.

  3. BFO alignment: Map each extracted entity to the appropriate BFO upper-level category, ensuring the fragment inherits BFO’s formal axioms.

Validation Gate

Not every LLM output is usable. A validation gate checks three properties:

  • Syntactic: Does the output parse as valid OWL/RDF?
  • BFO alignment: Is every class properly subsumed by a BFO category?
  • Coherence: Are there contradictory axioms or dangling references?

Fragments that fail validation are discarded or re-prompted. In practice, structured output modes (JSON schema enforcement) in GLM-4.7/GLM-5 achieve >90% first-pass validation rates.

Why BFO

Basic Formal Ontology (ISO/IEC 21838-2:2021) is the most widely adopted upper ontology in applied information science:

  • 700+ ontology projects built on BFO across government, defense, healthcare, and industry
  • Common Core Ontologies (CCO) used by the U.S. Department of Defense and Intelligence Community
  • OBO Foundry biomedical ontologies (Gene Ontology, ChEBI, etc.) all align to BFO
  • Formal first-order logic axiomatization ensures machine-verifiable consistency

BFO provides the upper-level categories that give our extracted ontologies a shared formal backbone. Without this grounding, extracted ontologies would be ad-hoc entity lists with no guaranteed interoperability or logical structure.

BFO Categories for Information Systems

The BFO categories most relevant to relational data modeling:

BFO CategoryIRIMaps ToExample
Generically Dependent ContinuantBFO:0000031InformationEntityA patient record, a diagnosis code
ObjectBFO:0000030Concrete entityA patient, a medical device
QualityBFO:0000019Data attributeAcuity level, sensitivity classification
RoleBFO:0000023Functional roleData subject, provider, auditor
ProcessBFO:0000015Temporal eventAn encounter, a transaction, a review
Specifically Dependent ContinuantBFO:0000020Inherent propertyA patient’s blood type, a device’s serial number

These categories constrain what kinds of entities can participate in what kinds of relationships – a Patient (Object) can bear a DataSubjectRole (Role), an Encounter (Process) has participant a Patient (Object), and so on. These constraints propagate through the pipeline: they determine which foreign key relationships are valid in the generated schemas.

Formal Definition

An ontology fragment is a tuple:

\[ O = (C, R, A, \iota) \]

where:

  • \(C = \{c_1, \ldots, c_n\}\) is a set of classes (entity types), each with a set of properties \(P(c_i) = \{p_1, \ldots, p_k\}\)
  • \(R = \{r_1, \ldots, r_m\}\) is a set of relations between classes, each \(r_j: c_a \to c_b\) with cardinality constraints
  • \(A\) is a set of axioms – subsumption (\(c_i \sqsubseteq c_j\)), disjointness (\(c_i \sqcap c_j = \bot\)), and property constraints (domain, range, cardinality)
  • \(\iota: C \to \text{BFO}\) is the BFO alignment mapping that assigns each class to a BFO upper-level category

The alignment mapping \(\iota\) must satisfy BFO’s axioms: if \(\iota(c_i) = \text{BFO:Process}\), then \(c_i\) inherits Process axioms (has temporal extent, can have participants, etc.). This is not merely a label – it constrains the valid relationships and properties that \(c_i\) can participate in.

Output

Each successfully validated ontology fragment becomes input to Stage 2: Schema Projection. A single text passage typically yields 1–5 fragments, depending on the complexity and domain diversity of the passage content.

The ontology fragments are serialized as OWL/RDF for archival and as structured JSON for downstream processing. Both representations preserve the full BFO alignment mapping, enabling validation at every subsequent stage.