End-to-End Example

This walkthrough traces a single educational text passage through the entire pretraining pipeline – from raw text to a validated training example. Every intermediate representation is shown concretely, making the abstract pipeline tangible.

Hero Diagram

Step 1: Input Text

A passage from a PDF about hospital emergency department workflows:

Emergency departments manage patient flow through a structured triage process. When a patient arrives, a triage nurse assesses their condition and assigns an acuity level using the Emergency Severity Index (ESI), ranging from 1 (resuscitation) to 5 (non-urgent). Each patient encounter records the presenting complaint, vital signs at triage, the assigned provider, and any diagnostic tests ordered.

Diagnoses are coded using ICD-10-CM, with a primary diagnosis and optional secondary diagnoses recorded per encounter. Medications prescribed during the encounter are tracked with the drug name, dosage, route of administration, and the prescribing provider. The encounter concludes with a disposition decision: admission, discharge, transfer, or observation.

This is a typical educational passage: clear, structured, and rich in implicit ontological content.

Step 2: Ontology Extraction

The LLM receives the passage with a structured extraction prompt and produces a BFO-grounded ontology fragment:

Classes (with BFO alignment):

Class	BFO Parent	Properties
`Patient`	`BFO:Object`	patient_id, date_of_birth, gender, address
`Encounter`	`BFO:Process`	encounter_id, encounter_date, presenting_complaint, disposition
`Provider`	`BFO:Role`	provider_id, name, specialty, license_number
`Diagnosis`	`BFO:GDC`	diagnosis_id, icd10_code, description, is_primary
`Medication`	`BFO:GDC`	medication_id, drug_name, dosage, route
`VitalSigns`	`BFO:Quality`	heart_rate, blood_pressure, temperature, respiratory_rate, spo2
`AcuityLevel`	`BFO:Quality`	esi_level (1–5)

Relations:

Relation	Domain	Range	Cardinality
`hasEncounter`	Patient	Encounter	1..*
`hasProvider`	Encounter	Provider	1..1
`hasDiagnosis`	Encounter	Diagnosis	1..*
`hasMedication`	Encounter	Medication	0..*
`hasVitalSigns`	Encounter	VitalSigns	1..1
`hasAcuity`	Encounter	AcuityLevel	1..1
`prescribedBy`	Medication	Provider	1..1

Axioms:

Encounter.disposition ∈ {admission, discharge, transfer, observation}
AcuityLevel.esi_level ∈ {1, 2, 3, 4, 5}
Diagnosis.is_primary is unique per Encounter (exactly one primary diagnosis)

Step 3: SysMLv2 Model

The ontology maps to SysMLv2 block definitions:

The SysMLv2 model adds lifecycle semantics (the Encounter state machine: entry → triage → treatment → disposition → closed) and formal constraints that the flat ontology fragment does not capture.

Step 4: Python Data Objects

from dataclasses import dataclass
from datetime import date, datetime
from enum import Enum
from typing import Optional

class Disposition(Enum):
    ADMISSION = "admission"
    DISCHARGE = "discharge"
    TRANSFER = "transfer"
    OBSERVATION = "observation"

class Route(Enum):
    ORAL = "oral"
    IV = "intravenous"
    IM = "intramuscular"
    TOPICAL = "topical"
    INHALED = "inhaled"

@dataclass
class Patient:
    patient_id: str
    date_of_birth: date
    gender: str
    address: str

@dataclass
class Provider:
    provider_id: str
    name: str
    specialty: str
    license_number: str

@dataclass
class Encounter:
    encounter_id: str
    patient_id: str           # FK → Patient
    provider_id: str          # FK → Provider
    encounter_date: datetime
    presenting_complaint: str
    esi_level: int            # 1-5
    disposition: Disposition
    heart_rate: int
    blood_pressure: str
    temperature: float
    respiratory_rate: int
    spo2: int

@dataclass
class Diagnosis:
    diagnosis_id: str
    encounter_id: str         # FK → Encounter
    icd10_code: str
    description: str
    is_primary: bool

@dataclass
class Medication:
    medication_id: str
    encounter_id: str         # FK → Encounter
    prescribed_by: str        # FK → Provider
    drug_name: str
    dosage: str
    route: Route

Note that VitalSigns and AcuityLevel (BFO:Quality entities) have been denormalized into the Encounter table – a deliberate schema variation that the model must learn to handle. In a different schema variant, these would be separate tables.

Step 5: Relational Schema

CREATE TABLE patient (
    patient_id     VARCHAR(36) PRIMARY KEY,
    date_of_birth  DATE NOT NULL,
    gender         VARCHAR(10) NOT NULL,
    address        TEXT
);

CREATE TABLE provider (
    provider_id    VARCHAR(36) PRIMARY KEY,
    name           VARCHAR(100) NOT NULL,
    specialty      VARCHAR(50) NOT NULL,
    license_number VARCHAR(20) NOT NULL UNIQUE
);

CREATE TABLE encounter (
    encounter_id        VARCHAR(36) PRIMARY KEY,
    patient_id          VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
    provider_id         VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
    encounter_date      TIMESTAMP NOT NULL,
    presenting_complaint TEXT NOT NULL,
    esi_level           INTEGER NOT NULL CHECK (esi_level BETWEEN 1 AND 5),
    disposition         VARCHAR(20) NOT NULL
                        CHECK (disposition IN ('admission','discharge','transfer','observation')),
    heart_rate          INTEGER,
    blood_pressure      VARCHAR(10),
    temperature         NUMERIC(4,1),
    respiratory_rate    INTEGER,
    spo2                INTEGER CHECK (spo2 BETWEEN 0 AND 100)
);

CREATE TABLE diagnosis (
    diagnosis_id VARCHAR(36) PRIMARY KEY,
    encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
    icd10_code   VARCHAR(10) NOT NULL,
    description  TEXT,
    is_primary   BOOLEAN NOT NULL DEFAULT FALSE,
    UNIQUE (encounter_id, is_primary) -- at most one primary per encounter
);

CREATE TABLE medication (
    medication_id VARCHAR(36) PRIMARY KEY,
    encounter_id  VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
    prescribed_by VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
    drug_name     VARCHAR(100) NOT NULL,
    dosage        VARCHAR(50) NOT NULL,
    route         VARCHAR(20) NOT NULL
);

Step 6: Synthetic Data

Sample rows from the populated tables:

patient (200 rows):

patient_id	date_of_birth	gender	address
`a3f8c1d0-...`	1987-03-15	Female	2847 Oak Ave, Portland, OR 97205
`b7e2a4f1-...`	1952-11-28	Male	156 Pine St, Austin, TX 78701
`c9d0b3e2-...`	2001-07-04	Female	4021 Maple Dr, Denver, CO 80202

encounter (1,400 rows, ~7 per patient):

encounter_id	patient_id	provider_id	encounter_date	presenting_complaint	esi_level	disposition	heart_rate	blood_pressure	temperature
`e1a2b3c4-...`	`a3f8c1d0-...`	`p001-...`	2024-01-15 14:30	Acute chest pain	2	admission	98	145/92	98.6
`e5f6a7b8-...`	`b7e2a4f1-...`	`p003-...`	2024-02-03 09:15	Laceration, left hand	4	discharge	72	128/78	98.2

diagnosis (3,200 rows, ~2.3 per encounter):

diagnosis_id	encounter_id	icd10_code	description	is_primary
`d100-...`	`e1a2b3c4-...`	I21.9	Acute myocardial infarction, unspecified	true
`d101-...`	`e1a2b3c4-...`	I10	Essential hypertension	false
`d200-...`	`e5f6a7b8-...`	S61.412A	Laceration without FB, left hand	true

medication (2,100 rows):

medication_id	encounter_id	prescribed_by	drug_name	dosage	route
`m100-...`	`e1a2b3c4-...`	`p001-...`	Aspirin	325mg	oral
`m101-...`	`e1a2b3c4-...`	`p001-...`	Heparin	5000 units	intravenous
`m200-...`	`e5f6a7b8-...`	`p003-...`	Lidocaine	1% 5mL	topical

Step 7: Serialized Input

Aegir receives the tables as byte-serialized CSV data. Here’s what the model actually sees (abbreviated):

patient_id,date_of_birth,gender,address
a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,1987-03-15,Female,"2847 Oak Ave, Portland, OR 97205"
b7e2a4f1-3c5d-4e6f-8a9b-c0d1e2f3a4b5,1952-11-28,Male,"156 Pine St, Austin, TX 78701"
c9d0b3e2-1a4f-4c7d-9e2b-f3a5b6c7d8e9,2001-07-04,Female,"4021 Maple Dr, Denver, CO 80202"
...
===TABLE_BOUNDARY===
encounter_id,patient_id,provider_id,encounter_date,presenting_complaint,esi_level,disposition,heart_rate,blood_pressure,temperature,respiratory_rate,spo2
e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,p001-a2b3-c4d5,2024-01-15 14:30:00,Acute chest pain,2,admission,98,145/92,98.6,20,97
...
===TABLE_BOUNDARY===
diagnosis_id,encounter_id,icd10_code,description,is_primary
d100-e1f2-a3b4-c5d6,e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,I21.9,"Acute myocardial infarction, unspecified",true
...

The model sees raw bytes. No type annotations, no foreign key declarations, no semantic metadata – just the patterns in the data itself.

Step 8: Training Target

The expected predictions for this training example:

CTA predictions (per column):

Table	Column	Expected Type	BFO Category
patient	patient_id	PersonIdentifier	GDC
patient	date_of_birth	BirthDate	Quality
patient	gender	BiologicalSex	Quality
encounter	encounter_id	EncounterIdentifier	GDC
encounter	patient_id	PersonIdentifier (FK)	GDC
encounter	esi_level	AcuityLevel	Quality
encounter	disposition	DispositionDecision	Quality
encounter	heart_rate	VitalSign	Quality
diagnosis	icd10_code	DiagnosisCode	GDC
diagnosis	is_primary	PrimaryIndicator	Quality
medication	drug_name	MedicationName	GDC
medication	dosage	Dosage	Quality
medication	route	AdministrationRoute	Quality

Data element predictions (cross-table clusters):

Data Element	Columns	Source Entity
PatientDemographics	patient.patient_id, patient.date_of_birth, patient.gender, patient.address, encounter.patient_id	`Patient`
ClinicalEncounter	encounter.encounter_id, encounter.encounter_date, encounter.presenting_complaint, encounter.esi_level, encounter.disposition, encounter.heart_rate, encounter.blood_pressure, encounter.temperature	`Encounter` + `VitalSigns` + `AcuityLevel`
DiagnosisRecord	diagnosis.diagnosis_id, diagnosis.encounter_id, diagnosis.icd10_code, diagnosis.description, diagnosis.is_primary	`Diagnosis`
MedicationOrder	medication.medication_id, medication.encounter_id, medication.drug_name, medication.dosage, medication.route	`Medication`
ClinicalProvider	provider.provider_id, provider.name, provider.specialty, provider.license_number, encounter.provider_id, medication.prescribed_by	`Provider`

Note that the PatientDemographics data element spans patient.patient_id and encounter.patient_id – cross-table discovery. Similarly, ClinicalProvider spans columns in three tables (provider, encounter, medication). This is exactly the cross-table data element discovery that enterprise data governance requires.

Step 9: Validation

The round-trip check confirms that predicted data elements map back to source ontological entities:

Every predicted data element corresponds to exactly one source ontology entity. The ClinicalEncounter element correctly groups encounter properties with the denormalized VitalSigns and AcuityLevel qualities – demonstrating that the model learned to see through the denormalization to the underlying ontological structure.

This validation is automatic and exact because the generation pipeline preserves complete provenance. There is no human labeling, no ambiguity, and no annotation disagreement. The ground truth is a mathematical consequence of the generation process.

What This Means in Practice

When this training process is applied at scale – across hundreds of millions of passages spanning every domain in FineWeb-Edu – the model learns:

Column type recognition that generalizes across naming conventions, data formats, and serialization styles
Cross-table relationship discovery that identifies semantically related columns regardless of which tables they appear in
Ontological hierarchy that connects specific types (ICD-10 codes) to general categories (information entities) through BFO’s formal structure
Confusable type resolution by leveraging cross-column context (patient_id vs provider_id look identical in isolation but participate in different relationship patterns)

These capabilities transfer directly to real enterprise data warehouses, where the model encounters the same patterns – just without the luxury of knowing the ontological provenance in advance.

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking