Pretraining: Ontology-Grounded Synthetic Data

How do you train a structured information model at LLM scale when labeled relational data is scarce and expensive?

Conventional approaches to semantic column annotation rely on manually labeled benchmark datasets – SOTAB, GitTables, WikiTables – that are costly to create, domain-limited, and rarely capture the cross-table relationships needed for data element discovery. Self-supervised pretraining on raw tables (as in DODUO and TURL) learns useful representations, but the “ground truth” for column semantics remains noisy or absent.

Aegir takes a fundamentally different approach: we generate the training data from first principles, so the ground truth is always known by construction.

The Core Insight

We invert the usual pipeline. Instead of finding tables and labeling them, we:

Start from the highest-quality curated text available
Extract formal ontological structure using LLMs
Project that structure into relational database schemas
Populate schemas with realistic synthetic data
Train the model to recover the ontological entities from the raw table data

Because we control every step of the generation process, the mapping from table columns back to ontological entities is always available as ground truth. The diversity of the input text drives the diversity of the synthetic data; the formal ontological backbone guarantees structural correctness.

Pipeline Overview

What Is Novel

No prior work combines all five stages into a single pipeline. Each stage has precedent; the composition does not.

Stage	Prior Art	What Exists	What Is New
Text → Ontology	OntoGPT, REBEL, DeepOnto	LLM-based ontology extraction from text	Using curated educational text as seed for training data generation
BFO Grounding	Common Core Ontologies, OBO Foundry	BFO as upper ontology for domain modeling	BFO as the generative backbone for synthetic ML training data
SysMLv2 Intermediate	openCAESAR, Cameo	SysMLv2 for systems engineering	SysMLv2 MBSE as intermediate representation in an ML data pipeline
Synthetic Tables	MOSTLY.ai, SDV, NeurIPS 2024 TRL	Synthetic table generation for augmentation	Tables generated from ontological structure with known entity provenance
Entity Recovery	DODUO, TURL (masked column)	Masked language model pretraining on tables	Ontological entity recovery as the training objective, not next-token prediction

The closest related work is “Enhancing Table Representations with LLM-powered Synthetic Data Generation” (NeurIPS 2024 TRL Workshop), which generates synthetic tables to improve column embedding similarity. That work generates tables for representation learning; Aegir generates tables for ontological entity recovery – a fundamentally different objective that produces richer training signal because the ground truth includes hierarchical entity structure, cross-table relationships, and BFO-grounded type constraints.

Why This Scales

The bottleneck in conventional table annotation is human labeling. The bottleneck here is LLM inference for ontology extraction – which is embarrassingly parallel and decreasing in cost.

The multiplicative structure of the pipeline ensures near-unlimited training data:

Stage	Multiplier	Source
Curated text	~500M passages	FineWeb-Edu (1.3T tokens)
Ontology fragments	1–5 per passage	Domain-dependent entity density
Database schemas	1–10 per fragment	Varying normalization strategies
Table instances	100–10,000 rows	Procedural generation with distribution control
Total training examples	effectively unbounded	Combinatorial product of all stages

A single educational passage about hospital billing can produce ontology fragments for patient demographics, encounter management, diagnosis coding, insurance claims, and provider credentialing – each of which generates distinct database schemas, each populated with different synthetic data distributions. The diversity of the training data is bounded only by the diversity of human knowledge captured in the source text.

How This Connects to Aegir

The pretraining objective maps directly to Aegir’s three target tasks:

Column Type Annotation (CTA): The per-column entity type predictions from pretraining transfer directly to CTA on SOTAB, GitTables, and WikiTables benchmarks.
Column Property Annotation (CPA): The cross-column relationship predictions learned during pretraining capture the same inter-column semantics needed for CPA.
Data Element Discovery: The core pretraining objective – grouping related columns into ontological entities across tables – is data element discovery. The model learns this from synthetic data where the answer is known, then applies it to real enterprise data warehouses.

Furthermore, Aegir’s agent swarm architecture enables cross-table reasoning during both pretraining and inference. Each agent processes a table, and the fused recurrent states capture inter-table relationships that no single-table model can learn.

The following sections detail each stage of the pipeline.