Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Pretraining: Ontology-Grounded Synthetic Data

How do you train a structured information model at LLM scale when labeled relational data is scarce and expensive?

Conventional approaches to semantic column annotation rely on manually labeled benchmark datasets – SOTAB, GitTables, WikiTables – that are costly to create, domain-limited, and rarely capture the cross-table relationships needed for data element discovery. Self-supervised pretraining on raw tables (as in DODUO and TURL) learns useful representations, but the “ground truth” for column semantics remains noisy or absent.

Aegir takes a fundamentally different approach: we generate the training data from first principles, so the ground truth is always known by construction.

The Core Insight

We invert the usual pipeline. Instead of finding tables and labeling them, we:

  1. Start from the highest-quality curated text available
  2. Extract formal ontological structure using LLMs
  3. Project that structure into relational database schemas
  4. Populate schemas with realistic synthetic data
  5. Train the model to recover the ontological entities from the raw table data

Because we control every step of the generation process, the mapping from table columns back to ontological entities is always available as ground truth. The diversity of the input text drives the diversity of the synthetic data; the formal ontological backbone guarantees structural correctness.

Pipeline Overview

What Is Novel

No prior work combines all five stages into a single pipeline. Each stage has precedent; the composition does not.

StagePrior ArtWhat ExistsWhat Is New
Text → OntologyOntoGPT, REBEL, DeepOntoLLM-based ontology extraction from textUsing curated educational text as seed for training data generation
BFO GroundingCommon Core Ontologies, OBO FoundryBFO as upper ontology for domain modelingBFO as the generative backbone for synthetic ML training data
SysMLv2 IntermediateopenCAESAR, CameoSysMLv2 for systems engineeringSysMLv2 MBSE as intermediate representation in an ML data pipeline
Synthetic TablesMOSTLY.ai, SDV, NeurIPS 2024 TRLSynthetic table generation for augmentationTables generated from ontological structure with known entity provenance
Entity RecoveryDODUO, TURL (masked column)Masked language model pretraining on tablesOntological entity recovery as the training objective, not next-token prediction

The closest related work is “Enhancing Table Representations with LLM-powered Synthetic Data Generation” (NeurIPS 2024 TRL Workshop), which generates synthetic tables to improve column embedding similarity. That work generates tables for representation learning; Aegir generates tables for ontological entity recovery – a fundamentally different objective that produces richer training signal because the ground truth includes hierarchical entity structure, cross-table relationships, and BFO-grounded type constraints.

Why This Scales

The bottleneck in conventional table annotation is human labeling. The bottleneck here is LLM inference for ontology extraction – which is embarrassingly parallel and decreasing in cost.

The multiplicative structure of the pipeline ensures near-unlimited training data:

StageMultiplierSource
Curated text~500M passagesFineWeb-Edu (1.3T tokens)
Ontology fragments1–5 per passageDomain-dependent entity density
Database schemas1–10 per fragmentVarying normalization strategies
Table instances100–10,000 rowsProcedural generation with distribution control
Total training exampleseffectively unboundedCombinatorial product of all stages

A single educational passage about hospital billing can produce ontology fragments for patient demographics, encounter management, diagnosis coding, insurance claims, and provider credentialing – each of which generates distinct database schemas, each populated with different synthetic data distributions. The diversity of the training data is bounded only by the diversity of human knowledge captured in the source text.

How This Connects to Aegir

The pretraining objective maps directly to Aegir’s three target tasks:

  • Column Type Annotation (CTA): The per-column entity type predictions from pretraining transfer directly to CTA on SOTAB, GitTables, and WikiTables benchmarks.
  • Column Property Annotation (CPA): The cross-column relationship predictions learned during pretraining capture the same inter-column semantics needed for CPA.
  • Data Element Discovery: The core pretraining objective – grouping related columns into ontological entities across tables – is data element discovery. The model learns this from synthetic data where the answer is known, then applies it to real enterprise data warehouses.

Furthermore, Aegir’s agent swarm architecture enables cross-table reasoning during both pretraining and inference. Each agent processes a table, and the fused recurrent states capture inter-table relationships that no single-table model can learn.

The following sections detail each stage of the pipeline.