Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Training Tactics

The training objective defines what Aegir learns – ontological entity recovery from serialized relational tables. This page defines how: the specific self-supervised tasks, corruption strategies, and curriculum design that compose the pretraining regiment. Each tactic is adapted from a proven LLM pretraining method but re-targeted at the structural properties of relational data with known ontological provenance.

Tactic Overview

Each tactic is described below with its LLM analogue, formal task specification, and the downstream capability it trains.


Core Objectives

Object Property Masking

LLM analogue: Masked Language Modeling (BERT)

Mask one or more properties from an ontological entity definition. The model receives the serialized tables (which still contain the data for the masked properties) and must predict what properties the source entity had.

Difficulty gradation:

LevelWhat’s MaskedChallenge
EasyA column with structurally distinctive values (dates, emails)Pattern recognition
MediumA column whose type depends on co-occurring columnsCross-column reasoning
HardA column with confusable values (UUID vs advertising ID)Contextual disambiguation
ExpertMultiple properties from the same entity simultaneouslyEntity structure reconstruction

Loss: Cross-entropy over the property type vocabulary, plus a regression loss for predicting the property name embedding.

\[ \mathcal{L}{\text{OPM}} = -\frac{1}{|M|} \sum{p \in M} \left[ \log P(y_p \mid \mathbf{h}_p) + \alpha | \hat{\mathbf{e}}_p - \mathbf{e}_p |^2 \right] \]

where \(M\) is the set of masked properties, \(y_p\) is the property’s BFO type, \(\mathbf{e}_p\) is the property name embedding, and \(\alpha\) weights the name regression term.

Trains: Column type annotation (CTA). The model learns to identify what semantic role a column plays from its value distribution and surrounding context.


Replaced Column Detection

LLM analogue: Replaced Token Detection (ELECTRA)

Swap columns between tables that originated from different ontological entities. A discriminator must identify which columns are imposters — present in a table they don’t ontologically belong to.

The generator learns to make plausible swaps — columns with similar value distributions but different semantic types. This is precisely the confusable-pair problem. A naive generator might swap patient_id (UUID) with encounter_date (timestamp) — trivially detectable. A trained generator learns to swap patient_id with provider_id (both UUIDs, both foreign-keyed) — a much harder discrimination task.

Two-phase training:

  1. Generator: A small model that scores candidate column swaps by value-distribution similarity and selects high-similarity pairs
  2. Discriminator: Aegir itself, trained to detect which columns don’t belong

The ELECTRA insight applies directly: the discriminator receives a training signal on every column (original or replaced), not just the masked positions. This is far more sample-efficient than masking-based objectives.

Loss: Binary cross-entropy per column.

\[ \mathcal{L}{\text{RCD}} = -\frac{1}{N} \sum{i=1}^{N} \left[ y_i \log D(\mathbf{h}_i) + (1 - y_i) \log(1 - D(\mathbf{h}_i)) \right] \]

where \(y_i = 1\) if column \(i\) was replaced and \(D\) is the discriminator head.

Trains: Confusable type resolution. Directly addresses the hardest failure mode in production column annotation — columns with identical value patterns but different semantic roles.


Relation Masking

LLM analogue: Next Sentence Prediction (BERT) / Sentence Order Prediction (ALBERT), extended to structural relationships

Drop a foreign key column from the serialized data and ask the model to predict that a relationship between two tables exists, which tables it connects, and what column would mediate it.

Task variants:

VariantInputTarget
ExistenceTwo tables, FK column removedBinary: are these tables related?
DirectionTwo related tables, FK removedWhich table is parent, which is child?
ColumnTables with FK removedWhich column in the child table held the FK?
Full recoveryMulti-table schema, one FK removedPredict source table, target table, and mediating column

Difficulty: Existence is easy (value overlap between tables is a strong signal). Direction requires understanding cardinality from data distributions. Full recovery in a 10-table schema with multiple possible FK targets is genuinely hard.

Loss: Cross-entropy over table pairs for existence/direction, cross-entropy over columns for the FK column prediction.

\[ \mathcal{L}{\text{RM}} = \mathcal{L}{\text{exist}} + \beta_1 \mathcal{L}{\text{direction}} + \beta_2 \mathcal{L}{\text{column}} \]

Trains: Cross-table data element discovery. The model learns to identify structural relationships between tables from data patterns alone — exactly what’s needed when foreign key metadata is missing or unreliable in enterprise warehouses.


Span Corruption (Entity-Level)

LLM analogue: Span Corruption (T5)

Mask all columns belonging to one data element across all tables and replace them with a sentinel. The model must predict what kind of entity is absent based on the remaining schema structure.

This is harder than single-property masking because the model must reason about the structural hole in the schema. A schema with patients, encounters, medications, and providers but no diagnostic information has a recognizable gap — clinical workflows always involve diagnosis. The model learns domain-level structural expectations.

Masking strategies:

  • Single entity: Remove all columns from one BFO class (as above)
  • Related pair: Remove two related entities (e.g., Diagnosis and its FK in Encounter)
  • Subtree: Remove an entity and all its dependents in the ontological hierarchy

Loss: Sequence-to-sequence generation of the masked entity structure, or classification over a vocabulary of entity type templates.

Trains: Entity boundary detection and structural reasoning. When the model encounters a real database missing expected entities, it can predict what should exist — critical for data governance gap analysis.


Augmentation Strategies

Schema Denoising

LLM analogue: Denoising Autoencoder (BART)

Apply multiple corruptions to the serialized schema simultaneously. The model must recover the clean ontological structure from the noisy input.

Corruption menu (applied stochastically per training example):

CorruptionWhat ChangesReal-World Analogue
Column renamingdate_of_birthcol_3Generic column names in enterprise DW
Column shufflingRandomize column order within tablesArbitrary column ordering conventions
Table mergingJoin two tables into one wide tableDenormalization for query performance
Table splittingSplit one table into arbitrary fragmentsVertical partitioning
Type coercionStore dates as strings, integers as floatsLegacy system type mismatches
Delimiter variationCSV → TSV → pipe-delimited → fixed-widthDifferent export formats
Header removalDrop column headers entirelyHeaderless data exports
Row samplingKeep only a random subset of rowsPartial data access

Multiple corruptions can stack: rename columns and merge tables and switch delimiters. The model trained on this distribution becomes robust to the full range of real-world schema messiness.

Loss: Reconstruction loss on the original ontological labels applied to the column embeddings from the corrupted input. The corruptions change what the model sees; the targets remain the clean ontological structure.

Trains: Robustness to real-world data formats. Enterprise databases exhibit every one of these corruptions and often several simultaneously.


Cross-Schema Contrastive Learning

LLM analogue: Contrastive Learning (SimCLR, CLIP)

Generate two different schemas from the same ontology fragment — one normalized with clear names, one denormalized with obfuscated names — and train the model to produce similar representations for both. Schemas from different ontology fragments should produce dissimilar representations.

Positive pairs: Two schema variants from the same ontology fragment. Negative pairs: Schemas from different ontology fragments (even within the same domain — two different healthcare schemas should still be distinguishable).

Loss: InfoNCE contrastive loss over schema-level representations.

\[ \mathcal{L}{\text{CSC}} = -\frac{1}{|\mathcal{B}|} \sum{i \in \mathcal{B}} \log \frac{\exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}i^b) / \tau)}{\sum{j \in \mathcal{B}} \exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}_j^b) / \tau)} \]

where \(\mathbf{z}_i^a\) and \(\mathbf{z}_i^b\) are schema-level embeddings (pooled from column embeddings) for the two variants of ontology fragment \(i\), and \(\mathcal{B}\) is the batch.

Trains: Schema-invariant representations. The model learns that the same information can appear in radically different structural formats — the core challenge in enterprise data integration.


Domain-Specific Objectives

Axiom Recovery

LLM analogue: No direct analogue — novel to this setting

Given only the populated tables (no schema metadata), predict the constraints from the source ontology.

Target axioms:

Axiom TypeExampleEvidence in Data
Enum constraintdisposition ∈ {admission, discharge, transfer, observation}Closed set of distinct values
Uniquenesslicense_number is unique per providerNo duplicates in column
CardinalityExactly one is_primary=true per encounterGroup-by count pattern
Rangeesi_level ∈ [1, 5]Min/max of integer column
ReferentialEvery encounter.patient_id appears in patient.patient_idValue subset relationship
Functional dependencyzip_code → stateDeterministic mapping in data

Loss: Multi-label classification over axiom templates, parameterized by column references and value sets.

Trains: Constraint discovery. In production, many database constraints are implicit (enforced by application logic, not declared in the schema). A model that can infer constraints from data patterns provides direct value for data quality assessment and governance.


Normalization Prediction

LLM analogue: No direct analogue — novel to this setting

Given a denormalized table, predict the normalized ontological entities — which groups of columns should be separate entities.

In the hospital example, a fully denormalized patient_encounters table contains patient demographics, encounter details, vital signs, diagnoses, and medications all in one wide table. The model must predict that this represents 5+ distinct ontological entities that have been collapsed.

The inverse task is also valuable: given a normalized schema, predict which tables could be meaningfully denormalized (i.e., which tables represent qualities or sub-parts of a parent entity).

Loss: Clustering loss over column embeddings within a single table — columns that should be factored into the same normalized entity should cluster together.

\[ \mathcal{L}{\text{norm}} = -\frac{1}{|\mathcal{P}{\text{intra}}|} \sum_{(i,j) \in \mathcal{P}_{\text{intra}}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \in \text{cols}(t)} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]

where \(\mathcal{P}_{\text{intra}}\) is the set of column pairs within a single table that originate from the same ontological entity.

Trains: Entity boundary detection within denormalized tables. Real enterprise data warehouses are heavily denormalized for query performance. Recovering the underlying entity structure from a 200-column fact table is a high-value governance task.


Cardinality Estimation

LLM analogue: No direct analogue — extends relational reasoning

Given populated tables, predict the cardinality constraints from the source ontology: one-to-one, one-to-many, or many-to-many.

The model must infer cardinality from value distributions:

  • 1:1: Every FK value appears exactly once in both tables
  • 1:N: FK values in the child table repeat; each parent PK appears once
  • M:N: Both sides have repeating values (mediated by a junction table)

Loss: Cross-entropy over cardinality categories per table pair.

Trains: Relationship characterization. Understanding cardinality is foundational for schema understanding and directly supports both CPA and data element discovery — a 1:1 relationship suggests entity decomposition, while M:N suggests an independent association.


Difficulty Curriculum

Following UL2’s insight that mixing objectives with explicit difficulty signals outperforms any single objective, training uses a difficulty-tagged curriculum.

Each training example carries a difficulty tag prepended to the input. The model learns to allocate capacity differently depending on the expected difficulty — using fast pattern matching for R-level tasks and deeper structural reasoning for X/Z-level tasks.

Curriculum Schedule

Training proceeds in four phases, progressively increasing difficulty:

PhaseEpochsMix (R/S/X/Z)Objectives Introduced
10–1070/20/10/0OPM, RCD (easy variants)
210–3030/40/20/10+ Relation Masking, Schema Denoising
330–6010/30/30/30+ Span Corruption, Cross-Schema Contrastive
460+10/20/30/40+ Axiom Recovery, Normalization, Cardinality

Domain-specific objectives (axiom recovery, normalization prediction, cardinality estimation) are introduced late because they require the model to already have basic column understanding and cross-table reasoning capabilities.

Objective Priority

The objectives are not equally important. Based on downstream task alignment:

ObjectivePriorityDownstream Impact
Object Property MaskingCoreDirectly trains CTA
Replaced Column DetectionCoreResolves confusable pairs — the hardest CTA failures
Relation MaskingCoreDirectly trains cross-table data element discovery
Span CorruptionCoreTrains entity boundary detection
Schema DenoisingHighRobustness to real-world data — improves all tasks
Cross-Schema ContrastiveHighSchema-invariant representations — critical for transfer
Axiom RecoveryMediumValuable for governance but not core to CTA/DE
Normalization PredictionMediumImportant for denormalized warehouses
Cardinality EstimationMediumSupports relationship characterization

The four core objectives should compose the majority of training compute. Augmentation strategies (denoising, contrastive) are applied as data transformations rather than separate losses. Domain-specific objectives are scheduled in later phases as refinement tasks.