Training Tactics

The training objective defines what Aegir learns – ontological entity recovery from serialized relational tables. This page defines how: the specific self-supervised tasks, corruption strategies, and curriculum design that compose the pretraining regiment. Each tactic is adapted from a proven LLM pretraining method but re-targeted at the structural properties of relational data with known ontological provenance.

Tactic Overview

Each tactic is described below with its LLM analogue, formal task specification, and the downstream capability it trains.

Core Objectives

Object Property Masking

LLM analogue: Masked Language Modeling (BERT)

Mask one or more properties from an ontological entity definition. The model receives the serialized tables (which still contain the data for the masked properties) and must predict what properties the source entity had.

Difficulty gradation:

Level	What’s Masked	Challenge
Easy	A column with structurally distinctive values (dates, emails)	Pattern recognition
Medium	A column whose type depends on co-occurring columns	Cross-column reasoning
Hard	A column with confusable values (UUID vs advertising ID)	Contextual disambiguation
Expert	Multiple properties from the same entity simultaneously	Entity structure reconstruction

Loss: Cross-entropy over the property type vocabulary, plus a regression loss for predicting the property name embedding.

\[ \mathcal{L}{\text{OPM}} = -\frac{1}{|M|} \sum{p \in M} \left[ \log P(y_p \mid \mathbf{h}_p) + \alpha | \hat{\mathbf{e}}_p - \mathbf{e}_p |^2 \right] \]

where \(M\) is the set of masked properties, \(y_p\) is the property’s BFO type, \(\mathbf{e}_p\) is the property name embedding, and \(\alpha\) weights the name regression term.

Trains: Column type annotation (CTA). The model learns to identify what semantic role a column plays from its value distribution and surrounding context.

Replaced Column Detection

LLM analogue: Replaced Token Detection (ELECTRA)

Swap columns between tables that originated from different ontological entities. A discriminator must identify which columns are imposters — present in a table they don’t ontologically belong to.

The generator learns to make plausible swaps — columns with similar value distributions but different semantic types. This is precisely the confusable-pair problem. A naive generator might swap patient_id (UUID) with encounter_date (timestamp) — trivially detectable. A trained generator learns to swap patient_id with provider_id (both UUIDs, both foreign-keyed) — a much harder discrimination task.

Two-phase training:

Generator: A small model that scores candidate column swaps by value-distribution similarity and selects high-similarity pairs
Discriminator: Aegir itself, trained to detect which columns don’t belong

The ELECTRA insight applies directly: the discriminator receives a training signal on every column (original or replaced), not just the masked positions. This is far more sample-efficient than masking-based objectives.

Loss: Binary cross-entropy per column.

\[ \mathcal{L}{\text{RCD}} = -\frac{1}{N} \sum{i=1}^{N} \left[ y_i \log D(\mathbf{h}_i) + (1 - y_i) \log(1 - D(\mathbf{h}_i)) \right] \]

where \(y_i = 1\) if column \(i\) was replaced and \(D\) is the discriminator head.

Trains: Confusable type resolution. Directly addresses the hardest failure mode in production column annotation — columns with identical value patterns but different semantic roles.

Relation Masking

LLM analogue: Next Sentence Prediction (BERT) / Sentence Order Prediction (ALBERT), extended to structural relationships

Drop a foreign key column from the serialized data and ask the model to predict that a relationship between two tables exists, which tables it connects, and what column would mediate it.

Task variants:

Variant	Input	Target
Existence	Two tables, FK column removed	Binary: are these tables related?
Direction	Two related tables, FK removed	Which table is parent, which is child?
Column	Tables with FK removed	Which column in the child table held the FK?
Full recovery	Multi-table schema, one FK removed	Predict source table, target table, and mediating column

Difficulty: Existence is easy (value overlap between tables is a strong signal). Direction requires understanding cardinality from data distributions. Full recovery in a 10-table schema with multiple possible FK targets is genuinely hard.

Loss: Cross-entropy over table pairs for existence/direction, cross-entropy over columns for the FK column prediction.

\[ \mathcal{L}{\text{RM}} = \mathcal{L}{\text{exist}} + \beta_1 \mathcal{L}{\text{direction}} + \beta_2 \mathcal{L}{\text{column}} \]

Trains: Cross-table data element discovery. The model learns to identify structural relationships between tables from data patterns alone — exactly what’s needed when foreign key metadata is missing or unreliable in enterprise warehouses.

Span Corruption (Entity-Level)

LLM analogue: Span Corruption (T5)

Mask all columns belonging to one data element across all tables and replace them with a sentinel. The model must predict what kind of entity is absent based on the remaining schema structure.

This is harder than single-property masking because the model must reason about the structural hole in the schema. A schema with patients, encounters, medications, and providers but no diagnostic information has a recognizable gap — clinical workflows always involve diagnosis. The model learns domain-level structural expectations.

Masking strategies:

Single entity: Remove all columns from one BFO class (as above)
Related pair: Remove two related entities (e.g., Diagnosis and its FK in Encounter)
Subtree: Remove an entity and all its dependents in the ontological hierarchy

Loss: Sequence-to-sequence generation of the masked entity structure, or classification over a vocabulary of entity type templates.

Trains: Entity boundary detection and structural reasoning. When the model encounters a real database missing expected entities, it can predict what should exist — critical for data governance gap analysis.

Augmentation Strategies

Schema Denoising

LLM analogue: Denoising Autoencoder (BART)

Apply multiple corruptions to the serialized schema simultaneously. The model must recover the clean ontological structure from the noisy input.

Corruption menu (applied stochastically per training example):

Corruption	What Changes	Real-World Analogue
Column renaming	`date_of_birth` → `col_3`	Generic column names in enterprise DW
Column shuffling	Randomize column order within tables	Arbitrary column ordering conventions
Table merging	Join two tables into one wide table	Denormalization for query performance
Table splitting	Split one table into arbitrary fragments	Vertical partitioning
Type coercion	Store dates as strings, integers as floats	Legacy system type mismatches
Delimiter variation	CSV → TSV → pipe-delimited → fixed-width	Different export formats
Header removal	Drop column headers entirely	Headerless data exports
Row sampling	Keep only a random subset of rows	Partial data access

Multiple corruptions can stack: rename columns and merge tables and switch delimiters. The model trained on this distribution becomes robust to the full range of real-world schema messiness.

Loss: Reconstruction loss on the original ontological labels applied to the column embeddings from the corrupted input. The corruptions change what the model sees; the targets remain the clean ontological structure.

Trains: Robustness to real-world data formats. Enterprise databases exhibit every one of these corruptions and often several simultaneously.

Cross-Schema Contrastive Learning

LLM analogue: Contrastive Learning (SimCLR, CLIP)

Generate two different schemas from the same ontology fragment — one normalized with clear names, one denormalized with obfuscated names — and train the model to produce similar representations for both. Schemas from different ontology fragments should produce dissimilar representations.

Positive pairs: Two schema variants from the same ontology fragment. Negative pairs: Schemas from different ontology fragments (even within the same domain — two different healthcare schemas should still be distinguishable).

Loss: InfoNCE contrastive loss over schema-level representations.

\[ \mathcal{L}{\text{CSC}} = -\frac{1}{|\mathcal{B}|} \sum{i \in \mathcal{B}} \log \frac{\exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}i^b) / \tau)}{\sum{j \in \mathcal{B}} \exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}_j^b) / \tau)} \]

where \(\mathbf{z}_i^a\) and \(\mathbf{z}_i^b\) are schema-level embeddings (pooled from column embeddings) for the two variants of ontology fragment \(i\), and \(\mathcal{B}\) is the batch.

Trains: Schema-invariant representations. The model learns that the same information can appear in radically different structural formats — the core challenge in enterprise data integration.

Domain-Specific Objectives

Axiom Recovery

LLM analogue: No direct analogue — novel to this setting

Given only the populated tables (no schema metadata), predict the constraints from the source ontology.

Target axioms:

Axiom Type	Example	Evidence in Data
Enum constraint	`disposition ∈ {admission, discharge, transfer, observation}`	Closed set of distinct values
Uniqueness	`license_number` is unique per provider	No duplicates in column
Cardinality	Exactly one `is_primary=true` per encounter	Group-by count pattern
Range	`esi_level ∈ [1, 5]`	Min/max of integer column
Referential	Every `encounter.patient_id` appears in `patient.patient_id`	Value subset relationship
Functional dependency	`zip_code → state`	Deterministic mapping in data

Loss: Multi-label classification over axiom templates, parameterized by column references and value sets.

Trains: Constraint discovery. In production, many database constraints are implicit (enforced by application logic, not declared in the schema). A model that can infer constraints from data patterns provides direct value for data quality assessment and governance.

Normalization Prediction

LLM analogue: No direct analogue — novel to this setting

Given a denormalized table, predict the normalized ontological entities — which groups of columns should be separate entities.

In the hospital example, a fully denormalized patient_encounters table contains patient demographics, encounter details, vital signs, diagnoses, and medications all in one wide table. The model must predict that this represents 5+ distinct ontological entities that have been collapsed.

The inverse task is also valuable: given a normalized schema, predict which tables could be meaningfully denormalized (i.e., which tables represent qualities or sub-parts of a parent entity).

Loss: Clustering loss over column embeddings within a single table — columns that should be factored into the same normalized entity should cluster together.

\[ \mathcal{L}{\text{norm}} = -\frac{1}{|\mathcal{P}{\text{intra}}|} \sum_{(i,j) \in \mathcal{P}_{\text{intra}}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \in \text{cols}(t)} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]

where \(\mathcal{P}_{\text{intra}}\) is the set of column pairs within a single table that originate from the same ontological entity.

Trains: Entity boundary detection within denormalized tables. Real enterprise data warehouses are heavily denormalized for query performance. Recovering the underlying entity structure from a 200-column fact table is a high-value governance task.

Cardinality Estimation

LLM analogue: No direct analogue — extends relational reasoning

Given populated tables, predict the cardinality constraints from the source ontology: one-to-one, one-to-many, or many-to-many.

The model must infer cardinality from value distributions:

1:1: Every FK value appears exactly once in both tables
1:N: FK values in the child table repeat; each parent PK appears once
M:N: Both sides have repeating values (mediated by a junction table)

Loss: Cross-entropy over cardinality categories per table pair.

Trains: Relationship characterization. Understanding cardinality is foundational for schema understanding and directly supports both CPA and data element discovery — a 1:1 relationship suggests entity decomposition, while M:N suggests an independent association.

Difficulty Curriculum

Following UL2’s insight that mixing objectives with explicit difficulty signals outperforms any single objective, training uses a difficulty-tagged curriculum.

Each training example carries a difficulty tag prepended to the input. The model learns to allocate capacity differently depending on the expected difficulty — using fast pattern matching for R-level tasks and deeper structural reasoning for X/Z-level tasks.

Curriculum Schedule

Training proceeds in four phases, progressively increasing difficulty:

Phase	Epochs	Mix (R/S/X/Z)	Objectives Introduced
1	0–10	70/20/10/0	OPM, RCD (easy variants)
2	10–30	30/40/20/10	+ Relation Masking, Schema Denoising
3	30–60	10/30/30/30	+ Span Corruption, Cross-Schema Contrastive
4	60+	10/20/30/40	+ Axiom Recovery, Normalization, Cardinality

Domain-specific objectives (axiom recovery, normalization prediction, cardinality estimation) are introduced late because they require the model to already have basic column understanding and cross-table reasoning capabilities.

Objective Priority

The objectives are not equally important. Based on downstream task alignment:

Objective	Priority	Downstream Impact
Object Property Masking	Core	Directly trains CTA
Replaced Column Detection	Core	Resolves confusable pairs — the hardest CTA failures
Relation Masking	Core	Directly trains cross-table data element discovery
Span Corruption	Core	Trains entity boundary detection
Schema Denoising	High	Robustness to real-world data — improves all tasks
Cross-Schema Contrastive	High	Schema-invariant representations — critical for transfer
Axiom Recovery	Medium	Valuable for governance but not core to CTA/DE
Normalization Prediction	Medium	Important for denormalized warehouses
Cardinality Estimation	Medium	Supports relationship characterization

The four core objectives should compose the majority of training compute. Augmentation strategies (denoising, contrastive) are applied as data transformations rather than separate losses. Domain-specific objectives are scheduled in later phases as refinement tasks.

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking