Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Stage 4: Training Objective

The training objective is the key departure from standard pretraining: Aegir does not learn to predict the next token. It learns to recover the ontological entities – data elements – that were used to generate the relational data it observes. This is possible because the generation pipeline (Stages 1–3) preserves a complete mapping from every column back to its source ontological entity.

Task Formulation

What the Model Sees

The model receives byte-serialized relational tables – one or more tables from the same generated schema, serialized as a byte stream. The serialization format mirrors how real data would be encountered:

  • CSV-style serialization with delimiters, quoting, and escape characters
  • Column headers may be descriptive (patient_id), abbreviated (pat_id), or opaque (col_0)
  • Multiple tables are concatenated with table-boundary markers
  • No schema metadata (no types, no foreign key declarations, no table names beyond what appears in headers)

The model must infer semantic structure purely from the byte patterns it observes.

What the Model Predicts

Three prediction heads operate on the column-level embeddings produced by Aegir’s hierarchical encoder:

  1. Column Type Annotation (CTA): For each column, predict its BFO-grounded semantic type from a taxonomy. This maps directly to the CTA task on benchmarks like SOTAB and GitTables.

  2. Data Element Discovery (DE): Predict which columns – potentially across different tables – belong to the same ontological entity. This is formulated as a clustering task: columns originating from the same BFO class should receive similar embeddings.

  3. Hierarchical Consistency: Predict the BFO hierarchy level for each column. If a column is classified as Diagnosis (a subclass of GDC), it should also be recognized as a GenericallyDependentContinuant. This head enforces ontological coherence.

What We Compare Against

The ground truth comes directly from the generation pipeline:

  • CTA labels: The Column → BFO property mapping from Stage 2 gives the exact semantic type of every column
  • DE labels: The Column → BFO class mapping identifies which columns originated from the same ontological entity
  • Hierarchy labels: The BFO subsumption hierarchy defines the expected parent types for every leaf prediction

Loss Function

The total loss is a weighted combination of three terms:

\[ \mathcal{L} = \mathcal{L}{\text{CTA}} + \lambda_1 \mathcal{L}{\text{DE}} + \lambda_2 \mathcal{L}_{\text{hier}} \]

Column Type Annotation Loss

Standard cross-entropy over the column type taxonomy:

\[ \mathcal{L}{\text{CTA}} = -\frac{1}{N} \sum{i=1}^{N} \log p(y_i \mid \mathbf{h}_i) \]

where \(\mathbf{h}_i\) is the column embedding for column \(i\), \(y_i\) is the ground truth BFO-grounded type, and \(N\) is the total number of columns across all tables in the batch.

Data Element Discovery Loss

A contrastive loss that pulls together columns from the same ontological entity and pushes apart columns from different entities:

\[ \mathcal{L}{\text{DE}} = -\frac{1}{|\mathcal{P}|} \sum{(i,j) \in \mathcal{P}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \neq i} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]

where \(\mathcal{P}\) is the set of positive pairs (columns from the same BFO class), \(\text{sim}\) is cosine similarity, and \(\tau\) is a temperature parameter.

This loss is what teaches the model to discover data elements: columns that the model embeds close together are predicted to belong to the same real-world entity, regardless of which table they appear in.

Hierarchical Consistency Loss

A penalty for predictions that violate the BFO subsumption hierarchy:

\[ \mathcal{L}{\text{hier}} = \frac{1}{N} \sum{i=1}^{N} \sum_{c \in \text{ancestors}(y_i)} \max(0, \delta - p(c \mid \mathbf{h}_i)) \]

where \(\text{ancestors}(y_i)\) returns all BFO ancestors of the predicted type, and \(\delta\) is a margin. If a column is predicted as Diagnosis, the model should assign high probability to all ancestor types: GDC, Continuant, Entity.

Training Loop

Batch Construction

Each training batch contains serialized tables from multiple generated schemas:

  1. Sample a schema from the pool (with curriculum: simpler schemas early, complex multi-table schemas later)
  2. Serialize one or more tables from the schema to bytes, using randomized serialization parameters (delimiter choice, quoting style, header format)
  3. Attach the ontological provenance labels as training targets

Multi-Table Batches

For cross-table data element discovery, batches include multiple tables from the same schema. The agent swarm architecture processes each table with a separate agent, and the fused recurrent states are used for the DE prediction head. This directly trains the model’s cross-table reasoning capability.

Connection to Downstream Tasks

The pretraining objective maps precisely to the three real-world tasks described in the Introduction:

Pretraining TaskDownstream TaskTransfer Mechanism
Column type predictionCTA on SOTAB/GitTables/WikiTablesFine-tune CTA head on benchmark taxonomy
Cross-column clusteringCPA on benchmark datasetsColumn pair relationship classification
Cross-table data element predictionEnterprise data element discoveryDirect application – same task, real data

The key advantage: by pretraining on synthetic data with known ground truth at massive scale, the model enters fine-tuning with strong representations for column semantics. The confusable types, cross-table relationships, and ontological hierarchies it has learned from synthetic data transfer directly to the noisy, inconsistently-named, under-documented columns in real enterprise data warehouses.

Integration with Evidence Pipelines

In production, Aegir’s predictions feed into Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. The model produces:

  • Column type predictions with calibrated confidence – these become mass functions in the DST framework
  • Column embedding similarities – these provide evidence for same-entity relationships
  • Hierarchical type predictions – these constrain the feasible type space for conjunctive combination

The calibration quality of Aegir’s confidence scores matters as much as the accuracy of its top-1 predictions. Training on diverse synthetic data with controlled difficulty (including deliberately confusable types) produces well-calibrated uncertainty estimates, because the model learns from data where the boundary between types is precisely controlled.

The specific self-supervised tasks, corruption strategies, and curriculum design that implement this objective are detailed in Training Tactics.