Stage 4: Training Objective
The training objective is the key departure from standard pretraining: Aegir does not learn to predict the next token. It learns to recover the ontological entities – data elements – that were used to generate the relational data it observes. This is possible because the generation pipeline (Stages 1–3) preserves a complete mapping from every column back to its source ontological entity.
Task Formulation
What the Model Sees
The model receives byte-serialized relational tables – one or more tables from the same generated schema, serialized as a byte stream. The serialization format mirrors how real data would be encountered:
- CSV-style serialization with delimiters, quoting, and escape characters
- Column headers may be descriptive (
patient_id), abbreviated (pat_id), or opaque (col_0) - Multiple tables are concatenated with table-boundary markers
- No schema metadata (no types, no foreign key declarations, no table names beyond what appears in headers)
The model must infer semantic structure purely from the byte patterns it observes.
What the Model Predicts
Three prediction heads operate on the column-level embeddings produced by Aegir’s hierarchical encoder:
-
Column Type Annotation (CTA): For each column, predict its BFO-grounded semantic type from a taxonomy. This maps directly to the CTA task on benchmarks like SOTAB and GitTables.
-
Data Element Discovery (DE): Predict which columns – potentially across different tables – belong to the same ontological entity. This is formulated as a clustering task: columns originating from the same BFO class should receive similar embeddings.
-
Hierarchical Consistency: Predict the BFO hierarchy level for each column. If a column is classified as
Diagnosis(a subclass ofGDC), it should also be recognized as aGenericallyDependentContinuant. This head enforces ontological coherence.
What We Compare Against
The ground truth comes directly from the generation pipeline:
- CTA labels: The
Column → BFO propertymapping from Stage 2 gives the exact semantic type of every column - DE labels: The
Column → BFO classmapping identifies which columns originated from the same ontological entity - Hierarchy labels: The BFO subsumption hierarchy defines the expected parent types for every leaf prediction
Loss Function
The total loss is a weighted combination of three terms:
\[ \mathcal{L} = \mathcal{L}{\text{CTA}} + \lambda_1 \mathcal{L}{\text{DE}} + \lambda_2 \mathcal{L}_{\text{hier}} \]
Column Type Annotation Loss
Standard cross-entropy over the column type taxonomy:
\[ \mathcal{L}{\text{CTA}} = -\frac{1}{N} \sum{i=1}^{N} \log p(y_i \mid \mathbf{h}_i) \]
where \(\mathbf{h}_i\) is the column embedding for column \(i\), \(y_i\) is the ground truth BFO-grounded type, and \(N\) is the total number of columns across all tables in the batch.
Data Element Discovery Loss
A contrastive loss that pulls together columns from the same ontological entity and pushes apart columns from different entities:
\[ \mathcal{L}{\text{DE}} = -\frac{1}{|\mathcal{P}|} \sum{(i,j) \in \mathcal{P}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \neq i} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]
where \(\mathcal{P}\) is the set of positive pairs (columns from the same BFO class), \(\text{sim}\) is cosine similarity, and \(\tau\) is a temperature parameter.
This loss is what teaches the model to discover data elements: columns that the model embeds close together are predicted to belong to the same real-world entity, regardless of which table they appear in.
Hierarchical Consistency Loss
A penalty for predictions that violate the BFO subsumption hierarchy:
\[ \mathcal{L}{\text{hier}} = \frac{1}{N} \sum{i=1}^{N} \sum_{c \in \text{ancestors}(y_i)} \max(0, \delta - p(c \mid \mathbf{h}_i)) \]
where \(\text{ancestors}(y_i)\) returns all BFO ancestors of the predicted type, and \(\delta\) is a margin. If a column is predicted as Diagnosis, the model should assign high probability to all ancestor types: GDC, Continuant, Entity.
Training Loop
Batch Construction
Each training batch contains serialized tables from multiple generated schemas:
- Sample a schema from the pool (with curriculum: simpler schemas early, complex multi-table schemas later)
- Serialize one or more tables from the schema to bytes, using randomized serialization parameters (delimiter choice, quoting style, header format)
- Attach the ontological provenance labels as training targets
Multi-Table Batches
For cross-table data element discovery, batches include multiple tables from the same schema. The agent swarm architecture processes each table with a separate agent, and the fused recurrent states are used for the DE prediction head. This directly trains the model’s cross-table reasoning capability.
Connection to Downstream Tasks
The pretraining objective maps precisely to the three real-world tasks described in the Introduction:
| Pretraining Task | Downstream Task | Transfer Mechanism |
|---|---|---|
| Column type prediction | CTA on SOTAB/GitTables/WikiTables | Fine-tune CTA head on benchmark taxonomy |
| Cross-column clustering | CPA on benchmark datasets | Column pair relationship classification |
| Cross-table data element prediction | Enterprise data element discovery | Direct application – same task, real data |
The key advantage: by pretraining on synthetic data with known ground truth at massive scale, the model enters fine-tuning with strong representations for column semantics. The confusable types, cross-table relationships, and ontological hierarchies it has learned from synthetic data transfer directly to the noisy, inconsistently-named, under-documented columns in real enterprise data warehouses.
Integration with Evidence Pipelines
In production, Aegir’s predictions feed into Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. The model produces:
- Column type predictions with calibrated confidence – these become mass functions in the DST framework
- Column embedding similarities – these provide evidence for same-entity relationships
- Hierarchical type predictions – these constrain the feasible type space for conjunctive combination
The calibration quality of Aegir’s confidence scores matters as much as the accuracy of its top-1 predictions. Training on diverse synthetic data with controlled difficulty (including deliberately confusable types) produces well-calibrated uncertainty estimates, because the model learns from data where the boundary between types is precisely controlled.
The specific self-supervised tasks, corruption strategies, and curriculum design that implement this objective are detailed in Training Tactics.