Introduction

Aegir is a hierarchical sequence model for semantic column annotation and cross-table data element discovery on relational data. Given one or more tables, Aegir predicts semantic types for individual columns (Column Type Annotation), identifies properties and relationships between columns (Column Property Annotation), and discovers coherent data elements – groups of semantically related columns that span multiple tables in a data warehouse.

Problem Setting

Enterprise data warehouses contain thousands of tables with columns whose meaning is often opaque: generic names (col0, field_42), inconsistent conventions across teams, and no machine-readable metadata. Understanding what each column represents – and which columns across different tables refer to the same real-world concept – is foundational to data governance, privacy compliance, and integration.

Current approaches to this problem fall into two categories:

Pattern and heuristic-based methods identify column types through regex detectors (email, SSN, credit card patterns), name matching, embedding similarity, and gradient-boosted classifiers trained on hand-engineered features. These methods work well for structurally distinct types but struggle with confusable pairs – columns whose value distributions are nearly identical but whose semantic types differ (e.g., advertising IDs vs GUIDs, bank account numbers vs payment card numbers). They also require manual enumeration of data element patterns and cannot generalize to novel relationship types.

Learned sequence models (DODUO, RECA, REVEAL) treat the table as a token sequence and classify columns via fine-tuned transformers. REVEAL’s key insight is that context column selection matters: choosing the right neighboring columns (via MMR diversity sampling) dramatically improves annotation accuracy. However, these models operate on single tables in isolation and use fixed subword tokenizers that fragment tabular data unpredictably.

Aegir bridges these approaches. It is designed to be trained in situ alongside evidence-based classification pipelines – consuming the same serialized table representations, but learning cross-column and cross-table relationships end-to-end rather than relying on manually enumerated patterns. Specifically:

Column Type Annotation (CTA): Classify individual columns into a semantic taxonomy (e.g., SIGDG ontology categories, Schema.org types, DBpedia classes).
Column Property Annotation (CPA): Identify properties and relationships between column pairs (e.g., “city is-located-in country”).
Data Element Discovery: Identify groups of related columns across tables that constitute coherent real-world entities (e.g., a PaymentCard data element spanning card_number, expiry, cardholder columns across billing, transaction, and customer tables).

The third task – cross-table data element discovery – is where the greatest value lies for enterprise governance. Current pipelines discover data elements through keyword-based schema matching and post-classification co-occurrence analysis. A model that learns these relationships from data can generalize beyond enumerated patterns, handle non-English and abbreviated column names, and resolve confusable pairs by leveraging cross-table structural context that no single-table classifier can access.

Target benchmarks:

SOTAB – Semantic column annotation on Web tables (Schema.org types)
GitTables – Large-scale column type detection across 1M+ CSV tables from GitHub (100% generic column names – the hardest regime)
WikiTables – Column annotation on Wikipedia HTML tables

Key Innovations

Byte-level dynamic chunking as learned tokenization. Rather than using a fixed tokenizer (BPE, SentencePiece), Aegir operates on raw bytes and learns to segment sequences into variable-length chunks via content-dependent boundary prediction. A routing module measures cosine similarity between consecutive hidden states; high dissimilarity triggers a chunk boundary. This makes the “tokenization” fully differentiable and adapted to the data distribution – critical for tabular data where delimiters, numeric formats, and encodings vary wildly across sources.

All-RWKV recurrent architecture. The primary sequence processing blocks use RWKV-7 time mixing with flash-linear-attention Triton kernels. RWKV-7 maintains a constant-size recurrent state matrix of shape (B, H, head_size, head_size) regardless of sequence length. This gives O(1) memory per token during inference and, critically, makes the recurrent state a fixed-size object that can be serialized, transmitted, and algebraically combined across agents.

ROSA suffix automaton for exact pattern retrieval. The ROSA (RWKV Online Suffix Automaton) module provides lossless infinite-range retrieval by constructing an online suffix automaton over binarized hidden representations. While RWKV-7 learns smooth sequence-level patterns, ROSA can retrieve exact substring matches from arbitrarily far in the past – enabling precise pattern detection (email formats, card number structures) that complements the learned recurrent state.

Agent swarm with state fusion for cross-table reasoning. Multiple specialist agents can process different tables or column families in parallel. Because RWKV recurrent states are fixed-size matrices, they can be fused via attention-weighted combination, learned gating, or projection – far more efficiently than merging transformer KV caches, which grow linearly with sequence length. This architecture enables cross-table data element discovery: each agent processes a table, and the fused state captures inter-table relationships that no single-table model can learn.

In-situ training within evidence pipelines. Aegir is designed to integrate with Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. Its predictions – with calibrated confidence – feed into the same conjunctive combination framework alongside cosine similarity, gradient boosting, pattern detectors, and name matching. The model learns from the pipeline’s own bootstrap labels and SAGE-validated features, creating a self-improving loop where Aegir’s learned representations replace hand-engineered heuristics as they prove their value.

Architecture at a Glance

Aegir uses a recursive hierarchy defined by nested layout strings:

arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]

This reads as: 2 RWKV-7 encoder blocks, then a sub-hierarchy (2 encoder blocks, 4 main blocks, 2 decoder blocks), then 2 RWKV-7 decoder blocks. At each non-innermost stage, dynamic chunking downsamples the sequence before passing it to the next level, and an EMA-based dechunking module reconstructs the full resolution on the way back up.

The block types – RWKV-7, ROSA, MHA, Mamba-2 – can be freely mixed within any stage using compact layout strings like "w4T1r2".

Architecture Overview

Aegir is a recursive hierarchical sequence model. At the top level, it processes raw byte sequences through nested stages of encoding, dynamic chunking, inner processing, dechunking, and decoding. Each stage can use a different hidden dimension and a different mix of block types.

Recursive Hierarchy

The architecture is defined by a nested list called arch_layout. For example:

arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
d_model     = [128,   192,   192]

This defines three stages (depth 0, 1, 2):

Stage	Role	Layout	Dimension
0	Outermost encoder/decoder	`"w2"` / `"w2"`	128
1	Middle encoder/decoder	`"w2"` / `"w2"`	192
2	Innermost (main)	`"w4"`	192

At each non-innermost stage, the data flow is:

At the innermost stage, only the main network runs (no chunking). The recursion bottoms out at a flat Isotropic block stack.

Data Flow in Detail

Encoder: A flat stack of blocks (e.g., 2 RWKV-7 blocks) processes the full-resolution sequence.
Routing: RoutingModule predicts boundary probabilities via cosine similarity. Tokens at predicted boundaries are selected as chunk representatives.
Chunk: ChunkLayer downsamples by keeping only boundary tokens, producing a shorter sequence.
Main network: The shorter sequence is processed by the next hierarchy level – which may itself contain encoding, chunking, and another level of recursion.
Dechunk: DeChunkLayer reconstructs the full-length sequence via an EMA scan, blending chunk outputs back into non-boundary positions.
Residual: A skip connection around the entire chunk/process/dechunk block, gated via straight-through estimation of the routing probabilities.
Decoder: Another flat stack of blocks processes the reconstructed sequence.

Dimension Padding

When inner stages have a larger hidden dimension than outer stages, Aegir pads the input with a learnable vector (pad_dimension) on entry and slices it off on exit. This avoids linear projection overhead at every stage transition.

Why All-RWKV

The primary design choice is to use RWKV-7 time mixing at all stages rather than transformers or pure SSMs. The motivation is threefold:

1. Uniform O(1) Recurrent State

Every RWKV-7 block maintains a recurrent state of shape (B, H, head_size, head_size). This is constant regardless of sequence length. During autoregressive inference, each token step updates this matrix and reads from it in O(head_size^2) time per head.

2. Agent State Fusion

For the agent swarm architecture, specialist agents process the same input and produce recurrent states. These states must be combined. RWKV states are fixed-size matrices that live in a well-defined linear space, making fusion via weighted sum, gating, or projection algebraically natural. In contrast:

Transformer KV caches are O(L * d) and grow with sequence length, making fusion combinatorially expensive.
Mamba-2 states are smaller but have different algebraic structure (diagonal recurrence).

3. Chunk-Parallel Training

The chunk_rwkv7 kernel from flash-linear-attention enables training with parallel chunk processing while maintaining exact recurrent semantics. This gives near-transformer training throughput with recurrent inference efficiency.

Comparison Table

Property	RWKV-7 (`w`/`W`)	Mamba-2 (`m`/`M`)	Transformer (`t`/`T`)
Training kernel	`chunk_rwkv7` (Triton)	Mamba-2 SSD (CUDA)	Flash Attention 2
Recurrent state	`(H, K, K)` matrix	`(H, d_state)` vector	None (KV cache)
Inference memory	O(d^2) constant	O(d * d_state) constant	O(L * d) linear
State fusibility	Natural (matrix sum)	Possible (vector sum)	Impractical
Exact retrieval	Via ROSA blocks	No	Via full attention
FFN pairing	CMix (relu^2) or SwiGLU	SwiGLU or none	SwiGLU or none

In practice, RWKV-7 blocks (w/W) are the default choice at all stages. Mamba-2 (m/M) and MHA (t/T) blocks are available for ablation studies and hybrid configurations. ROSA (r/R) blocks provide exact substring matching as a complement to learned recurrent processing.

Hierarchical Dynamic Chunking

Dynamic chunking is Aegir’s mechanism for content-dependent hierarchical segmentation. Rather than using a fixed tokenizer, the model learns to predict chunk boundaries based on the hidden representations themselves. This module is adapted from H-Net (goombalab/hnet).

Overview

The chunking pipeline has three components that work together at each non-innermost stage of the hierarchy:

RoutingModule – predicts which tokens are chunk boundaries
ChunkLayer – downsamples the sequence by selecting boundary tokens
DeChunkLayer – reconstructs the full-length sequence from chunk outputs via EMA

RoutingModule: Boundary Prediction

The routing module decides where to place chunk boundaries by measuring how different consecutive hidden states are.

Algorithm

For a sequence of hidden states h[0], h[1], ..., h[L-1]:

Project consecutive pairs through learnable Q and K matrices (initialized to identity).
Compute cosine similarity between adjacent projected states:
```
cos_sim[t] = cosine(Q @ h[t], K @ h[t+1])
```

Convert to boundary probability:

p[t] = clamp((1 - cos_sim[t]) / 2, 0, 1)

The first token always gets p = 1.0 (always a boundary).
Threshold at 0.5: if p[t] > 0.5, token t is a boundary.

High dissimilarity between consecutive states means the content is changing – a natural place to start a new chunk. The Q/K projections are initialized to identity so the model starts with raw cosine similarity and can learn to refine the boundary criterion.

Handling Variable-Length Sequences

The routing module supports two modes:

Padded mode (mask provided): Standard (B, L, D) tensors with a boolean mask. Boundary predictions outside the mask are suppressed.
Packed mode (cu_seqlens provided): Sequences concatenated into a single (1, total_len, D) tensor with cumulative sequence lengths. The first token of each sequence in the pack is forced to be a boundary.

ChunkLayer: Downsampling

Once boundaries are predicted, ChunkLayer selects only the boundary tokens to form a shorter sequence.

In padded mode:

Count how many boundary tokens each batch element has.
Sort token indices so boundary tokens come first.
Gather the first max_boundaries tokens per batch element.
Produce a new mask indicating which positions in the shorter sequence are valid.

In packed mode:

Boolean-index the boundary tokens directly from the flat sequence.
Recompute cu_seqlens for the shorter packed sequence.

The output is a shorter sequence containing only the tokens that were at chunk boundaries.

DeChunkLayer: Reconstruction via EMA

After the inner hierarchy processes the chunked (shorter) sequence, DeChunkLayer reconstructs the full-length sequence. The key insight is that non-boundary tokens should smoothly interpolate from their nearest preceding boundary token’s output.

EMA Scan

The reconstruction uses an exponential moving average (EMA) scan:

y[0] = x[0]
y[t] = decay[t] * y[t-1] + (1 - decay[t]) * x[t]

where decay[t] = 1 - p[t] and p[t] is the boundary probability for token t.

At boundary tokens (p ~ 1), the output snaps to the new chunk value. At non-boundary tokens (p ~ 0), the output carries forward the previous value. The boundary probability controls the blend continuously, allowing gradient flow through the routing decisions.

Reconstruction Steps

Reorder the chunk outputs according to the original boundary positions.
Map each position in the full sequence to its cumulative boundary count (i.e., which chunk it belongs to).
Run the EMA scan over the reordered chunk outputs with boundary-probability-derived decay factors.
Gather the EMA outputs back to the original sequence positions.

Residual Connection

The entire chunk/process/dechunk pipeline is wrapped in a residual connection:

output = dechunk_output * STE(selected_probs) + residual_proj(encoder_output)

The residual_proj is a linear layer initialized to zero, so at initialization the chunking pathway contributes nothing and the model starts as a simple encoder-decoder. The Straight-Through Estimator (STE) passes gradients through the discrete routing decisions.

Recursive Nesting

The chunking pattern nests recursively. Consider a 3-stage hierarchy:

arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]

Stage 0: Encode the full byte sequence, predict boundaries, chunk down, pass to Stage 1, dechunk back up, decode.
Stage 1: Encode the chunked sequence from Stage 0, predict boundaries again on this shorter sequence, chunk down further, pass to Stage 2, dechunk, decode.
Stage 2: Process the doubly-chunked sequence with a flat stack of blocks (no further chunking).

Each level of chunking reduces the sequence length by a data-dependent factor. For byte-level input, the first level might learn character-like boundaries; the second level might learn word-like or phrase-like boundaries. The model discovers its own hierarchy of tokenization.

Inference: Token-by-Token Stepping

During autoregressive inference, each component has a step method for single-token processing:

RoutingModule.step: Compares the new token against the previously seen token’s hidden state. If the boundary probability exceeds 0.5, the token starts a new chunk.
ChunkLayer.step: If the token is a boundary, pass it through to the inner hierarchy. Otherwise, skip the inner hierarchy entirely.
DeChunkLayer.step: Blend the new chunk output (if any) with the previous EMA value using the boundary probability as the mixing weight.

This means that during inference, the inner hierarchy only runs when a chunk boundary is detected, saving compute on non-boundary tokens.

RWKV-7 Time Mixing

RWKV-7 time mixing is the primary sequence processing mechanism in Aegir. It implements a linear recurrence with a matrix-valued state, combining the training efficiency of chunk-parallel computation with the inference efficiency of constant-memory recurrence. The implementation uses flash-linear-attention’s optimized Triton kernels.

Reference: RWKV-v8 “Heron” (BlinkDL/RWKV-LM), fla RWKV7Attention.

Core Recurrence

The recurrent state S[t] is a matrix of shape (H, head_size, head_size) per batch element, where H is the number of attention heads. The state update at each time step is:

S[t] = diag(w[t]) * S[t-1] + S[t-1] @ ab[t] + v[t] @ k[t]^T

where:

diag(w[t]) is the per-element exponential decay applied column-wise
ab[t] = (-kk[t])^T @ (kk[t] * a[t])^T is the attention gate correction
v[t] @ k[t]^T is the new key-value outer product

The output is read from the state via:

o[t] = S[t] @ r[t]

where r[t] is the receptance (query) vector.

Time-Shift Mixing

Before computing projections, RWKV-7 mixes each token with its predecessor via learned interpolation coefficients. Given input x[t]:

delta[t] = x[t-1] - x[t]       (delta[0] = -x[0])

xr = x + delta * mu_r
xw = x + delta * mu_w
xk = x + delta * mu_k
xv = x + delta * mu_v
xa = x + delta * mu_a
xg = x + delta * mu_g

Each mu_* is a learnable (1, 1, D) parameter initialized with a position-and-layer-dependent schedule. This provides a simple form of local context mixing before the main recurrence.

Decay LoRA

The decay vector w[t] controls how quickly the recurrent state forgets. It is computed via a low-rank adaptation:

w[t] = -softplus(-(w0 + tanh(W1 @ xw[t]) @ W2)) - 0.5

where:

w0 is a (D,) bias initialized with a position-dependent schedule
W1 is (D, decay_low_rank_dim) and W2 is (decay_low_rank_dim, D)
The result is in log-space (negative values); the -0.5 ensures minimum decay

For the chunked training kernel (chunk_rwkv7), w is passed in log-space. For the single-token step, it is converted to the multiplicative factor:

w_step = exp(-0.606531 * sigmoid(w0 + tanh(W1 @ xw) @ W2))

Attention Gate LoRA

The attention gate a[t] modulates the key’s influence on the state update. It controls the ab correction term:

a[t] = sigmoid(a0 + A2(A1(xa[t])))

where a0 is a (D,) bias and A1, A2 form a low-rank bottleneck. The key is then modified as:

k'[t] = k[t] * (1 + (a[t] - 1) * k_a)

where k_a is a learnable per-dimension scale (initialized to 1.0).

RWKV-7 shares value information across layers via a “value-first” mechanism:

Layer 0: Stores its value projection as v_first.
Layers 1+: Lerp their value toward v_first:

v[t] = v[t] + (v_first[t] - v[t]) * sigmoid(v0 + V2(V1(xv[t])))

This provides a residual-like connection specifically for value information, allowing deeper layers to reference the original value representation from layer 0.

L2 Key Normalization

Keys are L2-normalized per head before entering the suffix automaton correction:

kk[t] = L2_normalize(k[t] * k_k)   per head

where k_k is a learnable per-dimension scale (initialized to 0.85). The normalized keys kk are used in the ab correction term but not in the main key-value outer product.

Bonus Term

A direct key-query interaction term is added to the output:

bonus[t] = sum(r[t] * k[t] * r_k, dim=-1, keepdim=True) * v[t]

where r_k is a (H, head_size) parameter initialized with small random values. This provides a shortcut path that bypasses the recurrent state entirely.

GroupNorm Output

The recurrent output is passed through GroupNorm (one group per attention head) before the bonus term is added:

o = GroupNorm(S[t] @ r[t])  +  bonus[t]

Output Gating

The final output is gated via another LoRA:

g[t] = G2(sigmoid(G1(xg[t])))
output = o * g
output = W_o @ output

The output projection W_o is initialized to zero so that at initialization, RWKV-7 blocks contribute nothing to the residual stream.

Training: Chunk-Parallel Computation

During training, the chunk_rwkv7 kernel from flash-linear-attention processes the sequence in parallel chunks while maintaining exact recurrent semantics. The function signature:

o, final_state = chunk_rwkv7(
    r, w, k, v,
    -kk, kk * a,            # ab decomposed as two rank-1 terms
    initial_state=state,     # (B, H, K, K) or None
    output_final_state=True,
)

Inputs are shaped (B, T, H, head_size) and w is in log-space.

Inference: Token-by-Token Recurrence

During autoregressive inference, the step method implements the exact recurrence manually:

vk = v @ k^T                    # (B, H, N, N)
ab = (-kk)^T @ (kk * a)^T       # (B, H, N, N)
S  = S * diag(w) + S @ ab + vk  # state update
o  = S @ r                       # read output

The recurrent state S is stored in inference_params.key_value_memory_dict[layer_idx].att_kv as a float32 tensor of shape (B, H, head_size, head_size).

LoRA Dimension Auto-Calculation

If not explicitly specified in RWKVConfig, LoRA dimensions are computed from d_model following the fla convention:

factor = head_size / 64
sqrt_d = sqrt(d_model)

decay_low_rank_dim = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
gate_low_rank_dim  = max(32, round(5.0 * sqrt_d / 32) * 32)
a_low_rank_dim     = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
v_low_rank_dim     = max(32, round(1.7 * sqrt_d * factor / 32) * 32)

All dimensions are rounded up to multiples of 32 for hardware efficiency.

Weight Initialization

Initialization follows RWKV-7 conventions with layer-dependent schedules:

Time-shift coefficients (mu_*): Initialized as 1 - d^(c * ratio) where d is a per-dimension ramp [0, 1), c is a coefficient specific to each mix type, and ratio varies from 1 (first layer) to 0 (last layer).
Decay bias (w0): Initialized as -7 + 5 * (d / D)^(0.85 + ratio^0.5), giving a range from fast decay (early dimensions) to slow decay (late dimensions).
Key normalization (k_k): 0.85 uniformly.
Key attention scale (k_a): 1.0 uniformly.
Bonus (r_k): Small random normal (std=0.1).
Output projection (W_o): Zero initialized.

ROSA Suffix Automaton

ROSA (RWKV Online Suffix Automaton) provides lossless infinite-range exact sequence matching as a complement to RWKV-7’s learned recurrent processing. While RWKV-7 maintains a compressed state that approximates the input history, ROSA can retrieve exact substring matches from arbitrarily far in the past.

Reference: “ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching” (arXiv:2602.02499), ported from RWKV-v8 (BlinkDL/RWKV-LM).

Algorithm Overview

ROSA constructs an online suffix automaton over discretized hidden representations. For each position in the query sequence, it finds the longest suffix of the query that appears as a substring in the key sequence seen so far, then returns the corresponding value from the position immediately after the match.

The core operation is rosa_qkv_ref(qqq, kkk, vvv):

Maintain an online suffix automaton built incrementally from the key sequence.
For each new position i:
- Query phase: Walk the automaton to find the longest suffix of qqq[:i+1] that matches a substring in kkk[:i].
- Key phase: Extend the automaton with kkk[i].
If a match of sufficient length is found, return vvv[match_end + 1]. Otherwise return a sentinel value.

The suffix automaton provides O(n) construction and O(n) total query time, making the entire operation linear in sequence length.

1-Bit Binarization

To convert continuous hidden states into discrete tokens suitable for suffix automaton matching, ROSA uses 1-bit binarization:

x_binary = (x > 0) ? 1 : 0

This is applied per channel across the hidden dimension. Given a hidden state tensor of shape (B, T, C):

Binarize: q_bin[b, t, c] = uint8(q[b, t, c] > 0) (same for k, v).
Transpose: Reshape from (B, T, C) to (B*C, T) – each channel becomes an independent sequence.
Match: Run rosa_qkv_batch_ref over all B*C channel sequences in parallel.
Reconstruct: Reshape indices back to (B, T, C).
Scale: Output = (2 * idx_float - 1) * emb, where emb is a learnable (1, 1, C) scale parameter.

The matched bit value 1 maps to +emb and 0 maps to -emb, giving the output the same sign structure as the matched hidden representation scaled by a learnable magnitude.

The `_RosaQKV1BitOp` Autograd Function

ROSA’s suffix automaton is non-differentiable (it involves discrete automaton state transitions). The autograd function handles this:

Forward: Binarize inputs, run suffix matching on CPU, scale by emb.
Backward: Gradients for q, k, v are None (zero). Gradients for emb are passed through directly.

This means ROSA layers learn only through:

The learnable emb scale parameter.
The Q/K/V linear projections preceding ROSA (which receive gradients from other paths in the network through residual connections).
The surrounding block’s residual connection.

The projections learn to produce hidden representations whose binarization yields useful matching patterns, even though the binarization itself has no gradient.

CPU Execution

The suffix automaton runs on CPU. Tensors are moved to CPU before matching and results are moved back to the accelerator. This is a deliberate design choice:

Suffix automata use pointer-chasing data structures (dictionaries, linked suffix links) that are not amenable to GPU parallelism.
The per-channel parallelism (B*C independent sequences) provides sufficient throughput for moderate batch sizes.
During inference, ROSA blocks primarily contribute during prefill; the step method falls back to zero output since the automaton requires the full sequence context.

When to Use ROSA vs RWKV-7

Use Case	ROSA (`r`/`R`)	RWKV-7 (`w`/`W`)
Exact pattern retrieval	Yes – lossless via suffix matching	No – compressed into finite state
Learned sequence processing	Limited – only `emb` is trained	Full – all parameters are trained
Inference (autoregressive)	Degrades (needs full context)	Efficient (O(1) state update)
Long-range dependencies	Infinite range, exact	Finite effective range, approximate
Training speed	Slower (CPU automaton)	Fast (Triton chunk kernel)

In practice, ROSA blocks are best used sparingly alongside RWKV-7 blocks. A typical layout might be "w4r1" – four RWKV-7 blocks for general sequence processing, one ROSA block for exact retrieval. The ROSA block acts as a “lookup table” that can surface exact matches from the input, while RWKV-7 handles the bulk of learned representation building.

RWKV_ROSA Module

The RWKV_ROSA module wraps the ROSA matching in a standard time-mixing interface:

Time-shift mixing: Mix current token with previous token via learned interpolation (same as RWKV-7 but with only q/k/v coefficients).
Q/K/V projection: Linear projections from the mixed hidden states.
ROSA matching: RosaQKV1Bit on the projected q, k, v.
Output projection: Linear projection back to d_model.

The module is paired with either RWKV_CMix (relu^2 FFN, block code r) or SwiGLU (block code R) as its feedforward component.

Block Types Reference

Aegir’s architecture is built from modular blocks, each consisting of a mixer (the sequence processing module) and an optional MLP (the feedforward network). Blocks are identified by single-character codes and composed into layout strings that define the architecture at each stage.

Block Code Table

Code	Mixer	MLP	Description
`w`	RWKV-7 TimeMix	CMix (relu^2)	Full RWKV-7 recurrence with RWKV-style channel mixing
`W`	RWKV-7 TimeMix	SwiGLU	Full RWKV-7 recurrence with SwiGLU feedforward
`r`	ROSA (suffix automaton)	CMix (relu^2)	Exact pattern matching with RWKV-style channel mixing
`R`	ROSA (suffix automaton)	SwiGLU	Exact pattern matching with SwiGLU feedforward
`t`	Multi-Head Attention	None	Causal MHA with no feedforward
`T`	Multi-Head Attention	SwiGLU	Standard transformer block
`m`	Mamba-2 (SSM)	None	State-space model with no feedforward
`M`	Mamba-2 (SSM)	SwiGLU	State-space model with SwiGLU feedforward

Convention

Lowercase codes use RWKV-native FFN (CMix with relu^2) or no FFN at all.
Uppercase codes use SwiGLU as the feedforward network.
For w/W and r/R, lowercase uses CMix; uppercase uses SwiGLU.
For t/T and m/M, lowercase has no MLP; uppercase adds SwiGLU.

The Block Wrapper

Every block follows the pre-norm residual pattern:

                    +---> norm1 --> mixer ---+
                    |                       |
hidden_states ----->+                       +-----> hidden_states
(+ residual)        |                       |      (+ residual)
                    +---> norm2 --> mlp ----+  (if MLP exists)

Concretely, the Block class implements:

# Mixer sub-block
hidden_states, residual = norm1(hidden_states, residual, prenorm=True)
hidden_states = mixer(hidden_states)

# MLP sub-block (if present)
hidden_states, residual = norm2(hidden_states, residual, prenorm=True)
hidden_states = mlp(hidden_states)

The pre-norm pattern accumulates the residual stream separately from the normalized hidden states. The normalization module (RMSNorm from flash-attn, or a LayerNorm fallback) handles residual accumulation internally when prenorm=True.

Residual Height Counting

Each block contributes to the “height” of its parent Isotropic module, which is used for output projection scaling during initialization:

Lowercase blocks (single residual addition): height += 1
Uppercase blocks (mixer + MLP, two residual additions): height += 2

MLP Variants

CMix (RWKV Channel Mixing)

Used by lowercase RWKV codes (w, r). A simple feedforward with relu^2 activation:

# Time-shift mixing
xx = time_shift(x) - x
k = x + xx * x_k

# Feedforward
k = relu(W_key @ k) ** 2    # D -> 4D, relu squared
output = W_value @ k          # 4D -> D

The expansion factor defaults to rwkv_cfg.dim_ffn_mult (default 4.0). CMix includes its own time-shift mixing, independent of the mixer’s time-shift.

SwiGLU

Used by uppercase codes (W, R, T, M). The standard SwiGLU feedforward (Shazeer 2020):

y = W_fc1 @ x                # D -> 2 * D_intermediate
y, gate = split(y)           # Each D_intermediate
y = silu(gate) * y
output = W_fc2 @ y            # D_intermediate -> D

The intermediate dimension defaults to 8/3 * d_model, rounded up to the nearest multiple of 128.

Layout String Parsing

Architecture layout strings encode a sequence of block types and their counts. The string is parsed by the Isotropic module using a regex:

re.findall(r"([mMtTrRwW])(\d+)", arch_layout)

Examples:

Layout String	Parsed Blocks
`"w4"`	4 RWKV-7+CMix blocks
`"w4T1r2"`	4 RWKV-7+CMix, 1 MHA+SwiGLU, 2 ROSA+CMix
`"W8"`	8 RWKV-7+SwiGLU blocks
`"m2w4m2"`	2 Mamba-2, 4 RWKV-7+CMix, 2 Mamba-2

Within a layout string, blocks are instantiated in order with sequential layer_idx values. The total layer count across all block types in the string is used for RWKV-7’s position-dependent weight initialization.

The `create_block` Function

create_block() is the factory function that dispatches on the block code character:

block = create_block(
    arch="w",                    # block code
    d_model=192,                 # hidden dimension
    d_intermediate=512,          # SwiGLU intermediate dim (for uppercase codes)
    ssm_cfg={...},               # Mamba-2 config (for m/M)
    attn_cfg={...},              # MHA config (for t/T)
    rwkv_cfg=RWKVConfig(...),    # RWKV config (for w/W/r/R)
    layer_idx=0,                 # layer index for cache keying
    num_hidden_layers=12,        # total layers for init scheduling
)

The function:

Selects the mixer class based on the code character.
Selects the MLP class: CMix for w/r, SwiGLU for uppercase, nn.Identity for t/m.
Selects the normalization class: flash-attn’s RMSNorm if available, otherwise a LayerNorm fallback with prenorm support.
Constructs and returns a Block instance with the selected components.

When an Isotropic module contains RWKV-7 blocks (w/W), it maintains a shared v_first = [None] container. This mutable list is passed as a mixer_kwarg to every RWKV-7 block:

The first RWKV-7 block (layer_idx 0 within the Isotropic) stores its value projection in v_first[0].
Subsequent RWKV-7 blocks lerp their value toward v_first[0] via a learnable gate.

This sharing is local to each Isotropic instance – encoder, decoder, and main network at each stage each have their own v_first container.

Pretraining: Ontology-Grounded Synthetic Data

How do you train a structured information model at LLM scale when labeled relational data is scarce and expensive?

Conventional approaches to semantic column annotation rely on manually labeled benchmark datasets – SOTAB, GitTables, WikiTables – that are costly to create, domain-limited, and rarely capture the cross-table relationships needed for data element discovery. Self-supervised pretraining on raw tables (as in DODUO and TURL) learns useful representations, but the “ground truth” for column semantics remains noisy or absent.

Aegir takes a fundamentally different approach: we generate the training data from first principles, so the ground truth is always known by construction.

The Core Insight

We invert the usual pipeline. Instead of finding tables and labeling them, we:

Start from the highest-quality curated text available
Extract formal ontological structure using LLMs
Project that structure into relational database schemas
Populate schemas with realistic synthetic data
Train the model to recover the ontological entities from the raw table data

Because we control every step of the generation process, the mapping from table columns back to ontological entities is always available as ground truth. The diversity of the input text drives the diversity of the synthetic data; the formal ontological backbone guarantees structural correctness.

Pipeline Overview

What Is Novel

No prior work combines all five stages into a single pipeline. Each stage has precedent; the composition does not.

Stage	Prior Art	What Exists	What Is New
Text → Ontology	OntoGPT, REBEL, DeepOnto	LLM-based ontology extraction from text	Using curated educational text as seed for training data generation
BFO Grounding	Common Core Ontologies, OBO Foundry	BFO as upper ontology for domain modeling	BFO as the generative backbone for synthetic ML training data
SysMLv2 Intermediate	openCAESAR, Cameo	SysMLv2 for systems engineering	SysMLv2 MBSE as intermediate representation in an ML data pipeline
Synthetic Tables	MOSTLY.ai, SDV, NeurIPS 2024 TRL	Synthetic table generation for augmentation	Tables generated from ontological structure with known entity provenance
Entity Recovery	DODUO, TURL (masked column)	Masked language model pretraining on tables	Ontological entity recovery as the training objective, not next-token prediction

The closest related work is “Enhancing Table Representations with LLM-powered Synthetic Data Generation” (NeurIPS 2024 TRL Workshop), which generates synthetic tables to improve column embedding similarity. That work generates tables for representation learning; Aegir generates tables for ontological entity recovery – a fundamentally different objective that produces richer training signal because the ground truth includes hierarchical entity structure, cross-table relationships, and BFO-grounded type constraints.

Why This Scales

The bottleneck in conventional table annotation is human labeling. The bottleneck here is LLM inference for ontology extraction – which is embarrassingly parallel and decreasing in cost.

The multiplicative structure of the pipeline ensures near-unlimited training data:

Stage	Multiplier	Source
Curated text	~500M passages	FineWeb-Edu (1.3T tokens)
Ontology fragments	1–5 per passage	Domain-dependent entity density
Database schemas	1–10 per fragment	Varying normalization strategies
Table instances	100–10,000 rows	Procedural generation with distribution control
Total training examples	effectively unbounded	Combinatorial product of all stages

A single educational passage about hospital billing can produce ontology fragments for patient demographics, encounter management, diagnosis coding, insurance claims, and provider credentialing – each of which generates distinct database schemas, each populated with different synthetic data distributions. The diversity of the training data is bounded only by the diversity of human knowledge captured in the source text.

How This Connects to Aegir

The pretraining objective maps directly to Aegir’s three target tasks:

Column Type Annotation (CTA): The per-column entity type predictions from pretraining transfer directly to CTA on SOTAB, GitTables, and WikiTables benchmarks.
Column Property Annotation (CPA): The cross-column relationship predictions learned during pretraining capture the same inter-column semantics needed for CPA.
Data Element Discovery: The core pretraining objective – grouping related columns into ontological entities across tables – is data element discovery. The model learns this from synthetic data where the answer is known, then applies it to real enterprise data warehouses.

Furthermore, Aegir’s agent swarm architecture enables cross-table reasoning during both pretraining and inference. Each agent processes a table, and the fused recurrent states capture inter-table relationships that no single-table model can learn.

The following sections detail each stage of the pipeline.

Stage 1: Ontology Extraction

The first stage transforms curated educational text into formal ontological structure. A large language model reads natural language passages and produces BFO-grounded ontology fragments – typed entity hierarchies with properties, relationships, and axioms that can be mechanically projected into database schemas.

Input: FineWeb PDFs Edu

The source corpus is FineWeb-Edu, a curated subset of Common Crawl filtered for educational content using LLaMA-3-70B-Instruct quality scoring. Key properties:

1.3 trillion tokens of curated, high-quality educational text
Spans every domain: medicine, law, finance, engineering, social sciences, natural sciences
Already deduplicated and quality-filtered – no need for additional curation
PDF-extracted passages preserve document structure (headings, tables, lists)

Each passage is a self-contained description of some real-world domain – exactly the kind of text that contains implicit ontological structure waiting to be made explicit.

Extraction Process

The extraction uses structured prompting with a three-phase approach:

Domain identification: Classify the passage into one or more information domains (healthcare, finance, logistics, etc.) to select domain-appropriate extraction templates.
Entity extraction: Identify entity types, their properties, and inter-entity relationships. The prompt constrains outputs to BFO-compatible categories.
BFO alignment: Map each extracted entity to the appropriate BFO upper-level category, ensuring the fragment inherits BFO’s formal axioms.

Validation Gate

Not every LLM output is usable. A validation gate checks three properties:

Syntactic: Does the output parse as valid OWL/RDF?
BFO alignment: Is every class properly subsumed by a BFO category?
Coherence: Are there contradictory axioms or dangling references?

Fragments that fail validation are discarded or re-prompted. In practice, structured output modes (JSON schema enforcement) in GLM-4.7/GLM-5 achieve >90% first-pass validation rates.

Why BFO

Basic Formal Ontology (ISO/IEC 21838-2:2021) is the most widely adopted upper ontology in applied information science:

700+ ontology projects built on BFO across government, defense, healthcare, and industry
Common Core Ontologies (CCO) used by the U.S. Department of Defense and Intelligence Community
OBO Foundry biomedical ontologies (Gene Ontology, ChEBI, etc.) all align to BFO
Formal first-order logic axiomatization ensures machine-verifiable consistency

BFO provides the upper-level categories that give our extracted ontologies a shared formal backbone. Without this grounding, extracted ontologies would be ad-hoc entity lists with no guaranteed interoperability or logical structure.

BFO Categories for Information Systems

The BFO categories most relevant to relational data modeling:

BFO Category	IRI	Maps To	Example
Generically Dependent Continuant	`BFO:0000031`	InformationEntity	A patient record, a diagnosis code
Object	`BFO:0000030`	Concrete entity	A patient, a medical device
Quality	`BFO:0000019`	Data attribute	Acuity level, sensitivity classification
Role	`BFO:0000023`	Functional role	Data subject, provider, auditor
Process	`BFO:0000015`	Temporal event	An encounter, a transaction, a review
Specifically Dependent Continuant	`BFO:0000020`	Inherent property	A patient’s blood type, a device’s serial number

These categories constrain what kinds of entities can participate in what kinds of relationships – a Patient (Object) can bear a DataSubjectRole (Role), an Encounter (Process) has participant a Patient (Object), and so on. These constraints propagate through the pipeline: they determine which foreign key relationships are valid in the generated schemas.

Formal Definition

An ontology fragment is a tuple:

\[ O = (C, R, A, \iota) \]

where:

\(C = \{c_1, \ldots, c_n\}\) is a set of classes (entity types), each with a set of properties \(P(c_i) = \{p_1, \ldots, p_k\}\)
\(R = \{r_1, \ldots, r_m\}\) is a set of relations between classes, each \(r_j: c_a \to c_b\) with cardinality constraints
\(A\) is a set of axioms – subsumption (\(c_i \sqsubseteq c_j\)), disjointness (\(c_i \sqcap c_j = \bot\)), and property constraints (domain, range, cardinality)
\(\iota: C \to \text{BFO}\) is the BFO alignment mapping that assigns each class to a BFO upper-level category

The alignment mapping \(\iota\) must satisfy BFO’s axioms: if \(\iota(c_i) = \text{BFO:Process}\), then \(c_i\) inherits Process axioms (has temporal extent, can have participants, etc.). This is not merely a label – it constrains the valid relationships and properties that \(c_i\) can participate in.

Output

Each successfully validated ontology fragment becomes input to Stage 2: Schema Projection. A single text passage typically yields 1–5 fragments, depending on the complexity and domain diversity of the passage content.

The ontology fragments are serialized as OWL/RDF for archival and as structured JSON for downstream processing. Both representations preserve the full BFO alignment mapping, enabling validation at every subsequent stage.

Stage 2: Schema Projection

Schema projection transforms BFO-grounded ontology fragments into relational database schemas through a two-step process: first into SysMLv2 systems engineering models, then into programmatic data objects and SQL schemas. The intermediate SysMLv2 representation captures structural constraints, lifecycle semantics, and system-level relationships that flat entity-relationship modeling would lose.

Why SysMLv2 as Intermediate Representation

Using SysMLv2 (OMG, approved July 2025) as an intermediate representation between ontology and database schema is unconventional – and deliberate. SysMLv2 provides formal constructs that bridge the gap between abstract ontological entities and concrete data structures:

SysMLv2 Construct	Ontological Concept	Database Primitive
Block Definition	Entity type	Table
Part Property	Composition	One-to-many FK
Reference Property	Association	Many-to-many junction table
Port	Interface/boundary	Shared column (FK target)
Attribute	Data property	Column
Constraint	Axiom	CHECK constraint
State Machine	Lifecycle	Status enum + temporal columns
Requirement	Validation rule	Application-level validation

The openCAESAR project provides an OWL2-DL ontology for SysMLv2, making the ontology-to-SysMLv2 projection formally well-defined. This means we’re not hand-waving the transformation – there’s a rigorous mapping from BFO-grounded classes and relations to SysMLv2 blocks and connections.

The critical advantage: SysMLv2 models encode systems with internal structure, constraints, and lifecycle semantics. The generated databases aren’t just flat tables with columns – they’re projections of coherent systems where referential integrity, state transitions, and constraint propagation all have formal justification in the source model.

Projection Pipeline

Step 1: Ontology → SysMLv2

Each BFO class maps to a SysMLv2 construct based on its upper-level category:

BFO:Object → part def (a concrete block with owned parts)
BFO:Process → action def with a state machine (lifecycle semantics)
BFO:Role → port def (an interface that objects can fulfill)
BFO:Quality → attribute def (a typed value property)
BFO:GDC (Generically Dependent Continuant) → part def with subsets informationEntity (a record or document)

Relations map to SysMLv2 connections:

Composition (whole-part) → part usage within a block
Association → ref usage with multiplicity
Participation (Object in Process) → perform action usage

Axioms map to constraint def blocks with OCL-like expressions.

Step 2: SysMLv2 → Programmatic Objects

The SysMLv2 model is projected into Python dataclasses via template-based code generation:

@dataclass
class Patient:
    patient_id: str          # from attribute def
    date_of_birth: date      # from attribute def
    gender: str              # from attribute def
    encounters: list         # from part usage (1..*)

@dataclass
class Encounter:
    encounter_id: str        # generated primary key
    patient_id: str          # from owning block (FK)
    encounter_date: datetime # from attribute def
    status: str              # from state machine states
    provider_id: str         # from ref usage (FK)
    diagnoses: list          # from part usage (1..*)

@dataclass
class Diagnosis:
    diagnosis_id: str        # generated primary key
    encounter_id: str        # from owning block (FK)
    code: str                # from attribute def
    description: str         # from attribute def
    coded_by: str            # from ref usage (FK)

Step 3: Data Objects → Relational Schema

The dataclasses are mapped to SQLAlchemy models and CREATE TABLE statements:

CREATE TABLE patient (
    patient_id    VARCHAR(36) PRIMARY KEY,
    date_of_birth DATE NOT NULL,
    gender        VARCHAR(10) NOT NULL
);

CREATE TABLE encounter (
    encounter_id   VARCHAR(36) PRIMARY KEY,
    patient_id     VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
    encounter_date TIMESTAMP NOT NULL,
    status         VARCHAR(20) NOT NULL CHECK (status IN ('active', 'closed')),
    provider_id    VARCHAR(36) NOT NULL REFERENCES provider(provider_id)
);

CREATE TABLE diagnosis (
    diagnosis_id VARCHAR(36) PRIMARY KEY,
    encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
    code         VARCHAR(10) NOT NULL,
    description  TEXT,
    coded_by     VARCHAR(36) REFERENCES provider(provider_id)
);

Ontological Mapping Rules

The projection preserves ontological structure through systematic rules:

Ontological Structure	Relational Mapping	Provenance Preserved
Entity type	Table	Table name ↔ BFO class
Data property	Column	Column name ↔ property IRI
Object property (1:N)	Foreign key	FK ↔ relation IRI
Object property (M:N)	Junction table	Junction ↔ relation IRI
Subsumption hierarchy	Table-per-type inheritance	Parent FK ↔ `rdfs:subClassOf`
Disjointness axiom	CHECK constraint	Constraint ↔ axiom
Cardinality constraint	NOT NULL / UNIQUE	Column constraint ↔ cardinality

The critical property is that every schema element traces back to a specific ontological element. This traceability is what makes the training objective possible: when the model predicts that two columns belong to the same data element, we can verify that prediction against the source ontology.

Schema Variation

A single ontology fragment can produce multiple valid database schemas through controlled variation:

Normalization level: 1NF, 2NF, 3NF, or fully denormalized
Inheritance strategy: Table-per-type, table-per-hierarchy, or single-table with discriminator
Naming conventions: snake_case, camelCase, abbreviated, or obfuscated (col_1, field_a)
Type mappings: DATE vs VARCHAR for dates, INTEGER vs VARCHAR for codes

This variation is essential for training robustness. Real-world databases use all of these conventions, often mixed within a single schema. By generating diverse schemas from the same ontological source, the model learns to recognize semantic equivalence across surface-level variation.

Formal Mapping

The schema projection is a function:

\[ \pi: O \to \mathcal{S} \]

where \(O = (C, R, A, \iota)\) is an ontology fragment and \(\mathcal{S} = \{S_1, \ldots, S_k\}\) is a set of valid relational schemas. Each schema \(S_i = (T, K, F, \Gamma)\) consists of:

\(T = \{t_1, \ldots, t_n\}\) – tables, each with columns \(\text{cols}(t_j)\)
\(K\) – primary key constraints
\(F\) – foreign key constraints
\(\Gamma\) – CHECK constraints

The projection must satisfy:

\[ \forall, t \in T,\ \exists, c \in C : \text{name}(t) \xleftarrow{\pi} c \]

\[ \forall, f \in F,\ \exists, r \in R : f \xleftarrow{\pi} r \]

That is, every table traces to a class and every foreign key traces to a relation. This bidirectional traceability is the formal guarantee that makes ontological entity recovery a well-defined training objective.

Stage 3: Synthetic Data Generation

Given a relational schema with known ontological provenance, the third stage populates tables with realistic synthetic data. The goal is not just to fill rows – it is to produce data distributions that exercise the same patterns and confusable types the model will encounter in real enterprise databases.

Population Pipeline

Value Generation

Each column type maps to a specialized generator that produces realistic values. The generator selection is driven by the ontological provenance of the column – a column traced to BFO:Quality with domain healthcare produces different values than one traced to BFO:Quality with domain finance.

Generator Categories

Column Semantics	Generator	Example Values
Person name	Faker (locale-aware)	“Maria Santos”, “James O’Brien”
Date/timestamp	Range-bounded random	2019-03-15, 2024-11-02T14:30:00
Identifier (UUID)	UUIDv4	`f47ac10b-58cc-4372-a567-0e02b2c3d479`
Identifier (sequential)	Auto-increment with prefix	`PAT-00001`, `ENC-2024-0042`
Medical code (ICD-10)	Sampled from code registry	`J18.9`, `I25.10`, `E11.65`
Financial code (IBAN)	Country-specific format	`DE89370400440532013000`
Categorical	Weighted sampling from enum	`active`, `closed`, `pending`
Free text	Template + Faker	“Patient presents with acute chest pain”
Numeric measure	Distribution-sampled	98.6, 120/80, 72
Boolean flag	Bernoulli(p)	`true`, `false`
Address	Locale-aware composite	“123 Main St, Springfield, IL 62704”
Email	Pattern-based	`maria.santos@hospital.org`
Phone	Country-format	`+1-555-0123`

Referential Integrity

Tables are populated in topological order (parents before children) to guarantee that every foreign key value references an existing parent row. The population engine:

Sorts tables by foreign key dependencies (detecting and breaking cycles if needed)
Populates root tables (no FK dependencies) first
For each child table, samples FK values from the parent table’s primary key column
Respects cardinality constraints: a NOT NULL FK always gets a valid reference; an optional FK gets NULL with configurable probability

Distribution Control

Real databases are not uniformly distributed. The generation config controls:

Cardinality: How many child rows per parent (e.g., 1–30 encounters per patient, following a power-law distribution)
Null ratio: What fraction of nullable columns contain NULL (typically 5–30% in real data)
Value entropy: How many distinct values appear in categorical columns (a status column might have 3 values; a diagnosis_code column might have 500)
Skew: Zipfian distributions for columns where a few values dominate (e.g., 80% of encounters are status='closed')
Temporal patterns: Dates that follow realistic patterns (weekday-heavy, seasonal, monotonically increasing)

Diversity from Source Text

The curated text input drives diversity along two independent axes:

Domain Diversity

Different passages produce different ontological domains, which produce structurally distinct databases:

Source Domain	Example Tables	Distinctive Patterns
Healthcare	patient, encounter, diagnosis, medication	ICD-10 codes, temporal encounter sequences
Finance	account, transaction, instrument, counterparty	IBAN/SWIFT codes, decimal precision, audit trails
Supply Chain	shipment, warehouse, item, carrier	GPS coordinates, weight/volume, tracking IDs
Education	student, course, enrollment, grade	GPA calculations, semester cycles
HR/Payroll	employee, department, payroll, benefit	SSN patterns, salary ranges, org hierarchies

Structural Diversity

Even within a single domain, different passages emphasize different relationships, producing varied schema structures:

A passage about emergency triage produces schemas with acuity levels, wait times, and disposition tracking
A passage about chronic disease management produces schemas with longitudinal encounters, medication histories, and care plans
A passage about hospital billing produces schemas with insurance claims, procedure codes, and payment reconciliation

All three are “healthcare databases” but have substantially different table structures, column types, and relationship patterns. This structural diversity is what trains the model to generalize beyond surface patterns.

Confusable Type Injection

A key training challenge is confusable pairs – columns with nearly identical value distributions but different semantic types. The generation pipeline deliberately injects these:

Confusable Pair	Value Pattern	Distinguishing Context
Advertising ID vs GUID	Both UUIDv4 format	Table context (ad_events vs generic)
Bank account vs payment card	Both numeric strings	Length, check digit algorithm
Phone number vs fax number	Both `+1-XXX-XXX-XXXX`	Column name, co-occurring columns
ZIP code vs department code	Both 5-digit numbers	Geographic context vs org context
Patient ID vs provider ID	Both `XXX-NNNNN` format	Foreign key relationships

By generating schemas where these confusable types coexist – often in the same database – the model learns to resolve ambiguity using cross-column and cross-table context rather than single-column pattern matching.

Scale Arithmetic

Working through concrete numbers:

Stage	Count	Basis
FineWeb-Edu passages	~500M	1.3T tokens / ~2,600 tokens per passage
Ontology fragments	~1–5 per passage	Domain-dependent entity density
Schemas per fragment	~1–10	Normalization and naming variation
Tables per schema	~5–50	Domain complexity
Rows per table	~100–10,000	Configurable per generation
Total table instances	>10 billion	Conservative lower bound

The bottleneck is LLM inference for ontology extraction (Stage 1), not data generation. Once an ontology fragment exists, schema projection and data population are purely procedural and can run on commodity hardware at millions of tables per hour.

Stage 4: Training Objective

The training objective is the key departure from standard pretraining: Aegir does not learn to predict the next token. It learns to recover the ontological entities – data elements – that were used to generate the relational data it observes. This is possible because the generation pipeline (Stages 1–3) preserves a complete mapping from every column back to its source ontological entity.

Task Formulation

What the Model Sees

The model receives byte-serialized relational tables – one or more tables from the same generated schema, serialized as a byte stream. The serialization format mirrors how real data would be encountered:

CSV-style serialization with delimiters, quoting, and escape characters
Column headers may be descriptive (patient_id), abbreviated (pat_id), or opaque (col_0)
Multiple tables are concatenated with table-boundary markers
No schema metadata (no types, no foreign key declarations, no table names beyond what appears in headers)

The model must infer semantic structure purely from the byte patterns it observes.

What the Model Predicts

Three prediction heads operate on the column-level embeddings produced by Aegir’s hierarchical encoder:

Column Type Annotation (CTA): For each column, predict its BFO-grounded semantic type from a taxonomy. This maps directly to the CTA task on benchmarks like SOTAB and GitTables.
Data Element Discovery (DE): Predict which columns – potentially across different tables – belong to the same ontological entity. This is formulated as a clustering task: columns originating from the same BFO class should receive similar embeddings.
Hierarchical Consistency: Predict the BFO hierarchy level for each column. If a column is classified as Diagnosis (a subclass of GDC), it should also be recognized as a GenericallyDependentContinuant. This head enforces ontological coherence.

What We Compare Against

The ground truth comes directly from the generation pipeline:

CTA labels: The Column → BFO property mapping from Stage 2 gives the exact semantic type of every column
DE labels: The Column → BFO class mapping identifies which columns originated from the same ontological entity
Hierarchy labels: The BFO subsumption hierarchy defines the expected parent types for every leaf prediction

Loss Function

The total loss is a weighted combination of three terms:

\[ \mathcal{L} = \mathcal{L}{\text{CTA}} + \lambda_1 \mathcal{L}{\text{DE}} + \lambda_2 \mathcal{L}_{\text{hier}} \]

Column Type Annotation Loss

Standard cross-entropy over the column type taxonomy:

\[ \mathcal{L}{\text{CTA}} = -\frac{1}{N} \sum{i=1}^{N} \log p(y_i \mid \mathbf{h}_i) \]

where \(\mathbf{h}_i\) is the column embedding for column \(i\), \(y_i\) is the ground truth BFO-grounded type, and \(N\) is the total number of columns across all tables in the batch.

Data Element Discovery Loss

A contrastive loss that pulls together columns from the same ontological entity and pushes apart columns from different entities:

\[ \mathcal{L}{\text{DE}} = -\frac{1}{|\mathcal{P}|} \sum{(i,j) \in \mathcal{P}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \neq i} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]

where \(\mathcal{P}\) is the set of positive pairs (columns from the same BFO class), \(\text{sim}\) is cosine similarity, and \(\tau\) is a temperature parameter.

This loss is what teaches the model to discover data elements: columns that the model embeds close together are predicted to belong to the same real-world entity, regardless of which table they appear in.

Hierarchical Consistency Loss

A penalty for predictions that violate the BFO subsumption hierarchy:

\[ \mathcal{L}{\text{hier}} = \frac{1}{N} \sum{i=1}^{N} \sum_{c \in \text{ancestors}(y_i)} \max(0, \delta - p(c \mid \mathbf{h}_i)) \]

where \(\text{ancestors}(y_i)\) returns all BFO ancestors of the predicted type, and \(\delta\) is a margin. If a column is predicted as Diagnosis, the model should assign high probability to all ancestor types: GDC, Continuant, Entity.

Training Loop

Batch Construction

Each training batch contains serialized tables from multiple generated schemas:

Sample a schema from the pool (with curriculum: simpler schemas early, complex multi-table schemas later)
Serialize one or more tables from the schema to bytes, using randomized serialization parameters (delimiter choice, quoting style, header format)
Attach the ontological provenance labels as training targets

Multi-Table Batches

For cross-table data element discovery, batches include multiple tables from the same schema. The agent swarm architecture processes each table with a separate agent, and the fused recurrent states are used for the DE prediction head. This directly trains the model’s cross-table reasoning capability.

Connection to Downstream Tasks

The pretraining objective maps precisely to the three real-world tasks described in the Introduction:

Pretraining Task	Downstream Task	Transfer Mechanism
Column type prediction	CTA on SOTAB/GitTables/WikiTables	Fine-tune CTA head on benchmark taxonomy
Cross-column clustering	CPA on benchmark datasets	Column pair relationship classification
Cross-table data element prediction	Enterprise data element discovery	Direct application – same task, real data

The key advantage: by pretraining on synthetic data with known ground truth at massive scale, the model enters fine-tuning with strong representations for column semantics. The confusable types, cross-table relationships, and ontological hierarchies it has learned from synthetic data transfer directly to the noisy, inconsistently-named, under-documented columns in real enterprise data warehouses.

Integration with Evidence Pipelines

In production, Aegir’s predictions feed into Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. The model produces:

Column type predictions with calibrated confidence – these become mass functions in the DST framework
Column embedding similarities – these provide evidence for same-entity relationships
Hierarchical type predictions – these constrain the feasible type space for conjunctive combination

The calibration quality of Aegir’s confidence scores matters as much as the accuracy of its top-1 predictions. Training on diverse synthetic data with controlled difficulty (including deliberately confusable types) produces well-calibrated uncertainty estimates, because the model learns from data where the boundary between types is precisely controlled.

The specific self-supervised tasks, corruption strategies, and curriculum design that implement this objective are detailed in Training Tactics.

Training Tactics

The training objective defines what Aegir learns – ontological entity recovery from serialized relational tables. This page defines how: the specific self-supervised tasks, corruption strategies, and curriculum design that compose the pretraining regiment. Each tactic is adapted from a proven LLM pretraining method but re-targeted at the structural properties of relational data with known ontological provenance.

Tactic Overview

Each tactic is described below with its LLM analogue, formal task specification, and the downstream capability it trains.

Core Objectives

Object Property Masking

LLM analogue: Masked Language Modeling (BERT)

Mask one or more properties from an ontological entity definition. The model receives the serialized tables (which still contain the data for the masked properties) and must predict what properties the source entity had.

Difficulty gradation:

Level	What’s Masked	Challenge
Easy	A column with structurally distinctive values (dates, emails)	Pattern recognition
Medium	A column whose type depends on co-occurring columns	Cross-column reasoning
Hard	A column with confusable values (UUID vs advertising ID)	Contextual disambiguation
Expert	Multiple properties from the same entity simultaneously	Entity structure reconstruction

Loss: Cross-entropy over the property type vocabulary, plus a regression loss for predicting the property name embedding.

\[ \mathcal{L}{\text{OPM}} = -\frac{1}{|M|} \sum{p \in M} \left[ \log P(y_p \mid \mathbf{h}_p) + \alpha | \hat{\mathbf{e}}_p - \mathbf{e}_p |^2 \right] \]

where \(M\) is the set of masked properties, \(y_p\) is the property’s BFO type, \(\mathbf{e}_p\) is the property name embedding, and \(\alpha\) weights the name regression term.

Trains: Column type annotation (CTA). The model learns to identify what semantic role a column plays from its value distribution and surrounding context.

Replaced Column Detection

LLM analogue: Replaced Token Detection (ELECTRA)

Swap columns between tables that originated from different ontological entities. A discriminator must identify which columns are imposters — present in a table they don’t ontologically belong to.

The generator learns to make plausible swaps — columns with similar value distributions but different semantic types. This is precisely the confusable-pair problem. A naive generator might swap patient_id (UUID) with encounter_date (timestamp) — trivially detectable. A trained generator learns to swap patient_id with provider_id (both UUIDs, both foreign-keyed) — a much harder discrimination task.

Two-phase training:

Generator: A small model that scores candidate column swaps by value-distribution similarity and selects high-similarity pairs
Discriminator: Aegir itself, trained to detect which columns don’t belong

The ELECTRA insight applies directly: the discriminator receives a training signal on every column (original or replaced), not just the masked positions. This is far more sample-efficient than masking-based objectives.

Loss: Binary cross-entropy per column.

\[ \mathcal{L}{\text{RCD}} = -\frac{1}{N} \sum{i=1}^{N} \left[ y_i \log D(\mathbf{h}_i) + (1 - y_i) \log(1 - D(\mathbf{h}_i)) \right] \]

where \(y_i = 1\) if column \(i\) was replaced and \(D\) is the discriminator head.

Trains: Confusable type resolution. Directly addresses the hardest failure mode in production column annotation — columns with identical value patterns but different semantic roles.

Relation Masking

LLM analogue: Next Sentence Prediction (BERT) / Sentence Order Prediction (ALBERT), extended to structural relationships

Drop a foreign key column from the serialized data and ask the model to predict that a relationship between two tables exists, which tables it connects, and what column would mediate it.

Task variants:

Variant	Input	Target
Existence	Two tables, FK column removed	Binary: are these tables related?
Direction	Two related tables, FK removed	Which table is parent, which is child?
Column	Tables with FK removed	Which column in the child table held the FK?
Full recovery	Multi-table schema, one FK removed	Predict source table, target table, and mediating column

Difficulty: Existence is easy (value overlap between tables is a strong signal). Direction requires understanding cardinality from data distributions. Full recovery in a 10-table schema with multiple possible FK targets is genuinely hard.

Loss: Cross-entropy over table pairs for existence/direction, cross-entropy over columns for the FK column prediction.

\[ \mathcal{L}{\text{RM}} = \mathcal{L}{\text{exist}} + \beta_1 \mathcal{L}{\text{direction}} + \beta_2 \mathcal{L}{\text{column}} \]

Trains: Cross-table data element discovery. The model learns to identify structural relationships between tables from data patterns alone — exactly what’s needed when foreign key metadata is missing or unreliable in enterprise warehouses.

Span Corruption (Entity-Level)

LLM analogue: Span Corruption (T5)

Mask all columns belonging to one data element across all tables and replace them with a sentinel. The model must predict what kind of entity is absent based on the remaining schema structure.

This is harder than single-property masking because the model must reason about the structural hole in the schema. A schema with patients, encounters, medications, and providers but no diagnostic information has a recognizable gap — clinical workflows always involve diagnosis. The model learns domain-level structural expectations.

Masking strategies:

Single entity: Remove all columns from one BFO class (as above)
Related pair: Remove two related entities (e.g., Diagnosis and its FK in Encounter)
Subtree: Remove an entity and all its dependents in the ontological hierarchy

Loss: Sequence-to-sequence generation of the masked entity structure, or classification over a vocabulary of entity type templates.

Trains: Entity boundary detection and structural reasoning. When the model encounters a real database missing expected entities, it can predict what should exist — critical for data governance gap analysis.

Augmentation Strategies

Schema Denoising

LLM analogue: Denoising Autoencoder (BART)

Apply multiple corruptions to the serialized schema simultaneously. The model must recover the clean ontological structure from the noisy input.

Corruption menu (applied stochastically per training example):

Corruption	What Changes	Real-World Analogue
Column renaming	`date_of_birth` → `col_3`	Generic column names in enterprise DW
Column shuffling	Randomize column order within tables	Arbitrary column ordering conventions
Table merging	Join two tables into one wide table	Denormalization for query performance
Table splitting	Split one table into arbitrary fragments	Vertical partitioning
Type coercion	Store dates as strings, integers as floats	Legacy system type mismatches
Delimiter variation	CSV → TSV → pipe-delimited → fixed-width	Different export formats
Header removal	Drop column headers entirely	Headerless data exports
Row sampling	Keep only a random subset of rows	Partial data access

Multiple corruptions can stack: rename columns and merge tables and switch delimiters. The model trained on this distribution becomes robust to the full range of real-world schema messiness.

Loss: Reconstruction loss on the original ontological labels applied to the column embeddings from the corrupted input. The corruptions change what the model sees; the targets remain the clean ontological structure.

Trains: Robustness to real-world data formats. Enterprise databases exhibit every one of these corruptions and often several simultaneously.

Cross-Schema Contrastive Learning

LLM analogue: Contrastive Learning (SimCLR, CLIP)

Generate two different schemas from the same ontology fragment — one normalized with clear names, one denormalized with obfuscated names — and train the model to produce similar representations for both. Schemas from different ontology fragments should produce dissimilar representations.

Positive pairs: Two schema variants from the same ontology fragment. Negative pairs: Schemas from different ontology fragments (even within the same domain — two different healthcare schemas should still be distinguishable).

Loss: InfoNCE contrastive loss over schema-level representations.

\[ \mathcal{L}{\text{CSC}} = -\frac{1}{|\mathcal{B}|} \sum{i \in \mathcal{B}} \log \frac{\exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}i^b) / \tau)}{\sum{j \in \mathcal{B}} \exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}_j^b) / \tau)} \]

where \(\mathbf{z}_i^a\) and \(\mathbf{z}_i^b\) are schema-level embeddings (pooled from column embeddings) for the two variants of ontology fragment \(i\), and \(\mathcal{B}\) is the batch.

Trains: Schema-invariant representations. The model learns that the same information can appear in radically different structural formats — the core challenge in enterprise data integration.

Domain-Specific Objectives

Axiom Recovery

LLM analogue: No direct analogue — novel to this setting

Given only the populated tables (no schema metadata), predict the constraints from the source ontology.

Target axioms:

Axiom Type	Example	Evidence in Data
Enum constraint	`disposition ∈ {admission, discharge, transfer, observation}`	Closed set of distinct values
Uniqueness	`license_number` is unique per provider	No duplicates in column
Cardinality	Exactly one `is_primary=true` per encounter	Group-by count pattern
Range	`esi_level ∈ [1, 5]`	Min/max of integer column
Referential	Every `encounter.patient_id` appears in `patient.patient_id`	Value subset relationship
Functional dependency	`zip_code → state`	Deterministic mapping in data

Loss: Multi-label classification over axiom templates, parameterized by column references and value sets.

Trains: Constraint discovery. In production, many database constraints are implicit (enforced by application logic, not declared in the schema). A model that can infer constraints from data patterns provides direct value for data quality assessment and governance.

Normalization Prediction

LLM analogue: No direct analogue — novel to this setting

Given a denormalized table, predict the normalized ontological entities — which groups of columns should be separate entities.

In the hospital example, a fully denormalized patient_encounters table contains patient demographics, encounter details, vital signs, diagnoses, and medications all in one wide table. The model must predict that this represents 5+ distinct ontological entities that have been collapsed.

The inverse task is also valuable: given a normalized schema, predict which tables could be meaningfully denormalized (i.e., which tables represent qualities or sub-parts of a parent entity).

Loss: Clustering loss over column embeddings within a single table — columns that should be factored into the same normalized entity should cluster together.

\[ \mathcal{L}{\text{norm}} = -\frac{1}{|\mathcal{P}{\text{intra}}|} \sum_{(i,j) \in \mathcal{P}_{\text{intra}}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \in \text{cols}(t)} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]

where \(\mathcal{P}_{\text{intra}}\) is the set of column pairs within a single table that originate from the same ontological entity.

Trains: Entity boundary detection within denormalized tables. Real enterprise data warehouses are heavily denormalized for query performance. Recovering the underlying entity structure from a 200-column fact table is a high-value governance task.

Cardinality Estimation

LLM analogue: No direct analogue — extends relational reasoning

Given populated tables, predict the cardinality constraints from the source ontology: one-to-one, one-to-many, or many-to-many.

The model must infer cardinality from value distributions:

1:1: Every FK value appears exactly once in both tables
1:N: FK values in the child table repeat; each parent PK appears once
M:N: Both sides have repeating values (mediated by a junction table)

Loss: Cross-entropy over cardinality categories per table pair.

Trains: Relationship characterization. Understanding cardinality is foundational for schema understanding and directly supports both CPA and data element discovery — a 1:1 relationship suggests entity decomposition, while M:N suggests an independent association.

Difficulty Curriculum

Following UL2’s insight that mixing objectives with explicit difficulty signals outperforms any single objective, training uses a difficulty-tagged curriculum.

Each training example carries a difficulty tag prepended to the input. The model learns to allocate capacity differently depending on the expected difficulty — using fast pattern matching for R-level tasks and deeper structural reasoning for X/Z-level tasks.

Curriculum Schedule

Training proceeds in four phases, progressively increasing difficulty:

Phase	Epochs	Mix (R/S/X/Z)	Objectives Introduced
1	0–10	70/20/10/0	OPM, RCD (easy variants)
2	10–30	30/40/20/10	+ Relation Masking, Schema Denoising
3	30–60	10/30/30/30	+ Span Corruption, Cross-Schema Contrastive
4	60+	10/20/30/40	+ Axiom Recovery, Normalization, Cardinality

Domain-specific objectives (axiom recovery, normalization prediction, cardinality estimation) are introduced late because they require the model to already have basic column understanding and cross-table reasoning capabilities.

Objective Priority

The objectives are not equally important. Based on downstream task alignment:

Objective	Priority	Downstream Impact
Object Property Masking	Core	Directly trains CTA
Replaced Column Detection	Core	Resolves confusable pairs — the hardest CTA failures
Relation Masking	Core	Directly trains cross-table data element discovery
Span Corruption	Core	Trains entity boundary detection
Schema Denoising	High	Robustness to real-world data — improves all tasks
Cross-Schema Contrastive	High	Schema-invariant representations — critical for transfer
Axiom Recovery	Medium	Valuable for governance but not core to CTA/DE
Normalization Prediction	Medium	Important for denormalized warehouses
Cardinality Estimation	Medium	Supports relationship characterization

The four core objectives should compose the majority of training compute. Augmentation strategies (denoising, contrastive) are applied as data transformations rather than separate losses. Domain-specific objectives are scheduled in later phases as refinement tasks.

End-to-End Example

This walkthrough traces a single educational text passage through the entire pretraining pipeline – from raw text to a validated training example. Every intermediate representation is shown concretely, making the abstract pipeline tangible.

Hero Diagram

Step 1: Input Text

A passage from a PDF about hospital emergency department workflows:

Emergency departments manage patient flow through a structured triage process. When a patient arrives, a triage nurse assesses their condition and assigns an acuity level using the Emergency Severity Index (ESI), ranging from 1 (resuscitation) to 5 (non-urgent). Each patient encounter records the presenting complaint, vital signs at triage, the assigned provider, and any diagnostic tests ordered.

Diagnoses are coded using ICD-10-CM, with a primary diagnosis and optional secondary diagnoses recorded per encounter. Medications prescribed during the encounter are tracked with the drug name, dosage, route of administration, and the prescribing provider. The encounter concludes with a disposition decision: admission, discharge, transfer, or observation.

This is a typical educational passage: clear, structured, and rich in implicit ontological content.

Step 2: Ontology Extraction

The LLM receives the passage with a structured extraction prompt and produces a BFO-grounded ontology fragment:

Classes (with BFO alignment):

Class	BFO Parent	Properties
`Patient`	`BFO:Object`	patient_id, date_of_birth, gender, address
`Encounter`	`BFO:Process`	encounter_id, encounter_date, presenting_complaint, disposition
`Provider`	`BFO:Role`	provider_id, name, specialty, license_number
`Diagnosis`	`BFO:GDC`	diagnosis_id, icd10_code, description, is_primary
`Medication`	`BFO:GDC`	medication_id, drug_name, dosage, route
`VitalSigns`	`BFO:Quality`	heart_rate, blood_pressure, temperature, respiratory_rate, spo2
`AcuityLevel`	`BFO:Quality`	esi_level (1–5)

Relations:

Relation	Domain	Range	Cardinality
`hasEncounter`	Patient	Encounter	1..*
`hasProvider`	Encounter	Provider	1..1
`hasDiagnosis`	Encounter	Diagnosis	1..*
`hasMedication`	Encounter	Medication	0..*
`hasVitalSigns`	Encounter	VitalSigns	1..1
`hasAcuity`	Encounter	AcuityLevel	1..1
`prescribedBy`	Medication	Provider	1..1

Axioms:

Encounter.disposition ∈ {admission, discharge, transfer, observation}
AcuityLevel.esi_level ∈ {1, 2, 3, 4, 5}
Diagnosis.is_primary is unique per Encounter (exactly one primary diagnosis)

Step 3: SysMLv2 Model

The ontology maps to SysMLv2 block definitions:

The SysMLv2 model adds lifecycle semantics (the Encounter state machine: entry → triage → treatment → disposition → closed) and formal constraints that the flat ontology fragment does not capture.

Step 4: Python Data Objects

from dataclasses import dataclass
from datetime import date, datetime
from enum import Enum
from typing import Optional

class Disposition(Enum):
    ADMISSION = "admission"
    DISCHARGE = "discharge"
    TRANSFER = "transfer"
    OBSERVATION = "observation"

class Route(Enum):
    ORAL = "oral"
    IV = "intravenous"
    IM = "intramuscular"
    TOPICAL = "topical"
    INHALED = "inhaled"

@dataclass
class Patient:
    patient_id: str
    date_of_birth: date
    gender: str
    address: str

@dataclass
class Provider:
    provider_id: str
    name: str
    specialty: str
    license_number: str

@dataclass
class Encounter:
    encounter_id: str
    patient_id: str           # FK → Patient
    provider_id: str          # FK → Provider
    encounter_date: datetime
    presenting_complaint: str
    esi_level: int            # 1-5
    disposition: Disposition
    heart_rate: int
    blood_pressure: str
    temperature: float
    respiratory_rate: int
    spo2: int

@dataclass
class Diagnosis:
    diagnosis_id: str
    encounter_id: str         # FK → Encounter
    icd10_code: str
    description: str
    is_primary: bool

@dataclass
class Medication:
    medication_id: str
    encounter_id: str         # FK → Encounter
    prescribed_by: str        # FK → Provider
    drug_name: str
    dosage: str
    route: Route

Note that VitalSigns and AcuityLevel (BFO:Quality entities) have been denormalized into the Encounter table – a deliberate schema variation that the model must learn to handle. In a different schema variant, these would be separate tables.

Step 5: Relational Schema

CREATE TABLE patient (
    patient_id     VARCHAR(36) PRIMARY KEY,
    date_of_birth  DATE NOT NULL,
    gender         VARCHAR(10) NOT NULL,
    address        TEXT
);

CREATE TABLE provider (
    provider_id    VARCHAR(36) PRIMARY KEY,
    name           VARCHAR(100) NOT NULL,
    specialty      VARCHAR(50) NOT NULL,
    license_number VARCHAR(20) NOT NULL UNIQUE
);

CREATE TABLE encounter (
    encounter_id        VARCHAR(36) PRIMARY KEY,
    patient_id          VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
    provider_id         VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
    encounter_date      TIMESTAMP NOT NULL,
    presenting_complaint TEXT NOT NULL,
    esi_level           INTEGER NOT NULL CHECK (esi_level BETWEEN 1 AND 5),
    disposition         VARCHAR(20) NOT NULL
                        CHECK (disposition IN ('admission','discharge','transfer','observation')),
    heart_rate          INTEGER,
    blood_pressure      VARCHAR(10),
    temperature         NUMERIC(4,1),
    respiratory_rate    INTEGER,
    spo2                INTEGER CHECK (spo2 BETWEEN 0 AND 100)
);

CREATE TABLE diagnosis (
    diagnosis_id VARCHAR(36) PRIMARY KEY,
    encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
    icd10_code   VARCHAR(10) NOT NULL,
    description  TEXT,
    is_primary   BOOLEAN NOT NULL DEFAULT FALSE,
    UNIQUE (encounter_id, is_primary) -- at most one primary per encounter
);

CREATE TABLE medication (
    medication_id VARCHAR(36) PRIMARY KEY,
    encounter_id  VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
    prescribed_by VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
    drug_name     VARCHAR(100) NOT NULL,
    dosage        VARCHAR(50) NOT NULL,
    route         VARCHAR(20) NOT NULL
);

Step 6: Synthetic Data

Sample rows from the populated tables:

patient (200 rows):

patient_id	date_of_birth	gender	address
`a3f8c1d0-...`	1987-03-15	Female	2847 Oak Ave, Portland, OR 97205
`b7e2a4f1-...`	1952-11-28	Male	156 Pine St, Austin, TX 78701
`c9d0b3e2-...`	2001-07-04	Female	4021 Maple Dr, Denver, CO 80202

encounter (1,400 rows, ~7 per patient):

encounter_id	patient_id	provider_id	encounter_date	presenting_complaint	esi_level	disposition	heart_rate	blood_pressure	temperature
`e1a2b3c4-...`	`a3f8c1d0-...`	`p001-...`	2024-01-15 14:30	Acute chest pain	2	admission	98	145/92	98.6
`e5f6a7b8-...`	`b7e2a4f1-...`	`p003-...`	2024-02-03 09:15	Laceration, left hand	4	discharge	72	128/78	98.2

diagnosis (3,200 rows, ~2.3 per encounter):

diagnosis_id	encounter_id	icd10_code	description	is_primary
`d100-...`	`e1a2b3c4-...`	I21.9	Acute myocardial infarction, unspecified	true
`d101-...`	`e1a2b3c4-...`	I10	Essential hypertension	false
`d200-...`	`e5f6a7b8-...`	S61.412A	Laceration without FB, left hand	true

medication (2,100 rows):

medication_id	encounter_id	prescribed_by	drug_name	dosage	route
`m100-...`	`e1a2b3c4-...`	`p001-...`	Aspirin	325mg	oral
`m101-...`	`e1a2b3c4-...`	`p001-...`	Heparin	5000 units	intravenous
`m200-...`	`e5f6a7b8-...`	`p003-...`	Lidocaine	1% 5mL	topical

Step 7: Serialized Input

Aegir receives the tables as byte-serialized CSV data. Here’s what the model actually sees (abbreviated):

patient_id,date_of_birth,gender,address
a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,1987-03-15,Female,"2847 Oak Ave, Portland, OR 97205"
b7e2a4f1-3c5d-4e6f-8a9b-c0d1e2f3a4b5,1952-11-28,Male,"156 Pine St, Austin, TX 78701"
c9d0b3e2-1a4f-4c7d-9e2b-f3a5b6c7d8e9,2001-07-04,Female,"4021 Maple Dr, Denver, CO 80202"
...
===TABLE_BOUNDARY===
encounter_id,patient_id,provider_id,encounter_date,presenting_complaint,esi_level,disposition,heart_rate,blood_pressure,temperature,respiratory_rate,spo2
e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,p001-a2b3-c4d5,2024-01-15 14:30:00,Acute chest pain,2,admission,98,145/92,98.6,20,97
...
===TABLE_BOUNDARY===
diagnosis_id,encounter_id,icd10_code,description,is_primary
d100-e1f2-a3b4-c5d6,e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,I21.9,"Acute myocardial infarction, unspecified",true
...

The model sees raw bytes. No type annotations, no foreign key declarations, no semantic metadata – just the patterns in the data itself.

Step 8: Training Target

The expected predictions for this training example:

CTA predictions (per column):

Table	Column	Expected Type	BFO Category
patient	patient_id	PersonIdentifier	GDC
patient	date_of_birth	BirthDate	Quality
patient	gender	BiologicalSex	Quality
encounter	encounter_id	EncounterIdentifier	GDC
encounter	patient_id	PersonIdentifier (FK)	GDC
encounter	esi_level	AcuityLevel	Quality
encounter	disposition	DispositionDecision	Quality
encounter	heart_rate	VitalSign	Quality
diagnosis	icd10_code	DiagnosisCode	GDC
diagnosis	is_primary	PrimaryIndicator	Quality
medication	drug_name	MedicationName	GDC
medication	dosage	Dosage	Quality
medication	route	AdministrationRoute	Quality

Data element predictions (cross-table clusters):

Data Element	Columns	Source Entity
PatientDemographics	patient.patient_id, patient.date_of_birth, patient.gender, patient.address, encounter.patient_id	`Patient`
ClinicalEncounter	encounter.encounter_id, encounter.encounter_date, encounter.presenting_complaint, encounter.esi_level, encounter.disposition, encounter.heart_rate, encounter.blood_pressure, encounter.temperature	`Encounter` + `VitalSigns` + `AcuityLevel`
DiagnosisRecord	diagnosis.diagnosis_id, diagnosis.encounter_id, diagnosis.icd10_code, diagnosis.description, diagnosis.is_primary	`Diagnosis`
MedicationOrder	medication.medication_id, medication.encounter_id, medication.drug_name, medication.dosage, medication.route	`Medication`
ClinicalProvider	provider.provider_id, provider.name, provider.specialty, provider.license_number, encounter.provider_id, medication.prescribed_by	`Provider`

Note that the PatientDemographics data element spans patient.patient_id and encounter.patient_id – cross-table discovery. Similarly, ClinicalProvider spans columns in three tables (provider, encounter, medication). This is exactly the cross-table data element discovery that enterprise data governance requires.

Step 9: Validation

The round-trip check confirms that predicted data elements map back to source ontological entities:

Every predicted data element corresponds to exactly one source ontology entity. The ClinicalEncounter element correctly groups encounter properties with the denormalized VitalSigns and AcuityLevel qualities – demonstrating that the model learned to see through the denormalization to the underlying ontological structure.

This validation is automatic and exact because the generation pipeline preserves complete provenance. There is no human labeling, no ambiguity, and no annotation disagreement. The ground truth is a mathematical consequence of the generation process.

What This Means in Practice

When this training process is applied at scale – across hundreds of millions of passages spanning every domain in FineWeb-Edu – the model learns:

Column type recognition that generalizes across naming conventions, data formats, and serialization styles
Cross-table relationship discovery that identifies semantically related columns regardless of which tables they appear in
Ontological hierarchy that connects specific types (ICD-10 codes) to general categories (information entities) through BFO’s formal structure
Confusable type resolution by leveraging cross-column context (patient_id vs provider_id look identical in isolation but participate in different relationship patterns)

These capabilities transfer directly to real enterprise data warehouses, where the model encounters the same patterns – just without the luxury of knowing the ontological provenance in advance.

Agent Swarm Architecture

Aegir’s agent swarm infrastructure enables multi-agent collaboration through RWKV recurrent state fusion. Rather than exchanging text messages or attention KV caches between agents, the swarm shares compact recurrent state tensors – a fundamentally more efficient communication medium for recurrent architectures.

The central insight is that RWKV’s recurrent state is constant in sequence length. Each layer’s state is a matrix of shape (H, K, V) where H is the number of heads and K = V = head_size. The total state size per layer is:

O(H * head_size^2) = O(d_model * head_size) = O(d^2)

This is independent of how many tokens the agent has processed.

For a swarm of N agents, the cost of sharing all recurrent states is:

RWKV:        O(N * d^2)          -- constant in sequence length
Transformer: O(N * n * d)        -- linear in sequence length n

At context lengths of 4k-128k tokens with typical d = 512-4096, RWKV state sharing is orders of magnitude cheaper. The LatentMAS paper (arXiv:2511.20639) quantifies this as 235-471x more information-dense than text-based inter-agent communication, since the recurrent state encodes a compressed summary of the entire processing history.

Swarm Components

The swarm consists of four modules:

Module	File	Purpose
`RWKVStateFusion`	`src/aegir/swarm/state_fusion.py`	Combine N agent states into one
`AlignmentProjection`	`src/aegir/swarm/alignment.py`	Map states between different-sized agents
`FrozenSpecialist`	`src/aegir/swarm/specialist.py`	Wrap pre-trained models as frozen agents
`SwarmOrchestrator`	`src/aegir/swarm/orchestrator.py`	K2.5 PARL routing and reward

State Fusion Modes

RWKVStateFusion supports three strategies for combining agent states:

weighted_sum – Attention-weighted combination using learnable query/key projections. The orchestrator learns which agents to trust per head.
gated – Per-agent softmax gates. Simpler than attention but still differentiable. Good baseline for initial experiments.
concat_project – Concatenate all agent states and project back to single-agent dimensions. Most expressive but O(N) in parameter count.

See RWKV State Fusion for mathematical details.

Information Density Advantage

LatentMAS demonstrates that recurrent state communication dramatically outperforms text-based multi-agent protocols. The recurrent state is a lossy but highly compressed representation of the agent’s entire context window. Sharing it is equivalent to sharing a continuous-valued “summary” that preserves the information most relevant to the model’s computation, rather than forcing that information through a text bottleneck.

For Aegir’s column annotation task, this means a specialist trained on (say) geographic column types can share its accumulated understanding of a table’s structure through a single (H, K, V) tensor per layer, rather than generating and parsing natural language explanations.

RWKV State Fusion

The RWKVStateFusion module combines recurrent states from multiple specialist agents into a single fused state for the primary agent. Implementation is in src/aegir/swarm/state_fusion.py.

Input Format

Each agent produces a per-layer recurrent state tensor of shape:

(B, H, K, V)

where B is batch size, H = num_heads, and K = V = head_size. Given N agents, the fusion module receives a list of N such tensors and outputs a single tensor of the same shape.

Internally, the input list is stacked into a single tensor of shape (B, N, H, K, V).

Fusion Modes

`weighted_sum` – Attention Over Agent States

Uses a learnable query vector per head and a key projection to compute attention weights over agents.

Parameters:

query: (H, K) – learnable query per attention head
key_proj: Linear mapping K*V -> K (no bias)

Computation:

flat   = reshape(stacked, [B, N, H, K*V])
keys   = key_proj(flat)                      # (B, N, H, K)
scores = einsum("bnhk, hk -> bnh", keys, query)
weights = softmax(scores, dim=1)             # (B, N, H)
fused  = einsum("bnh, bnhkv -> bhkv", weights, stacked)

Each head independently learns which agents to attend to. This is the default mode and generally the most effective, since it allows fine-grained per-head routing without excessive parameters.

`gated` – Learnable Per-Agent Gates

A simpler approach with a single learnable gate vector.

Parameters:

gates: (N,) – initialized to 1/N (uniform)

Computation:

weights = softmax(gates, dim=0)   # (N,)
fused   = einsum("n, bnhkv -> bhkv", weights, stacked)

All heads share the same agent weighting. This is cheaper than weighted_sum but less expressive – it cannot learn head-specific preferences for different specialists.

`concat_project` – Concatenate and Project

The most expressive mode. Concatenates all agent states along the agent dimension and projects back.

Parameters:

proj: Linear mapping N*K*V -> K*V (no bias)

Computation:

flat      = reshape(permute(stacked, [0,2,1,3,4]), [B, H, N*K*V])
projected = proj(flat)           # (B, H, K*V)
fused     = reshape(projected, [B, H, K, V])

This allows arbitrary mixing of information across agents within each head but scales linearly in parameters with the number of agents.

Usage Example

from aegir.swarm.state_fusion import RWKVStateFusion

fusion = RWKVStateFusion(
    num_heads=8,
    head_size=64,
    num_agents=3,
    mode="weighted_sum",
)

# agent_states: list of 3 tensors, each (B, 8, 64, 64)
fused_state = fusion(agent_states)  # (B, 8, 64, 64)

Mode Selection Guidelines

Mode	Parameters	Per-head routing	Best for
`weighted_sum`	`O(HK + KV*K)`	Yes	General use, default
`gated`	`O(N)`	No	Quick experiments, few agents
`concat_project`	`O(NKVKV)`	Yes	Maximum expressiveness, small N

LatentMAS Alignment Projection

The AlignmentProjection module maps recurrent states between agents that may have different architectures (different d_model, num_heads, or head_size). Implementation is in src/aegir/swarm/alignment.py.

Problem

When fusing states from multiple agents, all states must share the same (H, K, V) dimensions. But specialists may have been trained with different model sizes. A CTA specialist with d_model=256 and a CPA specialist with d_model=512 produce incompatible recurrent states. The alignment projection resolves this mismatch.

State Types

RWKV recurrent states consist of two kinds of tensors:

Matrix States (`att_kv`)

The core recurrent state from time mixing. Shape: (B, H, K, V) where K = V = head_size.

Projection: When source and target have different num_heads or head_size, the matrix state is flattened and linearly projected:

S_flat = reshape(S_source, [B, H_s * K_s * V_s])
S_target = W_matrix @ S_flat
S_out = reshape(S_target, [B, H_t, K_t, V_t])

where W_matrix has shape (H_t * K_t * V_t, H_s * K_s * V_s).

The LatentMAS paper (arXiv:2511.20639) proposes using bilinear projection S' = W_l @ S @ W_r^T and computing W_a via ridge regression on paired agent activations. Aegir instead trains the projection end-to-end as part of the swarm’s gradient flow, which avoids the need for a separate alignment data collection phase and allows the projection to co-adapt with the fusion module.

Vector States (`att_x_prev`, `ffn_x_prev`)

The previous-timestep hidden state cache used by RWKV’s time-shift mechanism. Shape: (B, D) where D = d_model.

Projection: Simple linear mapping when d_model differs:

x_target = W_vector @ x_source

where W_vector has shape (D_target, D_source).

When Projections Are Needed

The module detects whether projection is needed at initialization:

# Matrix projection: needed when head geometry differs
needs_matrix_proj = (
    source_num_heads != target_num_heads
    or source_head_size != target_head_size
)

# Vector projection: needed when d_model differs
needs_vector_proj = (source_d_model != target_d_model)

When source and target share the same architecture, both projections are identity operations (no parameters allocated).

Usage

from aegir.swarm.alignment import AlignmentProjection

align = AlignmentProjection(
    source_num_heads=4,   source_head_size=64,
    target_num_heads=8,   target_head_size=64,
    source_d_model=256,
    target_d_model=512,
)

# Project matrix state
att_kv_target = align.forward_matrix(att_kv_source)   # (B,4,64,64) -> (B,8,64,64)

# Project vector state
x_prev_target = align.forward_vector(x_prev_source)   # (B,256) -> (B,512)

LatentMAS vs Aegir Approach

Aspect	LatentMAS	Aegir
Alignment method	Ridge regression on collected pairs	End-to-end gradient training
Training data	Requires parallel agent runs	Learned during swarm training
Adaptability	Fixed after alignment phase	Continuously adapts
Projection type	Bilinear `W_l @ S @ W_r^T`	Flatten + linear (equivalent expressiveness)

The end-to-end approach is viable because Aegir’s swarm training already has gradient flow through the fusion module. The alignment projection sits in that gradient path and receives signal from the downstream task loss.

K2.5 PARL Orchestrator

The SwarmOrchestrator coordinates a trainable primary Aegir model with multiple frozen specialist agents, following the Parallel Agent Reinforcement Learning (PARL) pattern from Kimi K2.5 (arXiv:2602.02276). Implementation is in src/aegir/swarm/orchestrator.py.

Architecture

                    +-------------------+
                    | SwarmOrchestrator |
                    +-------------------+
                            |
             +--------------+--------------+
             |              |              |
     SpecialistRouter   Primary      FrozenSpecialists
     (sigmoid gates)    (trainable)   (frozen params)
             |              |              |
             |              |    +---------+---------+
             |              |    |         |         |
             |              |  Spec_0   Spec_1   Spec_N
             |              |    |         |         |
             +--> activation -->  state fusion  <----+
                  mask           (RWKVStateFusion)

The primary model is the only component whose parameters are updated during PARL training. Specialists are frozen checkpoints that contribute their recurrent states when activated by the router.

SpecialistRouter

The router decides which specialists to activate for a given input. It maps the primary agent’s hidden representation to per-specialist activation scores:

scores = sigmoid(W_router @ hidden_states)   # (B, num_specialists)
activation_mask = scores > threshold          # default threshold = 0.5

Sigmoid gating (rather than softmax) allows zero, one, or multiple specialists to be activated simultaneously. This is critical for the column annotation task where a table may require expertise from several domain specialists, or none at all.

PARL Reward Structure

The combined reward follows K2.5’s formulation:

r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf

Reward Components

r_perf – Performance reward. F1 accuracy on the annotation task (CTA or CPA). This is the primary signal that drives annotation quality.

r_parallel – Parallelism and load balancing reward. Encourages efficient specialist utilization: activate specialists when they help, avoid activating them when they don’t. Adapted from H-Net’s lb_loss which penalizes unbalanced routing across experts.

r_finish – Completion quality reward. All columns in a table must be annotated, and the router must not degenerate into always-on or always-off patterns. Penalizes incomplete annotations and trivial routing strategies.

Lambda Annealing Schedule

Following K2.5, the lambda weights anneal over training:

Phase	`lambda_1` (parallel)	`lambda_2` (finish)	Rationale
Early	0.3	0.1	Encourage exploration of specialist activation
Mid	0.1	0.3	Shift focus to completion quality
Late	0.05	0.05	Let `r_perf` dominate for final accuracy

The initial values (lambda_parallel=0.3, lambda_finish=0.1) are set in the orchestrator constructor. Annealing is managed by the training loop.

Token-Level Clipping RL

K2.5 uses a variant of PPO with token-level clipping rather than trajectory-level. This provides finer-grained credit assignment:

Each token’s routing decision gets its own clipped surrogate objective
Critical tokens (column boundaries, type-indicative values) receive higher weight
The clipping range narrows over training to stabilize converged policies

Critical-Steps Optimization

Rather than minimizing total computation, the orchestrator minimizes the critical path – the longest chain of sequential dependencies. Specialist activations that can run in parallel do not increase the critical path even if they increase total FLOPs. This encourages the router to prefer parallel specialist activation over sequential reasoning in the primary model when both achieve similar accuracy.

Forward Pass

orchestrator = SwarmOrchestrator(
    primary_model=primary,
    specialists=[spec_cta, spec_cpa, spec_geo],
    fusion=RWKVStateFusion(num_heads=8, head_size=64, num_agents=3),
    d_model=512,
    activation_threshold=0.5,
)

result = orchestrator(
    input_ids=tokens,
    mask=mask,
    routing_hidden=pooled_hidden,  # from primary's first layer
)

# result["output"]           -- primary model output
# result["specialist_outputs"] -- list of activated specialist results
# result["activation_mask"]  -- (B, num_specialists) boolean mask

When routing_hidden is None, specialist activation is skipped entirely and only the primary model runs. This allows the same orchestrator to be used in both supervised pre-training (no specialists) and PARL training (with specialists).

Roadmap: K2.5 RL Post-Training

This section outlines the four-phase plan for training Aegir from a supervised baseline through full multi-agent reinforcement learning with PARL orchestration.

Overview

The training follows a progressive complexity increase, where each phase builds on the previous one’s checkpoints and infrastructure:

Phase 1              Phase 2              Phase 3              Phase 4
Supervised     -->   Reward         -->   PARL           -->   Agent
Bootstrapping        Modeling             Training             Swarm RL

Train base           Design reward        Train orchestrator   Scale to
Aegir on CTA/CPA    components and       with frozen           multi-specialist
benchmarks           validate signals     specialists           swarms

Phases

Phase 1: Supervised Bootstrapping

Train the base Aegir model on column annotation benchmarks (CTA, CPA) with byte-level input and dynamic chunking. Establish baseline F1 scores and validate the hierarchical architecture on real table data.

Phase 2: Reward Modeling

Design and validate the three reward components (r_perf, r_parallel, r_finish) that will drive PARL training. Calibrate lambda weights and verify that the reward signal produces meaningful gradients.

Phase 3: PARL Training

Freeze the best Phase 1 checkpoint as a specialist and train a new primary model with the PARL orchestrator. Use token-level clipping RL with critical-steps optimization.

Phase 4: Agent Swarm RL

Scale from a single specialist to a full swarm with dynamic specialist spawning. Implement wide search (parallel column analysis) and deep search (hierarchical type reasoning) patterns.

Design Principles

Each phase produces a usable checkpoint. Even Phase 1 yields a competitive standalone column annotation model.
Frozen specialists are never modified. PARL training only updates the primary model and the routing/fusion modules. This prevents catastrophic forgetting in specialists and simplifies the training loop.
Reward components are validated independently. Phase 2 exists specifically to ensure that r_parallel and r_finish produce meaningful gradients before combining them with r_perf in Phase 3.
Complexity is additive, not multiplicative. Each phase adds exactly one new dimension of complexity (multi-task –> reward signals –> RL policy –> multi-agent), making failures easy to diagnose.

Phase 1: Supervised Bootstrapping

Train the base Aegir model on Column Type Annotation (CTA) and Column Property Annotation (CPA) benchmarks with byte-level input. This phase establishes baseline performance and validates the hierarchical architecture on real tabular data.

Objective

Produce a single Aegir checkpoint that achieves competitive F1 scores on standard CTA/CPA benchmarks, operating directly on raw byte sequences (no external tokenizer).

Target Datasets

Dataset	Task	Tables	Columns	Label Classes
SOTAB-CTA	Column Type Annotation	~50k	~500k	91 semantic types
GitTables	CTA (large-scale)	~1.5M	~15M	Schema.org types
WikiTables	CTA/CPA	~1.7M	~6M	DBpedia ontology

Baseline F1 Targets

These targets are based on published results from SOTAB and Retrieve-and-Verify:

Benchmark	Metric	Target F1
SOTAB-CTA (easy)	Macro F1	> 0.85
SOTAB-CTA (hard)	Macro F1	> 0.65
SOTAB-CPA	Macro F1	> 0.75

Byte-Level Input

Aegir operates on raw byte sequences (vocab_size=65536 to cover byte values plus special tokens). Tables are serialized into a linear byte stream with role markers distinguishing the target column from context columns.

Dynamic chunking learns tokenization from raw bytes. The RoutingModule in the hierarchical backbone predicts chunk boundaries based on cosine similarity between adjacent hidden states. This means the model discovers its own sub-word units during training, adapting segmentation to the statistics of tabular data rather than relying on a fixed tokenizer trained on natural language.

Serialization Format

Tables are serialized using the format in src/aegir/data/serialization.py:

[CLS] col_name: val1 | val2 | val3 [SEP] ctx_col1: v1 | v2 [SEP] ctx_col2: ...

The target column comes first, followed by context columns selected via MMR (Maximal Marginal Relevance) to maximize diversity while staying within the byte budget.

Training Configuration

Single-GPU (Development)

uv run --no-sync python train.py \
    --model-size tiny \
    --epochs 30 \
    --batch-size 32 \
    --lr 3e-4

Multi-GPU with DDP

uv run --no-sync torchrun --nproc_per_node=6 train.py \
    --model-size small \
    --epochs 100 \
    --batch-size 64 \
    --lr 1e-4

Training uses:

DDP (DistributedDataParallel) across GPUs
AMP (Automatic Mixed Precision) with bf16
Cosine LR schedule with linear warmup
Load balancing loss adapted from H-Net to regularize dynamic chunking

Model Sizes

Size	`d_model`	Layers	Parameters	Use Case
tiny	[128, 192, 192]	~10	~2M	Smoke tests, CI
small	[256, 384, 384]	~20	~15M	Development, ablations
base	[512, 768, 768]	~40	~120M	Benchmark evaluation

Success Criteria

Phase 1 is complete when:

The base model meets or exceeds F1 targets on SOTAB-CTA/CPA
Dynamic chunking converges to stable boundary predictions (no degenerate all-boundary or no-boundary patterns)
The trained checkpoint can be frozen and used as a specialist in Phase 3

Phase 2: Reward Modeling

Design and validate the three reward components that will drive PARL training in Phase 3. The goal is to ensure each reward signal produces meaningful, non-degenerate gradients before combining them into the full PARL objective.

Reward Components

`r_perf` – Performance Reward

The primary quality signal. Computed as F1 accuracy on held-out annotation tasks:

r_perf = macro_F1(predicted_labels, ground_truth)

For CTA, this is the macro-averaged F1 over all 91 semantic type classes. For CPA, it is the macro F1 over property classes.

This reward is straightforward to compute and directly measures what we care about. The challenge is that F1 is non-differentiable, so it must be used as an RL reward signal rather than a supervised loss (which uses cross-entropy as a differentiable proxy).

`r_parallel` – Load Balancing and Specialist Utilization

Adapted from H-Net’s lb_loss, this reward encourages efficient use of specialists:

r_parallel = -alpha * CV(activation_counts) + beta * utilization_rate

where:

CV(activation_counts) is the coefficient of variation of specialist activation counts across a batch. Penalizes routing that always sends to the same specialist.
utilization_rate is the fraction of specialists activated at least once in a batch. Rewards using the full specialist pool.
alpha, beta are tunable coefficients.

A degenerate router that always activates all specialists or never activates any will score poorly on this component. The reward is maximized when specialists are activated selectively and roughly equally.

`r_finish` – Completion Quality

Ensures that the swarm produces complete, non-degenerate outputs:

r_finish = coverage_score - degenerate_penalty

where:

coverage_score measures the fraction of columns in a table that receive an annotation. A table with 10 columns where only 7 are annotated scores 0.7.
degenerate_penalty fires when the router exhibits trivial strategies: always-on (activating all specialists for every input), always-off (never activating specialists), or constant routing (same activation pattern regardless of input).

Combined Reward

The three components are combined with annealing weights:

r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf

Note that r_perf has no lambda coefficient – it always contributes at full strength. The auxiliary rewards are scaled to be comparable in magnitude to r_perf and then weighted down.

Lambda Annealing Schedule

Following K2.5’s approach, the auxiliary reward weights change over training:

Training Progress	`lambda_1` (parallel)	`lambda_2` (finish)	Rationale
0-30%	0.3	0.1	Encourage specialist exploration early
30-70%	0.1	0.3	Shift focus to complete annotations
70-100%	0.05	0.05	Let accuracy dominate for fine-tuning

The annealing ensures that early training explores the specialist activation space (high lambda_1), then stabilizes routing toward complete outputs (high lambda_2), and finally optimizes purely for annotation accuracy.

Validation Protocol

Before proceeding to Phase 3, each reward component must pass these checks:

Non-zero gradient flow. The reward signal must produce non-trivial policy gradients through the router. Verified by checking that grad(router.weight) is non-zero after a reward update.
Correct polarity. Higher quality outputs must produce higher rewards. Verified by comparing rewards on hand-crafted good vs. bad annotation examples.
Independence. Each component must capture a distinct failure mode. Verified by constructing examples where one component fires but others do not:
- High r_perf, low r_parallel: accurate but always uses the same specialist
- High r_parallel, low r_finish: well-balanced routing but incomplete annotations
- High r_finish, low r_perf: complete annotations but wrong types
Scale compatibility. All three components should produce values in a comparable range (roughly [0, 1]) to avoid one signal dominating before lambda annealing can take effect.

Phase 3: PARL Training

Train the SwarmOrchestrator using Parallel Agent Reinforcement Learning, following the K2.5 framework (arXiv:2602.02276). The primary model learns to route inputs to frozen specialists and fuse their recurrent states, optimized via token-level clipping RL.

Setup

Primary Model

A fresh Aegir model initialized from the Phase 1 checkpoint. All parameters are trainable. The primary model learns to:

Process the input table and produce annotations
Decide which specialists to activate via the SpecialistRouter
Integrate specialist states through RWKVStateFusion

Frozen Specialists

One or more Phase 1 checkpoints frozen with requires_grad_(False). Each specialist is wrapped in a FrozenSpecialist that:

Runs forward passes with torch.no_grad()
Extracts recurrent states from its RWKV layers
Optionally applies AlignmentProjection if its architecture differs from the primary

Initially, Phase 3 uses a single specialist (the best Phase 1 checkpoint). Additional specialists with different training data or hyperparameters are added incrementally.

Token-Level Clipping RL

K2.5 uses a variant of PPO where the clipping objective is applied at the token level rather than the trajectory level. For each token position t:

L_t = min(
    rho_t * A_t,
    clip(rho_t, 1-eps, 1+eps) * A_t
)

where:

rho_t = pi_new(a_t | s_t) / pi_old(a_t | s_t) is the per-token importance ratio
A_t is the advantage estimate at position t
eps is the clipping range (starts at 0.2, narrows to 0.1 over training)

Token-level clipping provides finer-grained credit assignment than trajectory-level clipping. For column annotation, this means the router receives distinct gradient signal for each column boundary token, each type-indicative value, and each structural separator.

Routing as Action Space

The “action” at each routing decision point is the specialist activation vector:

a = sigmoid(W_router @ h)   # continuous in [0, 1]^num_specialists

The policy pi(a | s) is parameterized by the router weights. The RL objective encourages the router to activate specialists when they improve annotation quality and deactivate them when they don’t.

Critical-Steps Optimization

Rather than minimizing total FLOPs or wall-clock time, PARL optimizes the critical path – the longest sequential dependency chain in the computation.

critical_path = max(
    primary_forward_time,
    max(specialist_forward_times for activated specialists)
)

Specialist forward passes run in parallel (they are independent). The critical path is therefore the maximum of the primary and any single specialist, not the sum. This means:

Activating additional specialists that run in parallel is free in critical-path terms
The optimizer penalizes only sequential dependencies (e.g., if the primary must wait for specialist state before proceeding)
This naturally encourages parallel specialist activation over sequential reasoning in the primary

Training Loop

for each batch:
    1. Run primary model through first layer to get routing_hidden
    2. Compute specialist activation scores
    3. Run activated specialists (parallel, no_grad)
    4. Fuse specialist states into primary's recurrent state
    5. Complete primary forward pass
    6. Compute r_perf from annotation accuracy
    7. Compute r_parallel from activation statistics
    8. Compute r_finish from annotation completeness
    9. Combine rewards with annealed lambdas
   10. Compute token-level PPO loss and update primary + router + fusion

Budget-Limited vs Standard Scaling

PARL training alternates between two modes:

Budget-limited phase: The router has a hard cap on the number of specialists it can activate per batch. This encourages selective, high-value routing decisions. The cap starts low (1 specialist) and gradually increases.

Standard scaling phase: No activation cap. The router is free to activate as many specialists as it wants, paying only the r_parallel penalty for inefficient routing. This phase tests whether the router has learned meaningful selectivity.

The alternation prevents the router from converging to a trivial “activate everything” strategy during standard scaling while still allowing it to learn from unrestricted experimentation.

Success Criteria

Phase 3 is complete when:

The primary model with specialist fusion exceeds the standalone Phase 1 baseline by a meaningful margin (target: +2-5 F1 points on SOTAB-CTA hard split)
The router activates specialists selectively (not all-on or all-off) and the activation pattern varies with input content
The lambda annealing schedule produces smooth training curves without reward collapse

Phase 4: Agent Swarm RL

Scale from a single specialist to a full multi-specialist swarm with dynamic spawning, wide/deep search patterns, and adaptive specialist allocation based on table complexity.

Search Patterns

Wide Search – Parallel Column Analysis

Process multiple columns simultaneously by routing them to different specialists:

Table: [col_A, col_B, col_C, col_D, col_E]

Specialist 0 (geographic): col_A, col_C
Specialist 1 (temporal):   col_B
Specialist 2 (numeric):    col_D, col_E
Primary:                   all columns (final fusion)

Each specialist processes its assigned columns in parallel. The primary model receives fused states from all specialists and makes the final annotation decision. Wide search scales annotation throughput linearly with the number of specialists, bounded by the critical path of the slowest specialist.

Deep Search – Hierarchical Type Reasoning

For ambiguous columns, chain multiple specialists in sequence to progressively refine the type prediction:

Column: "Springfield" (city? state? person name?)

Step 1: Specialist 0 (general) --> geographic entity (0.6) | person name (0.3)
Step 2: Specialist 3 (geographic) --> city (0.7) | administrative region (0.2)
Step 3: Primary --> city (final, high confidence)

Deep search trades latency for accuracy on hard cases. The orchestrator learns when to invoke additional reasoning steps by monitoring the confidence of intermediate predictions.

Combined Wide-Deep

For complex tables, the orchestrator can combine both patterns: wide search across easy columns (one specialist pass each) and deep search on ambiguous columns (multiple specialist passes). The PARL reward structure naturally encourages this: r_parallel rewards wide parallelism, r_perf rewards deep accuracy, and critical-steps optimization keeps the overall latency bounded.

Dynamic Specialist Spawning

Rather than a fixed specialist pool, Phase 4 introduces dynamic spawning based on table complexity signals:

complexity = f(num_columns, label_entropy, column_diversity)

if complexity < threshold_low:
    activate 0-1 specialists (primary handles it alone)
elif complexity < threshold_high:
    activate 2-3 specialists (wide search)
else:
    activate N specialists + enable deep search

The complexity estimator is a lightweight head on the primary model’s first-layer output. It learns to predict how much specialist assistance a given table requires.

Specialist Pool Management

Warm pool: Pre-loaded specialists kept on GPU memory, ready for immediate activation.
Cold pool: Specialists on CPU/disk, loaded on demand for rare table types.
Spawn budget: Maximum number of active specialists at any time, set by available GPU memory.

Expected Scaling Characteristics

State Fusion Cost

With N specialists, each contributing L layers of state with shape (B, H, K, V):

Fusion FLOPs per layer:
  weighted_sum:    O(N * H * K * V)     -- linear in N
  gated:           O(N * H * K * V)     -- linear in N
  concat_project:  O(N^2 * H * K^2 * V^2)  -- quadratic in N (due to projection weight size)

For the weighted_sum mode (recommended for swarms), fusion cost grows linearly with specialist count and is negligible compared to the specialist forward passes themselves.

Throughput Scaling

Specialists	Expected Throughput	Expected Accuracy	Notes
0 (primary only)	1.0x baseline	Phase 1 F1	No overhead
1	~0.9x (routing overhead)	+2-5 F1	Phase 3 result
3	~0.85x	+5-10 F1	Wide search on typical tables
8+	~0.7x	+8-15 F1	Wide+deep on complex tables

Throughput decreases reflect routing overhead and state fusion cost. The critical-path optimization means that parallel specialists do not compound latency, so throughput degradation is sublinear in the number of specialists.

Memory Scaling

Each frozen specialist consumes GPU memory for its parameters but no optimizer state (frozen). The primary model requires both parameters and optimizer state.

Memory per specialist: ~model_params * sizeof(dtype)
Memory for primary:    ~3x model_params * sizeof(dtype)  (params + grad + optimizer)
Memory for fusion:     negligible (O(H * K * V) parameters)

With bf16 and a 120M-parameter base model, each specialist costs ~240MB. A 6x RTX 4090 setup (144GB total) can support approximately 8-10 specialists alongside the primary model and optimizer state.

Success Criteria

Phase 4 is complete when:

The swarm demonstrates measurable accuracy gains from adding specialists beyond the Phase 3 single-specialist result
Wide search provides throughput-proportional accuracy gains on easy tables
Deep search provides accuracy gains on the hardest SOTAB-CTA classes (those below 0.5 F1 in Phase 1)
Dynamic spawning correctly allocates more specialists to complex tables and fewer to simple ones

Development Guide

Building and Running

Critical: Always Use `--no-sync`

uv run --no-sync python main.py

The --no-sync flag prevents uv from re-resolving and reinstalling dependencies before running. This is required because flash-attn, flash-linear-attention (fla), mamba-ssm, and causal-conv1d are patched CUDA extensions that were built manually with corrected CXX11 ABI flags. Running uv run without --no-sync will clobber these patched builds with incompatible PyPI wheels.

Smoke Tests

# Model instantiation and forward pass shapes
uv run --no-sync python main.py

# Training loop validation (tiny model, synthetic data)
uv run --no-sync python train.py --smoke-test --model-size tiny --epochs 3

Multi-GPU Training

# 6x RTX 4090 training
uv run --no-sync torchrun --nproc_per_node=6 train.py \
    --model-size small \
    --epochs 100 \
    --batch-size 64 \
    --lr 1e-4

Training uses DDP (DistributedDataParallel), AMP with bf16, cosine LR schedule with linear warmup, and load balancing loss for dynamic chunking regularization.

CUDA Extension Build Notes

The devenv/Nix environment provides GCC 15, which sets _GLIBCXX_USE_CXX11_ABI=1. However, PyTorch’s cu124 wheels are built with _GLIBCXX_USE_CXX11_ABI=0. This ABI mismatch causes segfaults when CUDA extensions link against the wrong ABI.

Patching Procedure

Both mamba-ssm and flash-attn have a CachedWheelsCommand in their setup.py that downloads prebuilt wheels from GitHub releases, bypassing local compilation. To force a local build with the correct ABI:

Set environment variables to force local build:

export MAMBA_FORCE_BUILD=TRUE
export FLASH_ATTENTION_FORCE_BUILD=TRUE

Use env -i with system GCC-11 to get the correct ABI:

env -i PATH=/usr/bin:$PATH HOME=$HOME \
    pip install --no-build-isolation /tmp/mamba_src/mamba_ssm-2.3.1/

Patch setup.py in each extension to add explicit _abi_flag matching torch’s ABI.

Patched source trees are kept in /tmp/mamba_src/ and /tmp/flash_src/. See docs/notes/2026-03-28/010808_deps_smoke_train.md for the full step-by-step procedure.

Verifying the Build

After patching, verify that the extensions load correctly:

uv run --no-sync python -c "import mamba_ssm; print('mamba-ssm OK')"
uv run --no-sync python -c "import flash_attn; print('flash-attn OK')"
uv run --no-sync python -c "from fla.ops.rwkv7 import chunk_rwkv7; print('fla OK')"

Adding New Block Types

The architecture supports mixed block types (Mamba2, MHA, RWKV-7, RWKV-8 ROSA) within a single model. To add a new block type:

1. Implement the Mixer Class

Create a new module that implements three methods:

class MyNewMixer(nn.Module):
    def forward(self, hidden_states, inference_params=None, **kwargs):
        """Full-sequence forward pass. Input: (B, L, D). Output: (B, L, D)."""
        ...

    def step(self, hidden_states, inference_params):
        """Single-token autoregressive step. Input: (B, 1, D). Output: (B, 1, D)."""
        ...

    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
        """Allocate KV cache or recurrent state for inference."""
        ...

2. Register in `create_block()`

Add the new type to src/aegir/modules/block.py:

def create_block(arch, d_model, ...):
    if arch in ("x", "X"):  # new block type code
        from my_module import MyNewMixer
        mixer_cls = partial(MyNewMixer, **factory_kwargs, layer_idx=layer_idx)
    ...

Convention: lowercase letter = mixer only (no MLP), uppercase = mixer + SwiGLU MLP.

3. Add to Isotropic Forward Loop

In src/aegir/modules/isotropic.py, add the new block type to:

The regex pattern that parses layout strings:

layout_parse = re.findall(r"([mMtTrRwWxX])(\d+)", arch_layout)

The forward loop’s block-type dispatch:

elif arch in ("x", "X"):
    layer_mixer_kwargs = {}  # or whatever kwargs your mixer needs
    if hidden_states.dim() == 2:
        hidden_states = hidden_states.unsqueeze(0)
        residual = None if residual is None else residual.unsqueeze(0)

4. Test

# Verify the new block type instantiates and runs
uv run --no-sync python main.py

Project Structure

aegir/
  main.py                          -- Smoke tests
  train.py                         -- Training script (DDP, AMP, cosine LR)
  src/aegir/
    models/
      config.py                    -- AegirConfig, SSMConfig, AttnConfig, RWKVConfig
      aegir.py                     -- Recursive hierarchical backbone
      heads.py                     -- AegirForCausalLM, AegirForColumnAnnotation
    modules/
      block.py                     -- Block factory (create_block)
      isotropic.py                 -- Flat block stack with mixed types
      dc.py                        -- Dynamic chunking (RoutingModule, ChunkLayer, DeChunkLayer)
      rwkv7_tmix.py                -- RWKV-7 full TimeMix (fla kernels)
      rwkv.py                      -- RWKV-8 ROSA time mixing + relu^2 channel mixing
      rosa.py                      -- ROSA suffix automaton (CPU-based)
      mlp.py                       -- SwiGLU MLP
    swarm/
      state_fusion.py              -- RWKVStateFusion (3 modes)
      alignment.py                 -- AlignmentProjection (cross-agent state mapping)
      specialist.py                -- FrozenSpecialist wrapper
      orchestrator.py              -- SwarmOrchestrator (K2.5 PARL)
    data/
      serialization.py             -- Table-to-byte-sequence serialization
      context_select.py            -- MMR context column selection
      table_dataset.py             -- PyTorch dataset for table benchmarks
    utils/
      train.py                     -- Load balancing loss, F1 metrics, param grouping
  docs/                            -- mdbook documentation (this book)
  ref/                             -- Reference papers

Documentation

Build and serve the documentation locally:

mdbook build docs/
mdbook serve docs/    # serves at http://localhost:3000

The documentation uses mdbook with katex (math), mermaid (diagrams), and d2 (architecture diagrams) plugins, all provisioned by devenv.

Keyboard shortcuts

Ægir: Hierarchical Sequence Modeling with Dynamic Chunking