Introduction
Aegir is a hierarchical sequence model for semantic column annotation and cross-table data element discovery on relational data. Given one or more tables, Aegir predicts semantic types for individual columns (Column Type Annotation), identifies properties and relationships between columns (Column Property Annotation), and discovers coherent data elements – groups of semantically related columns that span multiple tables in a data warehouse.
Problem Setting
Enterprise data warehouses contain thousands of tables with columns whose meaning is often opaque: generic names (col0, field_42), inconsistent conventions across teams, and no machine-readable metadata. Understanding what each column represents – and which columns across different tables refer to the same real-world concept – is foundational to data governance, privacy compliance, and integration.
Current approaches to this problem fall into two categories:
Pattern and heuristic-based methods identify column types through regex detectors (email, SSN, credit card patterns), name matching, embedding similarity, and gradient-boosted classifiers trained on hand-engineered features. These methods work well for structurally distinct types but struggle with confusable pairs – columns whose value distributions are nearly identical but whose semantic types differ (e.g., advertising IDs vs GUIDs, bank account numbers vs payment card numbers). They also require manual enumeration of data element patterns and cannot generalize to novel relationship types.
Learned sequence models (DODUO, RECA, REVEAL) treat the table as a token sequence and classify columns via fine-tuned transformers. REVEAL’s key insight is that context column selection matters: choosing the right neighboring columns (via MMR diversity sampling) dramatically improves annotation accuracy. However, these models operate on single tables in isolation and use fixed subword tokenizers that fragment tabular data unpredictably.
Aegir bridges these approaches. It is designed to be trained in situ alongside evidence-based classification pipelines – consuming the same serialized table representations, but learning cross-column and cross-table relationships end-to-end rather than relying on manually enumerated patterns. Specifically:
- Column Type Annotation (CTA): Classify individual columns into a semantic taxonomy (e.g., SIGDG ontology categories, Schema.org types, DBpedia classes).
- Column Property Annotation (CPA): Identify properties and relationships between column pairs (e.g., “city is-located-in country”).
- Data Element Discovery: Identify groups of related columns across tables that constitute coherent real-world entities (e.g., a PaymentCard data element spanning
card_number,expiry,cardholdercolumns across billing, transaction, and customer tables).
The third task – cross-table data element discovery – is where the greatest value lies for enterprise governance. Current pipelines discover data elements through keyword-based schema matching and post-classification co-occurrence analysis. A model that learns these relationships from data can generalize beyond enumerated patterns, handle non-English and abbreviated column names, and resolve confusable pairs by leveraging cross-table structural context that no single-table classifier can access.
Target benchmarks:
- SOTAB – Semantic column annotation on Web tables (Schema.org types)
- GitTables – Large-scale column type detection across 1M+ CSV tables from GitHub (100% generic column names – the hardest regime)
- WikiTables – Column annotation on Wikipedia HTML tables
Key Innovations
Byte-level dynamic chunking as learned tokenization. Rather than using a fixed tokenizer (BPE, SentencePiece), Aegir operates on raw bytes and learns to segment sequences into variable-length chunks via content-dependent boundary prediction. A routing module measures cosine similarity between consecutive hidden states; high dissimilarity triggers a chunk boundary. This makes the “tokenization” fully differentiable and adapted to the data distribution – critical for tabular data where delimiters, numeric formats, and encodings vary wildly across sources.
All-RWKV recurrent architecture. The primary sequence processing blocks use RWKV-7 time mixing with flash-linear-attention Triton kernels. RWKV-7 maintains a constant-size recurrent state matrix of shape (B, H, head_size, head_size) regardless of sequence length. This gives O(1) memory per token during inference and, critically, makes the recurrent state a fixed-size object that can be serialized, transmitted, and algebraically combined across agents.
ROSA suffix automaton for exact pattern retrieval. The ROSA (RWKV Online Suffix Automaton) module provides lossless infinite-range retrieval by constructing an online suffix automaton over binarized hidden representations. While RWKV-7 learns smooth sequence-level patterns, ROSA can retrieve exact substring matches from arbitrarily far in the past – enabling precise pattern detection (email formats, card number structures) that complements the learned recurrent state.
Agent swarm with state fusion for cross-table reasoning. Multiple specialist agents can process different tables or column families in parallel. Because RWKV recurrent states are fixed-size matrices, they can be fused via attention-weighted combination, learned gating, or projection – far more efficiently than merging transformer KV caches, which grow linearly with sequence length. This architecture enables cross-table data element discovery: each agent processes a table, and the fused state captures inter-table relationships that no single-table model can learn.
In-situ training within evidence pipelines. Aegir is designed to integrate with Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. Its predictions – with calibrated confidence – feed into the same conjunctive combination framework alongside cosine similarity, gradient boosting, pattern detectors, and name matching. The model learns from the pipeline’s own bootstrap labels and SAGE-validated features, creating a self-improving loop where Aegir’s learned representations replace hand-engineered heuristics as they prove their value.
Architecture at a Glance
Aegir uses a recursive hierarchy defined by nested layout strings:
arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
This reads as: 2 RWKV-7 encoder blocks, then a sub-hierarchy (2 encoder blocks, 4 main blocks, 2 decoder blocks), then 2 RWKV-7 decoder blocks. At each non-innermost stage, dynamic chunking downsamples the sequence before passing it to the next level, and an EMA-based dechunking module reconstructs the full resolution on the way back up.
The block types – RWKV-7, ROSA, MHA, Mamba-2 – can be freely mixed within any stage using compact layout strings like "w4T1r2".
Architecture Overview
Aegir is a recursive hierarchical sequence model. At the top level, it processes raw byte sequences through nested stages of encoding, dynamic chunking, inner processing, dechunking, and decoding. Each stage can use a different hidden dimension and a different mix of block types.
Recursive Hierarchy
The architecture is defined by a nested list called arch_layout. For example:
arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
d_model = [128, 192, 192]
This defines three stages (depth 0, 1, 2):
| Stage | Role | Layout | Dimension |
|---|---|---|---|
| 0 | Outermost encoder/decoder | "w2" / "w2" | 128 |
| 1 | Middle encoder/decoder | "w2" / "w2" | 192 |
| 2 | Innermost (main) | "w4" | 192 |
At each non-innermost stage, the data flow is:
At the innermost stage, only the main network runs (no chunking). The recursion bottoms out at a flat Isotropic block stack.
Data Flow in Detail
- Encoder: A flat stack of blocks (e.g., 2 RWKV-7 blocks) processes the full-resolution sequence.
- Routing:
RoutingModulepredicts boundary probabilities via cosine similarity. Tokens at predicted boundaries are selected as chunk representatives. - Chunk:
ChunkLayerdownsamples by keeping only boundary tokens, producing a shorter sequence. - Main network: The shorter sequence is processed by the next hierarchy level – which may itself contain encoding, chunking, and another level of recursion.
- Dechunk:
DeChunkLayerreconstructs the full-length sequence via an EMA scan, blending chunk outputs back into non-boundary positions. - Residual: A skip connection around the entire chunk/process/dechunk block, gated via straight-through estimation of the routing probabilities.
- Decoder: Another flat stack of blocks processes the reconstructed sequence.
Dimension Padding
When inner stages have a larger hidden dimension than outer stages, Aegir pads the input with a learnable vector (pad_dimension) on entry and slices it off on exit. This avoids linear projection overhead at every stage transition.
Why All-RWKV
The primary design choice is to use RWKV-7 time mixing at all stages rather than transformers or pure SSMs. The motivation is threefold:
1. Uniform O(1) Recurrent State
Every RWKV-7 block maintains a recurrent state of shape (B, H, head_size, head_size). This is constant regardless of sequence length. During autoregressive inference, each token step updates this matrix and reads from it in O(head_size^2) time per head.
2. Agent State Fusion
For the agent swarm architecture, specialist agents process the same input and produce recurrent states. These states must be combined. RWKV states are fixed-size matrices that live in a well-defined linear space, making fusion via weighted sum, gating, or projection algebraically natural. In contrast:
- Transformer KV caches are O(L * d) and grow with sequence length, making fusion combinatorially expensive.
- Mamba-2 states are smaller but have different algebraic structure (diagonal recurrence).
3. Chunk-Parallel Training
The chunk_rwkv7 kernel from flash-linear-attention enables training with parallel chunk processing while maintaining exact recurrent semantics. This gives near-transformer training throughput with recurrent inference efficiency.
Comparison Table
| Property | RWKV-7 (w/W) | Mamba-2 (m/M) | Transformer (t/T) |
|---|---|---|---|
| Training kernel | chunk_rwkv7 (Triton) | Mamba-2 SSD (CUDA) | Flash Attention 2 |
| Recurrent state | (H, K, K) matrix | (H, d_state) vector | None (KV cache) |
| Inference memory | O(d^2) constant | O(d * d_state) constant | O(L * d) linear |
| State fusibility | Natural (matrix sum) | Possible (vector sum) | Impractical |
| Exact retrieval | Via ROSA blocks | No | Via full attention |
| FFN pairing | CMix (relu^2) or SwiGLU | SwiGLU or none | SwiGLU or none |
In practice, RWKV-7 blocks (w/W) are the default choice at all stages. Mamba-2 (m/M) and MHA (t/T) blocks are available for ablation studies and hybrid configurations. ROSA (r/R) blocks provide exact substring matching as a complement to learned recurrent processing.
Hierarchical Dynamic Chunking
Dynamic chunking is Aegir’s mechanism for content-dependent hierarchical segmentation. Rather than using a fixed tokenizer, the model learns to predict chunk boundaries based on the hidden representations themselves. This module is adapted from H-Net (goombalab/hnet).
Overview
The chunking pipeline has three components that work together at each non-innermost stage of the hierarchy:
- RoutingModule – predicts which tokens are chunk boundaries
- ChunkLayer – downsamples the sequence by selecting boundary tokens
- DeChunkLayer – reconstructs the full-length sequence from chunk outputs via EMA
RoutingModule: Boundary Prediction
The routing module decides where to place chunk boundaries by measuring how different consecutive hidden states are.
Algorithm
For a sequence of hidden states h[0], h[1], ..., h[L-1]:
-
Project consecutive pairs through learnable Q and K matrices (initialized to identity).
-
Compute cosine similarity between adjacent projected states:
cos_sim[t] = cosine(Q @ h[t], K @ h[t+1]) -
Convert to boundary probability:
p[t] = clamp((1 - cos_sim[t]) / 2, 0, 1) -
The first token always gets
p = 1.0(always a boundary). -
Threshold at 0.5: if
p[t] > 0.5, tokentis a boundary.
High dissimilarity between consecutive states means the content is changing – a natural place to start a new chunk. The Q/K projections are initialized to identity so the model starts with raw cosine similarity and can learn to refine the boundary criterion.
Handling Variable-Length Sequences
The routing module supports two modes:
- Padded mode (
maskprovided): Standard(B, L, D)tensors with a boolean mask. Boundary predictions outside the mask are suppressed. - Packed mode (
cu_seqlensprovided): Sequences concatenated into a single(1, total_len, D)tensor with cumulative sequence lengths. The first token of each sequence in the pack is forced to be a boundary.
ChunkLayer: Downsampling
Once boundaries are predicted, ChunkLayer selects only the boundary tokens to form a shorter sequence.
In padded mode:
- Count how many boundary tokens each batch element has.
- Sort token indices so boundary tokens come first.
- Gather the first
max_boundariestokens per batch element. - Produce a new mask indicating which positions in the shorter sequence are valid.
In packed mode:
- Boolean-index the boundary tokens directly from the flat sequence.
- Recompute
cu_seqlensfor the shorter packed sequence.
The output is a shorter sequence containing only the tokens that were at chunk boundaries.
DeChunkLayer: Reconstruction via EMA
After the inner hierarchy processes the chunked (shorter) sequence, DeChunkLayer reconstructs the full-length sequence. The key insight is that non-boundary tokens should smoothly interpolate from their nearest preceding boundary token’s output.
EMA Scan
The reconstruction uses an exponential moving average (EMA) scan:
y[0] = x[0]
y[t] = decay[t] * y[t-1] + (1 - decay[t]) * x[t]
where decay[t] = 1 - p[t] and p[t] is the boundary probability for token t.
At boundary tokens (p ~ 1), the output snaps to the new chunk value. At non-boundary tokens (p ~ 0), the output carries forward the previous value. The boundary probability controls the blend continuously, allowing gradient flow through the routing decisions.
Reconstruction Steps
- Reorder the chunk outputs according to the original boundary positions.
- Map each position in the full sequence to its cumulative boundary count (i.e., which chunk it belongs to).
- Run the EMA scan over the reordered chunk outputs with boundary-probability-derived decay factors.
- Gather the EMA outputs back to the original sequence positions.
Residual Connection
The entire chunk/process/dechunk pipeline is wrapped in a residual connection:
output = dechunk_output * STE(selected_probs) + residual_proj(encoder_output)
The residual_proj is a linear layer initialized to zero, so at initialization the chunking pathway contributes nothing and the model starts as a simple encoder-decoder. The Straight-Through Estimator (STE) passes gradients through the discrete routing decisions.
Recursive Nesting
The chunking pattern nests recursively. Consider a 3-stage hierarchy:
arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
- Stage 0: Encode the full byte sequence, predict boundaries, chunk down, pass to Stage 1, dechunk back up, decode.
- Stage 1: Encode the chunked sequence from Stage 0, predict boundaries again on this shorter sequence, chunk down further, pass to Stage 2, dechunk, decode.
- Stage 2: Process the doubly-chunked sequence with a flat stack of blocks (no further chunking).
Each level of chunking reduces the sequence length by a data-dependent factor. For byte-level input, the first level might learn character-like boundaries; the second level might learn word-like or phrase-like boundaries. The model discovers its own hierarchy of tokenization.
Inference: Token-by-Token Stepping
During autoregressive inference, each component has a step method for single-token processing:
- RoutingModule.step: Compares the new token against the previously seen token’s hidden state. If the boundary probability exceeds 0.5, the token starts a new chunk.
- ChunkLayer.step: If the token is a boundary, pass it through to the inner hierarchy. Otherwise, skip the inner hierarchy entirely.
- DeChunkLayer.step: Blend the new chunk output (if any) with the previous EMA value using the boundary probability as the mixing weight.
This means that during inference, the inner hierarchy only runs when a chunk boundary is detected, saving compute on non-boundary tokens.
RWKV-7 Time Mixing
RWKV-7 time mixing is the primary sequence processing mechanism in Aegir. It implements a linear recurrence with a matrix-valued state, combining the training efficiency of chunk-parallel computation with the inference efficiency of constant-memory recurrence. The implementation uses flash-linear-attention’s optimized Triton kernels.
Reference: RWKV-v8 “Heron” (BlinkDL/RWKV-LM), fla RWKV7Attention.
Core Recurrence
The recurrent state S[t] is a matrix of shape (H, head_size, head_size) per batch element, where H is the number of attention heads. The state update at each time step is:
S[t] = diag(w[t]) * S[t-1] + S[t-1] @ ab[t] + v[t] @ k[t]^T
where:
diag(w[t])is the per-element exponential decay applied column-wiseab[t] = (-kk[t])^T @ (kk[t] * a[t])^Tis the attention gate correctionv[t] @ k[t]^Tis the new key-value outer product
The output is read from the state via:
o[t] = S[t] @ r[t]
where r[t] is the receptance (query) vector.
Time-Shift Mixing
Before computing projections, RWKV-7 mixes each token with its predecessor via learned interpolation coefficients. Given input x[t]:
delta[t] = x[t-1] - x[t] (delta[0] = -x[0])
xr = x + delta * mu_r
xw = x + delta * mu_w
xk = x + delta * mu_k
xv = x + delta * mu_v
xa = x + delta * mu_a
xg = x + delta * mu_g
Each mu_* is a learnable (1, 1, D) parameter initialized with a position-and-layer-dependent schedule. This provides a simple form of local context mixing before the main recurrence.
Decay LoRA
The decay vector w[t] controls how quickly the recurrent state forgets. It is computed via a low-rank adaptation:
w[t] = -softplus(-(w0 + tanh(W1 @ xw[t]) @ W2)) - 0.5
where:
w0is a(D,)bias initialized with a position-dependent scheduleW1is(D, decay_low_rank_dim)andW2is(decay_low_rank_dim, D)- The result is in log-space (negative values); the
-0.5ensures minimum decay
For the chunked training kernel (chunk_rwkv7), w is passed in log-space. For the single-token step, it is converted to the multiplicative factor:
w_step = exp(-0.606531 * sigmoid(w0 + tanh(W1 @ xw) @ W2))
Attention Gate LoRA
The attention gate a[t] modulates the key’s influence on the state update. It controls the ab correction term:
a[t] = sigmoid(a0 + A2(A1(xa[t])))
where a0 is a (D,) bias and A1, A2 form a low-rank bottleneck. The key is then modified as:
k'[t] = k[t] * (1 + (a[t] - 1) * k_a)
where k_a is a learnable per-dimension scale (initialized to 1.0).
Value-First Sharing
RWKV-7 shares value information across layers via a “value-first” mechanism:
- Layer 0: Stores its value projection as
v_first. - Layers 1+: Lerp their value toward
v_first:
v[t] = v[t] + (v_first[t] - v[t]) * sigmoid(v0 + V2(V1(xv[t])))
This provides a residual-like connection specifically for value information, allowing deeper layers to reference the original value representation from layer 0.
L2 Key Normalization
Keys are L2-normalized per head before entering the suffix automaton correction:
kk[t] = L2_normalize(k[t] * k_k) per head
where k_k is a learnable per-dimension scale (initialized to 0.85). The normalized keys kk are used in the ab correction term but not in the main key-value outer product.
Bonus Term
A direct key-query interaction term is added to the output:
bonus[t] = sum(r[t] * k[t] * r_k, dim=-1, keepdim=True) * v[t]
where r_k is a (H, head_size) parameter initialized with small random values. This provides a shortcut path that bypasses the recurrent state entirely.
GroupNorm Output
The recurrent output is passed through GroupNorm (one group per attention head) before the bonus term is added:
o = GroupNorm(S[t] @ r[t]) + bonus[t]
Output Gating
The final output is gated via another LoRA:
g[t] = G2(sigmoid(G1(xg[t])))
output = o * g
output = W_o @ output
The output projection W_o is initialized to zero so that at initialization, RWKV-7 blocks contribute nothing to the residual stream.
Training: Chunk-Parallel Computation
During training, the chunk_rwkv7 kernel from flash-linear-attention processes the sequence in parallel chunks while maintaining exact recurrent semantics. The function signature:
o, final_state = chunk_rwkv7(
r, w, k, v,
-kk, kk * a, # ab decomposed as two rank-1 terms
initial_state=state, # (B, H, K, K) or None
output_final_state=True,
)
Inputs are shaped (B, T, H, head_size) and w is in log-space.
Inference: Token-by-Token Recurrence
During autoregressive inference, the step method implements the exact recurrence manually:
vk = v @ k^T # (B, H, N, N)
ab = (-kk)^T @ (kk * a)^T # (B, H, N, N)
S = S * diag(w) + S @ ab + vk # state update
o = S @ r # read output
The recurrent state S is stored in inference_params.key_value_memory_dict[layer_idx].att_kv as a float32 tensor of shape (B, H, head_size, head_size).
LoRA Dimension Auto-Calculation
If not explicitly specified in RWKVConfig, LoRA dimensions are computed from d_model following the fla convention:
factor = head_size / 64
sqrt_d = sqrt(d_model)
decay_low_rank_dim = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
gate_low_rank_dim = max(32, round(5.0 * sqrt_d / 32) * 32)
a_low_rank_dim = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
v_low_rank_dim = max(32, round(1.7 * sqrt_d * factor / 32) * 32)
All dimensions are rounded up to multiples of 32 for hardware efficiency.
Weight Initialization
Initialization follows RWKV-7 conventions with layer-dependent schedules:
- Time-shift coefficients (
mu_*): Initialized as1 - d^(c * ratio)wheredis a per-dimension ramp[0, 1),cis a coefficient specific to each mix type, andratiovaries from 1 (first layer) to 0 (last layer). - Decay bias (
w0): Initialized as-7 + 5 * (d / D)^(0.85 + ratio^0.5), giving a range from fast decay (early dimensions) to slow decay (late dimensions). - Key normalization (
k_k): 0.85 uniformly. - Key attention scale (
k_a): 1.0 uniformly. - Bonus (
r_k): Small random normal (std=0.1). - Output projection (
W_o): Zero initialized.
ROSA Suffix Automaton
ROSA (RWKV Online Suffix Automaton) provides lossless infinite-range exact sequence matching as a complement to RWKV-7’s learned recurrent processing. While RWKV-7 maintains a compressed state that approximates the input history, ROSA can retrieve exact substring matches from arbitrarily far in the past.
Reference: “ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching” (arXiv:2602.02499), ported from RWKV-v8 (BlinkDL/RWKV-LM).
Algorithm Overview
ROSA constructs an online suffix automaton over discretized hidden representations. For each position in the query sequence, it finds the longest suffix of the query that appears as a substring in the key sequence seen so far, then returns the corresponding value from the position immediately after the match.
The core operation is rosa_qkv_ref(qqq, kkk, vvv):
- Maintain an online suffix automaton built incrementally from the key sequence.
- For each new position
i:- Query phase: Walk the automaton to find the longest suffix of
qqq[:i+1]that matches a substring inkkk[:i]. - Key phase: Extend the automaton with
kkk[i].
- Query phase: Walk the automaton to find the longest suffix of
- If a match of sufficient length is found, return
vvv[match_end + 1]. Otherwise return a sentinel value.
The suffix automaton provides O(n) construction and O(n) total query time, making the entire operation linear in sequence length.
1-Bit Binarization
To convert continuous hidden states into discrete tokens suitable for suffix automaton matching, ROSA uses 1-bit binarization:
x_binary = (x > 0) ? 1 : 0
This is applied per channel across the hidden dimension. Given a hidden state tensor of shape (B, T, C):
- Binarize:
q_bin[b, t, c] = uint8(q[b, t, c] > 0)(same for k, v). - Transpose: Reshape from
(B, T, C)to(B*C, T)– each channel becomes an independent sequence. - Match: Run
rosa_qkv_batch_refover allB*Cchannel sequences in parallel. - Reconstruct: Reshape indices back to
(B, T, C). - Scale: Output
= (2 * idx_float - 1) * emb, whereembis a learnable(1, 1, C)scale parameter.
The matched bit value 1 maps to +emb and 0 maps to -emb, giving the output the same sign structure as the matched hidden representation scaled by a learnable magnitude.
The _RosaQKV1BitOp Autograd Function
ROSA’s suffix automaton is non-differentiable (it involves discrete automaton state transitions). The autograd function handles this:
- Forward: Binarize inputs, run suffix matching on CPU, scale by
emb. - Backward: Gradients for
q,k,vareNone(zero). Gradients forembare passed through directly.
This means ROSA layers learn only through:
- The learnable
embscale parameter. - The Q/K/V linear projections preceding ROSA (which receive gradients from other paths in the network through residual connections).
- The surrounding block’s residual connection.
The projections learn to produce hidden representations whose binarization yields useful matching patterns, even though the binarization itself has no gradient.
CPU Execution
The suffix automaton runs on CPU. Tensors are moved to CPU before matching and results are moved back to the accelerator. This is a deliberate design choice:
- Suffix automata use pointer-chasing data structures (dictionaries, linked suffix links) that are not amenable to GPU parallelism.
- The per-channel parallelism (
B*Cindependent sequences) provides sufficient throughput for moderate batch sizes. - During inference, ROSA blocks primarily contribute during prefill; the
stepmethod falls back to zero output since the automaton requires the full sequence context.
When to Use ROSA vs RWKV-7
| Use Case | ROSA (r/R) | RWKV-7 (w/W) |
|---|---|---|
| Exact pattern retrieval | Yes – lossless via suffix matching | No – compressed into finite state |
| Learned sequence processing | Limited – only emb is trained | Full – all parameters are trained |
| Inference (autoregressive) | Degrades (needs full context) | Efficient (O(1) state update) |
| Long-range dependencies | Infinite range, exact | Finite effective range, approximate |
| Training speed | Slower (CPU automaton) | Fast (Triton chunk kernel) |
In practice, ROSA blocks are best used sparingly alongside RWKV-7 blocks. A typical layout might be "w4r1" – four RWKV-7 blocks for general sequence processing, one ROSA block for exact retrieval. The ROSA block acts as a “lookup table” that can surface exact matches from the input, while RWKV-7 handles the bulk of learned representation building.
RWKV_ROSA Module
The RWKV_ROSA module wraps the ROSA matching in a standard time-mixing interface:
- Time-shift mixing: Mix current token with previous token via learned interpolation (same as RWKV-7 but with only q/k/v coefficients).
- Q/K/V projection: Linear projections from the mixed hidden states.
- ROSA matching:
RosaQKV1Biton the projected q, k, v. - Output projection: Linear projection back to
d_model.
The module is paired with either RWKV_CMix (relu^2 FFN, block code r) or SwiGLU (block code R) as its feedforward component.
Block Types Reference
Aegir’s architecture is built from modular blocks, each consisting of a mixer (the sequence processing module) and an optional MLP (the feedforward network). Blocks are identified by single-character codes and composed into layout strings that define the architecture at each stage.
Block Code Table
| Code | Mixer | MLP | Description |
|---|---|---|---|
w | RWKV-7 TimeMix | CMix (relu^2) | Full RWKV-7 recurrence with RWKV-style channel mixing |
W | RWKV-7 TimeMix | SwiGLU | Full RWKV-7 recurrence with SwiGLU feedforward |
r | ROSA (suffix automaton) | CMix (relu^2) | Exact pattern matching with RWKV-style channel mixing |
R | ROSA (suffix automaton) | SwiGLU | Exact pattern matching with SwiGLU feedforward |
t | Multi-Head Attention | None | Causal MHA with no feedforward |
T | Multi-Head Attention | SwiGLU | Standard transformer block |
m | Mamba-2 (SSM) | None | State-space model with no feedforward |
M | Mamba-2 (SSM) | SwiGLU | State-space model with SwiGLU feedforward |
Convention
- Lowercase codes use RWKV-native FFN (CMix with relu^2) or no FFN at all.
- Uppercase codes use SwiGLU as the feedforward network.
- For
w/Wandr/R, lowercase uses CMix; uppercase uses SwiGLU. - For
t/Tandm/M, lowercase has no MLP; uppercase adds SwiGLU.
The Block Wrapper
Every block follows the pre-norm residual pattern:
+---> norm1 --> mixer ---+
| |
hidden_states ----->+ +-----> hidden_states
(+ residual) | | (+ residual)
+---> norm2 --> mlp ----+ (if MLP exists)
Concretely, the Block class implements:
# Mixer sub-block
hidden_states, residual = norm1(hidden_states, residual, prenorm=True)
hidden_states = mixer(hidden_states)
# MLP sub-block (if present)
hidden_states, residual = norm2(hidden_states, residual, prenorm=True)
hidden_states = mlp(hidden_states)
The pre-norm pattern accumulates the residual stream separately from the normalized hidden states. The normalization module (RMSNorm from flash-attn, or a LayerNorm fallback) handles residual accumulation internally when prenorm=True.
Residual Height Counting
Each block contributes to the “height” of its parent Isotropic module, which is used for output projection scaling during initialization:
- Lowercase blocks (single residual addition): height += 1
- Uppercase blocks (mixer + MLP, two residual additions): height += 2
MLP Variants
CMix (RWKV Channel Mixing)
Used by lowercase RWKV codes (w, r). A simple feedforward with relu^2 activation:
# Time-shift mixing
xx = time_shift(x) - x
k = x + xx * x_k
# Feedforward
k = relu(W_key @ k) ** 2 # D -> 4D, relu squared
output = W_value @ k # 4D -> D
The expansion factor defaults to rwkv_cfg.dim_ffn_mult (default 4.0). CMix includes its own time-shift mixing, independent of the mixer’s time-shift.
SwiGLU
Used by uppercase codes (W, R, T, M). The standard SwiGLU feedforward (Shazeer 2020):
y = W_fc1 @ x # D -> 2 * D_intermediate
y, gate = split(y) # Each D_intermediate
y = silu(gate) * y
output = W_fc2 @ y # D_intermediate -> D
The intermediate dimension defaults to 8/3 * d_model, rounded up to the nearest multiple of 128.
Layout String Parsing
Architecture layout strings encode a sequence of block types and their counts. The string is parsed by the Isotropic module using a regex:
re.findall(r"([mMtTrRwW])(\d+)", arch_layout)
Examples:
| Layout String | Parsed Blocks |
|---|---|
"w4" | 4 RWKV-7+CMix blocks |
"w4T1r2" | 4 RWKV-7+CMix, 1 MHA+SwiGLU, 2 ROSA+CMix |
"W8" | 8 RWKV-7+SwiGLU blocks |
"m2w4m2" | 2 Mamba-2, 4 RWKV-7+CMix, 2 Mamba-2 |
Within a layout string, blocks are instantiated in order with sequential layer_idx values. The total layer count across all block types in the string is used for RWKV-7’s position-dependent weight initialization.
The create_block Function
create_block() is the factory function that dispatches on the block code character:
block = create_block(
arch="w", # block code
d_model=192, # hidden dimension
d_intermediate=512, # SwiGLU intermediate dim (for uppercase codes)
ssm_cfg={...}, # Mamba-2 config (for m/M)
attn_cfg={...}, # MHA config (for t/T)
rwkv_cfg=RWKVConfig(...), # RWKV config (for w/W/r/R)
layer_idx=0, # layer index for cache keying
num_hidden_layers=12, # total layers for init scheduling
)
The function:
- Selects the mixer class based on the code character.
- Selects the MLP class: CMix for
w/r, SwiGLU for uppercase,nn.Identityfort/m. - Selects the normalization class: flash-attn’s RMSNorm if available, otherwise a LayerNorm fallback with prenorm support.
- Constructs and returns a
Blockinstance with the selected components.
Value-First Sharing Across Blocks
When an Isotropic module contains RWKV-7 blocks (w/W), it maintains a shared v_first = [None] container. This mutable list is passed as a mixer_kwarg to every RWKV-7 block:
- The first RWKV-7 block (layer_idx 0 within the Isotropic) stores its value projection in
v_first[0]. - Subsequent RWKV-7 blocks lerp their value toward
v_first[0]via a learnable gate.
This sharing is local to each Isotropic instance – encoder, decoder, and main network at each stage each have their own v_first container.
Pretraining: Ontology-Grounded Synthetic Data
How do you train a structured information model at LLM scale when labeled relational data is scarce and expensive?
Conventional approaches to semantic column annotation rely on manually labeled benchmark datasets – SOTAB, GitTables, WikiTables – that are costly to create, domain-limited, and rarely capture the cross-table relationships needed for data element discovery. Self-supervised pretraining on raw tables (as in DODUO and TURL) learns useful representations, but the “ground truth” for column semantics remains noisy or absent.
Aegir takes a fundamentally different approach: we generate the training data from first principles, so the ground truth is always known by construction.
The Core Insight
We invert the usual pipeline. Instead of finding tables and labeling them, we:
- Start from the highest-quality curated text available
- Extract formal ontological structure using LLMs
- Project that structure into relational database schemas
- Populate schemas with realistic synthetic data
- Train the model to recover the ontological entities from the raw table data
Because we control every step of the generation process, the mapping from table columns back to ontological entities is always available as ground truth. The diversity of the input text drives the diversity of the synthetic data; the formal ontological backbone guarantees structural correctness.
Pipeline Overview
What Is Novel
No prior work combines all five stages into a single pipeline. Each stage has precedent; the composition does not.
| Stage | Prior Art | What Exists | What Is New |
|---|---|---|---|
| Text → Ontology | OntoGPT, REBEL, DeepOnto | LLM-based ontology extraction from text | Using curated educational text as seed for training data generation |
| BFO Grounding | Common Core Ontologies, OBO Foundry | BFO as upper ontology for domain modeling | BFO as the generative backbone for synthetic ML training data |
| SysMLv2 Intermediate | openCAESAR, Cameo | SysMLv2 for systems engineering | SysMLv2 MBSE as intermediate representation in an ML data pipeline |
| Synthetic Tables | MOSTLY.ai, SDV, NeurIPS 2024 TRL | Synthetic table generation for augmentation | Tables generated from ontological structure with known entity provenance |
| Entity Recovery | DODUO, TURL (masked column) | Masked language model pretraining on tables | Ontological entity recovery as the training objective, not next-token prediction |
The closest related work is “Enhancing Table Representations with LLM-powered Synthetic Data Generation” (NeurIPS 2024 TRL Workshop), which generates synthetic tables to improve column embedding similarity. That work generates tables for representation learning; Aegir generates tables for ontological entity recovery – a fundamentally different objective that produces richer training signal because the ground truth includes hierarchical entity structure, cross-table relationships, and BFO-grounded type constraints.
Why This Scales
The bottleneck in conventional table annotation is human labeling. The bottleneck here is LLM inference for ontology extraction – which is embarrassingly parallel and decreasing in cost.
The multiplicative structure of the pipeline ensures near-unlimited training data:
| Stage | Multiplier | Source |
|---|---|---|
| Curated text | ~500M passages | FineWeb-Edu (1.3T tokens) |
| Ontology fragments | 1–5 per passage | Domain-dependent entity density |
| Database schemas | 1–10 per fragment | Varying normalization strategies |
| Table instances | 100–10,000 rows | Procedural generation with distribution control |
| Total training examples | effectively unbounded | Combinatorial product of all stages |
A single educational passage about hospital billing can produce ontology fragments for patient demographics, encounter management, diagnosis coding, insurance claims, and provider credentialing – each of which generates distinct database schemas, each populated with different synthetic data distributions. The diversity of the training data is bounded only by the diversity of human knowledge captured in the source text.
How This Connects to Aegir
The pretraining objective maps directly to Aegir’s three target tasks:
- Column Type Annotation (CTA): The per-column entity type predictions from pretraining transfer directly to CTA on SOTAB, GitTables, and WikiTables benchmarks.
- Column Property Annotation (CPA): The cross-column relationship predictions learned during pretraining capture the same inter-column semantics needed for CPA.
- Data Element Discovery: The core pretraining objective – grouping related columns into ontological entities across tables – is data element discovery. The model learns this from synthetic data where the answer is known, then applies it to real enterprise data warehouses.
Furthermore, Aegir’s agent swarm architecture enables cross-table reasoning during both pretraining and inference. Each agent processes a table, and the fused recurrent states capture inter-table relationships that no single-table model can learn.
The following sections detail each stage of the pipeline.
Stage 1: Ontology Extraction
The first stage transforms curated educational text into formal ontological structure. A large language model reads natural language passages and produces BFO-grounded ontology fragments – typed entity hierarchies with properties, relationships, and axioms that can be mechanically projected into database schemas.
Input: FineWeb PDFs Edu
The source corpus is FineWeb-Edu, a curated subset of Common Crawl filtered for educational content using LLaMA-3-70B-Instruct quality scoring. Key properties:
- 1.3 trillion tokens of curated, high-quality educational text
- Spans every domain: medicine, law, finance, engineering, social sciences, natural sciences
- Already deduplicated and quality-filtered – no need for additional curation
- PDF-extracted passages preserve document structure (headings, tables, lists)
Each passage is a self-contained description of some real-world domain – exactly the kind of text that contains implicit ontological structure waiting to be made explicit.
Extraction Process
The extraction uses structured prompting with a three-phase approach:
-
Domain identification: Classify the passage into one or more information domains (healthcare, finance, logistics, etc.) to select domain-appropriate extraction templates.
-
Entity extraction: Identify entity types, their properties, and inter-entity relationships. The prompt constrains outputs to BFO-compatible categories.
-
BFO alignment: Map each extracted entity to the appropriate BFO upper-level category, ensuring the fragment inherits BFO’s formal axioms.
Validation Gate
Not every LLM output is usable. A validation gate checks three properties:
- Syntactic: Does the output parse as valid OWL/RDF?
- BFO alignment: Is every class properly subsumed by a BFO category?
- Coherence: Are there contradictory axioms or dangling references?
Fragments that fail validation are discarded or re-prompted. In practice, structured output modes (JSON schema enforcement) in GLM-4.7/GLM-5 achieve >90% first-pass validation rates.
Why BFO
Basic Formal Ontology (ISO/IEC 21838-2:2021) is the most widely adopted upper ontology in applied information science:
- 700+ ontology projects built on BFO across government, defense, healthcare, and industry
- Common Core Ontologies (CCO) used by the U.S. Department of Defense and Intelligence Community
- OBO Foundry biomedical ontologies (Gene Ontology, ChEBI, etc.) all align to BFO
- Formal first-order logic axiomatization ensures machine-verifiable consistency
BFO provides the upper-level categories that give our extracted ontologies a shared formal backbone. Without this grounding, extracted ontologies would be ad-hoc entity lists with no guaranteed interoperability or logical structure.
BFO Categories for Information Systems
The BFO categories most relevant to relational data modeling:
| BFO Category | IRI | Maps To | Example |
|---|---|---|---|
| Generically Dependent Continuant | BFO:0000031 | InformationEntity | A patient record, a diagnosis code |
| Object | BFO:0000030 | Concrete entity | A patient, a medical device |
| Quality | BFO:0000019 | Data attribute | Acuity level, sensitivity classification |
| Role | BFO:0000023 | Functional role | Data subject, provider, auditor |
| Process | BFO:0000015 | Temporal event | An encounter, a transaction, a review |
| Specifically Dependent Continuant | BFO:0000020 | Inherent property | A patient’s blood type, a device’s serial number |
These categories constrain what kinds of entities can participate in what kinds of relationships – a Patient (Object) can bear a DataSubjectRole (Role), an Encounter (Process) has participant a Patient (Object), and so on. These constraints propagate through the pipeline: they determine which foreign key relationships are valid in the generated schemas.
Formal Definition
An ontology fragment is a tuple:
\[ O = (C, R, A, \iota) \]
where:
- \(C = \{c_1, \ldots, c_n\}\) is a set of classes (entity types), each with a set of properties \(P(c_i) = \{p_1, \ldots, p_k\}\)
- \(R = \{r_1, \ldots, r_m\}\) is a set of relations between classes, each \(r_j: c_a \to c_b\) with cardinality constraints
- \(A\) is a set of axioms – subsumption (\(c_i \sqsubseteq c_j\)), disjointness (\(c_i \sqcap c_j = \bot\)), and property constraints (domain, range, cardinality)
- \(\iota: C \to \text{BFO}\) is the BFO alignment mapping that assigns each class to a BFO upper-level category
The alignment mapping \(\iota\) must satisfy BFO’s axioms: if \(\iota(c_i) = \text{BFO:Process}\), then \(c_i\) inherits Process axioms (has temporal extent, can have participants, etc.). This is not merely a label – it constrains the valid relationships and properties that \(c_i\) can participate in.
Output
Each successfully validated ontology fragment becomes input to Stage 2: Schema Projection. A single text passage typically yields 1–5 fragments, depending on the complexity and domain diversity of the passage content.
The ontology fragments are serialized as OWL/RDF for archival and as structured JSON for downstream processing. Both representations preserve the full BFO alignment mapping, enabling validation at every subsequent stage.
Stage 2: Schema Projection
Schema projection transforms BFO-grounded ontology fragments into relational database schemas through a two-step process: first into SysMLv2 systems engineering models, then into programmatic data objects and SQL schemas. The intermediate SysMLv2 representation captures structural constraints, lifecycle semantics, and system-level relationships that flat entity-relationship modeling would lose.
Why SysMLv2 as Intermediate Representation
Using SysMLv2 (OMG, approved July 2025) as an intermediate representation between ontology and database schema is unconventional – and deliberate. SysMLv2 provides formal constructs that bridge the gap between abstract ontological entities and concrete data structures:
| SysMLv2 Construct | Ontological Concept | Database Primitive |
|---|---|---|
| Block Definition | Entity type | Table |
| Part Property | Composition | One-to-many FK |
| Reference Property | Association | Many-to-many junction table |
| Port | Interface/boundary | Shared column (FK target) |
| Attribute | Data property | Column |
| Constraint | Axiom | CHECK constraint |
| State Machine | Lifecycle | Status enum + temporal columns |
| Requirement | Validation rule | Application-level validation |
The openCAESAR project provides an OWL2-DL ontology for SysMLv2, making the ontology-to-SysMLv2 projection formally well-defined. This means we’re not hand-waving the transformation – there’s a rigorous mapping from BFO-grounded classes and relations to SysMLv2 blocks and connections.
The critical advantage: SysMLv2 models encode systems with internal structure, constraints, and lifecycle semantics. The generated databases aren’t just flat tables with columns – they’re projections of coherent systems where referential integrity, state transitions, and constraint propagation all have formal justification in the source model.
Projection Pipeline
Step 1: Ontology → SysMLv2
Each BFO class maps to a SysMLv2 construct based on its upper-level category:
- BFO:Object →
part def(a concrete block with owned parts) - BFO:Process →
action defwith a state machine (lifecycle semantics) - BFO:Role →
port def(an interface that objects can fulfill) - BFO:Quality →
attribute def(a typed value property) - BFO:GDC (Generically Dependent Continuant) →
part defwithsubsets informationEntity(a record or document)
Relations map to SysMLv2 connections:
- Composition (whole-part) →
partusage within a block - Association →
refusage with multiplicity - Participation (Object in Process) →
performaction usage
Axioms map to constraint def blocks with OCL-like expressions.
Step 2: SysMLv2 → Programmatic Objects
The SysMLv2 model is projected into Python dataclasses via template-based code generation:
@dataclass
class Patient:
patient_id: str # from attribute def
date_of_birth: date # from attribute def
gender: str # from attribute def
encounters: list # from part usage (1..*)
@dataclass
class Encounter:
encounter_id: str # generated primary key
patient_id: str # from owning block (FK)
encounter_date: datetime # from attribute def
status: str # from state machine states
provider_id: str # from ref usage (FK)
diagnoses: list # from part usage (1..*)
@dataclass
class Diagnosis:
diagnosis_id: str # generated primary key
encounter_id: str # from owning block (FK)
code: str # from attribute def
description: str # from attribute def
coded_by: str # from ref usage (FK)
Step 3: Data Objects → Relational Schema
The dataclasses are mapped to SQLAlchemy models and CREATE TABLE statements:
CREATE TABLE patient (
patient_id VARCHAR(36) PRIMARY KEY,
date_of_birth DATE NOT NULL,
gender VARCHAR(10) NOT NULL
);
CREATE TABLE encounter (
encounter_id VARCHAR(36) PRIMARY KEY,
patient_id VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
encounter_date TIMESTAMP NOT NULL,
status VARCHAR(20) NOT NULL CHECK (status IN ('active', 'closed')),
provider_id VARCHAR(36) NOT NULL REFERENCES provider(provider_id)
);
CREATE TABLE diagnosis (
diagnosis_id VARCHAR(36) PRIMARY KEY,
encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
code VARCHAR(10) NOT NULL,
description TEXT,
coded_by VARCHAR(36) REFERENCES provider(provider_id)
);
Ontological Mapping Rules
The projection preserves ontological structure through systematic rules:
| Ontological Structure | Relational Mapping | Provenance Preserved |
|---|---|---|
| Entity type | Table | Table name ↔ BFO class |
| Data property | Column | Column name ↔ property IRI |
| Object property (1:N) | Foreign key | FK ↔ relation IRI |
| Object property (M:N) | Junction table | Junction ↔ relation IRI |
| Subsumption hierarchy | Table-per-type inheritance | Parent FK ↔ rdfs:subClassOf |
| Disjointness axiom | CHECK constraint | Constraint ↔ axiom |
| Cardinality constraint | NOT NULL / UNIQUE | Column constraint ↔ cardinality |
The critical property is that every schema element traces back to a specific ontological element. This traceability is what makes the training objective possible: when the model predicts that two columns belong to the same data element, we can verify that prediction against the source ontology.
Schema Variation
A single ontology fragment can produce multiple valid database schemas through controlled variation:
- Normalization level: 1NF, 2NF, 3NF, or fully denormalized
- Inheritance strategy: Table-per-type, table-per-hierarchy, or single-table with discriminator
- Naming conventions:
snake_case,camelCase, abbreviated, or obfuscated (col_1,field_a) - Type mappings:
DATEvsVARCHARfor dates,INTEGERvsVARCHARfor codes
This variation is essential for training robustness. Real-world databases use all of these conventions, often mixed within a single schema. By generating diverse schemas from the same ontological source, the model learns to recognize semantic equivalence across surface-level variation.
Formal Mapping
The schema projection is a function:
\[ \pi: O \to \mathcal{S} \]
where \(O = (C, R, A, \iota)\) is an ontology fragment and \(\mathcal{S} = \{S_1, \ldots, S_k\}\) is a set of valid relational schemas. Each schema \(S_i = (T, K, F, \Gamma)\) consists of:
- \(T = \{t_1, \ldots, t_n\}\) – tables, each with columns \(\text{cols}(t_j)\)
- \(K\) – primary key constraints
- \(F\) – foreign key constraints
- \(\Gamma\) – CHECK constraints
The projection must satisfy:
\[ \forall, t \in T,\ \exists, c \in C : \text{name}(t) \xleftarrow{\pi} c \]
\[ \forall, f \in F,\ \exists, r \in R : f \xleftarrow{\pi} r \]
That is, every table traces to a class and every foreign key traces to a relation. This bidirectional traceability is the formal guarantee that makes ontological entity recovery a well-defined training objective.
Stage 3: Synthetic Data Generation
Given a relational schema with known ontological provenance, the third stage populates tables with realistic synthetic data. The goal is not just to fill rows – it is to produce data distributions that exercise the same patterns and confusable types the model will encounter in real enterprise databases.
Population Pipeline
Value Generation
Each column type maps to a specialized generator that produces realistic values. The generator selection is driven by the ontological provenance of the column – a column traced to BFO:Quality with domain healthcare produces different values than one traced to BFO:Quality with domain finance.
Generator Categories
| Column Semantics | Generator | Example Values |
|---|---|---|
| Person name | Faker (locale-aware) | “Maria Santos”, “James O’Brien” |
| Date/timestamp | Range-bounded random | 2019-03-15, 2024-11-02T14:30:00 |
| Identifier (UUID) | UUIDv4 | f47ac10b-58cc-4372-a567-0e02b2c3d479 |
| Identifier (sequential) | Auto-increment with prefix | PAT-00001, ENC-2024-0042 |
| Medical code (ICD-10) | Sampled from code registry | J18.9, I25.10, E11.65 |
| Financial code (IBAN) | Country-specific format | DE89370400440532013000 |
| Categorical | Weighted sampling from enum | active, closed, pending |
| Free text | Template + Faker | “Patient presents with acute chest pain” |
| Numeric measure | Distribution-sampled | 98.6, 120/80, 72 |
| Boolean flag | Bernoulli(p) | true, false |
| Address | Locale-aware composite | “123 Main St, Springfield, IL 62704” |
| Pattern-based | maria.santos@hospital.org | |
| Phone | Country-format | +1-555-0123 |
Referential Integrity
Tables are populated in topological order (parents before children) to guarantee that every foreign key value references an existing parent row. The population engine:
- Sorts tables by foreign key dependencies (detecting and breaking cycles if needed)
- Populates root tables (no FK dependencies) first
- For each child table, samples FK values from the parent table’s primary key column
- Respects cardinality constraints: a
NOT NULLFK always gets a valid reference; an optional FK getsNULLwith configurable probability
Distribution Control
Real databases are not uniformly distributed. The generation config controls:
- Cardinality: How many child rows per parent (e.g., 1–30 encounters per patient, following a power-law distribution)
- Null ratio: What fraction of nullable columns contain NULL (typically 5–30% in real data)
- Value entropy: How many distinct values appear in categorical columns (a
statuscolumn might have 3 values; adiagnosis_codecolumn might have 500) - Skew: Zipfian distributions for columns where a few values dominate (e.g., 80% of encounters are
status='closed') - Temporal patterns: Dates that follow realistic patterns (weekday-heavy, seasonal, monotonically increasing)
Diversity from Source Text
The curated text input drives diversity along two independent axes:
Domain Diversity
Different passages produce different ontological domains, which produce structurally distinct databases:
| Source Domain | Example Tables | Distinctive Patterns |
|---|---|---|
| Healthcare | patient, encounter, diagnosis, medication | ICD-10 codes, temporal encounter sequences |
| Finance | account, transaction, instrument, counterparty | IBAN/SWIFT codes, decimal precision, audit trails |
| Supply Chain | shipment, warehouse, item, carrier | GPS coordinates, weight/volume, tracking IDs |
| Education | student, course, enrollment, grade | GPA calculations, semester cycles |
| HR/Payroll | employee, department, payroll, benefit | SSN patterns, salary ranges, org hierarchies |
Structural Diversity
Even within a single domain, different passages emphasize different relationships, producing varied schema structures:
- A passage about emergency triage produces schemas with acuity levels, wait times, and disposition tracking
- A passage about chronic disease management produces schemas with longitudinal encounters, medication histories, and care plans
- A passage about hospital billing produces schemas with insurance claims, procedure codes, and payment reconciliation
All three are “healthcare databases” but have substantially different table structures, column types, and relationship patterns. This structural diversity is what trains the model to generalize beyond surface patterns.
Confusable Type Injection
A key training challenge is confusable pairs – columns with nearly identical value distributions but different semantic types. The generation pipeline deliberately injects these:
| Confusable Pair | Value Pattern | Distinguishing Context |
|---|---|---|
| Advertising ID vs GUID | Both UUIDv4 format | Table context (ad_events vs generic) |
| Bank account vs payment card | Both numeric strings | Length, check digit algorithm |
| Phone number vs fax number | Both +1-XXX-XXX-XXXX | Column name, co-occurring columns |
| ZIP code vs department code | Both 5-digit numbers | Geographic context vs org context |
| Patient ID vs provider ID | Both XXX-NNNNN format | Foreign key relationships |
By generating schemas where these confusable types coexist – often in the same database – the model learns to resolve ambiguity using cross-column and cross-table context rather than single-column pattern matching.
Scale Arithmetic
Working through concrete numbers:
| Stage | Count | Basis |
|---|---|---|
| FineWeb-Edu passages | ~500M | 1.3T tokens / ~2,600 tokens per passage |
| Ontology fragments | ~1–5 per passage | Domain-dependent entity density |
| Schemas per fragment | ~1–10 | Normalization and naming variation |
| Tables per schema | ~5–50 | Domain complexity |
| Rows per table | ~100–10,000 | Configurable per generation |
| Total table instances | >10 billion | Conservative lower bound |
The bottleneck is LLM inference for ontology extraction (Stage 1), not data generation. Once an ontology fragment exists, schema projection and data population are purely procedural and can run on commodity hardware at millions of tables per hour.
Stage 4: Training Objective
The training objective is the key departure from standard pretraining: Aegir does not learn to predict the next token. It learns to recover the ontological entities – data elements – that were used to generate the relational data it observes. This is possible because the generation pipeline (Stages 1–3) preserves a complete mapping from every column back to its source ontological entity.
Task Formulation
What the Model Sees
The model receives byte-serialized relational tables – one or more tables from the same generated schema, serialized as a byte stream. The serialization format mirrors how real data would be encountered:
- CSV-style serialization with delimiters, quoting, and escape characters
- Column headers may be descriptive (
patient_id), abbreviated (pat_id), or opaque (col_0) - Multiple tables are concatenated with table-boundary markers
- No schema metadata (no types, no foreign key declarations, no table names beyond what appears in headers)
The model must infer semantic structure purely from the byte patterns it observes.
What the Model Predicts
Three prediction heads operate on the column-level embeddings produced by Aegir’s hierarchical encoder:
-
Column Type Annotation (CTA): For each column, predict its BFO-grounded semantic type from a taxonomy. This maps directly to the CTA task on benchmarks like SOTAB and GitTables.
-
Data Element Discovery (DE): Predict which columns – potentially across different tables – belong to the same ontological entity. This is formulated as a clustering task: columns originating from the same BFO class should receive similar embeddings.
-
Hierarchical Consistency: Predict the BFO hierarchy level for each column. If a column is classified as
Diagnosis(a subclass ofGDC), it should also be recognized as aGenericallyDependentContinuant. This head enforces ontological coherence.
What We Compare Against
The ground truth comes directly from the generation pipeline:
- CTA labels: The
Column → BFO propertymapping from Stage 2 gives the exact semantic type of every column - DE labels: The
Column → BFO classmapping identifies which columns originated from the same ontological entity - Hierarchy labels: The BFO subsumption hierarchy defines the expected parent types for every leaf prediction
Loss Function
The total loss is a weighted combination of three terms:
\[ \mathcal{L} = \mathcal{L}{\text{CTA}} + \lambda_1 \mathcal{L}{\text{DE}} + \lambda_2 \mathcal{L}_{\text{hier}} \]
Column Type Annotation Loss
Standard cross-entropy over the column type taxonomy:
\[ \mathcal{L}{\text{CTA}} = -\frac{1}{N} \sum{i=1}^{N} \log p(y_i \mid \mathbf{h}_i) \]
where \(\mathbf{h}_i\) is the column embedding for column \(i\), \(y_i\) is the ground truth BFO-grounded type, and \(N\) is the total number of columns across all tables in the batch.
Data Element Discovery Loss
A contrastive loss that pulls together columns from the same ontological entity and pushes apart columns from different entities:
\[ \mathcal{L}{\text{DE}} = -\frac{1}{|\mathcal{P}|} \sum{(i,j) \in \mathcal{P}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \neq i} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]
where \(\mathcal{P}\) is the set of positive pairs (columns from the same BFO class), \(\text{sim}\) is cosine similarity, and \(\tau\) is a temperature parameter.
This loss is what teaches the model to discover data elements: columns that the model embeds close together are predicted to belong to the same real-world entity, regardless of which table they appear in.
Hierarchical Consistency Loss
A penalty for predictions that violate the BFO subsumption hierarchy:
\[ \mathcal{L}{\text{hier}} = \frac{1}{N} \sum{i=1}^{N} \sum_{c \in \text{ancestors}(y_i)} \max(0, \delta - p(c \mid \mathbf{h}_i)) \]
where \(\text{ancestors}(y_i)\) returns all BFO ancestors of the predicted type, and \(\delta\) is a margin. If a column is predicted as Diagnosis, the model should assign high probability to all ancestor types: GDC, Continuant, Entity.
Training Loop
Batch Construction
Each training batch contains serialized tables from multiple generated schemas:
- Sample a schema from the pool (with curriculum: simpler schemas early, complex multi-table schemas later)
- Serialize one or more tables from the schema to bytes, using randomized serialization parameters (delimiter choice, quoting style, header format)
- Attach the ontological provenance labels as training targets
Multi-Table Batches
For cross-table data element discovery, batches include multiple tables from the same schema. The agent swarm architecture processes each table with a separate agent, and the fused recurrent states are used for the DE prediction head. This directly trains the model’s cross-table reasoning capability.
Connection to Downstream Tasks
The pretraining objective maps precisely to the three real-world tasks described in the Introduction:
| Pretraining Task | Downstream Task | Transfer Mechanism |
|---|---|---|
| Column type prediction | CTA on SOTAB/GitTables/WikiTables | Fine-tune CTA head on benchmark taxonomy |
| Cross-column clustering | CPA on benchmark datasets | Column pair relationship classification |
| Cross-table data element prediction | Enterprise data element discovery | Direct application – same task, real data |
The key advantage: by pretraining on synthetic data with known ground truth at massive scale, the model enters fine-tuning with strong representations for column semantics. The confusable types, cross-table relationships, and ontological hierarchies it has learned from synthetic data transfer directly to the noisy, inconsistently-named, under-documented columns in real enterprise data warehouses.
Integration with Evidence Pipelines
In production, Aegir’s predictions feed into Dempster-Shafer theory (DST) evidence fusion pipelines as a learned evidence source. The model produces:
- Column type predictions with calibrated confidence – these become mass functions in the DST framework
- Column embedding similarities – these provide evidence for same-entity relationships
- Hierarchical type predictions – these constrain the feasible type space for conjunctive combination
The calibration quality of Aegir’s confidence scores matters as much as the accuracy of its top-1 predictions. Training on diverse synthetic data with controlled difficulty (including deliberately confusable types) produces well-calibrated uncertainty estimates, because the model learns from data where the boundary between types is precisely controlled.
The specific self-supervised tasks, corruption strategies, and curriculum design that implement this objective are detailed in Training Tactics.
Training Tactics
The training objective defines what Aegir learns – ontological entity recovery from serialized relational tables. This page defines how: the specific self-supervised tasks, corruption strategies, and curriculum design that compose the pretraining regiment. Each tactic is adapted from a proven LLM pretraining method but re-targeted at the structural properties of relational data with known ontological provenance.
Tactic Overview
Each tactic is described below with its LLM analogue, formal task specification, and the downstream capability it trains.
Core Objectives
Object Property Masking
LLM analogue: Masked Language Modeling (BERT)
Mask one or more properties from an ontological entity definition. The model receives the serialized tables (which still contain the data for the masked properties) and must predict what properties the source entity had.
Difficulty gradation:
| Level | What’s Masked | Challenge |
|---|---|---|
| Easy | A column with structurally distinctive values (dates, emails) | Pattern recognition |
| Medium | A column whose type depends on co-occurring columns | Cross-column reasoning |
| Hard | A column with confusable values (UUID vs advertising ID) | Contextual disambiguation |
| Expert | Multiple properties from the same entity simultaneously | Entity structure reconstruction |
Loss: Cross-entropy over the property type vocabulary, plus a regression loss for predicting the property name embedding.
\[ \mathcal{L}{\text{OPM}} = -\frac{1}{|M|} \sum{p \in M} \left[ \log P(y_p \mid \mathbf{h}_p) + \alpha | \hat{\mathbf{e}}_p - \mathbf{e}_p |^2 \right] \]
where \(M\) is the set of masked properties, \(y_p\) is the property’s BFO type, \(\mathbf{e}_p\) is the property name embedding, and \(\alpha\) weights the name regression term.
Trains: Column type annotation (CTA). The model learns to identify what semantic role a column plays from its value distribution and surrounding context.
Replaced Column Detection
LLM analogue: Replaced Token Detection (ELECTRA)
Swap columns between tables that originated from different ontological entities. A discriminator must identify which columns are imposters — present in a table they don’t ontologically belong to.
The generator learns to make plausible swaps — columns with similar value distributions but different semantic types. This is precisely the confusable-pair problem. A naive generator might swap patient_id (UUID) with encounter_date (timestamp) — trivially detectable. A trained generator learns to swap patient_id with provider_id (both UUIDs, both foreign-keyed) — a much harder discrimination task.
Two-phase training:
- Generator: A small model that scores candidate column swaps by value-distribution similarity and selects high-similarity pairs
- Discriminator: Aegir itself, trained to detect which columns don’t belong
The ELECTRA insight applies directly: the discriminator receives a training signal on every column (original or replaced), not just the masked positions. This is far more sample-efficient than masking-based objectives.
Loss: Binary cross-entropy per column.
\[ \mathcal{L}{\text{RCD}} = -\frac{1}{N} \sum{i=1}^{N} \left[ y_i \log D(\mathbf{h}_i) + (1 - y_i) \log(1 - D(\mathbf{h}_i)) \right] \]
where \(y_i = 1\) if column \(i\) was replaced and \(D\) is the discriminator head.
Trains: Confusable type resolution. Directly addresses the hardest failure mode in production column annotation — columns with identical value patterns but different semantic roles.
Relation Masking
LLM analogue: Next Sentence Prediction (BERT) / Sentence Order Prediction (ALBERT), extended to structural relationships
Drop a foreign key column from the serialized data and ask the model to predict that a relationship between two tables exists, which tables it connects, and what column would mediate it.
Task variants:
| Variant | Input | Target |
|---|---|---|
| Existence | Two tables, FK column removed | Binary: are these tables related? |
| Direction | Two related tables, FK removed | Which table is parent, which is child? |
| Column | Tables with FK removed | Which column in the child table held the FK? |
| Full recovery | Multi-table schema, one FK removed | Predict source table, target table, and mediating column |
Difficulty: Existence is easy (value overlap between tables is a strong signal). Direction requires understanding cardinality from data distributions. Full recovery in a 10-table schema with multiple possible FK targets is genuinely hard.
Loss: Cross-entropy over table pairs for existence/direction, cross-entropy over columns for the FK column prediction.
\[ \mathcal{L}{\text{RM}} = \mathcal{L}{\text{exist}} + \beta_1 \mathcal{L}{\text{direction}} + \beta_2 \mathcal{L}{\text{column}} \]
Trains: Cross-table data element discovery. The model learns to identify structural relationships between tables from data patterns alone — exactly what’s needed when foreign key metadata is missing or unreliable in enterprise warehouses.
Span Corruption (Entity-Level)
LLM analogue: Span Corruption (T5)
Mask all columns belonging to one data element across all tables and replace them with a sentinel. The model must predict what kind of entity is absent based on the remaining schema structure.
This is harder than single-property masking because the model must reason about the structural hole in the schema. A schema with patients, encounters, medications, and providers but no diagnostic information has a recognizable gap — clinical workflows always involve diagnosis. The model learns domain-level structural expectations.
Masking strategies:
- Single entity: Remove all columns from one BFO class (as above)
- Related pair: Remove two related entities (e.g., Diagnosis and its FK in Encounter)
- Subtree: Remove an entity and all its dependents in the ontological hierarchy
Loss: Sequence-to-sequence generation of the masked entity structure, or classification over a vocabulary of entity type templates.
Trains: Entity boundary detection and structural reasoning. When the model encounters a real database missing expected entities, it can predict what should exist — critical for data governance gap analysis.
Augmentation Strategies
Schema Denoising
LLM analogue: Denoising Autoencoder (BART)
Apply multiple corruptions to the serialized schema simultaneously. The model must recover the clean ontological structure from the noisy input.
Corruption menu (applied stochastically per training example):
| Corruption | What Changes | Real-World Analogue |
|---|---|---|
| Column renaming | date_of_birth → col_3 | Generic column names in enterprise DW |
| Column shuffling | Randomize column order within tables | Arbitrary column ordering conventions |
| Table merging | Join two tables into one wide table | Denormalization for query performance |
| Table splitting | Split one table into arbitrary fragments | Vertical partitioning |
| Type coercion | Store dates as strings, integers as floats | Legacy system type mismatches |
| Delimiter variation | CSV → TSV → pipe-delimited → fixed-width | Different export formats |
| Header removal | Drop column headers entirely | Headerless data exports |
| Row sampling | Keep only a random subset of rows | Partial data access |
Multiple corruptions can stack: rename columns and merge tables and switch delimiters. The model trained on this distribution becomes robust to the full range of real-world schema messiness.
Loss: Reconstruction loss on the original ontological labels applied to the column embeddings from the corrupted input. The corruptions change what the model sees; the targets remain the clean ontological structure.
Trains: Robustness to real-world data formats. Enterprise databases exhibit every one of these corruptions and often several simultaneously.
Cross-Schema Contrastive Learning
LLM analogue: Contrastive Learning (SimCLR, CLIP)
Generate two different schemas from the same ontology fragment — one normalized with clear names, one denormalized with obfuscated names — and train the model to produce similar representations for both. Schemas from different ontology fragments should produce dissimilar representations.
Positive pairs: Two schema variants from the same ontology fragment. Negative pairs: Schemas from different ontology fragments (even within the same domain — two different healthcare schemas should still be distinguishable).
Loss: InfoNCE contrastive loss over schema-level representations.
\[ \mathcal{L}{\text{CSC}} = -\frac{1}{|\mathcal{B}|} \sum{i \in \mathcal{B}} \log \frac{\exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}i^b) / \tau)}{\sum{j \in \mathcal{B}} \exp(\text{sim}(\mathbf{z}_i^a, \mathbf{z}_j^b) / \tau)} \]
where \(\mathbf{z}_i^a\) and \(\mathbf{z}_i^b\) are schema-level embeddings (pooled from column embeddings) for the two variants of ontology fragment \(i\), and \(\mathcal{B}\) is the batch.
Trains: Schema-invariant representations. The model learns that the same information can appear in radically different structural formats — the core challenge in enterprise data integration.
Domain-Specific Objectives
Axiom Recovery
LLM analogue: No direct analogue — novel to this setting
Given only the populated tables (no schema metadata), predict the constraints from the source ontology.
Target axioms:
| Axiom Type | Example | Evidence in Data |
|---|---|---|
| Enum constraint | disposition ∈ {admission, discharge, transfer, observation} | Closed set of distinct values |
| Uniqueness | license_number is unique per provider | No duplicates in column |
| Cardinality | Exactly one is_primary=true per encounter | Group-by count pattern |
| Range | esi_level ∈ [1, 5] | Min/max of integer column |
| Referential | Every encounter.patient_id appears in patient.patient_id | Value subset relationship |
| Functional dependency | zip_code → state | Deterministic mapping in data |
Loss: Multi-label classification over axiom templates, parameterized by column references and value sets.
Trains: Constraint discovery. In production, many database constraints are implicit (enforced by application logic, not declared in the schema). A model that can infer constraints from data patterns provides direct value for data quality assessment and governance.
Normalization Prediction
LLM analogue: No direct analogue — novel to this setting
Given a denormalized table, predict the normalized ontological entities — which groups of columns should be separate entities.
In the hospital example, a fully denormalized patient_encounters table contains patient demographics, encounter details, vital signs, diagnoses, and medications all in one wide table. The model must predict that this represents 5+ distinct ontological entities that have been collapsed.
The inverse task is also valuable: given a normalized schema, predict which tables could be meaningfully denormalized (i.e., which tables represent qualities or sub-parts of a parent entity).
Loss: Clustering loss over column embeddings within a single table — columns that should be factored into the same normalized entity should cluster together.
\[ \mathcal{L}{\text{norm}} = -\frac{1}{|\mathcal{P}{\text{intra}}|} \sum_{(i,j) \in \mathcal{P}_{\text{intra}}} \log \frac{\exp(\text{sim}(\mathbf{h}_i, \mathbf{h}j) / \tau)}{\sum{k \in \text{cols}(t)} \exp(\text{sim}(\mathbf{h}_i, \mathbf{h}_k) / \tau)} \]
where \(\mathcal{P}_{\text{intra}}\) is the set of column pairs within a single table that originate from the same ontological entity.
Trains: Entity boundary detection within denormalized tables. Real enterprise data warehouses are heavily denormalized for query performance. Recovering the underlying entity structure from a 200-column fact table is a high-value governance task.
Cardinality Estimation
LLM analogue: No direct analogue — extends relational reasoning
Given populated tables, predict the cardinality constraints from the source ontology: one-to-one, one-to-many, or many-to-many.
The model must infer cardinality from value distributions:
- 1:1: Every FK value appears exactly once in both tables
- 1:N: FK values in the child table repeat; each parent PK appears once
- M:N: Both sides have repeating values (mediated by a junction table)
Loss: Cross-entropy over cardinality categories per table pair.
Trains: Relationship characterization. Understanding cardinality is foundational for schema understanding and directly supports both CPA and data element discovery — a 1:1 relationship suggests entity decomposition, while M:N suggests an independent association.
Difficulty Curriculum
Following UL2’s insight that mixing objectives with explicit difficulty signals outperforms any single objective, training uses a difficulty-tagged curriculum.
Each training example carries a difficulty tag prepended to the input. The model learns to allocate capacity differently depending on the expected difficulty — using fast pattern matching for R-level tasks and deeper structural reasoning for X/Z-level tasks.
Curriculum Schedule
Training proceeds in four phases, progressively increasing difficulty:
| Phase | Epochs | Mix (R/S/X/Z) | Objectives Introduced |
|---|---|---|---|
| 1 | 0–10 | 70/20/10/0 | OPM, RCD (easy variants) |
| 2 | 10–30 | 30/40/20/10 | + Relation Masking, Schema Denoising |
| 3 | 30–60 | 10/30/30/30 | + Span Corruption, Cross-Schema Contrastive |
| 4 | 60+ | 10/20/30/40 | + Axiom Recovery, Normalization, Cardinality |
Domain-specific objectives (axiom recovery, normalization prediction, cardinality estimation) are introduced late because they require the model to already have basic column understanding and cross-table reasoning capabilities.
Objective Priority
The objectives are not equally important. Based on downstream task alignment:
| Objective | Priority | Downstream Impact |
|---|---|---|
| Object Property Masking | Core | Directly trains CTA |
| Replaced Column Detection | Core | Resolves confusable pairs — the hardest CTA failures |
| Relation Masking | Core | Directly trains cross-table data element discovery |
| Span Corruption | Core | Trains entity boundary detection |
| Schema Denoising | High | Robustness to real-world data — improves all tasks |
| Cross-Schema Contrastive | High | Schema-invariant representations — critical for transfer |
| Axiom Recovery | Medium | Valuable for governance but not core to CTA/DE |
| Normalization Prediction | Medium | Important for denormalized warehouses |
| Cardinality Estimation | Medium | Supports relationship characterization |
The four core objectives should compose the majority of training compute. Augmentation strategies (denoising, contrastive) are applied as data transformations rather than separate losses. Domain-specific objectives are scheduled in later phases as refinement tasks.
End-to-End Example
This walkthrough traces a single educational text passage through the entire pretraining pipeline – from raw text to a validated training example. Every intermediate representation is shown concretely, making the abstract pipeline tangible.
Hero Diagram
Step 1: Input Text
A passage from a PDF about hospital emergency department workflows:
Emergency departments manage patient flow through a structured triage process. When a patient arrives, a triage nurse assesses their condition and assigns an acuity level using the Emergency Severity Index (ESI), ranging from 1 (resuscitation) to 5 (non-urgent). Each patient encounter records the presenting complaint, vital signs at triage, the assigned provider, and any diagnostic tests ordered.
Diagnoses are coded using ICD-10-CM, with a primary diagnosis and optional secondary diagnoses recorded per encounter. Medications prescribed during the encounter are tracked with the drug name, dosage, route of administration, and the prescribing provider. The encounter concludes with a disposition decision: admission, discharge, transfer, or observation.
This is a typical educational passage: clear, structured, and rich in implicit ontological content.
Step 2: Ontology Extraction
The LLM receives the passage with a structured extraction prompt and produces a BFO-grounded ontology fragment:
Classes (with BFO alignment):
| Class | BFO Parent | Properties |
|---|---|---|
Patient | BFO:Object | patient_id, date_of_birth, gender, address |
Encounter | BFO:Process | encounter_id, encounter_date, presenting_complaint, disposition |
Provider | BFO:Role | provider_id, name, specialty, license_number |
Diagnosis | BFO:GDC | diagnosis_id, icd10_code, description, is_primary |
Medication | BFO:GDC | medication_id, drug_name, dosage, route |
VitalSigns | BFO:Quality | heart_rate, blood_pressure, temperature, respiratory_rate, spo2 |
AcuityLevel | BFO:Quality | esi_level (1–5) |
Relations:
| Relation | Domain | Range | Cardinality |
|---|---|---|---|
hasEncounter | Patient | Encounter | 1..* |
hasProvider | Encounter | Provider | 1..1 |
hasDiagnosis | Encounter | Diagnosis | 1..* |
hasMedication | Encounter | Medication | 0..* |
hasVitalSigns | Encounter | VitalSigns | 1..1 |
hasAcuity | Encounter | AcuityLevel | 1..1 |
prescribedBy | Medication | Provider | 1..1 |
Axioms:
Encounter.disposition ∈ {admission, discharge, transfer, observation}AcuityLevel.esi_level ∈ {1, 2, 3, 4, 5}Diagnosis.is_primaryis unique per Encounter (exactly one primary diagnosis)
Step 3: SysMLv2 Model
The ontology maps to SysMLv2 block definitions:
The SysMLv2 model adds lifecycle semantics (the Encounter state machine: entry → triage → treatment → disposition → closed) and formal constraints that the flat ontology fragment does not capture.
Step 4: Python Data Objects
from dataclasses import dataclass
from datetime import date, datetime
from enum import Enum
from typing import Optional
class Disposition(Enum):
ADMISSION = "admission"
DISCHARGE = "discharge"
TRANSFER = "transfer"
OBSERVATION = "observation"
class Route(Enum):
ORAL = "oral"
IV = "intravenous"
IM = "intramuscular"
TOPICAL = "topical"
INHALED = "inhaled"
@dataclass
class Patient:
patient_id: str
date_of_birth: date
gender: str
address: str
@dataclass
class Provider:
provider_id: str
name: str
specialty: str
license_number: str
@dataclass
class Encounter:
encounter_id: str
patient_id: str # FK → Patient
provider_id: str # FK → Provider
encounter_date: datetime
presenting_complaint: str
esi_level: int # 1-5
disposition: Disposition
heart_rate: int
blood_pressure: str
temperature: float
respiratory_rate: int
spo2: int
@dataclass
class Diagnosis:
diagnosis_id: str
encounter_id: str # FK → Encounter
icd10_code: str
description: str
is_primary: bool
@dataclass
class Medication:
medication_id: str
encounter_id: str # FK → Encounter
prescribed_by: str # FK → Provider
drug_name: str
dosage: str
route: Route
Note that VitalSigns and AcuityLevel (BFO:Quality entities) have been denormalized into the Encounter table – a deliberate schema variation that the model must learn to handle. In a different schema variant, these would be separate tables.
Step 5: Relational Schema
CREATE TABLE patient (
patient_id VARCHAR(36) PRIMARY KEY,
date_of_birth DATE NOT NULL,
gender VARCHAR(10) NOT NULL,
address TEXT
);
CREATE TABLE provider (
provider_id VARCHAR(36) PRIMARY KEY,
name VARCHAR(100) NOT NULL,
specialty VARCHAR(50) NOT NULL,
license_number VARCHAR(20) NOT NULL UNIQUE
);
CREATE TABLE encounter (
encounter_id VARCHAR(36) PRIMARY KEY,
patient_id VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
provider_id VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
encounter_date TIMESTAMP NOT NULL,
presenting_complaint TEXT NOT NULL,
esi_level INTEGER NOT NULL CHECK (esi_level BETWEEN 1 AND 5),
disposition VARCHAR(20) NOT NULL
CHECK (disposition IN ('admission','discharge','transfer','observation')),
heart_rate INTEGER,
blood_pressure VARCHAR(10),
temperature NUMERIC(4,1),
respiratory_rate INTEGER,
spo2 INTEGER CHECK (spo2 BETWEEN 0 AND 100)
);
CREATE TABLE diagnosis (
diagnosis_id VARCHAR(36) PRIMARY KEY,
encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
icd10_code VARCHAR(10) NOT NULL,
description TEXT,
is_primary BOOLEAN NOT NULL DEFAULT FALSE,
UNIQUE (encounter_id, is_primary) -- at most one primary per encounter
);
CREATE TABLE medication (
medication_id VARCHAR(36) PRIMARY KEY,
encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
prescribed_by VARCHAR(36) NOT NULL REFERENCES provider(provider_id),
drug_name VARCHAR(100) NOT NULL,
dosage VARCHAR(50) NOT NULL,
route VARCHAR(20) NOT NULL
);
Step 6: Synthetic Data
Sample rows from the populated tables:
patient (200 rows):
| patient_id | date_of_birth | gender | address |
|---|---|---|---|
a3f8c1d0-... | 1987-03-15 | Female | 2847 Oak Ave, Portland, OR 97205 |
b7e2a4f1-... | 1952-11-28 | Male | 156 Pine St, Austin, TX 78701 |
c9d0b3e2-... | 2001-07-04 | Female | 4021 Maple Dr, Denver, CO 80202 |
encounter (1,400 rows, ~7 per patient):
| encounter_id | patient_id | provider_id | encounter_date | presenting_complaint | esi_level | disposition | heart_rate | blood_pressure | temperature |
|---|---|---|---|---|---|---|---|---|---|
e1a2b3c4-... | a3f8c1d0-... | p001-... | 2024-01-15 14:30 | Acute chest pain | 2 | admission | 98 | 145/92 | 98.6 |
e5f6a7b8-... | b7e2a4f1-... | p003-... | 2024-02-03 09:15 | Laceration, left hand | 4 | discharge | 72 | 128/78 | 98.2 |
diagnosis (3,200 rows, ~2.3 per encounter):
| diagnosis_id | encounter_id | icd10_code | description | is_primary |
|---|---|---|---|---|
d100-... | e1a2b3c4-... | I21.9 | Acute myocardial infarction, unspecified | true |
d101-... | e1a2b3c4-... | I10 | Essential hypertension | false |
d200-... | e5f6a7b8-... | S61.412A | Laceration without FB, left hand | true |
medication (2,100 rows):
| medication_id | encounter_id | prescribed_by | drug_name | dosage | route |
|---|---|---|---|---|---|
m100-... | e1a2b3c4-... | p001-... | Aspirin | 325mg | oral |
m101-... | e1a2b3c4-... | p001-... | Heparin | 5000 units | intravenous |
m200-... | e5f6a7b8-... | p003-... | Lidocaine | 1% 5mL | topical |
Step 7: Serialized Input
Aegir receives the tables as byte-serialized CSV data. Here’s what the model actually sees (abbreviated):
patient_id,date_of_birth,gender,address
a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,1987-03-15,Female,"2847 Oak Ave, Portland, OR 97205"
b7e2a4f1-3c5d-4e6f-8a9b-c0d1e2f3a4b5,1952-11-28,Male,"156 Pine St, Austin, TX 78701"
c9d0b3e2-1a4f-4c7d-9e2b-f3a5b6c7d8e9,2001-07-04,Female,"4021 Maple Dr, Denver, CO 80202"
...
===TABLE_BOUNDARY===
encounter_id,patient_id,provider_id,encounter_date,presenting_complaint,esi_level,disposition,heart_rate,blood_pressure,temperature,respiratory_rate,spo2
e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,a3f8c1d0-7b2e-4a1f-9c3d-e5f6a7b8c9d0,p001-a2b3-c4d5,2024-01-15 14:30:00,Acute chest pain,2,admission,98,145/92,98.6,20,97
...
===TABLE_BOUNDARY===
diagnosis_id,encounter_id,icd10_code,description,is_primary
d100-e1f2-a3b4-c5d6,e1a2b3c4-5d6e-4f7a-8b9c-0d1e2f3a4b5c,I21.9,"Acute myocardial infarction, unspecified",true
...
The model sees raw bytes. No type annotations, no foreign key declarations, no semantic metadata – just the patterns in the data itself.
Step 8: Training Target
The expected predictions for this training example:
CTA predictions (per column):
| Table | Column | Expected Type | BFO Category |
|---|---|---|---|
| patient | patient_id | PersonIdentifier | GDC |
| patient | date_of_birth | BirthDate | Quality |
| patient | gender | BiologicalSex | Quality |
| encounter | encounter_id | EncounterIdentifier | GDC |
| encounter | patient_id | PersonIdentifier (FK) | GDC |
| encounter | esi_level | AcuityLevel | Quality |
| encounter | disposition | DispositionDecision | Quality |
| encounter | heart_rate | VitalSign | Quality |
| diagnosis | icd10_code | DiagnosisCode | GDC |
| diagnosis | is_primary | PrimaryIndicator | Quality |
| medication | drug_name | MedicationName | GDC |
| medication | dosage | Dosage | Quality |
| medication | route | AdministrationRoute | Quality |
Data element predictions (cross-table clusters):
| Data Element | Columns | Source Entity |
|---|---|---|
| PatientDemographics | patient.patient_id, patient.date_of_birth, patient.gender, patient.address, encounter.patient_id | Patient |
| ClinicalEncounter | encounter.encounter_id, encounter.encounter_date, encounter.presenting_complaint, encounter.esi_level, encounter.disposition, encounter.heart_rate, encounter.blood_pressure, encounter.temperature | Encounter + VitalSigns + AcuityLevel |
| DiagnosisRecord | diagnosis.diagnosis_id, diagnosis.encounter_id, diagnosis.icd10_code, diagnosis.description, diagnosis.is_primary | Diagnosis |
| MedicationOrder | medication.medication_id, medication.encounter_id, medication.drug_name, medication.dosage, medication.route | Medication |
| ClinicalProvider | provider.provider_id, provider.name, provider.specialty, provider.license_number, encounter.provider_id, medication.prescribed_by | Provider |
Note that the PatientDemographics data element spans patient.patient_id and encounter.patient_id – cross-table discovery. Similarly, ClinicalProvider spans columns in three tables (provider, encounter, medication). This is exactly the cross-table data element discovery that enterprise data governance requires.
Step 9: Validation
The round-trip check confirms that predicted data elements map back to source ontological entities:
Every predicted data element corresponds to exactly one source ontology entity. The ClinicalEncounter element correctly groups encounter properties with the denormalized VitalSigns and AcuityLevel qualities – demonstrating that the model learned to see through the denormalization to the underlying ontological structure.
This validation is automatic and exact because the generation pipeline preserves complete provenance. There is no human labeling, no ambiguity, and no annotation disagreement. The ground truth is a mathematical consequence of the generation process.
What This Means in Practice
When this training process is applied at scale – across hundreds of millions of passages spanning every domain in FineWeb-Edu – the model learns:
- Column type recognition that generalizes across naming conventions, data formats, and serialization styles
- Cross-table relationship discovery that identifies semantically related columns regardless of which tables they appear in
- Ontological hierarchy that connects specific types (ICD-10 codes) to general categories (information entities) through BFO’s formal structure
- Confusable type resolution by leveraging cross-column context (patient_id vs provider_id look identical in isolation but participate in different relationship patterns)
These capabilities transfer directly to real enterprise data warehouses, where the model encounters the same patterns – just without the luxury of knowing the ontological provenance in advance.
Agent Swarm Architecture
Aegir’s agent swarm infrastructure enables multi-agent collaboration through RWKV recurrent state fusion. Rather than exchanging text messages or attention KV caches between agents, the swarm shares compact recurrent state tensors – a fundamentally more efficient communication medium for recurrent architectures.
Why RWKV State Sharing
The central insight is that RWKV’s recurrent state is constant in sequence length. Each layer’s state is a matrix of shape (H, K, V) where H is the number of heads and K = V = head_size. The total state size per layer is:
O(H * head_size^2) = O(d_model * head_size) = O(d^2)
This is independent of how many tokens the agent has processed.
For a swarm of N agents, the cost of sharing all recurrent states is:
RWKV: O(N * d^2) -- constant in sequence length
Transformer: O(N * n * d) -- linear in sequence length n
At context lengths of 4k-128k tokens with typical d = 512-4096, RWKV state sharing is orders of magnitude cheaper. The LatentMAS paper (arXiv:2511.20639) quantifies this as 235-471x more information-dense than text-based inter-agent communication, since the recurrent state encodes a compressed summary of the entire processing history.
Swarm Components
The swarm consists of four modules:
| Module | File | Purpose |
|---|---|---|
RWKVStateFusion | src/aegir/swarm/state_fusion.py | Combine N agent states into one |
AlignmentProjection | src/aegir/swarm/alignment.py | Map states between different-sized agents |
FrozenSpecialist | src/aegir/swarm/specialist.py | Wrap pre-trained models as frozen agents |
SwarmOrchestrator | src/aegir/swarm/orchestrator.py | K2.5 PARL routing and reward |
State Fusion Modes
RWKVStateFusion supports three strategies for combining agent states:
-
weighted_sum– Attention-weighted combination using learnable query/key projections. The orchestrator learns which agents to trust per head. -
gated– Per-agent softmax gates. Simpler than attention but still differentiable. Good baseline for initial experiments. -
concat_project– Concatenate all agent states and project back to single-agent dimensions. Most expressive butO(N)in parameter count.
See RWKV State Fusion for mathematical details.
Information Density Advantage
LatentMAS demonstrates that recurrent state communication dramatically outperforms text-based multi-agent protocols. The recurrent state is a lossy but highly compressed representation of the agent’s entire context window. Sharing it is equivalent to sharing a continuous-valued “summary” that preserves the information most relevant to the model’s computation, rather than forcing that information through a text bottleneck.
For Aegir’s column annotation task, this means a specialist trained on (say) geographic column types can share its accumulated understanding of a table’s structure through a single (H, K, V) tensor per layer, rather than generating and parsing natural language explanations.
RWKV State Fusion
The RWKVStateFusion module combines recurrent states from multiple specialist agents into a single fused state for the primary agent. Implementation is in src/aegir/swarm/state_fusion.py.
Input Format
Each agent produces a per-layer recurrent state tensor of shape:
(B, H, K, V)
where B is batch size, H = num_heads, and K = V = head_size. Given N agents, the fusion module receives a list of N such tensors and outputs a single tensor of the same shape.
Internally, the input list is stacked into a single tensor of shape (B, N, H, K, V).
Fusion Modes
weighted_sum – Attention Over Agent States
Uses a learnable query vector per head and a key projection to compute attention weights over agents.
Parameters:
query:(H, K)– learnable query per attention headkey_proj: Linear mappingK*V -> K(no bias)
Computation:
flat = reshape(stacked, [B, N, H, K*V])
keys = key_proj(flat) # (B, N, H, K)
scores = einsum("bnhk, hk -> bnh", keys, query)
weights = softmax(scores, dim=1) # (B, N, H)
fused = einsum("bnh, bnhkv -> bhkv", weights, stacked)
Each head independently learns which agents to attend to. This is the default mode and generally the most effective, since it allows fine-grained per-head routing without excessive parameters.
gated – Learnable Per-Agent Gates
A simpler approach with a single learnable gate vector.
Parameters:
gates:(N,)– initialized to1/N(uniform)
Computation:
weights = softmax(gates, dim=0) # (N,)
fused = einsum("n, bnhkv -> bhkv", weights, stacked)
All heads share the same agent weighting. This is cheaper than weighted_sum but less expressive – it cannot learn head-specific preferences for different specialists.
concat_project – Concatenate and Project
The most expressive mode. Concatenates all agent states along the agent dimension and projects back.
Parameters:
proj: Linear mappingN*K*V -> K*V(no bias)
Computation:
flat = reshape(permute(stacked, [0,2,1,3,4]), [B, H, N*K*V])
projected = proj(flat) # (B, H, K*V)
fused = reshape(projected, [B, H, K, V])
This allows arbitrary mixing of information across agents within each head but scales linearly in parameters with the number of agents.
Usage Example
from aegir.swarm.state_fusion import RWKVStateFusion
fusion = RWKVStateFusion(
num_heads=8,
head_size=64,
num_agents=3,
mode="weighted_sum",
)
# agent_states: list of 3 tensors, each (B, 8, 64, 64)
fused_state = fusion(agent_states) # (B, 8, 64, 64)
Mode Selection Guidelines
| Mode | Parameters | Per-head routing | Best for |
|---|---|---|---|
weighted_sum | O(H*K + K*V*K) | Yes | General use, default |
gated | O(N) | No | Quick experiments, few agents |
concat_project | O(N*K*V*K*V) | Yes | Maximum expressiveness, small N |
LatentMAS Alignment Projection
The AlignmentProjection module maps recurrent states between agents that may have different architectures (different d_model, num_heads, or head_size). Implementation is in src/aegir/swarm/alignment.py.
Problem
When fusing states from multiple agents, all states must share the same (H, K, V) dimensions. But specialists may have been trained with different model sizes. A CTA specialist with d_model=256 and a CPA specialist with d_model=512 produce incompatible recurrent states. The alignment projection resolves this mismatch.
State Types
RWKV recurrent states consist of two kinds of tensors:
Matrix States (att_kv)
The core recurrent state from time mixing. Shape: (B, H, K, V) where K = V = head_size.
Projection: When source and target have different num_heads or head_size, the matrix state is flattened and linearly projected:
S_flat = reshape(S_source, [B, H_s * K_s * V_s])
S_target = W_matrix @ S_flat
S_out = reshape(S_target, [B, H_t, K_t, V_t])
where W_matrix has shape (H_t * K_t * V_t, H_s * K_s * V_s).
The LatentMAS paper (arXiv:2511.20639) proposes using bilinear projection S' = W_l @ S @ W_r^T and computing W_a via ridge regression on paired agent activations. Aegir instead trains the projection end-to-end as part of the swarm’s gradient flow, which avoids the need for a separate alignment data collection phase and allows the projection to co-adapt with the fusion module.
Vector States (att_x_prev, ffn_x_prev)
The previous-timestep hidden state cache used by RWKV’s time-shift mechanism. Shape: (B, D) where D = d_model.
Projection: Simple linear mapping when d_model differs:
x_target = W_vector @ x_source
where W_vector has shape (D_target, D_source).
When Projections Are Needed
The module detects whether projection is needed at initialization:
# Matrix projection: needed when head geometry differs
needs_matrix_proj = (
source_num_heads != target_num_heads
or source_head_size != target_head_size
)
# Vector projection: needed when d_model differs
needs_vector_proj = (source_d_model != target_d_model)
When source and target share the same architecture, both projections are identity operations (no parameters allocated).
Usage
from aegir.swarm.alignment import AlignmentProjection
align = AlignmentProjection(
source_num_heads=4, source_head_size=64,
target_num_heads=8, target_head_size=64,
source_d_model=256,
target_d_model=512,
)
# Project matrix state
att_kv_target = align.forward_matrix(att_kv_source) # (B,4,64,64) -> (B,8,64,64)
# Project vector state
x_prev_target = align.forward_vector(x_prev_source) # (B,256) -> (B,512)
LatentMAS vs Aegir Approach
| Aspect | LatentMAS | Aegir |
|---|---|---|
| Alignment method | Ridge regression on collected pairs | End-to-end gradient training |
| Training data | Requires parallel agent runs | Learned during swarm training |
| Adaptability | Fixed after alignment phase | Continuously adapts |
| Projection type | Bilinear W_l @ S @ W_r^T | Flatten + linear (equivalent expressiveness) |
The end-to-end approach is viable because Aegir’s swarm training already has gradient flow through the fusion module. The alignment projection sits in that gradient path and receives signal from the downstream task loss.
K2.5 PARL Orchestrator
The SwarmOrchestrator coordinates a trainable primary Aegir model with multiple frozen specialist agents, following the Parallel Agent Reinforcement Learning (PARL) pattern from Kimi K2.5 (arXiv:2602.02276). Implementation is in src/aegir/swarm/orchestrator.py.
Architecture
+-------------------+
| SwarmOrchestrator |
+-------------------+
|
+--------------+--------------+
| | |
SpecialistRouter Primary FrozenSpecialists
(sigmoid gates) (trainable) (frozen params)
| | |
| | +---------+---------+
| | | | |
| | Spec_0 Spec_1 Spec_N
| | | | |
+--> activation --> state fusion <----+
mask (RWKVStateFusion)
The primary model is the only component whose parameters are updated during PARL training. Specialists are frozen checkpoints that contribute their recurrent states when activated by the router.
SpecialistRouter
The router decides which specialists to activate for a given input. It maps the primary agent’s hidden representation to per-specialist activation scores:
scores = sigmoid(W_router @ hidden_states) # (B, num_specialists)
activation_mask = scores > threshold # default threshold = 0.5
Sigmoid gating (rather than softmax) allows zero, one, or multiple specialists to be activated simultaneously. This is critical for the column annotation task where a table may require expertise from several domain specialists, or none at all.
PARL Reward Structure
The combined reward follows K2.5’s formulation:
r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf
Reward Components
r_perf – Performance reward. F1 accuracy on the annotation task (CTA or CPA). This is the primary signal that drives annotation quality.
r_parallel – Parallelism and load balancing reward. Encourages efficient specialist utilization: activate specialists when they help, avoid activating them when they don’t. Adapted from H-Net’s lb_loss which penalizes unbalanced routing across experts.
r_finish – Completion quality reward. All columns in a table must be annotated, and the router must not degenerate into always-on or always-off patterns. Penalizes incomplete annotations and trivial routing strategies.
Lambda Annealing Schedule
Following K2.5, the lambda weights anneal over training:
| Phase | lambda_1 (parallel) | lambda_2 (finish) | Rationale |
|---|---|---|---|
| Early | 0.3 | 0.1 | Encourage exploration of specialist activation |
| Mid | 0.1 | 0.3 | Shift focus to completion quality |
| Late | 0.05 | 0.05 | Let r_perf dominate for final accuracy |
The initial values (lambda_parallel=0.3, lambda_finish=0.1) are set in the orchestrator constructor. Annealing is managed by the training loop.
Token-Level Clipping RL
K2.5 uses a variant of PPO with token-level clipping rather than trajectory-level. This provides finer-grained credit assignment:
- Each token’s routing decision gets its own clipped surrogate objective
- Critical tokens (column boundaries, type-indicative values) receive higher weight
- The clipping range narrows over training to stabilize converged policies
Critical-Steps Optimization
Rather than minimizing total computation, the orchestrator minimizes the critical path – the longest chain of sequential dependencies. Specialist activations that can run in parallel do not increase the critical path even if they increase total FLOPs. This encourages the router to prefer parallel specialist activation over sequential reasoning in the primary model when both achieve similar accuracy.
Forward Pass
orchestrator = SwarmOrchestrator(
primary_model=primary,
specialists=[spec_cta, spec_cpa, spec_geo],
fusion=RWKVStateFusion(num_heads=8, head_size=64, num_agents=3),
d_model=512,
activation_threshold=0.5,
)
result = orchestrator(
input_ids=tokens,
mask=mask,
routing_hidden=pooled_hidden, # from primary's first layer
)
# result["output"] -- primary model output
# result["specialist_outputs"] -- list of activated specialist results
# result["activation_mask"] -- (B, num_specialists) boolean mask
When routing_hidden is None, specialist activation is skipped entirely and only the primary model runs. This allows the same orchestrator to be used in both supervised pre-training (no specialists) and PARL training (with specialists).
Roadmap: K2.5 RL Post-Training
This section outlines the four-phase plan for training Aegir from a supervised baseline through full multi-agent reinforcement learning with PARL orchestration.
Overview
The training follows a progressive complexity increase, where each phase builds on the previous one’s checkpoints and infrastructure:
Phase 1 Phase 2 Phase 3 Phase 4
Supervised --> Reward --> PARL --> Agent
Bootstrapping Modeling Training Swarm RL
Train base Design reward Train orchestrator Scale to
Aegir on CTA/CPA components and with frozen multi-specialist
benchmarks validate signals specialists swarms
Phases
Phase 1: Supervised Bootstrapping
Train the base Aegir model on column annotation benchmarks (CTA, CPA) with byte-level input and dynamic chunking. Establish baseline F1 scores and validate the hierarchical architecture on real table data.
Phase 2: Reward Modeling
Design and validate the three reward components (r_perf, r_parallel, r_finish) that will drive PARL training. Calibrate lambda weights and verify that the reward signal produces meaningful gradients.
Phase 3: PARL Training
Freeze the best Phase 1 checkpoint as a specialist and train a new primary model with the PARL orchestrator. Use token-level clipping RL with critical-steps optimization.
Phase 4: Agent Swarm RL
Scale from a single specialist to a full swarm with dynamic specialist spawning. Implement wide search (parallel column analysis) and deep search (hierarchical type reasoning) patterns.
Design Principles
-
Each phase produces a usable checkpoint. Even Phase 1 yields a competitive standalone column annotation model.
-
Frozen specialists are never modified. PARL training only updates the primary model and the routing/fusion modules. This prevents catastrophic forgetting in specialists and simplifies the training loop.
-
Reward components are validated independently. Phase 2 exists specifically to ensure that
r_parallelandr_finishproduce meaningful gradients before combining them withr_perfin Phase 3. -
Complexity is additive, not multiplicative. Each phase adds exactly one new dimension of complexity (multi-task –> reward signals –> RL policy –> multi-agent), making failures easy to diagnose.
Phase 1: Supervised Bootstrapping
Train the base Aegir model on Column Type Annotation (CTA) and Column Property Annotation (CPA) benchmarks with byte-level input. This phase establishes baseline performance and validates the hierarchical architecture on real tabular data.
Objective
Produce a single Aegir checkpoint that achieves competitive F1 scores on standard CTA/CPA benchmarks, operating directly on raw byte sequences (no external tokenizer).
Target Datasets
| Dataset | Task | Tables | Columns | Label Classes |
|---|---|---|---|---|
| SOTAB-CTA | Column Type Annotation | ~50k | ~500k | 91 semantic types |
| GitTables | CTA (large-scale) | ~1.5M | ~15M | Schema.org types |
| WikiTables | CTA/CPA | ~1.7M | ~6M | DBpedia ontology |
Baseline F1 Targets
These targets are based on published results from SOTAB and Retrieve-and-Verify:
| Benchmark | Metric | Target F1 |
|---|---|---|
| SOTAB-CTA (easy) | Macro F1 | > 0.85 |
| SOTAB-CTA (hard) | Macro F1 | > 0.65 |
| SOTAB-CPA | Macro F1 | > 0.75 |
Byte-Level Input
Aegir operates on raw byte sequences (vocab_size=65536 to cover byte values plus special tokens). Tables are serialized into a linear byte stream with role markers distinguishing the target column from context columns.
Dynamic chunking learns tokenization from raw bytes. The RoutingModule in the hierarchical backbone predicts chunk boundaries based on cosine similarity between adjacent hidden states. This means the model discovers its own sub-word units during training, adapting segmentation to the statistics of tabular data rather than relying on a fixed tokenizer trained on natural language.
Serialization Format
Tables are serialized using the format in src/aegir/data/serialization.py:
[CLS] col_name: val1 | val2 | val3 [SEP] ctx_col1: v1 | v2 [SEP] ctx_col2: ...
The target column comes first, followed by context columns selected via MMR (Maximal Marginal Relevance) to maximize diversity while staying within the byte budget.
Training Configuration
Single-GPU (Development)
uv run --no-sync python train.py \
--model-size tiny \
--epochs 30 \
--batch-size 32 \
--lr 3e-4
Multi-GPU with DDP
uv run --no-sync torchrun --nproc_per_node=6 train.py \
--model-size small \
--epochs 100 \
--batch-size 64 \
--lr 1e-4
Training uses:
- DDP (DistributedDataParallel) across GPUs
- AMP (Automatic Mixed Precision) with bf16
- Cosine LR schedule with linear warmup
- Load balancing loss adapted from H-Net to regularize dynamic chunking
Model Sizes
| Size | d_model | Layers | Parameters | Use Case |
|---|---|---|---|---|
| tiny | [128, 192, 192] | ~10 | ~2M | Smoke tests, CI |
| small | [256, 384, 384] | ~20 | ~15M | Development, ablations |
| base | [512, 768, 768] | ~40 | ~120M | Benchmark evaluation |
Success Criteria
Phase 1 is complete when:
- The base model meets or exceeds F1 targets on SOTAB-CTA/CPA
- Dynamic chunking converges to stable boundary predictions (no degenerate all-boundary or no-boundary patterns)
- The trained checkpoint can be frozen and used as a specialist in Phase 3
Phase 2: Reward Modeling
Design and validate the three reward components that will drive PARL training in Phase 3. The goal is to ensure each reward signal produces meaningful, non-degenerate gradients before combining them into the full PARL objective.
Reward Components
r_perf – Performance Reward
The primary quality signal. Computed as F1 accuracy on held-out annotation tasks:
r_perf = macro_F1(predicted_labels, ground_truth)
For CTA, this is the macro-averaged F1 over all 91 semantic type classes. For CPA, it is the macro F1 over property classes.
This reward is straightforward to compute and directly measures what we care about. The challenge is that F1 is non-differentiable, so it must be used as an RL reward signal rather than a supervised loss (which uses cross-entropy as a differentiable proxy).
r_parallel – Load Balancing and Specialist Utilization
Adapted from H-Net’s lb_loss, this reward encourages efficient use of specialists:
r_parallel = -alpha * CV(activation_counts) + beta * utilization_rate
where:
CV(activation_counts)is the coefficient of variation of specialist activation counts across a batch. Penalizes routing that always sends to the same specialist.utilization_rateis the fraction of specialists activated at least once in a batch. Rewards using the full specialist pool.alpha,betaare tunable coefficients.
A degenerate router that always activates all specialists or never activates any will score poorly on this component. The reward is maximized when specialists are activated selectively and roughly equally.
r_finish – Completion Quality
Ensures that the swarm produces complete, non-degenerate outputs:
r_finish = coverage_score - degenerate_penalty
where:
coverage_scoremeasures the fraction of columns in a table that receive an annotation. A table with 10 columns where only 7 are annotated scores 0.7.degenerate_penaltyfires when the router exhibits trivial strategies: always-on (activating all specialists for every input), always-off (never activating specialists), or constant routing (same activation pattern regardless of input).
Combined Reward
The three components are combined with annealing weights:
r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf
Note that r_perf has no lambda coefficient – it always contributes at full strength. The auxiliary rewards are scaled to be comparable in magnitude to r_perf and then weighted down.
Lambda Annealing Schedule
Following K2.5’s approach, the auxiliary reward weights change over training:
| Training Progress | lambda_1 (parallel) | lambda_2 (finish) | Rationale |
|---|---|---|---|
| 0-30% | 0.3 | 0.1 | Encourage specialist exploration early |
| 30-70% | 0.1 | 0.3 | Shift focus to complete annotations |
| 70-100% | 0.05 | 0.05 | Let accuracy dominate for fine-tuning |
The annealing ensures that early training explores the specialist activation space (high lambda_1), then stabilizes routing toward complete outputs (high lambda_2), and finally optimizes purely for annotation accuracy.
Validation Protocol
Before proceeding to Phase 3, each reward component must pass these checks:
-
Non-zero gradient flow. The reward signal must produce non-trivial policy gradients through the router. Verified by checking that
grad(router.weight)is non-zero after a reward update. -
Correct polarity. Higher quality outputs must produce higher rewards. Verified by comparing rewards on hand-crafted good vs. bad annotation examples.
-
Independence. Each component must capture a distinct failure mode. Verified by constructing examples where one component fires but others do not:
- High
r_perf, lowr_parallel: accurate but always uses the same specialist - High
r_parallel, lowr_finish: well-balanced routing but incomplete annotations - High
r_finish, lowr_perf: complete annotations but wrong types
- High
-
Scale compatibility. All three components should produce values in a comparable range (roughly [0, 1]) to avoid one signal dominating before lambda annealing can take effect.
Phase 3: PARL Training
Train the SwarmOrchestrator using Parallel Agent Reinforcement Learning, following the K2.5 framework (arXiv:2602.02276). The primary model learns to route inputs to frozen specialists and fuse their recurrent states, optimized via token-level clipping RL.
Setup
Primary Model
A fresh Aegir model initialized from the Phase 1 checkpoint. All parameters are trainable. The primary model learns to:
- Process the input table and produce annotations
- Decide which specialists to activate via the
SpecialistRouter - Integrate specialist states through
RWKVStateFusion
Frozen Specialists
One or more Phase 1 checkpoints frozen with requires_grad_(False). Each specialist is wrapped in a FrozenSpecialist that:
- Runs forward passes with
torch.no_grad() - Extracts recurrent states from its RWKV layers
- Optionally applies
AlignmentProjectionif its architecture differs from the primary
Initially, Phase 3 uses a single specialist (the best Phase 1 checkpoint). Additional specialists with different training data or hyperparameters are added incrementally.
Token-Level Clipping RL
K2.5 uses a variant of PPO where the clipping objective is applied at the token level rather than the trajectory level. For each token position t:
L_t = min(
rho_t * A_t,
clip(rho_t, 1-eps, 1+eps) * A_t
)
where:
rho_t = pi_new(a_t | s_t) / pi_old(a_t | s_t)is the per-token importance ratioA_tis the advantage estimate at positiontepsis the clipping range (starts at 0.2, narrows to 0.1 over training)
Token-level clipping provides finer-grained credit assignment than trajectory-level clipping. For column annotation, this means the router receives distinct gradient signal for each column boundary token, each type-indicative value, and each structural separator.
Routing as Action Space
The “action” at each routing decision point is the specialist activation vector:
a = sigmoid(W_router @ h) # continuous in [0, 1]^num_specialists
The policy pi(a | s) is parameterized by the router weights. The RL objective encourages the router to activate specialists when they improve annotation quality and deactivate them when they don’t.
Critical-Steps Optimization
Rather than minimizing total FLOPs or wall-clock time, PARL optimizes the critical path – the longest sequential dependency chain in the computation.
critical_path = max(
primary_forward_time,
max(specialist_forward_times for activated specialists)
)
Specialist forward passes run in parallel (they are independent). The critical path is therefore the maximum of the primary and any single specialist, not the sum. This means:
- Activating additional specialists that run in parallel is free in critical-path terms
- The optimizer penalizes only sequential dependencies (e.g., if the primary must wait for specialist state before proceeding)
- This naturally encourages parallel specialist activation over sequential reasoning in the primary
Training Loop
for each batch:
1. Run primary model through first layer to get routing_hidden
2. Compute specialist activation scores
3. Run activated specialists (parallel, no_grad)
4. Fuse specialist states into primary's recurrent state
5. Complete primary forward pass
6. Compute r_perf from annotation accuracy
7. Compute r_parallel from activation statistics
8. Compute r_finish from annotation completeness
9. Combine rewards with annealed lambdas
10. Compute token-level PPO loss and update primary + router + fusion
Budget-Limited vs Standard Scaling
PARL training alternates between two modes:
Budget-limited phase: The router has a hard cap on the number of specialists it can activate per batch. This encourages selective, high-value routing decisions. The cap starts low (1 specialist) and gradually increases.
Standard scaling phase: No activation cap. The router is free to activate as many specialists as it wants, paying only the r_parallel penalty for inefficient routing. This phase tests whether the router has learned meaningful selectivity.
The alternation prevents the router from converging to a trivial “activate everything” strategy during standard scaling while still allowing it to learn from unrestricted experimentation.
Success Criteria
Phase 3 is complete when:
- The primary model with specialist fusion exceeds the standalone Phase 1 baseline by a meaningful margin (target: +2-5 F1 points on SOTAB-CTA hard split)
- The router activates specialists selectively (not all-on or all-off) and the activation pattern varies with input content
- The lambda annealing schedule produces smooth training curves without reward collapse
Phase 4: Agent Swarm RL
Scale from a single specialist to a full multi-specialist swarm with dynamic spawning, wide/deep search patterns, and adaptive specialist allocation based on table complexity.
Search Patterns
Wide Search – Parallel Column Analysis
Process multiple columns simultaneously by routing them to different specialists:
Table: [col_A, col_B, col_C, col_D, col_E]
Specialist 0 (geographic): col_A, col_C
Specialist 1 (temporal): col_B
Specialist 2 (numeric): col_D, col_E
Primary: all columns (final fusion)
Each specialist processes its assigned columns in parallel. The primary model receives fused states from all specialists and makes the final annotation decision. Wide search scales annotation throughput linearly with the number of specialists, bounded by the critical path of the slowest specialist.
Deep Search – Hierarchical Type Reasoning
For ambiguous columns, chain multiple specialists in sequence to progressively refine the type prediction:
Column: "Springfield" (city? state? person name?)
Step 1: Specialist 0 (general) --> geographic entity (0.6) | person name (0.3)
Step 2: Specialist 3 (geographic) --> city (0.7) | administrative region (0.2)
Step 3: Primary --> city (final, high confidence)
Deep search trades latency for accuracy on hard cases. The orchestrator learns when to invoke additional reasoning steps by monitoring the confidence of intermediate predictions.
Combined Wide-Deep
For complex tables, the orchestrator can combine both patterns: wide search across easy columns (one specialist pass each) and deep search on ambiguous columns (multiple specialist passes). The PARL reward structure naturally encourages this: r_parallel rewards wide parallelism, r_perf rewards deep accuracy, and critical-steps optimization keeps the overall latency bounded.
Dynamic Specialist Spawning
Rather than a fixed specialist pool, Phase 4 introduces dynamic spawning based on table complexity signals:
complexity = f(num_columns, label_entropy, column_diversity)
if complexity < threshold_low:
activate 0-1 specialists (primary handles it alone)
elif complexity < threshold_high:
activate 2-3 specialists (wide search)
else:
activate N specialists + enable deep search
The complexity estimator is a lightweight head on the primary model’s first-layer output. It learns to predict how much specialist assistance a given table requires.
Specialist Pool Management
- Warm pool: Pre-loaded specialists kept on GPU memory, ready for immediate activation.
- Cold pool: Specialists on CPU/disk, loaded on demand for rare table types.
- Spawn budget: Maximum number of active specialists at any time, set by available GPU memory.
Expected Scaling Characteristics
State Fusion Cost
With N specialists, each contributing L layers of state with shape (B, H, K, V):
Fusion FLOPs per layer:
weighted_sum: O(N * H * K * V) -- linear in N
gated: O(N * H * K * V) -- linear in N
concat_project: O(N^2 * H * K^2 * V^2) -- quadratic in N (due to projection weight size)
For the weighted_sum mode (recommended for swarms), fusion cost grows linearly with specialist count and is negligible compared to the specialist forward passes themselves.
Throughput Scaling
| Specialists | Expected Throughput | Expected Accuracy | Notes |
|---|---|---|---|
| 0 (primary only) | 1.0x baseline | Phase 1 F1 | No overhead |
| 1 | ~0.9x (routing overhead) | +2-5 F1 | Phase 3 result |
| 3 | ~0.85x | +5-10 F1 | Wide search on typical tables |
| 8+ | ~0.7x | +8-15 F1 | Wide+deep on complex tables |
Throughput decreases reflect routing overhead and state fusion cost. The critical-path optimization means that parallel specialists do not compound latency, so throughput degradation is sublinear in the number of specialists.
Memory Scaling
Each frozen specialist consumes GPU memory for its parameters but no optimizer state (frozen). The primary model requires both parameters and optimizer state.
Memory per specialist: ~model_params * sizeof(dtype)
Memory for primary: ~3x model_params * sizeof(dtype) (params + grad + optimizer)
Memory for fusion: negligible (O(H * K * V) parameters)
With bf16 and a 120M-parameter base model, each specialist costs ~240MB. A 6x RTX 4090 setup (144GB total) can support approximately 8-10 specialists alongside the primary model and optimizer state.
Success Criteria
Phase 4 is complete when:
- The swarm demonstrates measurable accuracy gains from adding specialists beyond the Phase 3 single-specialist result
- Wide search provides throughput-proportional accuracy gains on easy tables
- Deep search provides accuracy gains on the hardest SOTAB-CTA classes (those below 0.5 F1 in Phase 1)
- Dynamic spawning correctly allocates more specialists to complex tables and fewer to simple ones
Development Guide
Building and Running
Critical: Always Use --no-sync
uv run --no-sync python main.py
The --no-sync flag prevents uv from re-resolving and reinstalling dependencies before running. This is required because flash-attn, flash-linear-attention (fla), mamba-ssm, and causal-conv1d are patched CUDA extensions that were built manually with corrected CXX11 ABI flags. Running uv run without --no-sync will clobber these patched builds with incompatible PyPI wheels.
Smoke Tests
# Model instantiation and forward pass shapes
uv run --no-sync python main.py
# Training loop validation (tiny model, synthetic data)
uv run --no-sync python train.py --smoke-test --model-size tiny --epochs 3
Multi-GPU Training
# 6x RTX 4090 training
uv run --no-sync torchrun --nproc_per_node=6 train.py \
--model-size small \
--epochs 100 \
--batch-size 64 \
--lr 1e-4
Training uses DDP (DistributedDataParallel), AMP with bf16, cosine LR schedule with linear warmup, and load balancing loss for dynamic chunking regularization.
CUDA Extension Build Notes
The devenv/Nix environment provides GCC 15, which sets _GLIBCXX_USE_CXX11_ABI=1. However, PyTorch’s cu124 wheels are built with _GLIBCXX_USE_CXX11_ABI=0. This ABI mismatch causes segfaults when CUDA extensions link against the wrong ABI.
Patching Procedure
Both mamba-ssm and flash-attn have a CachedWheelsCommand in their setup.py that downloads prebuilt wheels from GitHub releases, bypassing local compilation. To force a local build with the correct ABI:
-
Set environment variables to force local build:
export MAMBA_FORCE_BUILD=TRUE export FLASH_ATTENTION_FORCE_BUILD=TRUE -
Use
env -iwith system GCC-11 to get the correct ABI:env -i PATH=/usr/bin:$PATH HOME=$HOME \ pip install --no-build-isolation /tmp/mamba_src/mamba_ssm-2.3.1/ -
Patch
setup.pyin each extension to add explicit_abi_flagmatching torch’s ABI.
Patched source trees are kept in /tmp/mamba_src/ and /tmp/flash_src/. See docs/notes/2026-03-28/010808_deps_smoke_train.md for the full step-by-step procedure.
Verifying the Build
After patching, verify that the extensions load correctly:
uv run --no-sync python -c "import mamba_ssm; print('mamba-ssm OK')"
uv run --no-sync python -c "import flash_attn; print('flash-attn OK')"
uv run --no-sync python -c "from fla.ops.rwkv7 import chunk_rwkv7; print('fla OK')"
Adding New Block Types
The architecture supports mixed block types (Mamba2, MHA, RWKV-7, RWKV-8 ROSA) within a single model. To add a new block type:
1. Implement the Mixer Class
Create a new module that implements three methods:
class MyNewMixer(nn.Module):
def forward(self, hidden_states, inference_params=None, **kwargs):
"""Full-sequence forward pass. Input: (B, L, D). Output: (B, L, D)."""
...
def step(self, hidden_states, inference_params):
"""Single-token autoregressive step. Input: (B, 1, D). Output: (B, 1, D)."""
...
def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
"""Allocate KV cache or recurrent state for inference."""
...
2. Register in create_block()
Add the new type to src/aegir/modules/block.py:
def create_block(arch, d_model, ...):
if arch in ("x", "X"): # new block type code
from my_module import MyNewMixer
mixer_cls = partial(MyNewMixer, **factory_kwargs, layer_idx=layer_idx)
...
Convention: lowercase letter = mixer only (no MLP), uppercase = mixer + SwiGLU MLP.
3. Add to Isotropic Forward Loop
In src/aegir/modules/isotropic.py, add the new block type to:
-
The regex pattern that parses layout strings:
layout_parse = re.findall(r"([mMtTrRwWxX])(\d+)", arch_layout) -
The forward loop’s block-type dispatch:
elif arch in ("x", "X"): layer_mixer_kwargs = {} # or whatever kwargs your mixer needs if hidden_states.dim() == 2: hidden_states = hidden_states.unsqueeze(0) residual = None if residual is None else residual.unsqueeze(0)
4. Test
# Verify the new block type instantiates and runs
uv run --no-sync python main.py
Project Structure
aegir/
main.py -- Smoke tests
train.py -- Training script (DDP, AMP, cosine LR)
src/aegir/
models/
config.py -- AegirConfig, SSMConfig, AttnConfig, RWKVConfig
aegir.py -- Recursive hierarchical backbone
heads.py -- AegirForCausalLM, AegirForColumnAnnotation
modules/
block.py -- Block factory (create_block)
isotropic.py -- Flat block stack with mixed types
dc.py -- Dynamic chunking (RoutingModule, ChunkLayer, DeChunkLayer)
rwkv7_tmix.py -- RWKV-7 full TimeMix (fla kernels)
rwkv.py -- RWKV-8 ROSA time mixing + relu^2 channel mixing
rosa.py -- ROSA suffix automaton (CPU-based)
mlp.py -- SwiGLU MLP
swarm/
state_fusion.py -- RWKVStateFusion (3 modes)
alignment.py -- AlignmentProjection (cross-agent state mapping)
specialist.py -- FrozenSpecialist wrapper
orchestrator.py -- SwarmOrchestrator (K2.5 PARL)
data/
serialization.py -- Table-to-byte-sequence serialization
context_select.py -- MMR context column selection
table_dataset.py -- PyTorch dataset for table benchmarks
utils/
train.py -- Load balancing loss, F1 metrics, param grouping
docs/ -- mdbook documentation (this book)
ref/ -- Reference papers
Documentation
Build and serve the documentation locally:
mdbook build docs/
mdbook serve docs/ # serves at http://localhost:3000
The documentation uses mdbook with katex (math), mermaid (diagrams), and d2 (architecture diagrams) plugins, all provisioned by devenv.