Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

Aegir is a hierarchical byte-level sequence model for semantic annotation of relational data. Given one or more tables, it predicts the semantic type of each column (Column Type Annotation, CTA), the relationships between columns (Column Property Annotation, CPA), and the cross-table groupings that constitute coherent real-world data elements — for example, a PaymentCard data element spanning card_number, expiry, and cardholder columns across billing, transaction, and customer tables. The model is paired with the Signals Data Governance (SDG) ontology — a BFO 2020 / CCO-grounded, HermiT-validated OWL artifact whose classes are the CTA/CPA annotation vocabulary; together they constitute a closed loop between the model and the structured-knowledge representation it learns from.

Two coupled research outputs

The project produces two outputs that are cited together:

  1. A hierarchical byte-level sequence model. All-RWKV-7 time-mixing with H-Net dynamic chunking, trained byte-level on a mixed corpus and fine-tuned for the column-annotation tasks above. The architecture is described in Architecture; the operational pretraining work is described in Pretraining and Training Regime.

  2. An ontology-grounded synthetic corpus and the SDG ontology that generates it. A BFO/CCO-grounded domain ontology, content-derived from FinePDFs and realized to a HermiT-validated OWL artifact, drives LLM generation of deterministically-verifiable, attribution-clean textbook chapters and a relational DDL spine. The corpus is byte-level pretraining data and an independent publishable deliverable (the corpora/ submodule). The ontology, its quality gates, and the disposal membranes that enforce them are documented in the SDG ontology chapter and the Ontology Authors Guide.

The two outputs share substrate — the SDG ontology, the family catalog of Manchester-syntax templates, the verbalization pipeline — and are coupled downstream: the ontology produces verbalized, verified chapters that feed back into the byte-level pretraining corpus. The ontology is treated as a primary research output in its own right, not as plumbing; its rigor program is described below.

The ontology rigor program

The SDG ontology is governed by propose / dispose: an agent proposes axioms and a stack of deterministic membranes disposes — admitting only what is well-formed (a Manchester/OWLAPI parse membrane), logically consistent under the reasoner (HermiT, with CCO imported as a reasoning authority so grounding is validated against CCO’s disjointness axioms), and ontologically clean (an OntoClean meta-property membrane). The two strongest membranes are un-fakeable: you cannot talk past a contradiction or an anti-rigidity violation. Quality is measured by a metric suite — IOF-derived rigor dimensions (definitional_completeness, bfo_grounded, realizable_machinery, def_annotation_coverage), field-standard OntoQA/OQuaRE structural metrics, and OntoClean taxonomic-correctness proxies — and a formal OQuaRE publish gate (six SQuaRE characteristics; floors of oquare_aggregate ≥ 3.5 and FunctionalAdequacy ≥ 3.0 plus HermiT consistency), wired hard into the publish path so a regression cannot ship. The Authors Guide is the canonical reference for every metric, band, and threshold.

There is also an in-flight RLVR research sub-track — a language-model policy trained with Group Relative Policy Optimization (GRPO) against a deterministic four-component verifier R(O, I) over OWL compositions — documented in the concept brief and the RLVR chapter. It shares the catalog and the verbalization pipeline with the rest of the ontology track but is distinct from the production realization-and-gate path above; its status is tracked in EVIDENCE.md.

Problem setting

Enterprise data warehouses contain thousands of tables with columns whose meaning is often opaque: generic names (col0, field_42), inconsistent conventions across teams, no machine-readable metadata. Understanding what each column represents — and which columns across different tables refer to the same real-world concept — is foundational to data governance, privacy compliance, and integration.

Two families of prior approaches exist. Pattern and heuristic methods identify column types through regex detectors, name matching, embedding similarity, and gradient-boosted classifiers on hand-engineered features. They work well for structurally distinct types but struggle with confusable pairs — columns whose value distributions are nearly identical but whose semantic types differ (advertising IDs versus GUIDs, bank account numbers versus payment card numbers). They also require manual enumeration of data-element patterns and do not generalize to novel relationship types. Learned sequence models — DODUO, RECA, REVEAL — treat the table as a token sequence and classify columns via fine-tuned transformers. REVEAL’s central insight is that context-column selection matters: choosing the right neighboring columns (via MMR diversity sampling) materially improves annotation accuracy. These models operate on single tables in isolation and use fixed subword tokenizers that fragment tabular data unpredictably.

Aegir bridges the two families. It is trained byte-level — no fixed tokenizer — and is designed to be deployed in situ alongside evidence-based classification pipelines: consuming the same serialized table representations as the surrounding stack, but learning cross-column and cross-table relationships end-to-end rather than relying on enumerated patterns. Standard column-annotation benchmarks remain reference points — SOTAB (Schema.org types over web tables), GitTables (large-scale column type detection across 1M+ CSV tables from GitHub), and WikiTables (Wikipedia HTML tables) — but the project’s own thesis is that the ontology is load-bearing for property-prediction (CPA) while it trades raw web-table distribution alignment, so the headline evaluation targets relational / data-element skill rather than assuming a single web-CTA benchmark is the goal (see Roadmap and EVIDENCE.md).

Methodological contributions

Algorithms. Byte-level dynamic chunking as differentiable tokenization: a routing module predicts boundary probabilities from consecutive hidden-state cosine similarity, and chunk representatives propagate to the next hierarchical stage. The H-Net primitive treats tokenization as a learned property of the architecture rather than a fixed preprocessing decision. Chunked-mode RWKV-7 time-mixing through flash-linear-attention Triton kernels provides constant-state recurrent computation with parallel training throughput. The ontology side contributes a propose / dispose discipline in which rigor is enforced by deterministic membranes (parse → HermiT/CCO → OntoClean) rather than asserted, and measured by an IOF-anchored OQuaRE gate — an intrinsic, reasoner-validated quality signal rather than a judge-mediated one.

Architecture. A recursive hierarchy in which each stage selects its own block-type mix from a block factory. Every current default arch_layout uses RWKV-7 time-mix at every stage; ROSA (a RWKV-8 suffix automaton for exact substring retrieval) and Mamba-2 SSD are additional block codes that the factory supports for hybrid configurations and ablations. The recursion alternates encoding, dynamic chunking, recursive inner processing, EMA dechunking, and decoding; the recurrent state at every RWKV-7 stage is constant in sequence length. This makes the recurrent state a fixed-size object that can be serialized, transmitted, and algebraically combined across agents — the substrate for the multi-agent state-fusion infrastructure described in Agent Swarm.

Objectives. The system pursues two complementary objectives. The first is competitive accuracy on column-annotation tasks, with an emphasis on relational / data-element skill — the application of the byte-level model. The second is a measurably rigorous SDG ontology and the verifiable, attribution-clean corpus it generates — evaluated by intrinsic, reasoner-validated gates rather than by a judge. The two objectives share the SDG ontology and the family catalog as substrate; the second feeds the first downstream as pretraining data.

Reading this document

  • A reader interested in the model architecture should read Architecture and its sub-pages on RWKV-7 time mixing, dynamic chunking, ROSA, and the block factory.
  • A reader interested in the ontology, its rigor program, and how to extend it should read the SDG ontology chapter and the Ontology Authors Guide; the in-flight RLVR sub-track is in the concept brief and the RLVR chapter.
  • A reader interested in byte-level pretraining and downstream fine-tune should read Pretraining and Training Regime.
  • A reader interested in who the system is built for and what workflows it commits to should read Personas.
  • A reader interested in the operational milestones and what has been delivered should read Roadmap.

The development guide and the worktree-aware development chapter cover operational concerns and the CUDA-extension build path that Aegir’s runtime depends on.

Personas

This document names the people the Aegir system is built for and the workflows each one needs to complete. It is the gate on the project’s BDD scenarios — every feature file in features/ should identify the persona(s) it serves, and no feature file is considered done until its scenarios describe a workflow recognizable to that persona.

Why persona-first

A scenario that exercises a system component (does the WAL line have the right fields? does the API return 200?) is a wire test. A scenario that exercises a user doing something is a workflow test. The two are not interchangeable. Wire tests catch regressions in plumbing; workflow tests describe the surface the system commits to producing. The Aegir feature suite as of the last project audit consisted almost entirely of wire tests — useful as a regression net, silent on whether anyone could actually use the system to accomplish anything.

This document defines four personas. Each one has a concrete role, a specific surface they interact with, a workflow they need to be able to complete, and a definition of “success.” Feature files derive from personas: when a scenario is written, it should be possible to point at a persona section below and say “this is the workflow that scenario covers.”

Three of the personas are practitioners who iterate on the Aegir-grounded data-governance loop in production. The fourth is an external reviewer whose own published work introduced primitives Aegir uses or extends. The first three want to get their answer; the fourth wants to know whether the answer is defensible. The two framings are complementary; the system needs to serve both.

The personas are intentionally archetypal — not modeled on specific individuals at any organization. Where individuals exist that resemble these archetypes (and they do, in this project’s actual reviewer pool), they remain unnamed by design.

Persona 1 — Data-governance director

Role. Owns the data-governance program at an enterprise. Reports to the CDO (or equivalent). Final approval on what tags get applied to production tables, what compliance constraints get enforced, and what vocabulary the organization formally commits to in its Business Glossary. Accountable upward for governance posture; accountable downward for the catalog.

Surface. The Aegir UI’s leaderboard, run-detail, and coverage views; the published Atlas Business Glossary; the SDG catalog (rendered as templates with cross-context cousining annotations); compliance and provenance reports lifted from lineage.

Daily workflow. Reviews the coverage report from the latest training run (what fraction of the in-scope warehouse can be tagged with the current catalog, where are the systematic gaps). Examines failure clusters that the steward (Persona 2) escalates from day-to-day operations. Approves or revises proposed catalog deltas — new templates, term additions, term revisions. Verifies that approved terms land in the Atlas Business Glossary with their SKOS hierarchies and isA edges to the SDG classification types. Signs off on production tag deployments.

Definition of success. She can articulate to her CDO, in language the CDO understands, what the system tags well, what it tags badly, and what is being done about the latter. She has a defensible audit trail for every term that appears in the published Business Glossary: who proposed it, what evidence motivated the proposal, what catalog templates use it, and which production tags depend on it.

What must never happen. A tag of unknown provenance appearing on a production table. A template addition that bypasses her review. A coverage gap that nobody investigates within an iteration. Drift between the published Business Glossary and the operational SDG catalog.

Persona 2 — Data steward

Role. Day-to-day operator of the data-governance pipeline. Works for the director. Lives in the system in the way production-database administrators live in their databases: this is where the real work gets done.

Surface. The Aegir UI’s run-detail pages (CTA results, failure cases, per-table breakdowns); the lineage trace view (cta_failure_trace, sample_provenance workflows); the sampling tools (sampling_strategy); the corpus-gap proposal tool (corpus_gap_proposal); the SDG catalog with template-minimality guidance.

Daily workflow. Receives an escalation from a downstream consumer: column X on table Y is tagged wrong, here’s why we think so. Opens the run-detail page, locates the failure case in the held-out evaluation or production run. Initiates a CTA failure trace: which compositions did the model see that involved the concepts at issue? Which catalog templates produced those compositions? Which input-pool documents shaped R_D (topic alignment) for those compositions? Identifies a probable cause — a missing template that should distinguish the two concepts, an undersampled corpus region that left the model without a clear contrastive signal, or a slot-fill pattern the catalog doesn’t yet express. Proposes a fix: either a corpus addition (sample more documents of type Z under a stated regex / filter rule) or a template revision (under template-minimality discipline — propose composition of existing templates before authoring new ones). Submits the proposal to the director for approval. Once approved, runs the curation iteration: sample, fine-tune the warm-start, evaluate against the held-out set, compare to the prior run’s R distribution. Reports back: did the fix close the failure?

Definition of success. They can take any failure case and produce a defensible explanation plus a defensible fix proposal, with the artifacts (lineage trace, sampling rules, proposed catalog delta) attached and the proposal traceable through the director’s review back to a production tag change. They can defend the template-minimality discipline: every new template they propose is accompanied by an argument for why the existing catalog could not express the distinction.

What must never happen. A failure case with no traceable provenance (the lineage substrate exists to prevent this). A retrain that doesn’t move the needle and the steward can’t tell why. Catalog or corpus changes that drift outside the locked-artifacts discipline. A fix that closes the originating failure but introduces a worse one elsewhere with no early warning.

Persona 3 — Embedded ML engineer

Role. Works at a downstream consumer of Aegir. Owns the production pipeline that integrates Aegir’s output — CTA/CPA tags, the SAE feature dictionary, the trained checkpoint — into the consumer’s broader system (DST evidence fusion, enterprise classification pipelines, governance tooling). Their stability is downstream of Aegir’s stability; their reproducibility is downstream of Aegir’s reproducibility.

Surface. The Aegir gateway API (/api/leaderboard, run-detail endpoints, plot endpoints); the locked-artifacts table (catalog version, weights hash, null-statistics hash); the SAE feature dictionary when it is stable enough to cite; run-metadata sidecars (what catalog + weights produced this checkpoint); lineage events emitted to Atlas (canonical) and Marquez (OL push for compatibility).

Iteration workflow. Pulls a new Aegir checkpoint into their integration test harness. Checks the locked-artifacts hashes against their pinned baseline. If hashes have changed, runs their downstream tests to see if behavior changed. If new SAE features have been surfaced, evaluates whether they are stable enough across runs to cite from the consumer’s own code. Reports back to the Aegir team when a regression is attributable to an Aegir change: your run X shifted SAE feature 4087’s mean activation on banking-PII columns by 35%, which broke our downstream classifier’s confidence calibration; here is the diff.

Definition of success. They can pin to a known Aegir state identified by the four-hash quadruple (catalog_version, locked_weights_hash, null_stats_hash, run_id), detect when that state has changed, and reproduce any cited result from the hashes alone — without coordinating with the Aegir team. The gateway responses they depend on are versioned; the SAE features they cite are stable across checkpoint refresh.

What must never happen. Aegir making a “small change” that silently shifts SAE feature semantics. Locked hashes that fail to capture some relevant aspect of state (catalog drift behind a stable hash). Gateway responses that change shape without either a major-version bump or a backwards-compatibility shim.

Persona 4 — Peer methods reviewer

Role. External researcher whose own published work introduced primitives that Aegir uses or extends — sequence modeling architectures (RWKV variants, H-Net dynamic chunking, xLSTM, related families), reinforcement-learning methodology (GRPO and successors, the broader RLVR program), sparse- autoencoder interpretability methods, or adjacent methodology. Reviewing the project as a collaborator-critic, not as a gatekeeper. The label “peer” is deliberate: it positions the project at the same methodological table as the reviewer, with the implicit invitation to give the feedback that would be given to a peer’s work — not the deference owed to a senior authority’s work.

Surface. The Ontology Authors Guide (the canonical reference for every ontology metric, gate, and membrane); the authoritative reference in production_state.md and the concept brief in concept_brief.md for the in-flight RLVR sub-track; the realized ontology artifact (corpora/ontology/sdg-ontology.{omn,owl}) and its HERMIT_CERTIFICATE.md; EVIDENCE.md (the pre-registered claims ledger); the locked C1 test set, held-out 50, and null-statistics snapshot; the verification gates (catalog schema check, the OQuaRE publish gate, the parse / HermiT / OntoClean disposal membranes, C1 AUC regeneration, verifier determinism, end-to-end scaffold); the repository itself (clone and re-run).

Review workflow. Reads the Authors Guide and the authoritative reference. Identifies the load-bearing claims. On the ontology rigor program (the primary, shipped track): that the realized SDG ontology clears the pre-registered objectives in EVIDENCE.mdOQ-Rigor (definitional_completeness ≥ 0.45realizable_machinery > 0) and OQ-Structure (bfo_grounded ≥ 0.95def_annotation_coverage ≥ 0.90ar > 0oquare_aggregate ≥ 3.5), with zero unsatisfiable classes under HermiT; and that the disposal membranes (parse → HermiT/CCO → OntoClean) are un-fakeable. On the in-flight RLVR sub-track: verifier discrimination on C1 (AUC 0.9956, mean R-separation 0.336), held-out 50 separation (0.5129), and the policy claim that GRPO can produce compositions whose R-distribution exceeds prompt-evolved and human-authored baselines. Picks one claim and tries to reproduce it from the repository alone, with no email to the authors. If reproduction works, asks the second-order questions. For the rigor program: Do the metrics regenerate — does scripts/ontology_metrology.py on the realized .owl produce the reported numbers, and does scripts/ontology_oquare.py return GREEN against the certificate? Are the gates actually un-fakeable — does an injected contradiction or anti-rigid-over-rigid subsumption get rejected with a reason? For the RLVR sub-track: Are baselines visible — what does the R-distribution look like under a prompt-evolved policy? under a random-sampling policy? under no constraint at all? Is each verifier component validated independently — could R_B be removed without losing meaningful discrimination? what does R_D alone discriminate? Is the experimental design honestly bounded — are the claims stated at the confidence level the data supports? Does the warm-start choice survive comparison — is Option A’s rejection-sampling + SFT compared head-to-head against Option B’s Instruct-paired baseline, or against Option C’s on-policy SDFT alternative, under matched conditions?

Definition of success. They can reproduce any claim from the repository alone. They can swap one verifier component out and re-run the C1 sweep without rebuilding the catalog. The project’s claims are bounded honestly — no overreach beyond what the experiments support — and any limitation they raise is either addressed in a follow-up or named in § 8 (Limitations) of the authoritative reference. They leave a collaborative critique that the team can act on, not a binary thumbs-up/thumbs-down.

What must never happen. A claim with no traceable experiment behind it. An ablation that “would” work but hasn’t been run, presented as if it had. Baselines that aren’t visible in the doc or the repo. Reproducibility that requires emailing the authors. The team treating the review as a gatekeeping audit rather than collaborative critique (which is on the project to invite, not on the reviewer to volunteer).

Workflow features → personas

The seven user-facing workflow features and the three reviewer-facing methodology features, mapped to the personas they primarily serve. Where a feature serves multiple personas, the primary owner is bolded; secondary owners cite the feature when their workflow touches it.

FeatureDirectorStewardML eng.Reviewer
cta_failure_tracereads clustersprimaryregression checkprovenance audit
vocabulary_coverageprimaryinforms proposalspins consumer scopeclaim audit
vocabulary_subsumptionprimaryproposes termsstructural-claim audit
sample_provenanceprimarycitation pinsreproducibility audit
sampling_strategyprimaryreproducibility audit
sae_vocabulary_alignmentterm-promotionprimaryinterpretability audit
corpus_gap_proposalapprovesprimaryproposal-logic audit
reproducibilityregression baselineprimary
ablation_surfaceprimary
baseline_comparisonreads summaryprimary

A feature that finds no row in the persona column it claims to serve is misplaced; either the persona is wrong or the feature is actually a wire test in disguise.

Conventions for feature files

When writing a .feature file under features/governance/ (or any sibling cluster derived from these personas), the Background section should reference the relevant persona by role rather than by the formal label:

Feature: CTA failure trace
  As a data steward investigating a misclassification on a
  production table, I need to follow the lineage chain from
  the failing column back through compositions, catalog
  templates, and input-pool documents — so that I can
  produce a defensible explanation and a fix proposal in
  the same iteration.

Plainer than As the Data Steward (Persona 2). The formal label belongs in this document; the feature files reference the workflow.

For reviewer-facing features, the same convention applies:

Feature: C1 sweep reproducibility
  As an external methods reviewer auditing the verifier's
  discrimination claim, I need to regenerate the C1 sweep
  AUC from the committed test set and locked weights, on a
  fresh clone of the repository, with no coordination with
  the project team.

Cross-references


This document gates the feature files in features/governance/ and any subsequent feature cluster that claims to test a user workflow. Adding a persona, retiring one, or revising a persona’s definition of success is a load-bearing change and should be reviewed alongside the affected feature files.

Architecture Overview

Aegir is a recursive hierarchical sequence model. At the top level, it processes raw byte sequences through nested stages of encoding, dynamic chunking, inner processing, dechunking, and decoding. Each stage can use a different hidden dimension and a different mix of block types.

Recursive Hierarchy

The architecture is defined by a nested list called arch_layout. For example:

arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
d_model     = [128,   192,   192]

This defines three stages (depth 0, 1, 2):

StageRoleLayoutDimension
0Outermost encoder/decoder"w2" / "w2"128
1Middle encoder/decoder"w2" / "w2"192
2Innermost (main)"w4"192

At each non-innermost stage, the data flow is:

At the innermost stage, only the main network runs (no chunking). The recursion bottoms out at a flat Isotropic block stack.

Data Flow in Detail

  1. Encoder: A flat stack of blocks (e.g., 2 RWKV-7 blocks) processes the full-resolution sequence.
  2. Routing: RoutingModule predicts boundary probabilities via cosine similarity. Tokens at predicted boundaries are selected as chunk representatives.
  3. Chunk: ChunkLayer downsamples by keeping only boundary tokens, producing a shorter sequence.
  4. Main network: The shorter sequence is processed by the next hierarchy level – which may itself contain encoding, chunking, and another level of recursion.
  5. Dechunk: DeChunkLayer reconstructs the full-length sequence via an EMA scan, blending chunk outputs back into non-boundary positions.
  6. Residual: A skip connection around the entire chunk/process/dechunk block, gated via straight-through estimation of the routing probabilities.
  7. Decoder: Another flat stack of blocks processes the reconstructed sequence.

Dimension Padding

When inner stages have a larger hidden dimension than outer stages, Aegir pads the input with a learnable vector (pad_dimension) on entry and slices it off on exit. This avoids linear projection overhead at every stage transition.

Why All-RWKV

The primary design choice is to use RWKV-7 time mixing at all stages rather than transformers or pure SSMs. The motivation is threefold:

1. Uniform O(1) Recurrent State

Every RWKV-7 block maintains a recurrent state of shape (B, H, head_size, head_size). This is constant regardless of sequence length. During autoregressive inference, each token step updates this matrix and reads from it in O(head_size^2) time per head.

2. Agent State Fusion

For the agent swarm architecture, specialist agents process the same input and produce recurrent states. These states must be combined. RWKV states are fixed-size matrices that live in a well-defined linear space, making fusion via weighted sum, gating, or projection algebraically natural. In contrast:

  • Transformer KV caches are O(L * d) and grow with sequence length, making fusion combinatorially expensive.
  • Mamba-2 states are smaller but have different algebraic structure (diagonal recurrence).

3. Chunk-Parallel Training

The chunk_rwkv7 kernel from flash-linear-attention enables training with parallel chunk processing while maintaining exact recurrent semantics. This gives near-transformer training throughput with recurrent inference efficiency.

Comparison Table

PropertyRWKV-7 (w/W)Mamba-2 (m/M)Transformer (t/T)
Training kernelchunk_rwkv7 (Triton)Mamba-2 SSD (CUDA)Flash Attention 2
Recurrent state(H, K, K) matrix(H, d_state) vectorNone (KV cache)
Inference memoryO(d^2) constantO(d * d_state) constantO(L * d) linear
State fusibilityNatural (matrix sum)Possible (vector sum)Impractical
Exact retrievalVia ROSA blocksNoVia full attention
FFN pairingCMix (relu^2) or SwiGLUSwiGLU or noneSwiGLU or none

In practice, RWKV-7 blocks (w/W) are the default at every stage and are the only block code used in any current arch_layout (main.py, train.py for tiny / small / base). Mamba-2 (m/M) is available as an optional dependency for ablation and hybrid configurations. ROSA (r/R) implements a RWKV-8 suffix automaton for exact substring retrieval; it is functional in the block factory but is not part of any current default arch_layout — available for hybrid configurations and ablations. MHA (t/T) codes are declared in the block factory but the implementing module is not currently shipped.

Hierarchical Dynamic Chunking

Dynamic chunking is Aegir’s mechanism for content-dependent hierarchical segmentation. Rather than using a fixed tokenizer, the model learns to predict chunk boundaries based on the hidden representations themselves. This module is adapted from H-Net (goombalab/hnet).

Overview

The chunking pipeline has three components that work together at each non-innermost stage of the hierarchy:

  1. RoutingModule – predicts which tokens are chunk boundaries
  2. ChunkLayer – downsamples the sequence by selecting boundary tokens
  3. DeChunkLayer – reconstructs the full-length sequence from chunk outputs via EMA

RoutingModule: Boundary Prediction

The routing module decides where to place chunk boundaries by measuring how different consecutive hidden states are.

Algorithm

For a sequence of hidden states h[0], h[1], ..., h[L-1]:

  1. Project consecutive pairs through learnable Q and K matrices (initialized to identity).

  2. Compute cosine similarity between adjacent projected states:

    cos_sim[t] = cosine(Q @ h[t], K @ h[t+1])
    
  3. Convert to boundary probability:

    p[t] = clamp((1 - cos_sim[t]) / 2, 0, 1)
    
  4. The first token always gets p = 1.0 (always a boundary).

  5. Threshold at 0.5: if p[t] > 0.5, token t is a boundary.

High dissimilarity between consecutive states means the content is changing – a natural place to start a new chunk. The Q/K projections are initialized to identity so the model starts with raw cosine similarity and can learn to refine the boundary criterion.

Handling Variable-Length Sequences

The routing module supports two modes:

  • Padded mode (mask provided): Standard (B, L, D) tensors with a boolean mask. Boundary predictions outside the mask are suppressed.
  • Packed mode (cu_seqlens provided): Sequences concatenated into a single (1, total_len, D) tensor with cumulative sequence lengths. The first token of each sequence in the pack is forced to be a boundary.

ChunkLayer: Downsampling

Once boundaries are predicted, ChunkLayer selects only the boundary tokens to form a shorter sequence.

In padded mode:

  1. Count how many boundary tokens each batch element has.
  2. Sort token indices so boundary tokens come first.
  3. Gather the first max_boundaries tokens per batch element.
  4. Produce a new mask indicating which positions in the shorter sequence are valid.

In packed mode:

  1. Boolean-index the boundary tokens directly from the flat sequence.
  2. Recompute cu_seqlens for the shorter packed sequence.

The output is a shorter sequence containing only the tokens that were at chunk boundaries.

DeChunkLayer: Reconstruction via EMA

After the inner hierarchy processes the chunked (shorter) sequence, DeChunkLayer reconstructs the full-length sequence. The key insight is that non-boundary tokens should smoothly interpolate from their nearest preceding boundary token’s output.

EMA Scan

The reconstruction uses an exponential moving average (EMA) scan:

y[0] = x[0]
y[t] = decay[t] * y[t-1] + (1 - decay[t]) * x[t]

where decay[t] = 1 - p[t] and p[t] is the boundary probability for token t.

At boundary tokens (p ~ 1), the output snaps to the new chunk value. At non-boundary tokens (p ~ 0), the output carries forward the previous value. The boundary probability controls the blend continuously, allowing gradient flow through the routing decisions.

Scan Backends

The EMA scan has two interchangeable backends that compute identical results:

  • Sequential (_ema_scan_sequential): the reference O(L)-depth loop above, accumulating outputs in a list and torch.stack-ing them (never in-place assignment, which would break autograd). Used on CPU, on non-CUDA devices, and as the fallback.
  • SSD (_ema_scan_ssd): a parallel scan that maps the EMA onto the Mamba-2 SSD recurrence (A = -1, dt = -log(decay), C = 1) and runs the mamba-ssm mamba_chunk_scan_combined Triton kernel. Because that kernel is tuned for many small heads, the feature dimension D is sliced into heads of size AEGIR_DECHUNK_SSD_HEADDIM (default 64, must divide D) so the backward kernel stays within the shared-memory budget on Ampere/Ada.

The _ema_scan dispatcher selects SSD only on CUDA when mamba-ssm is available and the post-chunk sequence length is at least AEGIR_DECHUNK_SSD_MIN_L (default 256); below that, the sequential scan is faster because the SSD kernel’s fixed setup and chunk_size=64 padding dominate. AEGIR_DECHUNK_SCAN (auto / sequential / ssd) overrides backend selection. Correctness at the edges is guaranteed by the sequential fallback.

Reconstruction Steps

  1. Reorder the chunk outputs according to the original boundary positions.
  2. Map each position in the full sequence to its cumulative boundary count (i.e., which chunk it belongs to).
  3. Run the EMA scan over the reordered chunk outputs with boundary-probability-derived decay factors.
  4. Gather the EMA outputs back to the original sequence positions.

Residual Connection

The entire chunk/process/dechunk pipeline is wrapped in a residual connection:

output = dechunk_output * STE(selected_probs) + residual_proj(encoder_output)

The residual_proj is a linear layer initialized to zero, so at initialization the chunking pathway contributes nothing and the model starts as a simple encoder-decoder. The Straight-Through Estimator (STE) passes gradients through the discrete routing decisions.

Recursive Nesting

The chunking pattern nests recursively. Consider a 3-stage hierarchy:

arch_layout = ["w2", ["w2", ["w4"], "w2"], "w2"]
  • Stage 0: Encode the full byte sequence, predict boundaries, chunk down, pass to Stage 1, dechunk back up, decode.
  • Stage 1: Encode the chunked sequence from Stage 0, predict boundaries again on this shorter sequence, chunk down further, pass to Stage 2, dechunk, decode.
  • Stage 2: Process the doubly-chunked sequence with a flat stack of blocks (no further chunking).

Each level of chunking reduces the sequence length by a data-dependent factor. For byte-level input, the first level might learn character-like boundaries; the second level might learn word-like or phrase-like boundaries. The model discovers its own hierarchy of tokenization.

Inference: Token-by-Token Stepping

During autoregressive inference, each component has a step method for single-token processing:

  • RoutingModule.step: Compares the new token against the previously seen token’s hidden state. If the boundary probability exceeds 0.5, the token starts a new chunk.
  • ChunkLayer.step: If the token is a boundary, pass it through to the inner hierarchy. Otherwise, skip the inner hierarchy entirely.
  • DeChunkLayer.step: Blend the new chunk output (if any) with the previous EMA value using the boundary probability as the mixing weight.

This means that during inference, the inner hierarchy only runs when a chunk boundary is detected, saving compute on non-boundary tokens.

RWKV-7 Time Mixing

RWKV-7 time mixing is the primary sequence processing mechanism in Aegir. It implements a linear recurrence with a matrix-valued state, combining the training efficiency of chunk-parallel computation with the inference efficiency of constant-memory recurrence. The implementation uses flash-linear-attention’s optimized Triton kernels.

Reference: RWKV-v8 “Heron” (BlinkDL/RWKV-LM), fla RWKV7Attention.

Core Recurrence

The recurrent state S[t] is a matrix of shape (H, head_size, head_size) per batch element, where H is the number of attention heads. The state update at each time step is:

S[t] = diag(w[t]) * S[t-1] + S[t-1] @ ab[t] + v[t] @ k[t]^T

where:

  • diag(w[t]) is the per-element exponential decay applied column-wise
  • ab[t] = (-kk[t])^T @ (kk[t] * a[t])^T is the attention gate correction
  • v[t] @ k[t]^T is the new key-value outer product

The output is read from the state via:

o[t] = S[t] @ r[t]

where r[t] is the receptance (query) vector.

Time-Shift Mixing

Before computing projections, RWKV-7 mixes each token with its predecessor via learned interpolation coefficients. Given input x[t]:

delta[t] = x[t-1] - x[t]       (delta[0] = -x[0])

xr = x + delta * mu_r
xw = x + delta * mu_w
xk = x + delta * mu_k
xv = x + delta * mu_v
xa = x + delta * mu_a
xg = x + delta * mu_g

Each mu_* is a learnable (1, 1, D) parameter initialized with a position-and-layer-dependent schedule. This provides a simple form of local context mixing before the main recurrence.

Decay LoRA

The decay vector w[t] controls how quickly the recurrent state forgets. It is computed via a low-rank adaptation:

w[t] = -softplus(-(w0 + tanh(W1 @ xw[t]) @ W2)) - 0.5

where:

  • w0 is a (D,) bias initialized with a position-dependent schedule
  • W1 is (D, decay_low_rank_dim) and W2 is (decay_low_rank_dim, D)
  • The result is in log-space (negative values); the -0.5 ensures minimum decay

For the chunked training kernel (chunk_rwkv7), w is passed in log-space. For the single-token step, it is converted to the multiplicative factor:

w_step = exp(-0.606531 * sigmoid(w0 + tanh(W1 @ xw) @ W2))

Attention Gate LoRA

The attention gate a[t] modulates the key’s influence on the state update. It controls the ab correction term:

a[t] = sigmoid(a0 + A2(A1(xa[t])))

where a0 is a (D,) bias and A1, A2 form a low-rank bottleneck. The key is then modified as:

k'[t] = k[t] * (1 + (a[t] - 1) * k_a)

where k_a is a learnable per-dimension scale (initialized to 1.0).

Value-First Sharing

RWKV-7 shares value information across layers via a “value-first” mechanism:

  • Layer 0: Stores its value projection as v_first.
  • Layers 1+: Lerp their value toward v_first:
v[t] = v[t] + (v_first[t] - v[t]) * sigmoid(v0 + V2(V1(xv[t])))

This provides a residual-like connection specifically for value information, allowing deeper layers to reference the original value representation from layer 0.

L2 Key Normalization

Keys are L2-normalized per head before entering the suffix automaton correction:

kk[t] = L2_normalize(k[t] * k_k)   per head

where k_k is a learnable per-dimension scale (initialized to 0.85). The normalized keys kk are used in the ab correction term but not in the main key-value outer product.

Bonus Term

A direct key-query interaction term is added to the output:

bonus[t] = sum(r[t] * k[t] * r_k, dim=-1, keepdim=True) * v[t]

where r_k is a (H, head_size) parameter initialized with small random values. This provides a shortcut path that bypasses the recurrent state entirely.

GroupNorm Output

The recurrent output is passed through GroupNorm (one group per attention head) before the bonus term is added:

o = GroupNorm(S[t] @ r[t])  +  bonus[t]

Output Gating

The final output is gated via another LoRA:

g[t] = G2(sigmoid(G1(xg[t])))
output = o * g
output = W_o @ output

The output projection W_o is initialized to zero so that at initialization, RWKV-7 blocks contribute nothing to the residual stream.

Training: Chunk-Parallel Computation

During training, the chunk_rwkv7 kernel from flash-linear-attention processes the sequence in parallel chunks while maintaining exact recurrent semantics. The function signature:

o, final_state = chunk_rwkv7(
    r, w, k, v,
    -kk, kk * a,            # ab decomposed as two rank-1 terms
    initial_state=state,     # (B, H, K, K) or None
    output_final_state=True,
)

Inputs are shaped (B, T, H, head_size) and w is in log-space.

Inference: Token-by-Token Recurrence

During autoregressive inference, the step method implements the exact recurrence manually:

vk = v @ k^T                    # (B, H, N, N)
ab = (-kk)^T @ (kk * a)^T       # (B, H, N, N)
S  = S * diag(w) + S @ ab + vk  # state update
o  = S @ r                       # read output

The recurrent state S is stored in inference_params.key_value_memory_dict[layer_idx].att_kv as a float32 tensor of shape (B, H, head_size, head_size).

LoRA Dimension Auto-Calculation

If not explicitly specified in RWKVConfig, LoRA dimensions are computed from d_model following the fla convention:

factor = head_size / 64
sqrt_d = sqrt(d_model)

decay_low_rank_dim = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
gate_low_rank_dim  = max(32, round(5.0 * sqrt_d / 32) * 32)
a_low_rank_dim     = max(32, round(2.5 * sqrt_d * factor / 32) * 32)
v_low_rank_dim     = max(32, round(1.7 * sqrt_d * factor / 32) * 32)

All dimensions are rounded up to multiples of 32 for hardware efficiency.

Weight Initialization

Initialization follows RWKV-7 conventions with layer-dependent schedules:

  • Time-shift coefficients (mu_*): Initialized as 1 - d^(c * ratio) where d is a per-dimension ramp [0, 1), c is a coefficient specific to each mix type, and ratio varies from 1 (first layer) to 0 (last layer).
  • Decay bias (w0): Initialized as -7 + 5 * (d / D)^(0.85 + ratio^0.5), giving a range from fast decay (early dimensions) to slow decay (late dimensions).
  • Key normalization (k_k): 0.85 uniformly.
  • Key attention scale (k_a): 1.0 uniformly.
  • Bonus (r_k): Small random normal (std=0.1).
  • Output projection (W_o): Zero initialized.

ROSA Suffix Automaton

ROSA (RWKV Online Suffix Automaton) provides lossless infinite-range exact sequence matching as a complement to RWKV-7’s learned recurrent processing. While RWKV-7 maintains a compressed state that approximates the input history, ROSA can retrieve exact substring matches from arbitrarily far in the past.

Reference: “ROSA-Tuning: Enhancing Long-Context Modeling via Suffix Matching” (arXiv:2602.02499), ported from RWKV-v8 (BlinkDL/RWKV-LM).

Algorithm Overview

ROSA constructs an online suffix automaton over discretized hidden representations. For each position in the query sequence, it finds the longest suffix of the query that appears as a substring in the key sequence seen so far, then returns the corresponding value from the position immediately after the match.

The core operation is rosa_qkv_ref(qqq, kkk, vvv):

  1. Maintain an online suffix automaton built incrementally from the key sequence.
  2. For each new position i:
    • Query phase: Walk the automaton to find the longest suffix of qqq[:i+1] that matches a substring in kkk[:i].
    • Key phase: Extend the automaton with kkk[i].
  3. If a match of sufficient length is found, return vvv[match_end + 1]. Otherwise return a sentinel value.

The suffix automaton provides O(n) construction and O(n) total query time, making the entire operation linear in sequence length.

1-Bit Binarization

To convert continuous hidden states into discrete tokens suitable for suffix automaton matching, ROSA uses 1-bit binarization:

x_binary = (x > 0) ? 1 : 0

This is applied per channel across the hidden dimension. Given a hidden state tensor of shape (B, T, C):

  1. Binarize: q_bin[b, t, c] = uint8(q[b, t, c] > 0) (same for k, v).
  2. Transpose: Reshape from (B, T, C) to (B*C, T) – each channel becomes an independent sequence.
  3. Match: Run rosa_qkv_batch_ref over all B*C channel sequences in parallel.
  4. Reconstruct: Reshape indices back to (B, T, C).
  5. Scale: Output = (2 * idx_float - 1) * emb, where emb is a learnable (1, 1, C) scale parameter.

The matched bit value 1 maps to +emb and 0 maps to -emb, giving the output the same sign structure as the matched hidden representation scaled by a learnable magnitude.

The _RosaQKV1BitOp Autograd Function

ROSA’s suffix automaton is non-differentiable (it involves discrete automaton state transitions). The autograd function handles this:

  • Forward: Binarize inputs, run suffix matching on CPU, scale by emb.
  • Backward: Gradients for q, k, v are None (zero). Gradients for emb are passed through directly.

This means ROSA layers learn only through:

  1. The learnable emb scale parameter.
  2. The Q/K/V linear projections preceding ROSA (which receive gradients from other paths in the network through residual connections).
  3. The surrounding block’s residual connection.

The projections learn to produce hidden representations whose binarization yields useful matching patterns, even though the binarization itself has no gradient.

CPU Execution

The suffix automaton runs on CPU. Tensors are moved to CPU before matching and results are moved back to the accelerator. This is a deliberate design choice:

  • Suffix automata use pointer-chasing data structures (dictionaries, linked suffix links) that are not amenable to GPU parallelism.
  • The per-channel parallelism (B*C independent sequences) provides sufficient throughput for moderate batch sizes.
  • ROSA is a prefill-only operator. A single-token query has no prior context to match against, so the 1-bit suffix-matching output is structurally ill-defined in step mode. RWKV_ROSA.step therefore raises NotImplementedError rather than returning a placeholder (an earlier zero-output fallback masked a real correctness trap). This path is unreachable in practice — no current arch_layout uses r/R blocks — and the docstring notes a possible future rolling-window decoder (replay the last K tokens through forward and extract the tail).

When to Use ROSA vs RWKV-7

Use CaseROSA (r/R)RWKV-7 (w/W)
Exact pattern retrievalYes – lossless via suffix matchingNo – compressed into finite state
Learned sequence processingLimited – only emb is trainedFull – all parameters are trained
Inference (autoregressive)Degrades (needs full context)Efficient (O(1) state update)
Long-range dependenciesInfinite range, exactFinite effective range, approximate
Training speedSlower (CPU automaton)Fast (Triton chunk kernel)

In practice, ROSA blocks are best used sparingly alongside RWKV-7 blocks. A typical layout might be "w4r1" – four RWKV-7 blocks for general sequence processing, one ROSA block for exact retrieval. The ROSA block acts as a “lookup table” that can surface exact matches from the input, while RWKV-7 handles the bulk of learned representation building.

RWKV_ROSA Module

The RWKV_ROSA module wraps the ROSA matching in a standard time-mixing interface:

  1. Time-shift mixing: Mix current token with previous token via learned interpolation (same as RWKV-7 but with only q/k/v coefficients).
  2. Q/K/V projection: Linear projections from the mixed hidden states.
  3. ROSA matching: RosaQKV1Bit on the projected q, k, v.
  4. Output projection: Linear projection back to d_model.

The module is paired with either RWKV_CMix (relu^2 FFN, block code r) or SwiGLU (block code R) as its feedforward component.

Block Types Reference

Aegir’s architecture is built from modular blocks, each consisting of a mixer (the sequence processing module) and an optional MLP (the feedforward network). Blocks are identified by single-character codes and composed into layout strings that define the architecture at each stage.

Block Code Table

CodeMixerMLPDescription
wRWKV-7 TimeMixCMix (relu^2)Full RWKV-7 recurrence with RWKV-style channel mixing
WRWKV-7 TimeMixSwiGLUFull RWKV-7 recurrence with SwiGLU feedforward
rROSA (suffix automaton)CMix (relu^2)Exact pattern matching with RWKV-style channel mixing
RROSA (suffix automaton)SwiGLUExact pattern matching with SwiGLU feedforward
tMulti-Head AttentionNoneCausal MHA with no feedforward
TMulti-Head AttentionSwiGLUStandard transformer block
mMamba-2 (SSM)NoneState-space model with no feedforward
MMamba-2 (SSM)SwiGLUState-space model with SwiGLU feedforward

Convention

  • Lowercase codes use RWKV-native FFN (CMix with relu^2) or no FFN at all.
  • Uppercase codes use SwiGLU as the feedforward network.
  • For w/W and r/R, lowercase uses CMix; uppercase uses SwiGLU.
  • For t/T and m/M, lowercase has no MLP; uppercase adds SwiGLU.

The Block Wrapper

Every block follows the pre-norm residual pattern:

                    +---> norm1 --> mixer ---+
                    |                       |
hidden_states ----->+                       +-----> hidden_states
(+ residual)        |                       |      (+ residual)
                    +---> norm2 --> mlp ----+  (if MLP exists)

Concretely, the Block class implements:

# Mixer sub-block
hidden_states, residual = norm1(hidden_states, residual, prenorm=True)
hidden_states = mixer(hidden_states)

# MLP sub-block (if present)
hidden_states, residual = norm2(hidden_states, residual, prenorm=True)
hidden_states = mlp(hidden_states)

The pre-norm pattern accumulates the residual stream separately from the normalized hidden states. The normalization module (RMSNorm from flash-attn, or a LayerNorm fallback) handles residual accumulation internally when prenorm=True.

Residual Height Counting

Each block contributes to the “height” of its parent Isotropic module, which is used for output projection scaling during initialization:

  • Lowercase blocks (single residual addition): height += 1
  • Uppercase blocks (mixer + MLP, two residual additions): height += 2

MLP Variants

CMix (RWKV Channel Mixing)

Used by lowercase RWKV codes (w, r). A simple feedforward with relu^2 activation:

# Time-shift mixing
xx = time_shift(x) - x
k = x + xx * x_k

# Feedforward
k = relu(W_key @ k) ** 2    # D -> 4D, relu squared
output = W_value @ k          # 4D -> D

The expansion factor defaults to rwkv_cfg.dim_ffn_mult (default 4.0). CMix includes its own time-shift mixing, independent of the mixer’s time-shift.

SwiGLU

Used by uppercase codes (W, R, T, M). The standard SwiGLU feedforward (Shazeer 2020):

y = W_fc1 @ x                # D -> 2 * D_intermediate
y, gate = split(y)           # Each D_intermediate
y = silu(gate) * y
output = W_fc2 @ y            # D_intermediate -> D

The intermediate dimension defaults to 8/3 * d_model, rounded up to the nearest multiple of 128.

Layout String Parsing

Architecture layout strings encode a sequence of block types and their counts. The string is parsed by the Isotropic module using a regex:

re.findall(r"([mMtTrRwW])(\d+)", arch_layout)

Examples:

Layout StringParsed Blocks
"w4"4 RWKV-7+CMix blocks
"w4T1r2"4 RWKV-7+CMix, 1 MHA+SwiGLU, 2 ROSA+CMix
"W8"8 RWKV-7+SwiGLU blocks
"m2w4m2"2 Mamba-2, 4 RWKV-7+CMix, 2 Mamba-2

Within a layout string, blocks are instantiated in order with sequential layer_idx values. The total layer count across all block types in the string is used for RWKV-7’s position-dependent weight initialization.

The create_block Function

create_block() is the factory function that dispatches on the block code character:

block = create_block(
    arch="w",                    # block code
    d_model=192,                 # hidden dimension
    d_intermediate=512,          # SwiGLU intermediate dim (for uppercase codes)
    ssm_cfg={...},               # Mamba-2 config (for m/M)
    attn_cfg={...},              # MHA config (for t/T)
    rwkv_cfg=RWKVConfig(...),    # RWKV config (for w/W/r/R)
    layer_idx=0,                 # layer index for cache keying
    num_hidden_layers=12,        # total layers for init scheduling
)

The function:

  1. Selects the mixer class based on the code character.
  2. Selects the MLP class: CMix for w/r, SwiGLU for uppercase, nn.Identity for t/m.
  3. Selects the normalization class: flash-attn’s RMSNorm if available, otherwise a LayerNorm fallback with prenorm support.
  4. Constructs and returns a Block instance with the selected components.

Value-First Sharing Across Blocks

When an Isotropic module contains RWKV-7 blocks (w/W), it maintains a shared v_first = [None] container. This mutable list is passed as a mixer_kwarg to every RWKV-7 block:

  • The first RWKV-7 block (layer_idx 0 within the Isotropic) stores its value projection in v_first[0].
  • Subsequent RWKV-7 blocks lerp their value toward v_first[0] via a learnable gate.

This sharing is local to each Isotropic instance – encoder, decoder, and main network at each stage each have their own v_first container.

Pretraining

This chapter describes the byte-level pretraining track of the project. The operational pretraining run is the v2 mixed-corpus byte-level pretrain completed 2026-04-27 — 122k training steps on a 2 GB mixed corpus (FineWeb-Edu + SQaLe + SchemaPile + FinePDFs-lab), next-byte-trained at the architecture described in Architecture. The pretrain produced the backbone that the M2 milestone of Track 1 fine-tunes for Column Type Annotation; the v2 result is the empirical anchor in Training Regime §10.

The chapter is organized in two parts: the operational pretraining track that the v2 run instantiates, and the long-term direction in which ontology-grounded synthetic data feeds successive pretraining generations.

The operational pretraining track

Why byte-level

Standard pretraining uses subword tokenization (BPE, SentencePiece) fit once on a pretraining corpus. Subword tokens fragment tabular data unpredictably — a column value "$1,234.56" may tokenize as five tokens or two depending on the corpus the BPE was fit on, and the boundary between adjacent columns has no consistent representation in the token stream. For column annotation, this fragmentation is structurally harmful: the model has to re-learn the boundary structure that the original CSV / JSON delimiters expressed losslessly. Byte-level input avoids the question — every byte is a primitive — at the cost of longer sequences.

Aegir’s hierarchical architecture (H-Net dynamic chunking on top of RWKV-7 time-mixing) makes byte-level training tractable: the routing module learns where token-like boundaries should be from the sequence’s content, replacing fixed tokenization with a content-adaptive operation that propagates through training. See Architecture for the recursive hierarchy and Hierarchical Dynamic Chunking for the boundary mechanism.

The v2 mixed-corpus pretrain (2026-04-27)

The v2 corpus is 2 GB of mixed text drawn from four sources:

SourceRole
FineWeb-EduCurated educational prose; general language-modeling signal.
SQaLeNatural-language → SQL pairs; structured-query reasoning.
SchemaPileDatabase schemas with metadata; relational-syntax signal.
FinePDFs-labLIMS-domain PDF text; in-distribution signal for the metadata-tagging target.

122k training steps were run at the small configuration (56M parameters), single GPU, ≈10 h wall clock, with a cosine LR schedule and AdamW. Stratified held-out evaluation against trained-time matched slices shows the result the M2 milestone depends on: non-degenerate representations across all four sources, ≈2 bpb drops on the domain-targeted FinePDFs-lab slice relative to a randomly-initialized baseline, and no regression on general prose. The full bits-per-byte table is in Training Regime §10.

The v2 pretrain is the project’s first real backbone. It established that the architecture converges under byte-level pretraining on real corpora — a precondition for any fine-tune work and for any ontology-grounded synthetic regime beyond it. The fine-tune that closes the M2 loop is described in Supervised Bootstrapping; the diagnostic that motivated the v2 pretrain in the first place is in Diagnostic Case Study.

v3 — multi-GPU step-up

The v3 pretrain is conditional on M2 clearing its liveness gate; M3 of Track 1 describes the planned step-up to 6 × RTX 4090 multi-GPU training at the next byte-budget bump (roughly 8 GB at ≈7 h vs. v2’s 10 h on 2 GB single-GPU). The target evaluation thresholds — keep eval.fineweb-held ≤ 1.61, push eval.finepdfs-lab-held below 1.78, no regression on SchemaPile or SQaLe — anchor v3 against the v2 baseline. The v3 corpus mix may incorporate verifier-passing synthetic slices from the ontology-grounded corpus pipeline once that corpus is available at the budget v3 needs; see the next section.

The long-term direction — ontology-grounded synthetic data

The byte-level pretraining track exists alongside a coupled research program that generates structured training data from a deterministically-grounded ontology rather than discovering structure in scraped corpora. The ontology, the rigor program that governs it, and the closed-loop corpus pipeline are documented in the Ontology chapter; the Authors Guide is the canonical reference for every metric, gate, and disposal membrane. The short version of how that program produces pretraining bytes:

  1. Derive and realize the ontology. sdg-ontology is a BFO 2020 / CCO-grounded domain ontology, content-derived from FinePDFs and realized to a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl} (with a consistency certificate at corpora/ontology/HERMIT_CERTIFICATE.md). Its classes are intermediate-depth subsumers — the property-bearing classes a heterogeneous-but-coherent column belongs to — and they are the CTA/CPA annotation vocabulary. An agent-mediated propose / dispose feedback loop drives the ontology: an engine proposes axioms; deterministic membranes (parse → HermiT with CCO imported as a reasoning authority → OntoClean) dispose and return their reason; the agent refines. The seven family catalogs (src/aegir/ontology/catalog/01…07) are a seed and regression baseline; the live driver is the content-first derivation pipeline, not a fixed template count.
  2. Generate ontology-grounded chapters. scripts/generate_chapter.py synthesizes textbook chapters grounded in the ontology — in the current path, content-first from a FinePDFs harvest (--from-harvest) — calling a generation backend that is either the local gRPC engine (engine/<capability>, $0) or a weighted GLM / Grok mix. Each chapter cites ontology templates, verbalizes their axioms into prose, and embeds RI-true relational tables and views projected from the DDL spine (src/aegir/ontology/ddl.py, realize.py), so each column’s source entity is known by construction.
  3. Verify each chapter. scripts/verify_chapters.py runs a four-scorer verification loop — R_topic (alignment with FinePDFs style anchors; dropped for content-first chapters), R_iri (cited templates’ key terms present in prose), R_density (markdown-table structure), and R_axiom (table headers match slot types) — and composites them as a geometric mean, accepting at τ_accept 0.50.
  4. Mix the accepted chapters into a v3-or-later pretraining corpus alongside real text, and evaluate the pretrain lift on the Track 1 stratified-eval surface to attribute any improvement to the ontology-grounded slice — the paper 2 claim, scoped in Roadmap.

The ontology-grounded corpus is also published as an independent deliverable: the SHARE-docs browsable corpus and the corpora/ submodule (zndx/sdg-corpora). Its publication is gated on the ontology’s OQuaRE quality model — sync --push is refused below GREEN, the hard gate the Authors Guide documents.

Why this scales

The bottleneck in conventional table annotation is human labeling. The bottleneck in this synthetic regime is generation-and-verification throughput — the pipeline must generate ontology-grounded chapters and the verifier must score them — which is embarrassingly parallel, and runs at $0 against the local gRPC engine on local GPUs. The diversity of the training data is bounded by the ontology’s expressivity, which is itself growing: the content-first derivation accretes FinePDFs-derived intermediate classes rather than enumerating a fixed catalog. Independent constraints on the regime are tracked as pre-registered gates in EVIDENCE.md — in particular the corpus’s maximum non-repetitive token yield, which caps the ontology-grounded fraction of any large pretraining budget, and the M2 lift that the v3 corpus mix must demonstrate over a no-ontology control.

How this connects to Aegir’s three target tasks

The two pretraining inputs — real corpora (v2) and ontology-grounded synthetic slices (v3 and beyond) — both serve the same three downstream tasks:

  • Column Type Annotation (CTA). Real-corpus pretraining gives general language and tabular-syntax signal; synthetic slices add per-column entity types under known provenance.
  • Column Property Annotation (CPA). Cross-column relationships in real corpora are noisy; synthetic slices supply clean cross-column relations from the ontology’s sdg:* property declarations and the DDL spine’s FK edges.
  • Data Element Discovery. Cross-table groupings under known ontological provenance are the synthetic regime’s distinctive contribution — real corpora do not supply ground-truth data elements at scale.

The first two tasks are addressable from v2 alone. The third benefits most directly from the synthetic regime and is the strongest motivator for completing the corpus pipeline.

Sub-pages

The Training Tactics note and the five Stage-named sub-pages (Stage 1: Ontology Extraction, Stage 2: Schema Projection, Stage 3: Synthetic Data Generation, Stage 4: Training Objective, End-to-End Example) describe an exploratory SysMLv2 / ORM pipeline — a long-horizon framing that preceded the convergence on the ontology-grounded chapter pipeline above. They are preserved in the repository (docs/current/src/pretraining/) as background to the active work; they are not wired into the rendered book.

The Diagnostic Case Study documents the 2026-04-19 SOTAB-CTA representation-collapse incident that motivated the v2 pretrain in the first place; it is the chapter’s primary historical reference.

Diagnostic Case Study: Representation Collapse on SOTAB

A short postmortem of the first SOTAB training attempt. The detailed technical note lives in docs/scratch/2026-04-19/234700_sotab_diagnostic_representation_collapse.md; this chapter extracts the reusable methodology and the lessons.

1. The signal

After 3 epochs of direct SOTAB CTA training, the loss and F1 curves looked like a plateau: train loss 4.13 → 4.10, val loss 4.55 → 4.52, best val macro F1 = 0.0007 reached at epoch 1 and never improved. A casual interpretation is “the model is having trouble learning a hard task.” That interpretation is wrong.

2. The three-phase diagnostic

The script at scripts/sotab_diagnostic.py runs three orthogonal analyses on any trained checkpoint. Each answers a different question; reading them together pinpoints the failure mode.

Phase 1 — Prediction distribution

Count what the classifier is actually predicting across the val set. Report the top-k predicted classes and the exact-match accuracy.

Signal read:

  • If a single class accounts for ≥50% of predictions → mode collapse.
  • If predictions are evenly distributed but wrong → learning rate or schedule problem.
  • If predictions are concentrated in a plausible few classes but mixed up → confusable-class problem, needs per-class analysis.

On our collapsed SOTAB checkpoint: 100% predictions of currency, exact-match 3.27% equalling the val base rate of that class.

Phase 2 — Cluster geometry

Extract pre-classifier pooled embeddings. Compute:

  • Mean embedding norm (scale of representation)
  • Per-dimension variance across samples (spread)
  • Max pairwise L2 distance, probed on 50 random sample pairs
  • Intra-class vs inter-class cosine distance on normalized embeddings

Collapse detector (the most important signal): the ratio of max pairwise L2 to mean embedding norm. If below 1% (equivalently, the spread across samples is below rounding noise on the mean vector), the representation has collapsed to a single point. All downstream cluster analyses become mathematically degenerate.

On our collapsed checkpoint: max pairwise L2 / mean norm = 2.9 × 10⁻³, well below the 10⁻² collapse threshold. Per-dimension variance max 2.96 × 10⁻⁶, median 4.51 × 10⁻⁸.

Phase 3 — MCL inflation sweep (van Dongen)

Build the cosine similarity graph of the embeddings. Run MCL at inflation ∈ {1.4, 2.0, 3.0, 4.0}. Report cluster count and cluster purity against both leaf labels and a parent-bucket mapping (for Schema.org, the top-level types: Organization, Place, CreativeWork, Intangible, etc.).

Signal read:

  • Cluster count decreases monotonically as inflation decreases.
  • Parent-level purity rises at coarser inflations → embedding geometry encodes the ontology hierarchy. The classifier head is the bottleneck; the representation is healthy.
  • Parent purity is flat across inflations → no hierarchical structure in the embeddings. Either the representation itself is weak, or the task-ontology pairing isn’t well-encoded. Loss function fixes won’t help.
  • One cluster at every inflation → representation collapse, audit is suspended pending a non-collapsed checkpoint.

On our collapsed checkpoint: one cluster at every inflation. The reported parent purity of 0.851 is simply the fraction of val labels whose Schema.org parent is the modal parent — a property of the label distribution, not the embedding geometry.

3. Localising the collapse

Weight inspection of the saved checkpoint:

LayerNormStatus
pooler.weight (256×256)8.68healthy
pooler.bias (256)0.61healthy
classifier.weight (91×256)6.42healthy
classifier.bias (91)0.44healthy
residual_proj.weight, outer (256×256)2.53healthy
residual_proj.weight, inner (384×384)1.33healthy

No dead heads. The collapse is upstream, inside the backbone.

Boundary-predictor diagnostics recorded per-epoch during training stay in the healthy 0.33-0.65 range (mean_F at each stage), so the dynamic chunker is NOT the mechanism. Suspects in the backbone remain:

  1. RWKV-7 time-decay saturationw parameter drifting to a fixed point where the recurrent state either never updates or resets every step, making layer output position-independent.
  2. Value-first sharing collapse — layer 0’s v_first becomes input-independent under gate saturation; all downstream layers inherit the same constant.
  3. STE residual interaction — at init the inner path runs at full strength and the residual_proj-scaled skip is near-zero; if the inner path collapses, the skip can’t rescue.

Localising which of the three requires layer-by-layer ablation and is postponed; the fix is training-regime-level and does not depend on knowing exactly which sub-pathology dominates.

4. Why gt-signals didn’t collapse but SOTAB did

Same model, same optimizer, same learning rate:

gt-signals-dbpediaSOTAB-Schemaorg-CTA
Train samples1,999116,887
Epochs203
Total gradient steps~2,500~22,000
Best val macro F10.1260.0007
Collapsenoyes

10× more gradient steps at the same aggressive hyperparameters is the differentiator. The reference RWKV-LM training recipes in ref/rwkv-lm/ use lr ≈ 1e-4 with 1000+ warmup steps and always clip gradients. We did none of that.

5. Reusable methodology

The diagnostic procedure generalises beyond SOTAB:

uv run --no-sync python scripts/sotab_diagnostic.py --max-val-samples 1500

Output at build/diagnostics/{task}/{run_id}/ includes report.md (human-readable), summary.json (machine-readable), and raw arrays (embeddings, predictions, confusion, MCL clusters).

When to run it:

  • After any “plateau” that doesn’t look like it’s converging.
  • Before concluding that a training recipe “works.”
  • Before publishing any F1 number — the diagnostic confirms the number reflects actual learning rather than a collapsed mode prediction that happens to hit the base rate.

6. What the audit is for once we have a real representation

The MCL-inflation-sweep approach is van Dongen’s core contribution: clusters emerge as attractor basins of stochastic flow at different granularities, controlled by one knob (inflation). Our use of it goes beyond post-hoc evaluation:

  • Geometry audit — does the learned embedding space admit recoverable hierarchical structure? If so, the classifier is a readout layer; if not, the representation needs more work.
  • DED-inference alternative to k-means — for cross-table Data Element Discovery, MCL replaces k-means because we don’t know the true number of data elements a priori. MCL finds them without being told.
  • Ontology-level evaluation — multi-granularity purity reporting (leaf vs parent vs grandparent) is a more honest evaluator than flat leaf-level F1, matching the structure of the label space.

None of this is possible on a collapsed checkpoint. The audit is postponed to after Stage B (pretraining) confirms we have a live representation.

Training Regime: Converting Sparse to Dense

“Even serializing out linear paths from the hierarchical regime would help convert our inherently sparse reward task into a more dense, learnable one.”

This chapter is a postmortem and forward plan, written after the first attempt to train Aegir on SOTAB Column Type Annotation failed by way of total representation collapse. It documents the diagnosis, the reframing the failure prompted, the staged plan that follows, and the operational parameters (optimizer, hygiene, compute envelope) that support it.

1. The first attempt and its failure

Aegir’s inaugural task was SOTAB v2 Schema.org Column Type Annotation (91 leaf classes, 116,887 training columns, 1,769 val). The training run used the small config (56M params), 3 epochs, batch 16, max_length 1024, lr 3e-4, AdamW without gradient clipping or explicit warmup.

The loss curve looked “plateaued” — train/val both hovering around 4.1/4.5 with F1 never exceeding 0.011/0.0007 (micro/macro). On the surface this looks like weak learning. A diagnostic pass (see § Diagnostic case study) told a much more specific story:

  • Every val sample produced the identical pooled embedding to within bf16 rounding noise. Max pairwise L2 across 50 probe samples was 0.020 on vectors of mean norm 6.98 — a relative spread of 2.9 × 10⁻³.
  • The classifier predicted currency on 100% of 1,500 val samples. Exact-match accuracy (3.27%) was exactly the val base rate of the mode class.
  • MCL clustering at every inflation returned one cluster, because there was only one point in embedding space to cluster.

The heads (pooler, classifier, residual projections) all had healthy weight norms. The collapse is upstream of them, inside the RWKV+H-Net backbone.

This is not a hyperparameter bug, and fixing it is not primarily a hyperparameter change.

2. The reframing

Three observations explain the failure together:

2.1 Aegir is architecturally a language model

H-Net dynamic chunking, RWKV-7 time-mixing, hierarchical recursion — every mechanism in the backbone was designed for dense per-token supervision on byte/text corpora. The H-Net paper trains this architecture on next-byte language modeling; the chunker learns boundaries as a byproduct of next-byte prediction. There is no published version of H-Net that’s trained from random on sparse classification. Not because nobody tried — because the architecture requires dense gradients to stabilize the chunker and the recurrent state.

2.2 Our task as formulated is sparse-reward

Direct CTA delivers one label per 1024-byte input. A from-scratch model must simultaneously discover byte statistics, column boundaries, content patterns, cross-column context, and the class→label mapping, from a single gradient signal per forward pass. Compare to REVEAL’s 0.815 micro F1 baseline: RoBERTa-base, pretrained on ~160 GB of text, fine-tuned on CTA. Without the pretraining stage REVEAL does not exist. Our attempt was REVEAL stage 2 without stage 1.

2.3 The collapse is a design-use mismatch, not a bug

The proximate mechanism is RWKV-7 time-decay saturation at 22k update steps of direct-CTA under lr 3e-4 with no gradient clipping and minimal warmup. The underlying problem is that a properly-pretrained backbone would already be in a well-conditioned region of parameter space before CTA fine-tuning began, and that region is robust to the saturation basin our first run fell into. Training hygiene helps at the margin; pretraining fixes the structural problem.

3. Two axes for densifying supervision

The reframing suggests two orthogonal moves to convert the sparse task into a dense one. They compose.

3.1 Axis 1 — Byte-level pretraining on domain corpora

Feed the architecture what it was designed for. Next-byte prediction on raw GitTables byte-serializations. Dense signal: every position is a supervision anchor. The chunker learns column-delimiter recognition, cell-pattern boundaries, content-type signatures for free, without any label. This is RWKV-7’s native training regime, applied to our actual domain. The AegirForCausalLM head in src/aegir/models/heads.py was built for this and has never been used.

Data: ~36 GB of GitTables parquets already on disk (/raid/datasets/ gittables/, 562,214 tables, ~24 million annotated columns). Byte serialization produces ~36 billion raw bytes. A first probe uses ~100 M bytes.

3.2 Axis 2 — Linearize ontology paths as prediction targets

Instead of predicting schema:Hotel as one token in a 91-way softmax, predict the Schema.org ancestry chain as a sequence:

<BOS> Thing → Organization → LocalBusiness → LodgingBusiness → Hotel <EOS>

Each step in the chain is a ~10-20-way softmax over the children of the current parent. Per-column supervision becomes 3-5 gradient signals instead of 1. Shallow predictions get partial credit. Rare leaf classes inherit gradient from their parents — 1,000+ examples teaching Place → * → Thing even when only 50 examples reach the leaf CivicStructure.

The path-prediction head’s softmax is naturally dual-center: each step’s centroid is a parent-level cluster. The dual center loss we argued for on first principles (see § Hierarchical loss design) falls out of the formulation for free.

4. Staged plan

The staged plan isolates the failure mode test from the production fix so attribution stays clean.

4.1 Stage A — Hygiene-only direct-CTA rerun

Question: Does training hygiene alone prevent the collapse in the sparse regime?

Change set (bundled because they’re the same intervention at four knobs):

KnobBeforeAfterReason
Learning rate3e-45e-56× reduction matches RWKV from-scratch norms.
Warmupnone1000 stepsAbsorbs the initial-lr transient that otherwise destabilises RWKV time decay.
Gradient clipnonemax_norm=1.0Reference RWKV-LM recipes always clip; we never did.
Weight decay1e-21e-4AdamW default is aggressive for recurrent architectures.

Outcome interpretation:

  • Collapse resolves → some F1 > 1e-2: hygiene was sufficient for the direct regime. Useful baseline but still not a competitive path; we proceed to Stage B-C for the real system.
  • Collapse persists: deeper issue (architectural or task-architecture mismatch). Escalate to structural diagnosis.

4.2 Stage B — Byte-level pretraining on GitTables

Question: Does Aegir’s architecture converge on its native training objective?

Configuration:

  • Model: small (56M params, same config as Stage A for A↔B comparison)
  • Objective: next-byte cross-entropy with AegirForCausalLM
  • Data: raw GitTables parquets serialized to byte streams, 100M-byte budget for the first probe
  • Training: same hygiene bundle as Stage A, but applied to pretraining where it matters most. lr 1e-4, warmup 1000, grad clip 1.0, wd 1e-4.
  • Batch: 64 with grad accumulation (effective 128-384 across DDP)
  • Instrumentation: per-step boundary_diagnostics logging to catch saturation as it happens, not after the fact.

Load-bearing question: if pretraining also collapses, there is a deeper architecture issue (v_first sharing, DeChunk EMA stability, boundary predictor interaction) that we must isolate independently of any task.

4.3 Stage C — Path-prediction fine-tuning

Question: Does pretraining + hierarchical supervision beat flat direct-CTA at the same compute?

Configuration:

  • Start from Stage B’s pretrained checkpoint
  • Implement Schema.org path serializer from CTA_CPA_label_set_schemaorg.xlsx
  • Add AegirForHierarchicalAnnotation head: pooled embedding → autoregressive decode of <BOS> → parent₁ → parent₂ → ... → leaf → <EOS>. Hierarchical cross-entropy per step.
  • Fine-tune on SOTAB. Evaluate at each ontology depth separately.
  • Run the MCL geometry audit (which was uninformative on the collapsed Stage A checkpoint) — now meaningful.

Success threshold: macro F1 at leaf level > Stage A hygiene-only result, and parent-level F1 (coarser granularity) >> leaf-level F1, confirming the hierarchical regularisation helps where it should.

4.4 Stage D — MuonClip infrastructure (parallel track)

Orthogonal to A/B/C. Muon’s Newton-Schulz step gives spectrally-bounded parameter updates; MuonClip adds post-step Q/K row-norm clipping to bound attention logits. Both are direction-addressing fixes to the same class of failure we just hit (parameter saturation under long schedules), stronger than magnitude-only gradient clipping.

Strategy:

  1. Port existing MuonClip code (prior work referenced in Atelier, need to locate).
  2. Bench Muon vs AdamW on the fast gt-signals-dbpedia task (~30 min per run, two runs).
  3. If Muon matches or beats AdamW on the known-learning task, it becomes the default optimizer for Stage B onwards. If not, an interaction with RWKV-7’s unusual parameter shapes needs to be understood before scaling up.

Muon is infrastructure, not a one-shot experiment. Once in, every subsequent phase (Stage C, Phase 1.5 Mergekit, v3 Phase 2 Nano alignment) benefits from a stronger base optimizer.

5. Training hygiene

The Stage A/B bundle is the minimum hygiene for from-scratch RWKV-7. The rationale per knob:

  • Learning rate 1e-4 to 5e-5. RWKV-LM’s from-scratch recipes for sub-1B models sit in this range. 3e-4 works for BPE transformers at scale; byte-level RWKV doesn’t have the same gradient-scale regime.
  • Warmup 1000+ steps. Protects against the initial loss cliff where the untrained decay parameter swings wildly. Skipping warmup is a common cause of early saturation.
  • Gradient clipping max_norm 1.0. Standard for RWKV. Not optional.
  • Weight decay 1e-4. AdamW’s default 1e-2 is calibrated for transformers; it is too strong a regulariser for recurrent architectures where the time-mix parameters are small and precious.
  • bf16 AMP with fp32 optimizer state accumulation. Already in place.
  • Boundary-diagnostics logging per step (new). The collapse we just experienced was detected only post-run; online diagnostics would have surfaced it within the first 500 steps.

6. Hierarchical loss design

Once we have a non-collapsed representation, the loss function question becomes active. First-principles observations about the domain:

  1. Labels are ontology nodes, not categorical IDs. Softmax CE treats schema:Hotelschema:Motelschema:Person as equally distant, contradicting the actual semantic geometry.
  2. Class distribution is long-tail. A few parent-level clusters dominate. Dual-center loss with inter-class repulsion prevents rare classes from being subsumed into the dominant cluster.
  3. Surface underspecifies the label; context carries it. But the parent level is usually decidable from surface alone. Uncertainty collapses monotonically up the ontology tree — a useful inductive bias that vanilla softmax does not exploit.
  4. H-Net is already hierarchical at the representation level. Dual centers at the output level (leaf + parent) align with it architecturally.
  5. DED (M2) is clustering. Dual-center embeddings ARE clusters. The same head doing CTA at inference produces the column embeddings we hand to B-cubed evaluation for DED.

Path-prediction (Axis 2 above) subsumes dual center loss — each autoregressive step’s softmax is a dual center at that ontology level. So the loss work is done by the head structure itself once Stages B-C are in place.

7. Methodology: MCL as a geometry audit

Borrowed from van Dongen’s MCL (Markov Cluster) algorithm (2000), developed over two decades for bioinformatic orthology detection. MCL simulates stochastic flow on a similarity graph: expansion (random walk via matrix multiplication) alternates with inflation (entrywise power plus renormalization), producing clusters as attractor basins of the flow. No k required; inflation parameter controls granularity.

Used here not for production clustering but as a geometry audit for embedding spaces:

  • Run the model, extract pre-classifier embeddings, build a cosine similarity graph.
  • Sweep inflation ∈ {1.4, 2.0, 3.0, 4.0}.
  • Report cluster count and purity against leaf labels and Schema.org parent labels at each inflation.

Interpretation:

  • Parent purity rises at coarser inflations → representation has recoverable hierarchical structure. The model’s embeddings encode the ontology even if the classifier head doesn’t read it out. Loss function is the appropriate lever.
  • Parent purity flat across inflations → no hierarchical structure. Representation itself is weak. Architectural or training-regime fix required.
  • Single cluster at all inflations → representation is degenerate (collapse). The audit itself is suspended. This is Stage A territory.

On the failed run, the audit was correctly uninformative — MCL produced one cluster because there was one point. Once a non-collapsed checkpoint exists, the audit becomes the tool for answering “does the embedding geometry encode the ontology, or just the base rates?”

8. Compute envelope and scaling headroom

Training hardware is the tinybox (6 × RTX 4090, 24 GB each, 144 GB aggregate). 4090s have no NVLink; P2P runs over a PCIe-switched fabric.

Per-GPU memory profile

At small config (56M params, our current training point) per-GPU static memory is roughly 800 MB, with 2-4 GB of activations + backward buffers at batch 16 / seq 1024. Peak usage sits around 5-6 GB of 24 GB available — ~18 GB of headroom per card.

ConfigParamsStatic memActivations (B=16, L=1024)Fits 4090?
tiny13.5M~0.2 GB~1 GB✅ trivially
small56M~0.8 GB~3 GB✅ abundant
base~500M~7 GB~6-8 GB✅ comfortably
large (~2B)~2B~22-28 GB~10 GB
xl (3B+)3B+45 GB+growing

The knee between “single-GPU fits” and “FSDP required” lies between 500M and 1.5B parameters, depending on batch size and sequence length.

Scaling levers in ascending order of complexity

  1. DDP (data parallel) — current. Full model per GPU, gradients AllReduce’d. Linear speedup up to bandwidth saturation. Unchanged up through base.
  2. Gradient accumulation — free. Effective batch scales with n_gpus × accum_steps × micro_batch. Gets us to batches of 384+ at base size without any new infra.
  3. Activation checkpointingtorch.utils.checkpoint wrap around the Aegir main network, re-compute during backward. Trades ~30% compute for ~3× activation memory savings. Worth implementing proactively before we hit any memory wall — even at current size it enables longer sequences and bigger batches.
  4. ZeRO-2 (optimizer + gradient sharding) — saves 6P/N bytes per GPU. On a hypothetical 2B model with 6 GPUs, that’s ~40 GB reclaimed per card. Minimal throughput cost.
  5. FSDP / ZeRO-3 (full parameter sharding) — sharded forward via AllGather, sharded backward. ~10-20% throughput cost but unlocks models that wouldn’t otherwise fit.

For our target domain (relational metadata), base (~500M) is competitive with REVEAL-class baselines and does not require FSDP. large (~1-2B) is the stretch goal for DED and Nano distillation; it may or may not need FSDP depending on how aggressive we are with batch size and sequence length. We have runway.

Target-domain parameter sizing

  • REVEAL (RoBERTa-base): 125M params, F1 0.815 on SOTAB CTA
  • TURL: ~110M
  • TabBERT / TaBERT: ~350M
  • Byte-level has a ~2-3× parameter penalty vs BPE for equivalent capability, so the competitive byte-level target is 250-400M.

small (56M) is undersized for competitive CTA. base (~500M) is right-sized or slightly overprovisioned. The Stage B pretraining probe runs on small for speed; Stage C production fine-tuning should step up to base once Stage B validates the pipeline.

Forward-looking instrumentation

Every training run’s metadata.json should carry a peak_cuda_memory_mb field (from torch.cuda.max_memory_allocated()). This is a cheap forward-looking indicator of how close each config is getting to the next scaling threshold. No surprises when we move from base to large.

9. Empirical validation (overnight, 2026-04-20)

Several claims in this chapter were falsifiable hypotheses when written. Stages A and B, kicked off the same evening, delivered verdicts.

ClaimSectionVerdictEvidence
“This is not a hyperparameter bug, and fixing it is not primarily a hyperparameter change.”§1confirmedStage A hygiene rerun (lr 3e-4 → 5e-5, weight decay 1e-2 → 1e-4, warmup 10% → 15%, grad clip already 1.0) tracks the original collapsed run almost exactly: train loss 4.1281 vs 4.1286, val loss 4.5470 vs 4.5468, best val macro F1 0.0003 vs 0.0007. Four knobs moved coherently changed the outcome by < 1 part in 10³.
§4.2 load-bearing question: “does the architecture converge under its designed training regime?”§4.2yesStage B byte-level pretraining on raw GitTables descended from loss 5.68 at step 20 (≈ entropy floor for 260-way softmax, log 260 ≈ 5.56) to 2.26 at step 3040. 3051 steps, 100M-byte budget, small model, SSD kernel active. Checkpoint at outputs/pretrain/20260420T002455Z/final.pt.
§7 geometry criterion: “is the pretrained representation actually alive?”§7yesPost-training, 8 random byte-sequence inputs produced 8 distinct embeddings. Max pairwise L2 = 21.6 on vectors of mean norm 16.0 — collapse ratio 1.35, vs the 0.01 threshold that flagged the SOTAB checkpoint. Per-dimension variance: median 0.34, max 1.76. The representation varies with input at the expected scale.
§8 compute projection: “small-config pretraining has ~18 GB of headroom on a 4090”§8confirmedPeak CUDA memory during Stage B was 5.7 GB (instrumented per-step via peak_cuda_mem_mb in metrics.jsonl). Stage C fine-tuning on the same hardware is comfortably within budget.
§3.1: “the chunker learns boundaries as a byproduct of next-byte prediction”§3.1untested in this probeStage B’s boundary_diagnostics were not logged per-step in this first probe. A follow-on instrumented re-run will confirm.
§6: “path-prediction subsumes dual center loss”§6untestedDepends on Stage C.

The combined verdict — hygiene does not escape sparse-CTA collapse, but pretraining does converge and does produce a varied representation — is the one the staged plan was designed to distinguish. The chapter’s argument now has running-code grounding, not just a first-principles shape.

10. v2 mixed-corpus pretrain (2026-04-27)

The Apr 20 single-slice GitTables pretrain (Stage B) demonstrated that the architecture converges under its native objective. v2 extends that result to a 2 GB mixed-corpus pretrain across nine slices and produces the project’s first real backbone — the empirical anchor that the M2/M3 milestones build on.

Run mechanics:

  • Wall clock: 2026-04-26 23:22 → 2026-04-27 09:26 MDT (~10 h 4 m, single GPU 0)
  • Training steps: 122,070
  • Checkpoint: outputs/mixed-v2/20260426T232240Z/final.pt (~174 MB, small config)
  • Intermediate checkpoints retained every 5,000 steps (24 total)
  • Metrics: metrics.jsonl (per-step), metrics_eval.jsonl (4 trained-time eval slices × 25 evals)

Headline: final training-loader bits-per-byte = 1.179 (vs. v1 mixed at 1.202, vs. FineWeb-only baseline at 1.774). The headline gain is small because the mixture distribution itself shifted between v1 and v2; the actual story is in the stratified held-out eval.

Stratified held-out comparison (apples-to-apples, both finals on the same 5 slices):

Held-out slicev1 finalv2 finalΔ (v2 − v1)What it measures
eval.fineweb-held1.6011.608+0.007General prose perplexity
eval.finepdfs-lab-held1.8821.784−0.098Lab/clinical/regulatory prose
eval.schemapile-held2.8880.997−1.891Real-world DDL syntax
eval.sqale-held2.8190.810−2.009NL+DDL+SQL alignment
eval.spider0.752*2.155n/a*v1 trained on Spider; 0.752 is contamination, not held-out competence. v2 holds Spider out cleanly; 2.155 is genuine generalization from SQaLe.

What this validates:

  1. Architecture is learning, not just enjoying easier distribution. v1’s headline gain over the FineWeb-only baseline could have been pure distribution effect. The stratified eval shows ~2 bpb drops on the specific slices the v2 mixture targeted, while general prose stays statistically flat. That is targeted learning.
  2. Trimming FineWeb 0.55 → 0.35 did not hurt prose. eval.fineweb-held is statistically indistinguishable (1.601 → 1.608, +0.4%). 700 MB of FineWeb training (35% × 2 GB) is sufficient at this budget.
  3. FinePDFs-lab vocabulary transfer is real. v2 trained on lab/clinical prose for the first time and held-out prose of the same flavor saw a consistent 0.098 bpb drop.
  4. SQaLe → Spider transfer works. v2 never saw Spider during training; Spider bpb dropped from random-init ~4.5 to 2.155. SQaLe was generated against Spider/BIRD as NL exemplars, and the alignment transfers to the source distribution.

Curve shapes: all four trained-time eval slices descended monotonically and plateaued in the last 5–10 evals. eval.schemapile-held and eval.sqale-held are saturating at the 2 GB budget; eval.fineweb-held and eval.spider could still use more bytes.

Forward implications:

  • Multi-GPU step-up justified at the next byte-budget bump. 8 GB on 6 × 4090 ≈ 7 h, vs. v2’s 10 h on a single GPU at 2 GB. DDP path is proven; what’s new is the budget.
  • v3 corpus mix has a clean baseline to beat. Any v3 mixture must keep eval.fineweb-held ≤ 1.61, push eval.finepdfs-lab-held below 1.78, and not regress on schemapile/sqale.
  • BIRD held-out as a second transfer probe in v3 — same logic as Spider in v2, cleaner test.

The session note at docs/scratch/2026-04-27/131700_v2_vs_v1_stratified_comparison.md contains the full comparison narrative including the v1 cross-eval that produced the comparison table.

11. The v2 → SOTAB head fine-tune gate

The v2 backbone is healthy in the unsupervised pretraining regime. It has not yet been validated on a supervised CTA objective. The 2026-04-19 representation collapse on direct-from-random SOTAB CTA was the open wound that motivated v2 in the first place; closing that loop requires a fine-tune from outputs/mixed-v2/20260426T232240Z/final.pt that produces non-degenerate per-class F1.

This is the M2 empirical gate. Three liveness thresholds:

  • ≥ 3 distinct embedding clusters at coarse MCL inflation (vs. the single cluster that flagged collapse in April)
  • ≥ 0.10 macro F1 on the held-out SOTAB v2 Schema.org CTA validation set
  • Predictions distributed across ≥ 10 distinct labels (no mode-class collapse)

These are deliberately undemanding. They distinguish “the model is alive” from “the model has collapsed.” If they fail, the underlying problem is architectural, not vocabulary-related, and vocabulary expansion work pauses until it is debugged.

If they pass, the Phase 1 supervised roadmap becomes meaningful — competitive F1 numbers against published baselines (SOTAB-CTA macro F1 > 0.85 easy,

0.65 hard, etc.) become legitimate next targets, vocabulary expansion past the copied baseline begins, and vocab_label_map.json v1.0.0 ships as the first outward release.

12. How this relates to v3

The v3 concept brief proposed a phased plan: Phase 1 (Aegir-only baseline) → Phase 1.5 (Mergekit specialist fusion) → Phase 2 (conditional Nano latent alignment). All three phases assumed “a working Aegir baseline.” The story in this chapter is what “working” means: Aegir cannot be trained from random on sparse classification — it needs pretraining + supervised fine-tune from a healthy backbone. The v2 mixed-corpus pretrain provides the backbone; the M2 head fine-tune provides the supervised half.

Phase 1.5 Mergekit fusion becomes stronger under this picture. The specialists it fuses will each be pretrained-then-task-finetuned, so the task-vectors it combines have genuine semantic structure rather than the small delta between random init and a barely-moved classifier.

Phase 2 Nano alignment becomes better grounded. v3 assumed Aegir had some baseline representation to align to Nano’s; the v2 stratified eval confirms that representation exists in the unsupervised regime. The supervised half of the alignment story still requires the M2 gate to clear.

13. Further reading

  • Diagnostic case study: representation collapse on SOTAB-Schema.org
  • Ontology Charter — the empirical gate formally specified, plus the outward contract Ægir publishes
  • Phase 1 supervised roadmap — current fine-tune-from-v2 plan that supersedes the from-random approach
  • Training tactics (docs/current/src/pretraining/training_tactics.md, not wired into the rendered book) — pre-existing ontology-side training objectives
  • Session notes:
    • docs/scratch/2026-04-19/ and docs/scratch/2026-04-20/ — Stage A/B findings
    • docs/scratch/2026-04-21/061600_overnight_corpus_and_mixed_training.md — v1 mixed-corpus run
    • docs/scratch/2026-04-23/232400_v2_corpus_kickoff.md — v2 setup
    • docs/scratch/2026-04-27/131700_v2_vs_v1_stratified_comparison.md — the v2-vs-v1 stratified result this section reports

Ontology

Aegir is the canonical owner of the bespoke BFO 2020 / CCO-grounded ontology used by the metadata-tagging stack — the Signals Data Governance (SDG) ontology — and of the ontology-grounded synthetic-data pipeline that produces in-distribution pretraining bytes against it. The ontology, its rigor program, and the realized OWL artifact published outward all live in this chapter. The chapter covers what the ontology is now, how its classes function as the annotation vocabulary, the quantitative rigor metrics and the formal publish gate every extension must clear, and the disposal membranes that enforce rigor rather than assert it.

The ontology conditions everything downstream — it is the annotation vocabulary for Column Type / Column Property Annotation (CTA/CPA) over wide relational tables. Its classes are not leaf terms but intermediate-depth subsumers: the property-bearing classes a heterogeneous-but-coherent column belongs to. Defining those classes well is building the annotation vocabulary, and the gates exist to keep every term a coherent, grounded annotation target. Putting the ontology next to the model is the only arrangement where these decisions stay coherent.

What the ontology is now

sdg-ontology is content-derived from FinePDFs (qdrant/ColBERT MaxSim domain filtering over a SKOS index) and realized to a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl}, with a consistency certificate at corpora/ontology/HERMIT_CERTIFICATE.md. The seven family catalogs (src/aegir/ontology/catalog/01…07) are a seed and regression baseline; FinePDFs-derived intermediate classes accrete in 08_derived.json, and the live driver is the content-first derivation pipeline (scripts/derive_ontology.py, scripts/define_intermediate_classes.py), not a fixed template count.

The architecture is an agent-mediated propose / dispose feedback loop: an engine proposes axioms; a stack of deterministic membranes (parse → HermiT with CCO imported as a reasoning authority → OntoClean) disposes and returns the reason; the agent responds and refines. Rigor is enforced, not asserted. The Authors Guide is the canonical reference for every metric, band, gate, and membrane.

Scope summary

ConcernOwnerNotes
SDG ontology IRIs + BFO/CCO groundingÆgirsrc/aegir/ontology/catalog/*.json → realized corpora/ontology/sdg-ontology.{omn,owl}
Content-first derivation (FinePDFs → classes)Ægirscripts/derive_ontology.py, scripts/define_intermediate_classes.py
Grounding-anchor retrieval (CCO + FHIR + accretive)Ægirscripts/grounding_anchors.py
Rigor metrology + OQuaRE publish gateÆgirscripts/ontology_metrology.py, scripts/ontology_oquare.py
Disposal membranes (parse / HermiT / OntoClean)Ægirscripts/build_realized_ontology.py, src/aegir/ontology/ontoclean.py
Ontology-grounded synthetic corpus + DDL spineÆgirscripts/generate_chapter.py, scripts/verify_chapters.py, src/aegir/ontology/ddl.py, realize.py
CTA / CPA dataset loadersÆgirsrc/aegir/data/table_dataset.py
Model training + evaluationÆgirtrain.py, train_pretrain.py, AegirForColumnAnnotation
Consumer-side use of the abovedownstream projectsOutside Ægir’s design constraints

A separate sibling project (Atelier) consumes Ægir-produced artifacts as an independent pretraining-efficacy gate. Atelier’s own docs describe what it needs from this contract, but those docs are advisory input here, not specification.

Sub-pages

  • Authors Guide — metrics & quality gatescanonical: the full quantitative metric suite (IOF rigor dimensions, OntoQA/OQuaRE structural metrics, OntoClean proxies), the OQuaRE publish gate with its [1,5] bands and floors, the disposal membranes, and the pre-registered OQ-Rigor / OQ-Structure objectives, with the exact formulas the tooling enforces
  • Charter — Ægir’s internal direction-setter for the ontology scope: provenance discipline, the committed BFO/CCO branch structure, and external-standard anchors
  • Migration — authoring history for the initial bespoke vocabulary
  • Concept brief — RLVR for ontology generation — the design of the long-horizon Signals M4 apparatus: a four-component verifiable reward R(O, I) over OWL artifacts and a GRPO-trained, SAE-instrumented local policy targeting it. That reward is now realized as the deterministic membrane stack (HermiT/CCO, OntoClean, OQuaRE) that the agent-mediated propose/dispose loop — documented in the Authors Guide — is building and proving today
  • Semantic engine — authoritative reference — the operational-state description of the SDG ontology, the rigor program, and the closed-loop synthetic-data pipeline
  • RLVR for ontology generation — the externally-readable methodological chapter for the long-horizon M4 apparatus: the verifier R(O, I), now realized as the membrane stack, and the SAE-instrumented-Qwen policy that GRPO trains against it to autonomously generate ontology extensions

Ontology Authors Guide

This guide is written for an ontology engineer who wants to extend the sdg ontology independently — add classes, definitions, roles, properties — and have a reasonable expectation of passing every quantitative and qualitative gate, and ideally improving on them. It documents the full metric suite, the formal quality gate, the disposal membranes, and the pre-registered objectives, with the exact formulas, bands, and thresholds the tooling enforces. Nothing here is aspirational: every number is the value the code checks.

The governing principle is propose / dispose. You (or an agent) propose axioms; a stack of deterministic membranes disposes — admitting only what is well-formed, logically consistent under the reasoner, and ontologically clean. Rigor is enforced, not asserted, and the two strongest membranes (HermiT and OntoClean) are un-fakeable: you cannot talk your way past a contradiction or an anti-rigidity violation. If your extension passes, it is genuinely rigorous; if it fails, the gate returns the reason and you refine.


1. What you are extending

sdg-ontology is a BFO 2020 / CCO-grounded domain ontology, content-derived from FinePDFs and realized to a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl} with a consistency certificate at corpora/ontology/HERMIT_CERTIFICATE.md.

Its purpose is to be the annotation vocabulary for Column Type / Column Property Annotation (CTA/CPA) over wide relational tables. That reframes what most of the classes are: they are not leaf terms but intermediate-depth subsumers — the property-bearing classes a heterogeneous-but- coherent column belongs to. A driver_stops_schedule.stops_addresses column holds a mix (origin + destination, residential + business shipping addresses, each bearing an avg-time-on-site); no leaf type fits — the right annotation is the least common subsumer that is still property-bearing, e.g. Address ⊓ ∃has-shipping-role ⊓ ∃avg-time-on-site. Defining these intermediate classes well is building the annotation vocabulary. When you extend the ontology, you are extending that vocabulary, and the gates exist to keep every term a coherent, grounded annotation target.

Namespaces

prefixIRI baseuse
bfo:http://purl.obolibrary.org/obo/BFO_upper categories (numeric IRIs, e.g. bfo:0000040)
cco:https://www.commoncoreontologies.org/mid-level genera (numeric IRIs, e.g. cco:ont00000713 = Vehicle) — note https
fhir:http://hl7.org/fhir/clinical/record types, bridged to cco:InformationContentEntity
iao:http://purl.obolibrary.org/obo/IAO_annotation properties (iao:0000115 = definition)
sdg:https://signals360.example.org/sdg#our own classes/properties
skos:, rdfs:, owl:standardlabels, definitions, structure

BFO categories you will reach for

bfo:0000040 material entity · bfo:0000004 independent continuant · bfo:0000002 continuant · bfo:0000015 process · bfo:0000031 generically dependent continuant (ICE) · bfo:0000019 quality · bfo:0000020 specifically dependent continuant · bfo:0000023 role · bfo:0000016 disposition · bfo:0000034 function · bfo:0000017 realizable entity. Realizable-machinery properties: bfo:0000055 realizes · bfo:0000052 inheres-in · bfo:0000053 bearer-of · bfo:0000054 realized-in.


2. How you author

Classes are authored as catalog templates — a Manchester-syntax skeleton with typed slots — that the realizer renders, grounds, and validates into the OWL artifact. The seven family JSON files live in src/aegir/ontology/catalog/; FinePDFs-derived intermediate classes accrete in 08_derived.json. Edit the family .json files, never combined.json (regenerated).

The slot DSL

{name:Type}                  e.g. {X:Class}, {p:ObjectProperty}, {Y:Class}
{name:Type:Bound}            subtype constraint: {X:Class:bfo:Continuant}

Type ∈ {Class, ObjectProperty, DataProperty, Individual}. A CatalogTemplate carries manchester_template, slot_types, verbal_template (an NL gloss → becomes the definition annotation), bfo_anchor_path, and provenance. Three canonical shapes:

Class: {X:Class} SubClassOf: {Y:Class}                              # primitive (a kind, undefined)
Class: {X:Class} SubClassOf: {p:ObjectProperty} some {Y:Class}      # existential restriction
Class: {X:Class} EquivalentTo: {Y:Class} and {p:ObjectProperty} some {Z:Class}   # DEFINED (genus + differentia)

Manchester conventions the membranes enforce

  1. Prefixes are lowercase onlycco: bfo: fhir: sdg:, never CCO:/BFO:.
  2. Every property is prefixedsdg:hasMeasurement some X, never bare hasMeasurement.
  3. Coined classes/properties use sdg: (camelCase) — do not invent cco:/bfo: names; those are numeric IRIs you must look up (see §3). No # comments (they break the OMN parser).
  4. The genus must be a broader class — a real BFO/CCO/sdg parent, never the class itself.

EquivalentTo vs SubClassOf — the single most important authoring choice

A class with SubClassOf: is primitive (necessary conditions only). A class with EquivalentTo: genus and differentia is defined (necessary and sufficient): anything that is the genus and satisfies the differentia is an instance. Prefer EquivalentTo wherever the differentia are genuinely sufficient — this is the definitional_completeness lever and the IOF discipline (the IOF defines ~55% of its terms this way). Do not force it: a genuine natural kind whose essence is not captured by the stated relations should stay primitive. Reserve EquivalentTo for kinds; model roles with the realizable pattern (§3), not as defined subclasses.

Grounding: choosing a genus

Every class must chain to a BFO category — directly, or through CCO/FHIR. Look up the real IRI with the grounding-anchor retriever rather than inventing one:

uv run --no-sync python scripts/grounding_anchors.py query "shipping address"
#   0.74 [cco]  Mailing Address     cco:ont00000xxx
#   0.59 [fhir] Address             fhir:Address
#   0.55 [sdg]  StopLocation        sdg:StopLocation   (reuse our own)

The index spans CCO (1431 BFO-aligned genera), FHIR R5 (210 record types, bridged to cco:InformationContentEntity), and our own grounded classes (the index accretes — each class you ground becomes a reusable anchor). Prefer, in order: an existing sdg: class (reuse), a CCO/FHIR genus, then a bare BFO category as a last resort. A generic bfo:0000040 placeholder where you meant “Patient” is grounded but shallow — find the real genus.

Roles and the realizable machinery

A role is anti-rigid and relational (supplier/operator/origin-address: the bearer could stop being it and still exist). Model it as a BFO role, never a rigid subclass:

Class: {OperatorRole:Class} SubClassOf: bfo:0000023,
   bfo:0000052 some {Operator:Class}, bfo:0000054 some {OperationProcess:Class}

The inheres-in (bfo:0000052) and realized-in (bfo:0000054) restrictions are what the realizable_machinery metric counts and what BFO discipline requires.


3. The quantitative metrics

All metrics are computed by scripts/ontology_metrology.py::compute() (pure rdflib, JVM-free) over the realized .owl, with CCO’s subClassOf backbone merged so that cco:-grounded chains resolve to BFO. Run:

uv run --no-sync python scripts/ontology_metrology.py corpora/ontology/sdg-ontology.owl   # or --json

n = number of sdg: named classes. Each metric below lists its formula, its IOF/field target, and the authoring lever that moves it.

3.1 IOF-derived rigor dimensions (what field-standard suites miss)

metricformulatargetlever
definitional_completeness‖{c : c owl:equivalentClass …}‖ / nIOF ≈ 0.55write EquivalentTo (genus+differentia) for definable kinds
bfo_grounded‖{c : subClassOf/≡-genus chain reaches a BFO IRI}‖ / n1.0ground every class to a BFO/CCO/FHIR genus
realizable_machinerycount of restrictions on realizes/inheres/bearer/realized props or some role/disposition/functionIOF ≥ 14model roles/dispositions/functions with the realizable pattern
def_annotation_coverage‖{c : rdfs:comment ∨ iao:0000115 ∨ skos:definition}‖ / n1.0 (IAO req)supply a verbal_template; the realizer emits iao:0000115 + rdfs:comment

These are the discriminators: an LLM (or a hasty author) recovers taxonomy + existentials (structure) but not sufficiency, full grounding, or BFO role discipline (rigor). They are where the FunctionalAdequacy gate floor lives.

3.2 Field-standard structural metrics (OntoQA / OQuaRE)

metricformulareading
rr relationship richnessn_∃some / (n_subClassOf + n_∃some)non-taxonomic richness; a pure tree → 0
ir inheritance richnessn_subClassOf / nsubclasses per class
ar attribute richnessn_DatatypeProperty / ntyped data attributes per class
aronto axiomatic strength(n_∃some + n_∀only + n_card) / nrestrictions per class
dit depthlongest subClassOf chaintaxonomic depth (more developed = deeper)
tm tangledness (inverted)‖{c : >1 named parent}‖ / nmultiple-inheritance load; lower is better

3.3 OntoClean taxonomic-correctness proxies (un-gameable)

Reasoner-invisible defects that a generic LLM cannot fake (it names meta-properties at ~96% but cannot operationalize them). Computed via the OntoClean classifier (src/aegir/ontology/ontoclean.py).

metricformulatarget
subsumption_cyclesclasses reachable from themselves via subClassOf (OOPS! P06)0 (hard)
ontoclean_violationssubClassOf edges where an anti-rigid (role) parent subsumes a non-anti-rigid (rigid) child0
sibling_disjointnessfraction of same-parent sibling pairs asserted owl:disjointWith (OOPS! P10)→ 1.0
orphan_ratefraction of sdg: classes with no parent (OOPS! P04 — islands)→ 0
taxonomic_cleanliness1 − (subsumption_cycles + ontoclean_violations) / n_subClassOf1.0

3.4 Consistency

HermiT over the realized ontology with CCO imported. Consumed by the gate from the certificate (isConsistent: true). Zero unsatisfiable classes is the real bar — an ontology can be isConsistent yet contain unsatisfiable classes (classes that can have no instances); both must be clean for a publish.


4. The OQuaRE quality gate

scripts/ontology_oquare.py is the formal publish gate. OQuaRE (Duque-Ramos et al. 2011) adapts ISO/IEC 25000 (SQuaRE) to ontologies: each metric is normalized to [1,5] against fixed, IOF-anchored bands, then aggregated into six characteristics and one holistic score.

uv run --no-sync python scripts/ontology_oquare.py corpora/ontology/sdg-ontology.owl \
    --certificate corpora/ontology/HERMIT_CERTIFICATE.md

4.1 Normalization bands (fixed a priori — a stable distance-to-IOF)

Piecewise-linear interpolation between breakpoints (value, score), clamped to [1,5]:

metric→1→3→5
definitional_completeness0.000.250.55
bfo_grounded0.500.851.00
realizable_machinery0514
def_annotation_coverage0.000.701.00
rr0.000.250.50
ir0.001.003.00
ar0.000.301.00
aronto0.000.601.50
dit138
tm (inverted)0.500.150.00
consistentinconsistentunknownconsistent

4.2 Characteristics (which metrics feed each)

characteristicconstituent metric scores
Structuralaronto, dit, tm, bfo_grounded, rr
FunctionalAdequacydefinitional_completeness, realizable_machinery, def_annotation_coverage
Reliabilitybfo_grounded, consistent, tm
Operabilitydef_annotation_coverage, rr
Maintainabilitytm, dit, ir
Transferabilitybfo_grounded, def_annotation_coverage

aggregate = mean of the six characteristics.

4.3 The gate (GREEN requires all three)

checkfloor
oquare_aggregate≥ 3.5
functional_adequacy≥ 3.0
hermit_consistent== true

AIM 3.9 — the published OQuaRE class of Brick (3.93) / RealEstateCore (3.91). The FunctionalAdequacy ≥ 3.0 floor is deliberate: it forces definitional rigor and BFO discipline, not structural/grounding gains alone. The gate is wired HARD into aegir.lineup.sync._gate(): sync --push of the ontology is refused below GREEN. You will not publish a regression.


5. The disposal membranes (what rejects your extension, and why)

Your axioms pass through these in order. Each returns a reason, so a failure is a repair instruction, not a dead end (this is the agent-mediated feedback loop; a human author reads the same reasons).

  1. Parse membrane (evolve_rigor.validate_detailed) — renders the axiom standalone and parses it under OWLAPI. Rejects malformed Manchester: uppercase prefixes, bare properties, undeclared entities, # comments. Reason: the parser error or “0 classes.”
  2. Reasoning-authority membrane (build_realized_ontology.consistency_check) — imports CCO and runs HermiT, so your grounding is validated against CCO’s disjointness axioms. A class grounded to a CCO-disjoint or BFO-incompatible genus (e.g. a Plant placed under cco:Vehicle, or a continuant genus where a process is required) is unsatisfiable and rejected. Reason: “genus X is incompatible — re-ground to a compatible parent.” This is un-fakeable.
  3. OntoClean meta-property membrane (src/aegir/ontology/ontoclean.py) — assigns Rigidity / Identity / Unity / Dependence and enforces the OntoClean constraint that an anti-rigid property cannot subsume a rigid one (a role cannot be the parent of a kind). Surfaces as ontoclean_violations. Also un-fakeable — reasoner-invisible yet checkable.

A self-check before you propose:

LD_LIBRARY_PATH=$(pwd)/build/jvm-libs uv run --no-sync python scripts/build_realized_ontology.py --strict-grounding
uv run --no-sync python src/aegir/ontology/ontoclean.py src/aegir/ontology/catalog/08_derived.json
just check-ontology-schema      # TTL parses, labels/definitions present, BFO ancestry, SPARQL totality

6. The pre-registered objectives (EVIDENCE.md)

Two standing objectives define “good enough to publish” and “rigorous”:

  • OQ-Structurebfo_grounded ≥ 0.95def_annotation_coverage ≥ 0.90ar > 0oquare_aggregate ≥ 3.5. Gate: the sync._gate publish gate.
  • OQ-Rigordefinitional_completeness ≥ 0.45realizable_machinery > 0. Gate: the OQuaRE FunctionalAdequacy ≥ 3.0 floor.

An extension that holds or raises both objectives is the bar to clear. The standing rule: no sync --push of the ontology Data Product until OQuaRE is GREEN.


7. Worked example — authoring an intermediate class end-to-end

Goal: a class for the stops_addresses column — shipping addresses (origin + destination) bearing an avg-time-on-site.

(1) Decide the modeling. A kind (an Address is rigidly an address) → define it with EquivalentTo. The shipping/origin/destination facet is a role the address bears, not a rigid parent — so it enters as a realizes-style differentia, keeping the genus an Address.

(2) Ground the genus. grounding_anchors.py query "mailing address" → reuse sdg:PostalAddress if present, else cco:ont… (Mailing Address). Coin sdg: only for genuinely new differentia.

(3) Author.

Class: {ShippingStopAddress:Class} EquivalentTo:
   cco:ont00000xxx
   and sdg:bearsShippingRole some {ShippingRole:Class}
   and sdg:hasAverageTimeOnSite some xsd:duration
Annotations: rdfs:label "shipping stop address",
   iao:0000115 "A mailing address that bears a shipping role (origin or destination) on a driver
   stop schedule and has an associated average time on site."

(4) Dispose. Realize → HermiT (the genus cco:…Address is a material/ICE entity; no disjointness violated → satisfiable). OntoClean (the genus is a rigid kind, not a role → no violation). Parse (prefixes lowercase, properties prefixed → admitted).

(5) Measure. Re-run the metrology + OQuaRE gate. This class raises definitional_completeness (an ), holds bfo_grounded (real CCO genus), adds def_annotation_coverage (the iao:0000115), and — because the role is modeled with a realizable differentia — nudges realizable_machinery.

(6) Iterate. If HermiT marks it unsatisfiable, the reason names the offending genus; pick a compatible one and re-dispose. If ontoclean_violations rises, you placed a role as a rigid parent — re-model it as a borne role.


8. How to improve upon the gates

Passing is the floor; the AIM is 3.9 and the IOF frontier beyond it. To raise each lever:

  • Definitional completeness toward 0.55+ — convert primitive kinds to EquivalentTo wherever the differentia are sufficient; define the referenced intermediate classes (the subsumers a column needs), not just the heads. This is the highest-leverage dimension and the one shallow extensions miss.
  • Realizable machinery toward 14+ — wherever a relational/anti-rigid concept appears, model it as a BFO role/disposition/function with inheres/realizes differentiae rather than a subclass.
  • OntoClean to a clean sheet — push sibling_disjointness up (assert disjointWith between identity-incompatible siblings) and keep ontoclean_violations/subsumption_cycles at 0. These are the un-gameable signals; a clean OntoClean profile is the field’s blind spot and your differentiator.
  • Annotation rigor — supply genus-differentia definitions (not vacuous label-glosses); the iao:0000115 should be a real sufficient definition, mirroring the EquivalentTo.
  • Contribute patterns — recurring genus-differentia or role shapes belong in src/aegir/ontology/axiom_patterns.json (DOSDP-style: the defining axiom lives in the pattern, you fill slots). Reserve equivalentClass for kinds; emit roles via the realizable pattern — do not conflate them (Neuhaus 2025: roles resist ).

The two reasoners are the discipline you cannot circumvent: HermiT rejects any grounding that contradicts CCO’s disjointness, and OntoClean rejects any anti-rigid-over-rigid subsumption. Build with them, not around them, and your extension is rigorous by construction.


9. Reference — commands & files

# metrics + gate
uv run --no-sync python scripts/ontology_metrology.py corpora/ontology/sdg-ontology.owl [--json]
uv run --no-sync python scripts/ontology_oquare.py corpora/ontology/sdg-ontology.owl \
    --certificate corpora/ontology/HERMIT_CERTIFICATE.md [--json]
# membranes / realize (LD_LIBRARY_PATH bootstraps the JVM for HermiT/DeepOnto)
LD_LIBRARY_PATH=$(pwd)/build/jvm-libs uv run --no-sync python scripts/build_realized_ontology.py --strict-grounding
uv run --no-sync python src/aegir/ontology/ontoclean.py src/aegir/ontology/catalog/08_derived.json
just check-ontology-schema
# grounding + agent-assisted authoring
uv run --no-sync python scripts/grounding_anchors.py query "<concept>"
LD_LIBRARY_PATH=$(pwd)/build/jvm-libs uv run --no-sync python scripts/define_intermediate_classes.py --rounds 4
uv run --no-sync python scripts/evolve_rigor.py --batch 12       # convert primitives → ≡ / roles
filerole
src/aegir/ontology/catalog/*.jsonthe seven family catalogs (edit these, not combined.json)
src/aegir/ontology/SLOT_DSL.mdthe slot grammar
src/aegir/ontology/axiom_patterns.jsonDOSDP-style genus-differentia + role patterns
scripts/ontology_metrology.pyevery metric (compute()) — the single source of truth
scripts/ontology_oquare.pythe [1,5] bands, characteristic map, FLOORS, the publish gate
src/aegir/ontology/ontoclean.pythe OntoClean classifier + meta-property membrane
scripts/build_realized_ontology.pyrender → CCO import → HermiT → .owl + certificate
scripts/grounding_anchors.pyCCO+FHIR+accretive genus retrieval
corpora/ontology/sdg-ontology.{omn,owl}the realized artifact (+ HERMIT_CERTIFICATE.md)

Citations. OQuaRE: Duque-Ramos et al. 2011. IOF/BFO signature: Smith et al. 2019. OntoClean: Guarino & Welty. CCO: Common Core Ontologies (CC0). FHIR: HL7 FHIR R5.

Charter

This is Ægir’s internal direction-setter for the ontology scope. It declares what Ægir publishes outward, names the provenance discipline and design constraints that follow, records the committed BFO/CCO branch structure and external-standard anchors, and pins the gate any ontology change must clear before it ships.

Status note. The operational rigor program — the metric suite, the OQuaRE publish gate, the disposal membranes, and the agent-mediated propose/dispose loop — is documented canonically in the Authors Guide; this charter does not duplicate it. The branch structure and external anchors in §§ Domain commitments below remain the committed architecture of the SDG ontology and are load-bearing. Sections of earlier revisions that framed the deliverable as a vocab_label_map.json contract, a src/aegir/synth/ generator library, or a fixed ~520-template catalog under a GRPO/verifier training program describe a superseded plan; the current deliverable is the realized HermiT-validated OWL artifact and the ontology-grounded corpus.

Contract Ægir publishes outward

The outward deliverable is the realized SDG ontology — a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl} with a consistency certificate (HERMIT_CERTIFICATE.md) — shared through the corpora submodule (zndx/sdg-corpora). It is versioned and independently consumable: any consumer can load the .omn/.owl in an OWL reasoner (Protégé/HermiT, ROBOT, owlready2) and re-verify consistency. The ontology ships alongside the ontology-grounded synthetic corpus and the relational DDL spine derived from it.

Publication is gated — no sync --push of the ontology Data Product until the OQuaRE quality gate is GREEN (see Authors Guide § 4). This is the entire outward obligation; anything else a consumer wants is a feature request, not a constraint on Ægir’s internals.

Provenance discipline

The ontology’s structure carries its own grounding:

  • Public-namespace IRIs (bfo:, cco:, fhir:, iao:, skos:, rdfs:, owl:) are reused directly — their authority comes from the namespace. Numeric IRIs (bfo:0000040, cco:ont00000713) are looked up with the grounding-anchor retriever (scripts/grounding_anchors.py), never invented.
  • SDG-namespace terms (sdg: prefix) are bespoke classes and properties authored by the project. Each chains to a BFO 2020 upper class (directly or through CCO/FHIR) and carries a definition. These are the novel contributions of the work — by construction they do not exist in public reference sets, which is the point of a bespoke ontology.

The discipline is editorial, not algorithmic. A CI script that tried to mechanically verify “novel-vs-derived” would either block legitimate bespoke entities (which by construction appear in no public reference set) or rubber-stamp around its own checks. Provenance lives in PR review: a reviewer who recognizes that a candidate term reads as material lifted from an external source, rather than as the project’s own engineering and conceptual work, raises that the same way they would raise any other authorship concern.

The mechanical checks Ægir does run are about structural integrity, not provenance: that the TTL parses, every term carries a label and definition, and every sdg: term has a BFO subClassOf chain (just check-ontology-schema). The strong, un-fakeable membranes (HermiT and OntoClean) enforce logical and ontological correctness; see Authors Guide § 5.

Design constraints that follow

  1. The ontology lives in source, not in a database. The seven family catalogs and 08_derived.json are text files in version control; mutations are PRs with diffs. The realized .omn/.owl is build output. If a UI ever writes to a DB, the export pipeline reconciles into the catalog, not the other way around.

  2. The ontology drives a synthetic corpus, not a service. The ontology-grounded chapter generation and DDL-spine materialization (scripts/generate_chapter.py, src/aegir/ontology/ddl.py, realize.py) run as importable, seed-deterministic Python that emits chapters, tables, and views to disk for downstream consumers; they are not a daemon or a network service in Ægir’s own usage.

  3. Content-first derivation drives coverage. The live ontology driver is FinePDFs-content derivation (qdrant/ColBERT domain filtering → engine derives intermediate classes → membranes dispose); the seven template families are a seed and regression baseline. Coverage grows by deriving new property-bearing subsumers from text, not by enlarging a fixed template count.

  4. One BFO anchor, multiple operational contexts. SDG forces cross-context concepts to be expressed as shared subclasses of common BFO/CCO ancestors rather than as discipline-specific aliases for the same real-world entity. This cross-context cousining is the load-bearing architectural invariant (§ Branch structure).

Domain commitments — Signals Data Governance (SDG) Ontology

Section added 2026-05-09 after collaborative domain choice; session note at docs/scratch/2026-05-09/232551_domain_choice.md. The branch structure and external anchors below remain the committed architecture of the SDG ontology.

Identity

The bespoke ontology the project authors and publishes is the Signals Data Governance (SDG) Ontology — a vendor-neutral research artifact that Signals 360 implements and extends. The neutral name preserves flexibility for open-source release or sovereign deployments.

  • Ontology IRI prefix: sdg: for bespoke terms; cco:, bfo:, fhir:, iao:, skos:, rdfs: for public-namespace anchors. (See the namespace table in Authors Guide § 1.)
  • Source-of-truth: the family catalogs under src/aegir/ontology/catalog/, realized to corpora/ontology/sdg-ontology.{omn,owl}.
  • Aegir remains the project / codebase identity; SDG is the ontology that the Aegir project hosts.

Branch structure (committed)

Five primary branches plus a belief branch, all anchored in BFO 2020

  • CCO. Cross-context cousining (e.g., sdg:Trace and sdg:LabRun share sdg:ObservationProcess) is the load-bearing architectural invariant.
bfo:Continuant
├── cco:IndependentContinuant
│   ├── cco:Artifact ← sdg:Instrument, sdg:Dataset, sdg:SystemBlock,
│   │                  sdg:Program, sdg:Sample, sdg:eBPFProgram,
│   │                  sdg:KernelHook, sdg:Map (eBPF map)
│   └── cco:Person / cco:Organization
└── bfo:GenericallyDependentContinuant
    └── cco:InformationContentEntity
        ├── cco:DesignativeICE ← sdg:Identifier, sdg:AttributeKey,
        │                         sdg:Reference, sdg:Syscall (ID)
        ├── cco:DescriptiveICE ← sdg:Measurement, sdg:Profile,
        │                         sdg:OutlierClaim, sdg:State,
        │                         sdg:Annotation, sdg:AttributeSet,
        │                         sdg:Lift, sdg:Aggregation
        │   └── sdg:BeliefStructure ← sdg:MassFunction,
        │                              sdg:BeliefInterval,
        │                              sdg:Evidence, sdg:Claim
        └── cco:DirectiveICE ← sdg:Requirement, sdg:Control,
                                sdg:Policy, sdg:Constraint
                                (CCO label is "Prescriptive ICE";
                                 SDG names this branch "Directive
                                 ICE" via owl:equivalentClass to
                                 cco:ont00000965 — see naming note
                                 below)

bfo:Occurrent
└── bfo:Process
    ├── sdg:ObservationProcess ← sdg:LabRun, sdg:Trace,
    │                             sdg:Profiling, sdg:OutlierDetection,
    │                             sdg:eBPFEvent
    ├── sdg:DerivationProcess ← sdg:LineageEdge, sdg:Transformation,
    │                            sdg:Allocation
    │                            (PROV-O anchored: subClassOf prov:Activity)
    └── sdg:GovernanceProcess ← sdg:Verification, sdg:Attestation,
                                 sdg:Classification, sdg:Audit

Branch / context mapping

Professional contextPrimary branch hits
LIMSsdg:Sample, sdg:Instrument, sdg:LabRun, sdg:Measurement, sdg:Verification; lineage via sdg:LineageEdge
MBSE / SysMLv2 (user-level)sdg:SystemBlock, sdg:Requirement, sdg:State, sdg:Verification, sdg:Allocation, sdg:Constraint
Database metadata + EAV + open lineagesdg:Dataset, sdg:AttributeKey, sdg:Identifier, sdg:Reference, sdg:Profile, sdg:Annotation, sdg:LineageEdge, sdg:Transformation
Macrobase modernizationsdg:OutlierDetection, sdg:OutlierClaim, sdg:AttributeSet, sdg:Lift, sdg:Aggregation, sdg:Profile
OTel + eBPF cybersecsdg:Trace (spans), sdg:Instrument (probe/exporter), sdg:Program, sdg:eBPFProgram, sdg:KernelHook, sdg:eBPFEvent, sdg:Syscall, sdg:Map, sdg:AttributeKey (SemConv), sdg:Measurement, sdg:Policy, sdg:Control

External anchors

External standardSDG alignment
BFO 2020Upper structure; every leaf has subClassOf+ to BFO
CCO 2.xMid-tier (Artifact, ICE branches); imported as a reasoning authority so HermiT validates grounding against CCO’s disjointness axioms
FHIR R5Clinical/record genera, bridged to cco:InformationContentEntity (210 types in the grounding index)
OBI / IAO (OBO Foundry)iao:0000115 definition annotations; anchor for sdg:LabRun, sdg:Measurement, sdg:Instrument
PROV-O (W3C)OWL-semantics anchor for sdg:DerivationProcess lineage
OpenLineage (LF AI&Data)Operational runtime surface for sdg:LineageEdge; mapped via SSSOM
OpenMetadataOperational runtime alignment for sdg:Dataset, sdg:Annotation
OTel SemConvMapping target for sdg:AttributeKey (HTTP, DB, RPC, security conventions)
SysMLv2 (user-level)Mapping target for sdg:SystemBlock, sdg:Requirement, sdg:Allocation, sdg:State, sdg:Constraint (KerML metamodel deferred)
NIST PII / ISO 19944Public reference for sdg:Classification sensitivity tiers
W3C DCAT, Schema.org, DBpediaPublic mid-tier for benchmark coverage (SOTAB, GitTables)

Naming note — DirectiveICE vs CCO’s PrescriptiveICE

CCO’s canonical IRI cco:ont00000965 carries rdfs:label "Prescriptive Information Content Entity". SDG renames this branch “Directive ICE” because directive better captures the normative sense (requirements, controls, policies, constraints) than prescriptive (which can read as recipe-like). The rename is a shorthand convention only — the bespoke sdg:DirectiveICE is declared as owl:equivalentClass cco:ont00000965 so all CCO-side deductions remain available. Reviewers reading CCO source see the canonical “Prescriptive ICE” label; reviewers reading SDG see “Directive ICE”; both ground at the same IRI.

Resolved design decisions (2026-05-09)

The six open questions in docs/scratch/2026-05-09/232551_domain_choice.md resolved as:

  • Q1 — Belief branch: include; sdg:BeliefStructure under cco:DescriptiveICE. Direct alignment with Atelier’s DST evidence fusion; future-proofs federated-intelligence use cases where conflict K and epistemic uncertainty must propagate across nodes.
  • Q2 — eBPF / cybersec depth: eBPF first-class; adds sdg:eBPFProgram, sdg:KernelHook, sdg:Syscall, sdg:Map, sdg:eBPFEvent. OTel remains the primary runtime surface; first-class eBPF preserves semantic grounding without translation loss.
  • Q3 — SysMLv2 depth: user-level primitives only; KerML metamodel deferred. Block, Part, Action, State, Requirement, Allocation, Verification only.
  • Q4 — Lineage anchor: PROV-O for OWL semantics + OpenLineage for runtime surface, mapped via SSSOM. Single deductive core; preserves operational interop.
  • Q5 — Macrobase: pre-anchor lightly (sdg:OutlierClaim, sdg:AttributeSet, sdg:Lift, sdg:Aggregation, plus relations). Modernization team free to extend.
  • Q6 — Ontology name: Signals Data Governance (SDG) Ontology. Vendor-neutral; preserves open-source / sovereign deployment optionality.

These decisions are committed at v0.1 of the SDG ontology. Future revisions require explicit version bumps tracked in docs/scratch/YYYY-MM-DD/ session notes.

What stays out of Ægir

  • Dempster-Shafer fusion, belief/plausibility logic, any specific classification pipeline shape (Atelier’s domain).
  • Gateway / UI features that are not directly about Ægir’s own view of its runs.
  • Customer-deployment glue: mid-run watchers, agent loop governance, FSM session state. These belong with the consumer that owns the deployment lifecycle.
  • Storage schemas (Hive / Iceberg / Postgres) that exist only for sibling-project governance flows. Ægir publishes the realized OWL artifact and the corpus; consumers translate to their own storage shape.

Migration

Plan for relocating the bespoke vocabulary and the synthetic data generators into Ægir without breaking the consumer projects that use them today. Designed to be reversible up to the cutover, and to fail loudly rather than drift silently.

Status (2026-06-29) — historical; partially executed, then superseded. This is the original migration plan as written on 2026-05-09. Phase 0 and the authoring step of Phase 1 happened: the renamed canonical vocabulary now lives in-tree at src/aegir/ontology/sdg-vocab.ttl and the mechanical TTL checks run in CI (just check-ontology-schema). The rest of the plan did not execute as written and is retained here only as a record of intent:

  • The named cutover artifact, vocab_label_map.json v1.0.0, was never produced, and there is no aegir-vocab.ttl in the tree. The project’s vocabulary work moved past a benchmark-label-map deliverable to a content-derived, HermiT-validated realized ontologycorpora/ontology/sdg-ontology.{omn,owl} with a consistency certificate (HERMIT_CERTIFICATE.md) — authored as catalog templates and gated by the OQuaRE publish gate. The current reference for that artifact and its gates is the Authors Guide.
  • The _LABEL_DIMS["sotab"] reconciliation in Phase 2 is done: src/aegir/data/table_dataset.py carries "sotab": 82.
  • The synth-migration phases (4–6) and the SOTAB-CTA empirical gate (Phase 3) were not run; current evidence (EVIDENCE.md) and project framing treat SOTAB-CTA as likely the wrong eval and gate the ontology Data Product on OQuaRE instead.

The phased plan below is preserved unchanged for provenance. Read it as history, not as the plan of record. Structural recommendation (flagged, not performed): archive this page, or fold it into a short “vocabulary provenance” note, once a maintainer confirms nothing downstream still links to the phased cutover.

Starting state (2026-05-09)

  • atelier-vocab.ttl lives in a sibling project, ~1052 lines. Operational — used by the sibling’s classification pipeline today.
  • Synthetic generators (synth_generators.py + synth_registry.py
    • synth.py) live in the same sibling, 316+ hand-coded generators plus a three-tier priority dispatcher. Used by the sibling’s BDD and pytest test infrastructure.
  • Ægir has no src/aegir/ontology/ directory and no .ttl files anywhere in the tree. Greenfield destination.
  • The sibling project has published an ownership-migration note confirming it will become a consumer of trained Ægir artifacts and surrender ownership of the vocabulary. That note is treated here as input, not specification — Ægir designs the move on its own terms.

Provenance discipline

Ægir’s vocabulary admits terms only on positive public-source grounding — see Charter §Provenance discipline for the principle and admission rules. The migration boundary below is where those rules first apply.

Cutover criterion (named, not implicit)

Cutover is when Ægir publishes vocab_label_map.json v1.0.0 and the sibling project commits a release that consumes it. From that commit onward, the sibling’s local TTL snapshot is frozen and any vocabulary change must land in Ægir.

Until that point, both copies are operational; the sibling continues its own development against its own TTL. After that point, the sibling’s TTL is documentation of its starting state, not a live artifact.

This avoids the failure mode where two TTLs evolve in parallel for weeks under the assumption the migration is “in progress” — the kind of two-source-of-truth state that doesn’t fail loudly and is expensive to reconcile after the fact.

Phased plan

Phase 0 — scaffolding (this commit)

  • Create src/aegir/ontology/ and src/aegir/synth/ as empty Python packages with __init__.py.
  • Add the directory layout described in Charter as empty placeholders or stubs that import-test cleanly.
  • Wire CI to no-op gracefully on the empty tree (totality query returns trivially when there are zero labels).
  • This document and the charter, in the published mdbook.

Phase 1 — author the initial vocabulary

src/aegir/ontology/sdg-vocab.ttl is authored fresh. The starting content is what Ægir actually needs at M2: BFO 2020 + CCO upper structure, the Schema.org and DBpedia terms that map to the SOTAB v2 and GitTables benchmark label sets we train against, and whatever bespoke sdg:-namespace entities the project decides to introduce on its own terms.

External TTL working sets (the sibling project’s vocabulary, OBO Foundry exports, etc.) are reference material, not source-of-truth inputs. The author may consult them while writing sdg-vocab.ttl, the same way they’d consult any reference document. A small mechanical helper at scripts/extract_public_terms.py can pull just the public-namespace IRIs out of a reference TTL into a plain text scratchpad — a convenience for sifting through large external files — but it does not produce committable output. The TTL that lands in sdg-vocab.ttl is the author’s edit.

Concrete process:

  1. Draft sdg-vocab.ttl by hand, aiming for benchmark-label coverage that the M2 dataset loader can resolve. Every term:
    • has an rdfs:label,
    • has a skos:definition,
    • and (for sdg:-namespace terms) at least one rdfs:subClassOf edge to a public-namespace ancestor.
  2. Author the SPARQL queries (totality.rq, ancestry.rq, coverage_by_namespace.rq) against Ægir’s own namespace conventions.
  3. Author scripts/build_vocab_label_map.py: TTL in, versioned vocab_label_map.json out. The initial map covers whatever benchmark labels the authored vocabulary supports — likely a partial picture of SOTAB v2 Schema.org and a partial picture of GitTables DBpedia. Coverage extends iteratively as the project demonstrates need.
  4. Author src/aegir/ontology/label_map.py with load(), iri_for(), bfo_ancestry(), labels_for_namespace() helpers.
  5. Wire the mechanical checks into CI: TTL parse + label/definition presence + BFO ancestry presence for sdg: terms + label-map JSON consistency + SPARQL totality.
  6. External working sets are unaffected.

Ongoing discipline lives in PR review, guided by a short src/aegir/ontology/PROVENANCE.md reviewer’s guide. The mechanical TTL checks described in Charter §Provenance discipline cover structural integrity. Whether a candidate term reads as the project’s own engineering work or as material lifted from elsewhere is an authorship judgment, made the way every other code-review judgment is made.

Phase 2 — _LABEL_DIMS reconciliation + label_to_iri resolver

  • Fix the stale _LABEL_DIMS["sotab"] = 91 to 82 in src/aegir/data/table_dataset.py (this is mechanical; the SOTAB v2 CSV union is verified at 82 distinct labels).
  • Add a label_to_iri(benchmark, label) resolver to the table dataset that consumes vocab_label_map.json at load time.
  • This is the smallest end-to-end vertical slice that proves the pipeline: a training run can now ask “what’s the BFO ancestry of the label this column got?” and get an answer from the in-tree map.

Phase 3 — empirical-gate validation

The v2→SOTAB head fine-tune runs from outputs/mixed-v2/20260426T232240Z/final.pt. The v2 → SOTAB head fine-tune gate specifies the liveness thresholds:

  • ≥ 3 distinct embedding clusters at coarse MCL inflation
  • ≥ 0.10 macro F1 on SOTAB v2 CTA validation
  • predictions distributed across ≥ 10 distinct labels

If these fail, vocabulary expansion is paused and the model issue is debugged first. No vocab edits land in main while the gate is red.

Phase 4 — synth migration (planned, not committed)

  • Decide the consumer-side consumption pattern before moving code. Three options on the table:

    1. Vendored snapshot: sibling ships a frozen sample under its own samples/ for tests; Ægir owns regeneration via scripts/snapshot_synth_corpus.py. Lowest friction.
    2. Sibling-repo dependency: sibling imports from src/aegir/synth/ directly. Requires the sibling to keep Ægir installable in its dev environment.
    3. Thin client: the leaderboard gateway (or a dedicated synth gateway) exposes a generation endpoint. Highest deployment cost, also the most flexible end state. Default plan: vendored snapshot during transition; thin client only if a customer deployment needs out-of-process generation later.
  • Once the consumption pattern is decided, copy the generator code in. Then port the priority registry. Then port the integration tests.

  • Cutover for synth runs in parallel with vocabulary cutover but is not required to be simultaneous. Sibling-project tests against vendored-snapshot fixtures keep passing through the whole transition.

Phase 5 — vocab_label_map.json v1.0.0

  • All benchmark labels in _LABEL_DIMS (sotab, sotab-dbp, sotab-dbp-re, gt-dbp) covered with at least direct or subsumption-reachable BFO ancestry.
  • Liveness gate passed.
  • Synth migration complete or vendored-snapshot stable.
  • Tag the release. The sibling project’s consumer-side commit can land against the tagged URL.

This is the cutover moment. Sibling’s TTL is now documentation.

Phase 6 — vocabulary expansion (Ægir-defined, not Atelier-tiered)

Once v1.0.0 is out and the cutover is done, vocabulary expansion is genuinely Ægir’s call. The Atelier-side three-tier breakdown (measurement zoo, subClassOf plumbing, Product/JobPosting/economics) is a useful reference for what’s missing, but the order, scoping, and implementation milestones are decided here against Ægir’s empirical priorities — which datasets are getting trained against, where coverage gaps are blocking benchmark progress, and where vocabulary expansion can plausibly help model performance vs. just adding label-space breadth.

In particular: if a v3 corpus mix wants ontology-conditioned synthetic slices, the vocabulary work prioritizes whatever the v3 corpus needs. That decision lives in the v3 design notes, not in a tier doc imported from a consumer.

What we do not commit to here

  • A timeline. The empirical gate (Phase 3) is a real gate; if the v2→SOTAB head fine-tune surfaces architectural issues, the rest of the migration waits. We commit to the order, not to dates.
  • A unification of the Schema.org and DBpedia label sets. Both stay separate keys in the JSON map per the charter.
  • A specific synth-migration consumption pattern. We commit to deciding before code moves, not to which option wins.
  • Backwards compatibility with the sibling’s TTL filename or directory structure. The renamed sdg-vocab.ttl is the canonical form.

Concept brief — RLVR for ontology generation

A four-component verifiable reward and GRPO-trained policy for OWL composition

Draft v0.5 — 2026-05-09 — companion to Charter and Migration. Supersedes v0.1, v0.2, v0.3, v0.4.

Status (2026-06-29) — research design of the long-horizon Signals M4 apparatus. This brief is the detailed research design for the RLVR program (the verifier R, the prior-art positioning, and the P0–P9 phase structure) — the Signals M4 apparatus, an SAE-instrumented-Qwen local policy GRPO-trained to autonomously generate ontology extensions. Its four-component verifier R(O, I) is now realized as the deterministic membrane stack (HermiT/CCO, OntoClean, OQuaRE), and the agent-mediated propose/dispose loop is building and proving that reward today, in direct service of M4. This brief remains the authoritative statement of the contribution claim and the seven required prior-art differentiations. For the current state of the program read its two companions, which carry the empirical and externally-readable surfaces:

Where this brief’s specifics have been overtaken, they are corrected inline below; the verifier definition, the prior-art differentiations, and the phase/gate methodology are preserved as written. Note also that this brief concerns the RLVR/GRPO research track, which is distinct from the ontology rigor program (the OQuaRE publish gate, OntoClean membranes, and intermediate-class authoring documented in the Authors Guide); the two share the SDG ontology but are governed separately.

Objective

Define a four-component deterministic verifier R over OWL ontology artifacts and the target text corpus I, then train an LLM policy π_θ via formal RLVR (GRPO with R as the reward) to produce ontologies that (a) exceed prompt-evolved and human-authored baselines on R, and (b) admit downstream verification that each R-component carries predictive validity for byte-level pretraining utility on Aegir’s stratified-eval surface.

The contribution is the verifier R together with the RL training loop that targets it. Pretraining lift is a downstream external validity check; it is the subject of a separate application paper scoped at the end of this brief.

Scope split — two papers

To avoid the chain-too-long failure mode of the v0.1 brief, the research program decomposes into two independently citable papers that share infrastructure:

PaperHeadline claimPhasesStatus in this brief
Paper 1 — RLVR for ontology generationA four-component verifier R and a GRPO-trained policy π_θ produce ontologies exceeding human and prompt-evolved baselines on R; each R-component is shown to discriminate quality on a held-out test set.P0 → P6Primary scope of this brief.
Paper 2 — Ontology-grounded byte-level pretrainingVerbalizations from R-passing ontologies (produced by paper 1’s policy) measurably improve byte-level pretraining of a hierarchical sequence model on Aegir’s stratified eval.P7 → P9Application follow-on, scoped here at lower resolution. Earned only if paper 1 lands.

This brief commits primarily to paper 1. Paper 2 is sketched at the phase level but its methodology is left to be revised after paper 1’s results constrain it.

Contribution claim (locked)

We define a four-component verifier R: OntologyArtifact × Corpus → [0, 1]. We demonstrate three subordinate claims, all tied to R:

  • C1 — Discrimination. R discriminates ontologies of varying quality on a held-out test set (R on known-good ≫ R on known-bad with a margin defended by AUC against a labeled set).
  • C2 — Optimizability. GRPO training of an LLM policy π_θ against R produces ontologies whose mean R exceeds prompt-evolved and human-authored baselines at p < 0.05 over a held-out generation set.
  • C3 — Component validity. Each component of R (R_A, R_B, R_C, R_D) carries predictive validity for downstream pretraining utility, established by ablation in paper 2; relative weights used in paper 1’s aggregation are derived from paper 2’s ablation.

The brief commits to C1 + C2 as paper 1’s headlines. C3 is paper 2’s primary contribution. The two papers cite each other.

Load-bearing novelty (sharpened in v0.5)

Per the v1 P0 literature review at docs/scratch/2026-05-09/225830_lit_review_v1.md, each component recipe of this brief has prior art on adjacent artifacts:

  • RL on graph-structured outputs: AutoGraph-R1 (knowledge graphs), the AMR-RL chain (semantic DAGs), Lehmann & Haase 2012 (symbolic RL on EL++ concepts).
  • Deterministic schema validators with RL: SRL, RL-Struct (JSON Schema).
  • Deterministic execution validators with RL: CodeRL, Reasoning-SQL.
  • Verbalization corpora for KGs: KELM/TEKGEN (Wikidata ABox → retrieval corpus).
  • End-to-end LLM ontology generation: OLLM (NeurIPS 2024, SFT + regulariser, taxonomic backbone).
  • Ontology-guided LLM optimization: OntoTune (SFT self-distillation), K2V (RLVR + KG, but emits QA traces).

What is not present in any of these is the combination this brief proposes:

The OWL artifact is the only graph-structured output where the verifier can run a sound-and-complete reasoner producing both structural and semantic verdicts (consistency, entailment, class-hierarchy coverage). That combination — graph-structured output with a semantically grounded deterministic verifier targeting an LLM policy under RLVR, plus the verbalization-corpus pretraining application — is the gap.

The brief’s contribution is the synthesis of the four precursor recipes (graph-output RL, schema-validator RL, code/SQL execution RL, verbalization-corpus pretraining) onto an artifact (OWL) where the verifier acquires DL deductive semantics that none of the precursors have.

Adjacent prior art — explicit differentiations

Seven adjacent works must be addressed head-on in this brief and any paper that emerges from it. Each differentiation is stated explicitly below.

KELM / TEKGEN (Agarwal, Ge, Shakeri, Aharoni, NAACL 2021). The single closest verbalization corpus recipe: verbalize a structured knowledge source into natural-language sentences, integrate as a model training corpus, measure downstream effect.

KELM verbalizes ABox triples from Wikidata into a retrieval corpus evaluated on QA. We verbalize TBox axioms from a bespoke OWL ontology into a flat byte-level pretraining slice evaluated on stratified held-out bits-per-byte. The structural content of the verbalized text differs (taxonomic and logical class expressions vs. instance-level facts); the integration mechanism differs (pretrain mix vs. retrieval); the evaluation isolates ontology-specific contribution rather than generic QA lift.

OLLM (Lo, Jiang, Li, Jamnik, NeurIPS 2024). Closest end-to-end LLM ontology generation approach. Fine-tunes an LLM with a custom regulariser that reduces overfitting on high-frequency concepts; produces taxonomic backbones from scratch.

OLLM uses plain SFT with a regulariser, not RL; produces taxonomic backbones only, not full OWL with restrictions or equivalentClass axioms; evaluation is graph-similarity against reference ontologies. We train via GRPO against a deterministic verifier that scores OWL-reasoner consistency, axiom complexity, and topic alignment with a target corpus; the policy emits full OWL compositions with restrictions and equivalentClass intersections.

AutoGraph-R1 (Tsang et al., arXiv:2510.15339, ICLR 2026 submission). Closest RL-trained graph-emitting policy. A GRPO-trained LLM policy emits a knowledge graph from text; reward is a “Knowledge-Carrying Reward” computed from the graph’s downstream RAG utility, judged extrinsically.

AutoGraph-R1 emits flat (head, rel, tail) triples — ABox-style instance facts — and uses an extrinsic LLM-judge reward via downstream QA accuracy on retrieved triples. We emit OWL TBox compositions with class axioms and use an intrinsic, semantically grounded deterministic verifier (DL-reasoner consistency check + programmatic structural property checks + topic-model alignment). Sound-and-complete deductive reasoning is unavailable to AutoGraph-R1’s flat-triple output by construction; OWL’s class-axiom expressivity is what makes the DL-reasoner verifier shape possible.

K2V — Knowledge-to-Verification (Yuan et al., ICLR 2026 submission). Closest RLVR + KG-derived reward methodology. Builds a KG from text and frames KG completion as a verifiable QA task to derive dense rule-based rewards for LLM reasoning.

K2V’s policy emits QA reasoning traces, not OWL. The verifier checks subtask correctness, not ontology loadability / axiom density / topic alignment. K2V proves the RLVR-with-KG-derived- reward shape works; we apply that shape to OWL generation directly, with the verifier targeting structural and semantic properties of the artifact rather than QA accuracy on downstream tasks.

OntoTune (Liu et al., WWW 2025). Iteratively refines an LLM against an ontology-grounded objective.

OntoTune iterates an LLM against an ontology-grounded objective via SFT, with the reward implicit in does-LLM-already-know-this gating. The model emits natural-language answers, not OWL. We train an LLM policy via GRPO with an explicit deterministic verifier whose output is a continuous reward in [0, 1], and the policy emits OWL ontology compositions whose well-formedness is guaranteed by the catalog’s typed-slot grammar.

Zaitoun, Sagi, Peleg (AAAI Symposium Series 2024). Closest OWL-specific verbalization-derived training data.

Zaitoun et al. use LLM-assisted verbalization of OWL axioms to create text→OWL supervised fine-tuning pairs. We treat the verbalizations as a flat self-supervised byte-level corpus mixed with general pretraining text, with no instruction-pair framing.

OnT — Language Models as Ontology Encoders (Yang, Chen, He, Gao, Horrocks, arXiv:2507.14334, 2024). Closest TBox-axiom-aware embedding approach. Compositional verbalization of OWL class expressions feeds a pretrained Sentence Transformer, re-trained via hyperbolic-space hierarchy/role/conjunction losses.

OnT verbalizes TBox axioms but uses them as auxiliary-objective embedding training (hyperbolic loss on hierarchy / role / conjunction); the underlying LM is not pretrained on the verbalizations. We treat verbalizations as flat next-token pretraining bytes mixed with general-purpose corpus.

These seven citations are required in the related-work section of any paper that emerges from this brief. Secondary methodological precedents (SRL / RL-Struct, AMR-RL chain, CodeRL / Reasoning-SQL, DRAGON, Lehmann & Haase 2012, OLLM) are listed in the lit review v1 (docs/scratch/2026-05-09/225830_lit_review_v1.md) citations index.

P0 exit gate is firmly green as of v1; the chain-of-three claim survives a depth-of-search expansion across the three highest-risk axes.

Verifier R — formal definition

The verifier is structured around a procedurally pre-computed catalog C (described in the next subsection). DeepOnto runs offline during catalog construction; the runtime verifier does not call DeepOnto and does not require a JVM. This decision keeps DeepOnto out of the RL loop’s critical path and out of any pretraining inference path.

Let O = compose(C, σ) be an ontology composed by the policy from catalog templates with slot-fill σ. I is a fixed input text corpus. R has four components.

R_A (Well-formedness).

R_A(O) = 1 if all (template, σ_i) pairs in O type-check against C else 0

A composition is well-formed iff every selected template’s slot constraints are satisfied by the chosen fillers (typed term inventory; types declared per-template at catalog construction time). R_A is a hard gate; R(O) = 0 if R_A = 0. By construction, R_A = 1 implies the rendered OWL also passes DeepOnto loadability, because every catalog template was DeepOnto-validated offline.

R_B (Complex-class density).

R_B(O) = min(complex_count(O) / τ_B, 1)

where complex_count(O) is the number of templates in O whose is_complex flag is set in C (DeepOnto-determined offline by running onto.get_asserted_complex_classes() on the rendered template). τ_B is the 95th percentile of complex-count from the structural-shuffle null distribution computed once per catalog.

R_C (Verbalization quality, as semantic-richness proxy).

By construction, every template in C verbalizes cleanly (offline gate at catalog construction time), so the binary “does it verbalize” question is uninformative at runtime. R_C is repurposed as a continuous semantic-richness proxy:

R_C(O) = clip(mean_verbal_length(O) / L_target, 0, 1)

where mean_verbal_length(O) is the mean character length of the pre-cached verbalizations of templates in O, and L_target is calibrated from the catalog’s distribution (P1 sets this so that the median template gives R_C ≈ 0.5).

R_D (Topic alignment with corpus I).

Let V(O) be the verbalization corpus of O, constructed by concatenating each template’s pre-cached verbalization with the slot-fillers substituted in. Fit BERTopic on V(O) (the topic model on I is fitted once and frozen at P1). Compute the Hungarian-optimal one-to-one matching between the V(O) topics and the frozen T_I topics under cosine similarity in the c-TF-IDF representation space. R_D is the mean matched cosine similarity, normalized to [0, 1] against the structural-shuffle null distribution.

R_D is the only runtime component that recomputes a topic model; its cost is the dominant per-step verifier cost. Mitigation: the sentence-embedding backbone is computed once on V(O) (small — typically 100s–1000s of sentences for a single composition) so the hot path is HDBSCAN clustering on cached embeddings, ~seconds per ontology on CPU.

Aggregation.

R(O) = R_A(O) · (a · R_B(O) + b · R_C(O) + c · R_D(O))

with a + b + c = 1. Initial weights {a, b, c} are derived from the P2 verifier-validation phase by maximizing AUC against a labeled ontology test set. The weights are not tuned during P5 RL training — they are fixed before the policy sees the verifier. This separation prevents the policy from gaming the weight-discovery process.

The verifier is implemented as scripts/aegir-verify, deterministic, hash-stable across runs given fixed input. No JVM, no DeepOnto, no Java dependencies at runtime — the catalog encapsulates all DeepOnto-derived knowledge as flat data.

Methodology

Input corpus I, pinned

I is the held-out subset of v2’s mixed-corpus pretrain mix restricted to SchemaPile + FinePDFs-lab (the same choice as v0.1 of this brief, for the same reasons: reproducible, downstream- aligned, disjoint from v2 trained-time eval). Size: ~200 MB.

The 200 MB working budget is not validated for BERTopic stability (BERTopic, unlike LDA, is sensitive to corpus size in different ways — too few documents and HDBSCAN clustering is unstable; too many and sentence-embedding compute dominates). P1’s verifier-implementation phase includes a corpus-size sensitivity sweep before I is locked for downstream phases.

Topic model — BERTopic primary, NMF ablation

LDA is not the default; project consensus from earlier work is that LDA’s bag-of-words assumption fails on DDL-heavy text where SQL syntax tokens dominate vocabulary frequency. BERTopic over a sentence- embedding backbone (default: all-MiniLM-L6-v2 for speed; switchable to gte-large for headline runs) is the primary choice. NMF over TF-IDF vectors is the robustness ablation.

For both topic models, the c-TF-IDF representation per topic is computed from the original corpus tokens; alignment is in this shared representation space.

Procedural catalog C — offline DeepOnto, runtime lookup

The catalog is the methodological pivot of v0.3. DeepOnto’s role is moved from runtime to catalog-construction time; the runtime verifier (and any downstream pretraining pipeline that consumes catalog-rendered ontologies) has no DeepOnto dependency.

Catalog structure. C is a flat data artifact: a JSON or SQLite table where each row is a template and contains:

FieldSourcePurpose
template_idcatalog assignmentunique handle
manchester_templatehand or LLM-authored, pre-validatedOWL Manchester syntax with typed slots, e.g. Class: {X:Class} SubClassOf: {p:ObjectProperty} some {Y:Class}
slot_typesderived from templatetyped-slot constraints for the policy’s slot-fill
is_complexDeepOnto offlineresult of onto.get_asserted_complex_classes() after rendering with placeholder fillers
verbal_templateDeepOnto offlineresult of OntologyVerbaliser.verbalise_class_expression() with slot fillers as variables
mean_verbal_lengthDeepOnto offlinecharacter length used by R_C
bfo_anchor_pathcatalog metadatardfs:subClassOf chain to BFO upper class
provenancecatalog metadataauthor, date, gate-version

Construction (P1 deliverable).

  1. Author or generate ~500 candidate templates spanning the OWL 2 axiom shape inventory (subclass, equivalent-with-intersection, existential, universal, cardinality, owl-thing-anchored, etc.).
  2. Spawn one JVM, instantiate each template with placeholder fillers, run DeepOnto’s Ontology loader, get_asserted_complex_classes(), and OntologyVerbaliser. Record is_complex, verbal_template, mean_verbal_length.
  3. Drop templates that fail to load, fail to verbalize, or whose verbalization is shorter than 5 chars.
  4. Generate the structural-shuffle null distribution (200 shuffles) of compositions over C and cache the per-shuffle complex_count and R_D values. τ_B and R_D’s null calibration live in C’s metadata.
  5. Commit C as a versioned artifact. (As built, the catalog is the seven family JSON files in src/aegir/ontology/catalog/ (01_foundation07_long_tail) plus the FinePDFs-derived 08_derived.json, not the single C-v0.1.{json,sqlite} file this brief originally proposed; the null statistics live in null_stats_canonical.json and the frozen topic model in T_I_canonical.pkl.)

After P1, the JVM is gone. The RL loop, verifier, and any v3 pretraining pipeline consume C by lookup.

Implications and trade-offs.

  • Pro: the policy’s outputs are well-formed by construction. Slot-typed composition cannot produce OWL that fails to load; R_A becomes a structural type-check rather than a parser exception.
  • Pro: per-step verifier cost drops by ~1–2 orders of magnitude. No JVM init (~5 s amortized over a batch becomes 0). No DeepOnto parse per generation. RL training throughput improves proportionally; ablation runs become cheaper.
  • Pro: pretraining-pipeline cleanliness. v3 verbalization slices drawn from C-compositions inherit no Java dependency; the v3 data path is pure Python + Rust (HF tokenizers, fla kernels).
  • Con: bounded expressivity. The policy can only compose what C contains. Novel axiom shapes the catalog doesn’t cover are unreachable by the policy. Mitigation: catalog coverage is itself a methodological knob; ablations on catalog size establish the expressivity / efficiency frontier. Authoring genuinely novel axiom shapes remains a human-baseline activity.
  • Con: R_C signal is degraded. Runtime “does it verbalize” is trivially true; we repurpose R_C as a length proxy, which is a weaker signal than gaius’s pass/fail. The brief acknowledges this and tests in P2 whether R_C’s reweighted form retains discriminative validity.
  • Con: discrimination claim (C1) needs care. Known-bad ontologies for the verifier-validation test set must be constructable within the catalog (e.g., compositions of only trivial templates). Out-of-catalog “bad” ontologies don’t test the runtime verifier; they test the catalog-construction step. P2 explicitly distinguishes these regimes.

This pivot is the single most important architectural change in v0.3 vs. v0.2. Subsequent sections assume it.

Bespoke ontology authorship — human baseline

The project author produces a baseline aegir-vocab.ttl against the same structural commitments specified earlier:

  • ≥ 50 named classes, of which ≥ 25 sit at depth ≥ 3 in rdfs:subClassOf
  • ≥ 15 complex asserted classes (existential / universal / cardinality / boolean intersections; ≥ 3 owl:equivalentClass with non-trivial Manchester-syntax bodies)
  • BFO 2020 ancestry on every leaf, mediated through CCO
  • rdfs:label and skos:definition on every term
  • All authorship is the project’s own; no content lifted from any non-public reference set

These thresholds are picked-by-convention for the human baseline target. They are not the brief’s structural claim about ontologies in general; they are a target the human author aims for, against which the RL-trained policy is compared.

Verbalization corpus V(O) — from catalog lookup

For each composition O = compose(C, σ):

  1. For each (template_id, σ_i) pair in O, look up the template’s verbal_template in C and substitute the slot fillers σ_i to produce the rendered verbalization sentence.
  2. Filter: drop empty/degenerate substitutions (slot filler produces a 0-length string after substitution).
  3. Deduplicate by sentence-embedding cosine similarity > 0.95 (same encoder as the topic model, to avoid distributional shift between dedup and topic fitting).
  4. Two configurations tested in ablation: plain (each substituted verbalization as one document) and templated (substituted verbalization paired with its immediate sub/super class templates’ verbalizations as adjacent sentences, found by walking the composition’s subClassOf graph).

Plain is the headline configuration; templated is ablation only. Verbalization corpus size per ontology is bounded by composition size and is fully predictable from C.

Null distributions — properly constructed

The v0.1 brief described Gate D’s null as “shuffling word-topic assignments.” That construction was incoherent — it perturbs the topic-model fit, not the ontology. v0.2 replaced it with a structural shuffle of arbitrary OWL. v0.3 specializes the structural shuffle to catalog compositions and lifts the entire computation into P1a (offline catalog construction), so it does not appear in the runtime verifier path.

Null for R_B. A null composition is generated by drawing catalog templates uniformly at random (preserving count) and filling slots with uniformly-sampled fillers from the typed term inventory. This preserves the structural shape (template-count, axiom-kind distribution) while destroying any topic-aligned selection signal. τ_B = 95th percentile of complex_count over 200 null compositions, computed once per catalog version.

Null for R_D. Same null composition as R_B, then render the null verbalization corpus from the cached verbal_templates, fit BERTopic on it, compute alignment against the frozen T_I. R_D is normalized so that null-mean alignment maps to 0 and observed best-case (human-baseline + 2σ headroom) maps to 1. 200 null compositions per catalog version; cost is amortized because T_I is fitted once per I version and the catalog templates’ verbalizations are pre-cached.

All null statistics are stored as catalog metadata in catalog/C-v0.1.{json,sqlite}. The runtime verifier reads τ_B and the R_D normalization constants from the catalog; it does not re-run the null construction.

RL policy and training loop

Base policy. (Superseded — see the authoritative reference: the operational policy is now Qwen3.5-9B-Base with a held-out SAE-Res-Qwen3.5-9B-Base-W64K-L0_50 residual-stream adapter, sized to the 6×4090 envelope. The 27B design below is the brief’s original target and the rationale for it still holds at the smaller scale.) SAE-Res-Qwen3.5-27B-W80K-L0_100 (instruct variant), a Qwen 3.5 27B base with a residual-stream sparse autoencoder of width 80K and average L0 ≈ 100 active features per token. Two reasons for this choice:

  1. Capacity. A 27B instruct model handles structured-syntax composition (catalog template selection + typed slot-fill) more reliably than a 7B model, especially with the controlled output space the catalog imposes.
  2. Interpretability dividend. SAE residual-stream features make the policy’s internal reasoning inspectable. At P5 we log SAE feature activations during generation; at P6 the comparison study analyzes which features differentiate gate-passing from gate-failing generations. This is a methodological enhancement that the brief earns “for free” by selecting an SAE-equipped base — vanilla 7B models do not offer this surface.

Parameter strategy. LoRA adapters on the base; SAE weights frozen and read-only (the SAE provides interpretability, not training signal). Full fine-tune of a 27B base is infeasible on 6×4090; LoRA + sharded weights (FSDP or tensor-parallel) is the realistic envelope.

Memory envelope. 27B × 2 bytes (bf16) = 54 GB weights. Sharded across 6×4090 (24 GB each, 144 GB aggregate): ~9 GB per GPU for weights, leaving ~15 GB per card for KV cache, activations, LoRA optimizer state, and group-size-8 generation buffers. Context window is constrained for RL training to ~4–8K tokens (ontology compositions don’t need 80K) to keep memory headroom; the SAE’s 80K width is a representation-space property, not a context constraint.

RL algorithm. GRPO (Group Relative Policy Optimization, per DeepSeek-R1 / DeepSeekMath) over groups of 8 samples per prompt (reduce to 4 if memory pressures during P4 smoke test). Reward is R(O_i) computed from the policy’s catalog-composed output. Critic-free; advantage is group-relative.

Prompt design. Each prompt is a (domain_seed, structural_constraint) pair — a short natural-language description of the ontology’s intended scope, plus the structural commitments (class count, depth, complex-class count) the policy is to satisfy. The policy emits a structured composition: a sequence of (template_id, slot_fillers) tuples, decoded into rendered OWL by a deterministic post-processor. Domain seeds are drawn from a held-out set so paper 1’s evaluation is on unseen-during-training prompt distribution.

Training budget. ~1000–3000 GRPO steps, group size 4–8, ~120–200 GPU-hours on 6×4090 with LoRA + tensor-parallel sharding. The catalog-precompute pivot eliminates DeepOnto’s per-step JVM cost; the new dominant cost is generation throughput on the 27B policy. P4’s smoke test confirms the actual per-step wall clock before P5 commits.

Checkpointing. Per-step R mean, per-step max-R, and per-step SAE feature-activation summary statistics logged. Best-R LoRA adapter and final-step LoRA adapter are kept. Both evaluated separately at the P5 exit gate.

Comparison study (P6)

Three policies generate ontologies from the same held-out prompt set:

  • Random LLM (no RL): Qwen2.5-7B-Instruct without any optimization.
  • Prompt-evolved (DSPy/GEPA): following gaius’s approach — evolve the prompt against R without weight updates.
  • GRPO-trained (this brief): the policy from P5.

Each generates 100 ontologies on the held-out prompts. Mean R, median R, and R distribution shape are reported per policy. C2 is evaluated by paired comparison over the same prompts — does the GRPO policy score higher than each baseline at p < 0.05 under a Wilcoxon signed-rank test?

The human-authored baseline is one ontology per author-week of effort; it is reported as a single point on the R axis with methodology-section discussion of why it is or is not exceeded.

Phase structure with formal gates

Each phase carries an entry gate (preconditions) and an exit gate (verification cases that must pass before progressing). Failure at an exit gate halts forward progress until resolved or the brief is revised.

P0 — Literature review and positioning

  • Scope: Review prior art across (a) ontology learning from text, (b) verbalization-augmented language modeling, (c) verifiable reward in language models, (d) data curation with verifiable signals, (e) topic-model evaluation of ontologies. Produce a positioning document at docs/scratch/YYYY-MM-DD/HHMMSS_lit_review.md.
  • Entry gate: brief approved by project lead.
  • Exit gate: positioning document explicitly states (i) what is done in prior art, with citations; (ii) what is open; (iii) the specific intersection this brief targets, with a “we are not aware of prior work that…” statement supported by the review. If novelty does not survive, brief is revised before P1.
  • Estimated effort: ~2 weeks of focused reading.

P1 — Catalog C construction + runtime verifier

  • Scope: This is the largest engineering phase. Two sub-deliverables:
    • P1a — Catalog construction (offline DeepOnto). Author/generate ~500 candidate axiom templates spanning OWL 2 axiom shapes; spawn one JVM, run DeepOnto loadability + complex-class + verbalizer over each; drop failures; cache results to catalog/C-v0.1.{json,sqlite}. Compute structural-shuffle null distribution (200 shuffles) and cache τ_B, R_D normalization statistics.
    • P1b — Runtime verifier (no JVM). Implement scripts/aegir-verify reading from C: structural type-check for R_A, lookup-based R_B and R_C, BERTopic fit + Hungarian alignment for R_D. Implement aggregation. Lock I + topic model + catalog version.
  • Entry gate: P0 exit passed.
  • Exit gate:
    1. C contains ≥ 200 surviving templates spanning at least 5 distinct axiom shapes (subclass, equivalent-with-intersection, existential, universal, cardinality).
    2. aegir-verify runs end-to-end with no Java/JVM dependency active.
    3. aegir-verify <composition.json> produces deterministic R ∈ [0, 1] hash-stable across 3 independent runs.
    4. Per-composition runtime ≤ 1 s on CPU (target informed by RL throughput budget).
    5. Corpus-size sweep on I shows BERTopic stability (silhouette score variance < 0.05).
  • Estimated effort: ~3 weeks (catalog construction is the bottleneck — template authorship + DeepOnto validation pass takes 1.5 weeks, runtime verifier ~1 week, corpus sweep + locks ~0.5 week).

P2 — Verifier validation (claim C1)

  • Scope: Build a labeled test set of ontologies — known-good (BFO, OBO Foundry exemplars, hand-authored quality samples), known-bad (LLM-generated junk, syntactically valid but conceptually empty, structurally truncated). Compute R on each. Tune aggregation weights {a, b, c} to maximize AUC. Lock weights.
  • Entry gate: P1 exit passed.
  • Exit gate:
    1. Labeled test set ≥ 30 ontologies, ≥ 10 known-good, ≥ 10 known-bad.
    2. AUC of R against label ≥ 0.85.
    3. Mean(R | good) − Mean(R | bad) ≥ 0.30.
    4. Weights {a, b, c} locked and committed to the verifier as a constant; re-running the verifier reproduces the AUC.
  • Estimated effort: ~2 weeks (test set construction is the bottleneck).
  • Failure mode: if AUC < 0.85, the verifier does not discriminate enough to be useful as an RL reward. Diagnose which component is weakest and revise (most likely Gate D is too noisy or Gate B’s threshold is mis-calibrated). Iterate before P3.

P3 — Human-authored baseline ontology

  • Scope: Project author produces aegir-vocab.ttl against the structural commitments. The author may use the catalog C as a drafting tool (selecting + slot-filling templates) or compose free OWL outside the catalog; the latter is recorded so that the comparison study (P6) can fairly compare catalog-bound policy outputs against catalog-free human authorship. Iterate against the verifier until R(human-authored) ≥ 0.70 (sanity).
  • Entry gate: P2 exit passed.
  • Exit gate:
    1. aegir-vocab.ttl parses, satisfies the structural commitments mechanically.
    2. R(aegir-vocab.ttl) ≥ 0.70 against locked verifier — note that scoring a free-OWL human-authored ontology requires the catalog to be expressive enough to encode the human author’s axiom shapes for verifier purposes; mapping from free OWL to catalog compositions for scoring is a P3-internal subtask.
    3. Manual review confirms the ontology is genuinely the project’s own work (per Charter §Provenance discipline).
  • Estimated effort: ~4–8 weeks. The brief acknowledges this is the largest creative effort and the hardest to bound. Two-week estimates from v0.1 are dropped.

P4 — RL infrastructure smoke test

  • Scope: Stand up the GRPO loop with Qwen2.5-7B + LoRA. Verify the loop converges on a trivial reward (R_trivial = “output contains the string Class:”) within 50 steps. Verify GPU budget per step matches estimates.
  • Entry gate: P3 exit passed.
  • Exit gate:
    1. Trivial-reward training reaches mean R_trivial = 1.0 within 50 steps.
    2. Per-step wall clock and memory profile within 2× of the budget estimate.
    3. Checkpoint save/load round-trips cleanly.
  • Estimated effort: ~1 week.

P5 — RL training run (claim C2)

  • Scope: Train π_θ (LoRA over SAE-Res-Qwen3.5-27B-W80K-L0_100) via GRPO against locked R on the held-out prompt training set. Log per-step mean R, max R, gate-pass rates per component, and SAE feature-activation summary statistics. Save best-R and final-step LoRA adapters.
  • Entry gate: P4 exit passed.
  • Exit gate:
    1. Training completes within budget (≤ 200 GPU-hours total on 6×4090; ~3× v0.2’s 72-hour budget to reflect 27B vs. 7B).
    2. Best-R checkpoint produces R(π_θ) > R(human-authored baseline) on a 50-prompt held-out evaluation set.
    3. Per-component gate-pass rates ≥ 80% on held-out evaluation set (i.e., the policy isn’t exploiting one component while ignoring the others).
    4. SAE feature-activation logs collected; ready for P6 analysis.
  • Failure mode: if the policy fails to exceed human baseline, diagnose (under-trained, reward sparsity, prompt-set distribution too narrow, catalog expressivity bound). Re-train or revise reward composition.

P6 — Comparison study (claim C2 finalized) and paper 1 writeup

  • Scope: Generate 100 ontologies each from random-LLM (SAE-Res-Qwen3.5-27B... no RL), prompt-evolved (DSPy/GEPA over the same model), and GRPO-trained (this brief) policies on 100 held-out prompts. Compute R per ontology. Wilcoxon signed-rank pairwise tests. Effect-size estimation. Per-component analysis. SAE feature-attribution analysis: which features differentiate gate-passing from gate-failing generations? Draft paper 1.
  • Entry gate: P5 exit passed.
  • Exit gate:
    1. GRPO mean R > prompt-evolved mean R at p < 0.05.
    2. GRPO mean R > random-LLM mean R at p < 0.05.
    3. Per-component analysis isolates which gates the GRPO policy improves on (informs C3 and paper 2’s ablations).
    4. SAE feature-attribution analysis identifies ≥ 5 features whose activation differs significantly between high-R and low-R generations (interpretability methodological enhancement).
    5. Paper 1 draft circulated for internal review.
  • Estimated effort: ~3 weeks.

P7 — Verbalization corpus + v3 pretrain (paper 2 begins)

  • Scope: Generate verbalization corpora from (a) human-authored baseline, (b) prompt-evolved best, (c) GRPO best. Run v3 pretraining at the same weight (0.05 of corpus mix) for each configuration plus a v2-replication baseline. Each run is the full v2 schedule (~10 GPU-hours). Add a new eval.ontology-recall slice fit on held-out verbalizations.
  • Entry gate: P6 exit passed.
  • Exit gate:
    1. Four pretrain runs complete; metrics + stratified eval committed.
    2. Comparison table: v2 vs. v3-{human, evolved, GRPO} on the full stratified eval surface.
  • Estimated effort: ~2 weeks (mostly unattended training).

P8 — Pretraining ablations and component validity (claim C3)

  • Scope: For the best-performing v3 configuration from P7, ablate each verifier component: re-train the policy with R’ that drops one component at a time, regenerate verbalization corpus, re-pretrain. Identify which gates carry predictive validity for downstream lift.
  • Entry gate: P7 exit passed and at least one v3 configuration shows ≥ 0.10 bpb lift on eval.ontology-recall over v2. (If no configuration shows lift, paper 2’s contribution is a bounded negative result; ablations refocus on understanding why.)
  • Exit gate:
    1. Ablation table per gate.
    2. Statistical test on whether each gate’s removal causes significant degradation.
    3. C3 statement is empirically supported or empirically refuted.
  • Estimated effort: ~6 weeks (multiple pretrain runs).

P9 — Paper 2 writeup

  • Scope: Draft paper 2 covering P7 + P8 results.
  • Entry gate: P8 exit passed.
  • Exit gate: paper 2 draft circulated.

Resource budget summary

ResourceP0–P6 (paper 1)P7–P9 (paper 2)Total
GPU-hours (training)~200 (P5 RLVR with 27B policy)~50 (4 pretrains) + ~150 (8 ablation pretrains) = ~200~400
GPU-hours (catalog construction, P1a)~30 (DeepOnto pass over ~500 templates; CPU-bound in practice)~30
Wall-clock~16–20 weeks~10–12 weeks~7 months optimistic, ~10 realistic
Author-weeks~12 (P0 lit review + P3 ontology authorship dominate; P1a catalog authoring adds ~2)~5~17
Paper outputs112

The brief’s largest single risk is P3 (human-authored ontology). A clean, gate-passing, BFO-anchored, project-domain ontology of the required structural shape may consume 6–8 weeks; it can also stall if the project domain doesn’t have a natural ontology shape. Mitigation: choose the project domain in P0 for ontology tractability, not just for downstream-task alignment.

The catalog-construction pivot (v0.3) shifted the engineering profile: P1 grew (~3 weeks vs. v0.2’s 2) because catalog authoring

  • DeepOnto pass is now in scope, but P5 became more expensive (27B vs. 7B base policy, ~200 vs. ~75 GPU-hours) and the v3 pretraining and inference paths are now JVM-free. Net engineering cost is comparable; runtime production characteristics are substantially better.

What this brief does not commit to

  • A specific ontology domain. Chosen during P0 against literature review and authorial expertise.
  • A specific RL algorithm beyond “GRPO-family.” If GRPO underperforms in P4, PPO or even REINFORCE with baseline are fallbacks.
  • A specific policy model beyond SAE-Res-Qwen3.5-27B-W80K-L0_100. If memory pressure during P4 forces a smaller model, downsizing to a 7B SAE-equipped variant (or a non-SAE 7B with the interpretability dividend dropped) is permitted with a documented rationale. The brief’s headline claims do not depend on the SAE surface — they depend on GRPO with a deterministic verifier.
  • Catalog C size beyond “≥ 200 surviving templates spanning ≥ 5 axiom shapes.” Larger catalogs improve expressivity at the cost of P1a authorship time.
  • Paper-2 success. Paper 2 stands or falls on P7 + P8 results; paper 1 stands independently on P6.

Risks and bounding negative results

  • C1 fails (verifier doesn’t discriminate). Diagnose at P2; iterate on aggregation weights or component definitions before P3. Worst case: the four-gate framing is insufficient and the brief is revised.
  • C2 fails (RL doesn’t beat baselines). Diagnose at P5; revise RL infrastructure or reward shaping. If paper 1’s headline does not land, the lit-review-bounded claim “GRPO with this verifier produces comparable but not superior ontologies to prompt evolution” is still publishable as a negative result, with diagnostic value for the field.
  • C3 fails (no pretraining lift). Paper 2 reports the bounded negative result: a verifier with strong discrimination and an RL loop that maximizes it does not, on this dataset, produce ontologies whose verbalizations measurably help byte-level pretraining. This is publishable as a bound on RLVR’s reach.
  • Verbalizer brittleness. DeepOnto’s OntologyVerbaliser is template-driven; the policy may exploit verbalizer-friendly axiom patterns at the expense of semantic depth. Mitigation: P2’s test set includes ontologies with diverse axiom shapes; AUC computation surfaces verbalizer exploitation.
  • Topic-model brittleness. BERTopic on small corpora is unstable in HDBSCAN clustering. Mitigation: P1 includes corpus-size sensitivity sweep; if instability is intractable, NMF becomes primary and BERTopic ablation.
  • Provenance drift. As the policy is trained against the human baseline + held-out prompt distribution, it may converge toward generating ontologies that read as derivative of the baseline. The Charter’s Provenance discipline applies to all ontologies entering the project artifact bundle; policy outputs that are accepted into aegir-vocab.ttl (vs. remaining in the policy’s evaluation set) go through the same PR review.
  • Author-week budget overrun. P3 dominates; if it stalls, paper 1 cannot complete. Mitigation: P3’s “≥ 0.70” threshold can be relaxed if the bottleneck is verifier-strictness rather than authorship quality (revisit P2 weights).

References (committed for P0 lit review)

The P0 lit review is committed — these are the starting set, not exhaustive.

Verifiable reward in language models:

  • Lambert, N., et al. (2024). Tülu 3: Pushing frontiers in open language model post-training.
  • DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning. (GRPO original.)
  • Shao, Z., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. (GRPO algorithm.)

Ontology engineering with deep learning:

  • He, Y., Chen, J., Antonyrajah, D., Horrocks, I. (2023). DeepOnto: A Python package for ontology engineering with deep learning.
  • Auer, S., et al. (2023). SciQA: A Scientific Question Answering Benchmark for Scholarly Knowledge. (Ontology-grounded LM evaluation.)

Verbalization and language modeling:

  • Petroni, F., et al. (2019). Language models as knowledge bases? (LAMA probing.)
  • Logan, R., et al. (2019). Barack’s wife Hillary: Using knowledge graphs for fact-aware language modeling. (KGLM.)
  • Zhang, Z., et al. (2019). ERNIE: Enhanced language representation with informative entities.
  • Wang, X., et al. (2021). KEPLER: A unified model for knowledge embedding and pre-trained language representation.

Data curation with verifiable signals:

  • Albalak, A., et al. (2023). A survey on data selection for language models.
  • Penedo, G., et al. (2024). FineWeb: Decanting the web for the finest text data at scale.

Topic modeling:

  • Blei, D., Ng, A., Jordan, M. (2003). Latent Dirichlet allocation. (Legacy reference; not the headline method.)
  • Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure. (Headline method.)
  • Lee, D., Seung, H. (1999). Learning the parts of objects by non-negative matrix factorization. (NMF baseline.)

Ontological foundations:

  • Arp, R., Smith, B., Spear, A. (2015). Building ontologies with Basic Formal Ontology. MIT Press.
  • Common Core Ontologies (CCO). github.com/CommonCoreOntology/CommonCoreOntologies.

Aegir / sibling project internal:

  • Charter — outward contract and provenance discipline.
  • Migration — vocabulary authorship that produces the human baseline ontology consumed at P3.
  • Training Regime §10 — v2 baseline against which paper 2’s lift is measured.
  • Sibling project Gaius: scripts/ontology_denovo_pipeline.py and scripts/validate_generated_ontology.py — gates A–C reference implementation; gaius’s gate D is stub-only, replaced by this brief.

Status

  • v0.1 — superseded; named four gates but framed as “RLVR” while describing static validation; mixed two contribution claims; null distribution incoherent; LDA as default; 1–2 week ontology authorship estimate unrealistic.
  • v0.2 — superseded; locked contribution to verifier R + GRPO loop; split into two papers; replaced null distribution with structural-shuffle null; promoted BERTopic; revised P3 estimate to 4–8 weeks; committed literature review as P0. Used Qwen2.5-7B as policy and assumed live DeepOnto in the verifier.
  • v0.3 — superseded; introduced procedural catalog and SAE-Res-Qwen policy, but did not yet incorporate the P0 lit review’s adjacent-work differentiations.
  • v0.4 — superseded; added KELM / OntoTune / Zaitoun et al. differentiation paragraphs from v0 lit review.
  • v0.5 — this document. Incorporates v1 P0 lit review findings (docs/scratch/2026-05-09/225830_lit_review_v1.md): three new must-cite differentiations (OLLM, AutoGraph-R1, K2V) plus OnT for the TBox-embedding axis; a new “Load-bearing novelty” section that sharpens the contribution claim to “the OWL artifact is the only graph-structured output where the verifier can run a sound-and-complete reasoner producing both structural and semantic verdicts”; positions the brief as the synthesis of four precursor recipes (graph-output RL, schema-validator RL, code/SQL execution RL, verbalization-corpus pretraining) onto OWL where the verifier acquires DL deductive semantics. P0 exit gate firmly green. Two architectural changes vs. v0.2:
    1. Procedural catalog C. DeepOnto runs offline at catalog construction time only. The runtime verifier and the v3 pretraining/inference paths have no JVM, no DeepOnto, no Java dependencies. Per-step verifier cost drops by 1–2 orders of magnitude. Trade-off: bounded expressivity (policy can only compose what C contains) and weaker R_C signal.
    2. Policy upgraded to SAE-Res-Qwen3.5-27B-W80K-L0_100 (instruct). Capacity for structured-syntax composition; SAE residual-stream features add interpretability surface for P5/P6 analysis. Memory envelope tighter; GPU-hour budget for P5 grows from ~75 to ~200.
  • v0.3 → v1.0 transition: contingent on P0 exit gate (positioning doc finalized). Until then this brief is provisional and the contribution claim is subject to revision based on lit-review findings.

Updates track in docs/scratch/YYYY-MM-DD/ session notes.

Aegir’s semantic engine: an authoritative reference

Last updated 2026-06-29. This document describes the current operational state of Aegir’s semantic engine for external/advisory audiences. The canonical, code-exact reference for every metric, band, gate, and membrane is the Authors Guide; this document summarizes the system and its empirical footing and does not duplicate the guide’s formulas.

Revision note. Earlier revisions of this page (through 2026-05-12) described a four-component verifier R(O, I) over (ontology composition, corpus) pairs and a GRPO/RLVR policy trained against it, with a 540-template catalog as the deliverable. That framing — the Concept brief / RLVR design (see Concept brief, RLVR) — is the long-horizon Signals M4 apparatus: its verifier R(O, I) is now realized as the deterministic membrane stack (HermiT/CCO, OntoClean, OQuaRE) and the content-first derivation + rigor program documented here and in the Authors Guide is the agent-mediated propose/dispose loop building and proving that reward today, in direct service of M4. The RLVR pages carry the M4 research design. Structural note (flagged, not performed): this page now overlaps substantially with the Authors Guide and the Charter; a future editorial pass may want to merge or re-scope it.

Abstract

Aegir’s semantic engine is a BFO 2020 / CCO-grounded domain ontology — the Signals Data Governance (SDG) ontology — that is content-derived from FinePDFs and realized to a HermiT-validated OWL artifact, together with the closed-loop pipeline that turns it into an ontology-grounded synthetic pretraining corpus and a relational DDL spine. The ontology’s classes are intermediate-depth subsumers that serve as the annotation vocabulary for Column Type / Column Property Annotation (CTA/CPA) over wide relational tables.

Rigor is enforced, not asserted, through an agent-mediated propose / dispose feedback loop: an engine proposes axioms; a stack of deterministic membranes — a Manchester parse membrane, a HermiT reasoning-authority membrane (with CCO imported so grounding is checked against CCO’s disjointness axioms), and an OntoClean meta-property membrane — disposes and returns the reason, and the agent responds. The two strongest membranes (HermiT and OntoClean) are un-fakeable: an extension cannot talk its way past a contradiction or an anti-rigidity violation.

The realized artifact lives at corpora/ontology/sdg-ontology.{omn,owl} with a consistency certificate (HERMIT_CERTIFICATE.md). At the time of writing it has 285 named classes, 0 unsatisfiable classes, and clears the pre-registered rigor objectives — definitional completeness 0.554, BFO-grounding 0.896, realizable-machinery 10 — with an OQuaRE aggregate of 4.24 (GREEN).

1. Background

The substantive engineering claim is that a bespoke ontology grounded in upper-level formal foundations (BFO 2020 + CCO) produces verifiable cross-context cousining — a relation between concepts from disparate operational domains that share an upper-class ancestry — and that this ontology can serve as a rigorous, content-grounded CTA/CPA annotation vocabulary. The methodological claim is that rigor can be enforced by deterministic disposal membranes rather than asserted: a generic LLM recovers taxonomy and existential restrictions (structure) but not definitional sufficiency, full grounding, or BFO role discipline (rigor), and the membranes are precisely what hold the line on the latter.

The distinctive position is generative + content-grounded + IOF-rigorous: the ontology is derived from text (not expert-authored like the IOF), yet the IOF-discipline layer is applied and measured on top of it.

The combination — a content-derived OWL ontology disposed by an un-fakeable reasoner/OntoClean membrane, used as a CTA/CPA annotation vocabulary and to seed a verbalization-grounded pretraining corpus — sits at the intersection of several lines of prior work. Each component has prior art on adjacent artifacts; the combination is the gap.

KELM / TEKGEN [Agarwal et al. 2021]. The closest verbalization-corpus recipe: verbalize a structured knowledge source into natural-language sentences, integrate as a model training corpus, measure downstream effect. KELM verbalizes ABox triples from Wikidata; the present system verbalizes TBox axioms from a bespoke OWL ontology into a byte-level pretraining slice. The structural content differs (taxonomic and logical class expressions vs. instance-level facts) and the integration is a pretrain mix rather than a retrieval corpus.

OLLM [Lo et al. 2024]. The closest end-to-end LLM ontology generation approach: fine-tunes an LLM with a custom regulariser to produce taxonomic backbones from scratch. The present system derives classes from FinePDFs content and disposes them with a reasoner + OntoClean membrane, emitting full OWL with restrictions and equivalentClass genus-differentia definitions rather than taxonomic backbones only.

AutoGraph-R1 [Tsang et al. 2026] and K2V [Yuan et al. 2026]. The closest RL-trained graph-emitting and RLVR-with-KG-derived-reward methods. Both establish that verifiable structured-output shaping works; the present system’s disposal authority is an intrinsic, semantically grounded DL reasoner (HermiT over the realized ontology with CCO imported) plus reasoner-invisible OntoClean checks, rather than an extrinsic LLM-judge or QA-accuracy signal. OWL’s class-axiom expressivity is what makes the DL-reasoner membrane possible.

OntoTune [Liu et al. 2025] and Zaitoun, Sagi, Peleg [AAAI Symposium 2024] and OnT [Yang et al. 2024]. Adjacent LLM ontology / OWL-verbalization work — iterative ontology-grounded SFT, LLM-assisted axiom verbalization for SFT pairs, and TBox-axiom-aware embedding training respectively. The present system treats verbalizations as a flat self-supervised byte-level corpus mixed with general pretraining text, and treats the ontology itself as a disposed, publishable artifact.

Secondary methodological precedents — symbolic RL on EL++ concepts, JSON-Schema / execution-validator RL, retrieval-augmented graph reasoning — are listed in the project’s lit-review document and not elaborated here.

3. The Signals Data Governance ontology

SDG is a bespoke OWL ontology designed to underwrite metadata tagging across operational contexts that enterprise data-governance teams ordinarily treat as separate disciplines: LIMS sample tagging, MBSE/SysMLv2 system design, database metadata governance, kernel-trace observability, and PROV-O / OpenLineage data lineage. It is grounded in BFO 2020 and CCO and organized into five top-level branches plus a belief branch:

  • Artifact (CCO) — material things, datasets, programs.
  • DesignativeICE (CCO) — names, identifiers, designators.
  • DescriptiveICE (CCO) — measurements, claims, lineage records; hosts the sdg:BeliefStructure (DST) branch.
  • DirectiveICE (CCO; alias of cco:ont00000965 “Prescriptive ICE”) — requirements, controls, policies, constraints.
  • Process (BFO 2020 bfo:0000015) — observation, derivation, governance activity.

The branches are cross-cousined: every domain context contributes classes to multiple branches, anchored at shared upper-level parents. This is the load-bearing architectural invariant — the ontology is forced to express cross-context concepts as shared subclasses of common BFO/CCO ancestors rather than as discipline-specific aliases for the same real-world entity. (The full committed branch structure and external-standard anchors are in the Charter.)

3.1 Classes as the annotation vocabulary

The purpose reframes what most of the classes are: not leaf terms but intermediate-depth subsumers — the property-bearing classes a heterogeneous-but-coherent column belongs to. A driver_stops_schedule.stops_addresses column holds a mix (origin + destination, residential + business shipping addresses, each bearing an avg-time-on-site); no leaf type fits — the right annotation is the least common subsumer that is still property-bearing, e.g. Address ⊓ ∃has-shipping-role ⊓ ∃avg-time-on-site. Defining these intermediate classes well is building the annotation vocabulary, and the rigor gates exist to keep every term a coherent, grounded annotation target.

3.2 Content-first derivation

The ontology is derived from FinePDFs, concept-filtered by a ColBERT/Qdrant MaxSim domain filter over a SKOS index (scripts/derive_ontology.py::_apply_domain_filter). The seven family catalogs under src/aegir/ontology/catalog/ (01_foundation07_long_tail) are a seed and regression baseline; FinePDFs-derived intermediate classes accrete in 08_derived.json. The live driver is the content-first pipeline — text → engine derives candidate classes → grounding-anchor retrieval supplies a real genus → the disposal membranes admit or reject — not a fixed template count. Classes are authored as Manchester-syntax catalog templates with a typed slot DSL that the realizer renders, grounds, and validates into the OWL artifact (scripts/build_realized_ontology.py).

3.3 Cross-context cousining — concrete instances

Cousining is verifiable directly from the catalog. Representative instances:

  • bfo:Process (BFO 2020 bfo:0000015) is shared by sdg:LabRun (LIMS), sdg:Trace (database / MBSE), sdg:eBPFEvent (kernel observability), sdg:Transformation + sdg:Allocation (lineage), and sdg:Audit + governance activity.
  • cco:DescriptiveICE is shared by LIMS measurements, database governance records (sdg:ColumnPolicy), PROV-O lineage edges, and the DST primitives sdg:Evidence / sdg:Claim / sdg:BeliefInterval / sdg:MassFunction.
  • cco:DirectiveICE (alias of cco:ont00000965) is shared by HIPAA rules, column policies, SQL constraints, SysMLv2 constraints, and eBPF security-policy classes — SQL CHECK clauses and SysMLv2 constraint blocks land at the same upper class as a HIPAA Privacy Rule provision.
  • cco:DesignativeICE is shared by database identifiers, the kernel syscall surface, and schema.org alignment properties — syscalls and database identifiers are treated as cousins, not separate disciplines.

3.4 Atelier ↔ Aegir state-fusion via DST

The sdg:BeliefStructure primitives — sdg:MassFunction, sdg:BeliefInterval, sdg:Evidence, sdg:Claim — provide shared structural vocabulary for Dempster-Shafer evidence-fusion pipelines. The Aegir agent-swarm state-fusion layer consumes belief structures emitted by Atelier, a sibling project providing DST-based evidence fusion for enterprise data-classification, using these names directly and without translation. The cousining is at the explicit-uncertainty layer: the same sdg:Evidence → sdg:Claim → sdg:BeliefInterval triplet covers LIMS quality-tier evidence, audit findings, lineage-edge plausibility, and column-tag claims at calibrated confidence levels.

4. Rigor — the metric suite and the publish gate

Rigor is measured by scripts/ontology_metrology.py::compute() (pure rdflib, JVM-free, with CCO’s subClassOf backbone merged so cco:-chains resolve to BFO) and gated by scripts/ontology_oquare.py. The exact formulas, bands, and characteristic mapping are in Authors Guide §§ 3–4; this section gives the operational summary.

4.1 The metric families

  • IOF-derived rigor dimensions (the discriminators a shallow author misses): definitional_completeness (fraction of classes defined with EquivalentTo genus+differentia), bfo_grounded (fraction whose subClassOf/≡-genus chain reaches a BFO IRI), realizable_machinery (count of BFO role/disposition/function restrictions), def_annotation_coverage (fraction carrying an iao:0000115 / rdfs:comment / skos:definition).
  • Field-standard structural metrics (OntoQA / OQuaRE): rr, ir, ar, aronto, dit, and tm (tangledness, inverted).
  • OntoClean taxonomic-correctness proxies (reasoner-invisible yet checkable, hence un-gameable): subsumption_cycles, ontoclean_violations, sibling_disjointness, orphan_rate, taxonomic_cleanliness.

4.2 The OQuaRE publish gate

OQuaRE (Duque-Ramos et al. 2011) adapts ISO/IEC 25000 (SQuaRE) to ontologies: each metric is normalized to [1,5] against fixed, IOF-anchored bands, aggregated into six characteristics (Structural, FunctionalAdequacy, Reliability, Operability, Maintainability, Transferability) and one holistic score. The gate is GREEN only when all three hold: oquare_aggregate ≥ 3.5, functional_adequacy ≥ 3.0, and hermit_consistent == true. The FunctionalAdequacy ≥ 3.0 floor is deliberate — it forces definitional rigor and BFO discipline, not structural/grounding gains alone. The gate is wired HARD into aegir.lineup.sync._gate(): a sync --push of the ontology Data Product is refused below GREEN, so a regression cannot publish. The AIM is 3.9, the published OQuaRE class of Brick (3.93) / RealEstateCore (3.91).

4.3 Current state

Verified against the realized artifact (corpora/ontology/sdg-ontology.owl + HERMIT_CERTIFICATE.md):

metricvaluetarget
definitional_completeness0.554IOF ≈ 0.55
bfo_grounded0.8961.0
realizable_machinery10IOF ≥ 14
def_annotation_coverage0.9461.0
unsatisfiable classes00 (hard)
OQuaRE FunctionalAdequacy4.55≥ 3.0 (floor)
OQuaRE aggregate4.24 (GREEN)≥ 3.5 (floor), AIM 3.9

Both pre-registered objectives are essentially met: OQ-Structure (bfo_grounded ≥ 0.95def_annotation_coverage ≥ 0.90ar > 0oquare_aggregate ≥ 3.5) and OQ-Rigor (definitional_completeness ≥ 0.45realizable_machinery > 0). See EVIDENCE.md for the full ledger history.

5. The disposal membranes

Proposed axioms pass through three membranes in order; each returns a reason, so a failure is a repair instruction, not a dead end (this is the agent-mediated feedback loop — a human author reads the same reasons). Full detail in Authors Guide § 5.

  1. Parse membrane (evolve_rigor.validate_detailed) — renders the axiom standalone and parses it under OWLAPI. Rejects malformed Manchester (uppercase prefixes, bare properties, undeclared entities, # comments).
  2. Reasoning-authority membrane (build_realized_ontology.consistency_check) — imports CCO and runs HermiT, so grounding is validated against CCO’s disjointness axioms. A class grounded to a CCO-disjoint or BFO-incompatible genus is unsatisfiable and rejected. Un-fakeable.
  3. OntoClean meta-property membrane (src/aegir/ontology/ontoclean.py) — assigns Rigidity / Identity / Unity / Dependence and enforces that an anti-rigid (role) property cannot subsume a rigid (kind) one. Surfaces as ontoclean_violations. Also un-fakeable — reasoner-invisible yet checkable.

Grounding-anchor retrieval (scripts/grounding_anchors.py) lets the agent ground proposals to real genera: the index spans CCO (1431 BFO-aligned classes), FHIR R5 (210 record types bridged to cco:InformationContentEntity), and our own grounded classes (it accretes — each class grounded becomes a reusable anchor).

6. The closed-loop synthetic-data pipeline

The realized ontology drives a closed loop that converts organic input corpora into a verifier-scored synthetic training corpus and a relational DDL spine. Input corpora (FinePDFs and others) are domain-filtered and used to derive intermediate classes; the ontology’s classes are verbalized (DeepOnto parse-tree recomposition into slot-faithful procedural frames); verbalizations seed LLM generative chapter text and RI-true relational tables materialized from the ontology’s slot structure; chapters are checked by a 4-scorer verification loop (scripts/verify_chapters.py) and the Semantic-Layer-Upkeep gate (verbalization diversity, value semantics, column-name de-canning). The relational DDL spine (src/aegir/ontology/ddl.py, realize.py) projects ontology → SQL tables/views/FKs and lands in the Atlas-on-AGE provenance graph as a relational Data Product.

Figure 1 — The Aegir closed-loop pipeline. FinePDFs content derives intermediate classes; the disposal membranes (and the OQuaRE publish gate) admit only rigorous additions and return their reasons; verbalized classes seed LLM chapter text and RI-true relational tables; the output corpus becomes a byte-level pretraining slice. The dashed gray arrow indicates the downstream pretraining application (continued-pretraining augmentation on RWKV World v3 — Path A).

7. Repository state and reproducibility

The realized artifact and its certificate are versioned in the corpora submodule (zndx/sdg-corpora); the metrics are reproducible from the committed .owl with the JVM-free metrology:

uv run --no-sync python scripts/ontology_metrology.py corpora/ontology/sdg-ontology.owl --json
uv run --no-sync python scripts/ontology_oquare.py corpora/ontology/sdg-ontology.owl \
    --certificate corpora/ontology/HERMIT_CERTIFICATE.md --json

Re-deriving / re-realizing (the JVM membranes) needs the LD_LIBRARY_PATH bootstrap (DeepOnto/HermiT, see the project CLAUDE.md ontology notes):

LD_LIBRARY_PATH=$(pwd)/build/jvm-libs uv run --no-sync python scripts/build_realized_ontology.py --strict-grounding
just check-ontology-schema      # TTL parses, labels/definitions present, BFO ancestry, SPARQL totality

Consistency is independently re-verifiable: load sdg-ontology.omn (or .owl) in any OWL reasoner (Protégé/HermiT, ROBOT, owlready2) and check consistency against the certificate (isConsistent: true, 0 unsatisfiable). The Aegir repository also contains the metrology and OQuaRE gate, the OntoClean classifier, the grounding-anchor retriever, the content-first derivation pipeline, the corpus generation + verification + DDL-spine tooling, and the lineup KB that surfaces the rigor metrics. External datasets (SchemaPile, FinePDFs, SOTAB v2, GitTables, FineWeb-Edu) are obtained via documented download scripts with stable public distributions.

8. Limitations and threats to validity

  • Grounding is strong but not complete. bfo_grounded is 0.896 and realizable_machinery is 10 (IOF aim 14); a residual fraction of classes still ground shallowly (a bare BFO category where a real CCO genus would be better). These are the active levers, not closed problems.
  • Definitional rigor is at the IOF band, not beyond it. definitional_completeness (0.554) sits at the IOF ≈ 0.55 frontier; the AIM is 3.9-class and the IOF discipline beyond it. Raising it means defining more of the referenced intermediate classes, not just the heads.
  • The corpus’s relational claim is not yet demonstrated at scale. The ontology is load-bearing for slot-type prediction (CPA) but trades raw FinePDFs distribution alignment; whether the ontology-grounded mix lifts relational + Data-Element-elucidation skill over a no-ontology ablation at RWKV-7-matched scale is the pre-registered M2 / M3 gate (EVIDENCE.md), UNTESTED. Do not assume a downstream benchmark target without checking the current plan.
  • Content origin vs. expert authorship. SDG is derived from FinePDFs, not expert-authored like the IOF. The distinctive position is generative + content-grounded + IOF-rigorous; the rigor is the IOF-discipline layer measured on a content-grounded ontology, and the membranes (not assertions) are what make that measurable claim hold.

References

Ontology engineering & quality.

  • Duque-Ramos, A., et al. (2011). OQuaRE: A SQuaRE-based approach for evaluating the quality of ontologies.
  • Smith, B., et al. (2019). Industrial Ontologies Foundry (IOF) / BFO signature.
  • Guarino, N., Welty, C. An overview of OntoClean.
  • Arp, R., Smith, B., Spear, A. (2015). Building Ontologies with Basic Formal Ontology. MIT Press.
  • Common Core Ontologies (CCO). github.com/CommonCoreOntology/CommonCoreOntologies. (CC0)
  • HL7 FHIR R5.
  • He, Y., Chen, J., Antonyrajah, D., Horrocks, I. (2023). DeepOnto: A Python package for ontology engineering with deep learning.

Adjacent LLM ontology / verbalization / structured-RL work.

  • Lo, A., Jiang, A. Q., Li, W., Jamnik, M. (2024). OLLM: Generating ontologies from texts. NeurIPS 2024.
  • Liu, et al. (2025). OntoTune. WWW 2025.
  • Zaitoun, A., Sagi, T., Peleg, M. (2024). LLM-assisted verbalization of OWL axioms. AAAI Symposium Series 2024.
  • Yang, Z., Chen, J., He, Y., Gao, F., Horrocks, I. (2024). OnT — Language Models as Ontology Encoders. arXiv:2507.14334.
  • Agarwal, O., Ge, H., Shakeri, S., Aharoni, R. (2021). Knowledge graph based synthetic corpus generation (KELM / TEKGEN). NAACL 2021.
  • Tsang, et al. (2026). AutoGraph-R1. arXiv:2510.15339.
  • Yuan, et al. (2026). K2V — Knowledge-to-Verification.

Internal references.

  • Aegir Authors Guide — the canonical, code-exact reference for every metric, gate, and membrane.
  • Aegir Charter — outward contract, provenance discipline, committed branch structure and external anchors.
  • Aegir Migration — vocabulary authorship history.
  • Aegir Concept brief / RLVR — the research design for the long-horizon Signals M4 apparatus (the RLVR generator whose reward is now realized as the membrane stack).

RLVR for ontology generation

This chapter is the externally-readable description of the project’s reinforcement-learning-with-verifiable-reward (RLVR) program — paper 1 of the two-paper structure documented in the concept brief. The semantic-engine authoritative reference is the canonical empirical surface; this chapter is the methodological counterpart, accessible without the concept brief’s research-design overhead.

The chapter is organized in five parts: the verifiable-reward setting and why it fits OWL ontology generation; the four-component deterministic verifier R(O, I); the GRPO training program that targets it; how the verifier generalizes as the system scales beyond a single policy; and the paper-2 application that this work enables downstream.

The verifiable-reward setting

Reinforcement learning with a verifiable reward (RLVR) is the training discipline in which a policy is updated against a reward that can be computed deterministically from the policy’s output, without an LLM judge in the loop. The verifier is a function — not a model — and its output is hash-stable: identical inputs produce bit-identical reward values across re-runs. This shape has been demonstrated on mathematics [DeepSeekMath; Shao et al. 2024; DeepSeek-R1; DeepSeek-AI 2025] and on code execution [CodeRL, Reasoning-SQL]; the present project applies it to OWL ontology composition.

OWL is a particularly natural target for RLVR. The artifact is graph-structured (classes, properties, axioms with restrictions and equivalentClass intersections), and a sound-and-complete description-logic reasoner can verify both structural properties (does the artifact load? are slot fills well-typed?) and semantic properties (does the artifact entail what its templates claim it entails?). The verifier R(O, I) combines those checks with a corpus-alignment component that measures whether the ontology’s verbalizations are on-topic for a target corpus I.

The combination — graph-structured output with a semantically grounded deterministic verifier targeting an LLM policy under RLVR — is the contribution claim. Adjacent prior art either emits flat (head, rel, tail) triples without a reasoner-based verifier [AutoGraph-R1; Tsang et al. 2026], emits QA reasoning traces rather than OWL [K2V; Yuan et al. 2026], or uses plain SFT against a custom regulariser [OLLM; Lo et al. 2024]. The concept brief documents each differentiation in detail.

The verifier R(O, I)

Let O = compose(C, σ) denote an OWL ontology composition produced from the procedural catalog C with slot-fill σ. Let I denote a fixed input text corpus. The verifier R: OntologyComposition × Corpus → [0, 1] is defined as

R(O, I) = R_A(O) · ( a · R_B(O) + b · R_C(O) + c · R_D(O, I) )

with four components. R_A is a hard structural gate: it returns 1 iff every (template, σᵢ) pair type-checks against the catalog’s typed slot DSL, and returns 0 otherwise. Because every catalog template was DeepOnto-validated offline at catalog construction time [He et al. 2023], R_A = 1 implies the materialized OWL also passes DeepOnto loadability at runtime. R_B measures complex-class density relative to a structural-shuffle null distribution. R_C is a coarse semantic-richness proxy via cached verbalization length. R_D is the Hungarian-optimal cosine alignment between the composition’s verbalizations and a BERTopic topic model fit to I [Grootendorst 2022], normalized against the same structural-shuffle null.

The aggregation weights {a, b, c} = {0.50, 0.05, 0.45} were locked by a sweep over the unit simplex against a 30-ontology hand-authored discrimination test set (15 known-good, 15 known-bad). With those weights, the verifier achieves AUC 0.9956 and mean R-separation 0.336 on the test set. A held-out evaluation set of 50 scenarios (25 good + 25 bad), authored before any policy-side RL work began, gives separation 0.5129 against the locked verifier — leakage-free with respect to any policy that subsequently trains against R. The full empirical surface is documented in the semantic-engine authoritative reference.

The verifier is deterministic, hash-stable, and has no JVM or Java dependencies in its runtime hot path. DeepOnto is invoked only at catalog construction time; the runtime verifier reads pre-cached verbal templates from the JSON catalog. Per-sample scoring on CPU takes about 0.02 s once the encoder and topic model are loaded, which means the verifier is not the rate-limiting step in any practical RL training loop.

GRPO at the weight level (paper 1)

The current operational training program is a GRPO-trained policy on Qwen3.5-9B-Base with a LoRA adapter on attention and MLP projections. The corresponding SAE-Res-Qwen3.5-9B-Base residual-stream adapter is held untouched so that the interpretability claim about the trained policy’s representations survives weight updates [sparse-autoencoder feature decomposition; Cunningham et al. 2024]. The training pipeline includes:

  1. Constrained-decode JSON Schema enforcement via lm-format-enforcer, wired through a wrap on model.generate that survives TRL’s unwrap_model indirection. Without this wrap the schema is built but never reached by the generation path, and the policy emits free-form text that R_A clamps to zero — a failure mode that produced a 1690-step zero-reward run before the wrap was added.
  2. Rejection-sampling SFT warm-start. The Base model emits zero-reward outputs cold; the warm-start samples compositions under constrained decoding across rotated few-shot variations, scores them with the verifier, retains the R ≥ 0.3 subset, and supervised-fine-tunes the Base on that retained corpus before GRPO begins.
  3. Per-iteration verifier scoring with the locked R(O, I) and group-relative advantage estimation.

Paper 1’s two subordinate claims are C1 — discrimination (the locked verifier discriminates known-good from known-bad ontologies on the test set; established at AUC 0.9956) and C2 — optimizability (GRPO training of the policy against R produces compositions whose R-distribution exceeds prompt-evolved and human-authored baselines on the held-out 50; under empirical test). The choice of warm-start procedure (Option A: rejection-sampling SFT — currently running; Option B: Instruct-paired model + Instruct-paired SAE adapter; Option C: Self-Distillation Fine-Tuning [SDFT; Shenfeld et al. 2026]) is explicitly under revision; the authoritative reference names what the in-flight run will and will not settle.

Generalizing the verifier: scaling beyond a single policy

As the system scales to larger metadata landscapes — GitTables (1M+ tables), the WikiTables corpus, and streaming sources such as Flink SQL and Spark SQL where schemas appear continuously rather than as static batches — a single-policy weight-trained approach faces two pressures. First, the breadth of in-scope concepts expands past what a single LoRA-fine-tuned policy of fixed capacity can absorb without forgetting. Second, online adaptation to newly-arrived schemas in a streaming context calls for an optimization loop that can react faster than a full GRPO retrain.

The verifier R(O, I) is the durable asset across this transition. Whatever the optimization layer (policy weights, prompts, agent configurations, multi-agent routing), the same deterministic verifier supplies the reward signal. Two recent frameworks make the optimization layers above the weight surface explicit and externalize them as engineering substrate the project can adopt as operational pressure makes it useful:

  • GEPA — Reflective Prompt Evolution [Agrawal et al. 2025; arXiv:2507.19457; ICLR 2026 Oral]. GEPA is a genetic-Pareto prompt optimizer: given an LLM-based system, it samples trajectories, reflects on them in natural language to diagnose failure modes, proposes prompt updates, and combines complementary lessons along the Pareto frontier of its own attempts. The paper reports that reflective prompt evolution can outperform reinforcement learning on certain agentic tasks under a fixed budget. For this project, the relevance is direct: GEPA’s outer loop targets a programmatic fitness signal, and R(O, I) is one. Substituting R(O, I) in place of GEPA’s example fitness gives prompt-level optimization of an ontology-emitting LLM system without weight updates. A reference implementation lives in DSPy as dspy.GEPA.
  • Agent Lightning — RL for agent systems [Microsoft Research 2025; arXiv:2508.03680]. Agent Lightning is a framework that adds RL-based training to agents built on LangChain, Microsoft AutoGen, the OpenAI Agents SDK, or arbitrary custom code, with effectively zero code modification to the agent itself. The framework formalizes agent execution as a Markov decision process, defines a unified data interface, and introduces a hierarchical RL algorithm (LightningRL) with explicit credit assignment so any agent’s trajectories can be decomposed into training transitions. For this project, the relevance is that R(O, I) can act as the RL reward for any agent built on this substrate — including multi-agent configurations where one agent retrieves context (sampled passages from a target corpus I), another proposes catalog compositions, and a third refines slot fills. The agent swarm scaffolding in the codebase is the project-side complement to this generalization.

The unifying point is methodological: paper 1 establishes that R(O, I) discriminates ontology quality and is optimizable via weight-level GRPO. Once that’s established, the same verifier becomes the reusable substrate for prompt-level optimization (GEPA) and agent-level RL (Agent Lightning) as the engineering pressure from streaming sources and 10⁶-table metadata landscapes makes those optimization layers necessary. The hardest gate to cross is verifier validity; the rest is engineering on top of a stable substrate.

No project work has been committed to GEPA or Agent Lightning integration yet — that work follows paper 1 and the operational scale-up to GitTables and streaming Flink / Spark SQL tagging. The roadmap names that scale-up as the forcing function.

Paper 2 — ontology-grounded byte-level pretraining

Paper 2’s claim is downstream of paper 1: verbalizations from R-passing ontologies (produced by paper 1’s policy) measurably improve byte-level pretraining of Aegir’s hierarchical sequence model on stratified held-out evaluation. Paper 2 is contingent on paper 1’s policy producing verifier-passing compositions at corpus scale; its methodology will be refined after paper 1’s results constrain it. The pretraining track that paper 2 augments is documented in Pretraining.

What this chapter does not claim

  • Paper 1’s C2 (optimizability) is not yet established. C1 — the verifier discriminates ontology quality on the test set — is established at AUC 0.9956. C2 is the experimental claim currently under test; the in-flight GRPO run is the first empirical attempt.
  • The choice among warm-start procedures (Options A, B, C) is not settled. The current run tests Option A; a comparison against Option B is the next-priority experimental step. Any claim built on the specific warm-start procedure being correct is unsupported until at least one direct comparison is run.
  • GEPA and Agent Lightning are not yet integrated. The scaling argument in the section above identifies them as the methodological layer the project will reach for when single- policy training proves limiting; it is not a description of current work.
  • Paper 2’s lift is not yet measured. Whether R-passing ontology verbalizations measurably improve downstream pretraining utility is a separate experimental question with its own evaluation surface.

References

RLVR core method.

  • Shao, Z., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. (Origin of the GRPO algorithm.)
  • DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.

Optimization layers above policy weights.

  • Agrawal, L. A., et al. (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv:2507.19457; ICLR 2026 Oral. Reference implementation: dspy.GEPA.
  • Microsoft Research. (2025). Agent Lightning: Train ANY AI Agents with Reinforcement Learning. arXiv:2508.03680.
  • Shenfeld, I., Damani, M., Hübotter, J., Agrawal, P. (2026). Self-Distillation Enables Continual Learning. arXiv:2601.19897. (On-policy self-distillation as a candidate warm-start.)

Adjacent OWL / KG / verbalization work.

  • Lo, A., Jiang, A. Q., Li, W., Jamnik, M. (2024). OLLM: Generating ontologies from texts. NeurIPS 2024.
  • Tsang, et al. (2026). AutoGraph-R1. arXiv:2510.15339, ICLR 2026 submission.
  • Yuan, et al. (2026). K2V — Knowledge-to-Verification. ICLR 2026 submission.
  • He, Y., Chen, J., Antonyrajah, D., Horrocks, I. (2023). DeepOnto: A Python package for ontology engineering with deep learning.
  • Liu, et al. (2025). OntoTune. WWW 2025.

Topic modeling and interpretability.

  • Grootendorst, M. (2022). BERTopic: Neural topic modeling with a class-based TF-IDF procedure.
  • Cunningham, H., Ewart, A., Riggs, L., Huben, R., Sharkey, L. (2024). Sparse Autoencoders Find Highly Interpretable Features in Language Models. ICLR 2024.

Internal references.

  • Concept brief — full research design and literature review.
  • Semantic-engine authoritative reference — the empirical surface this chapter’s claims rest on.
  • Charter — the outward contract that the SDG ontology serves.
  • Roadmap — the two-paper milestone structure this chapter sits within.
  • Agent swarm — the project-side scaffolding for the multi-agent generalization sketched in the scaling section.

Skills Library & Closed Generate→Re-ground→Refine Engine — Specification

Status: v0.1 — accepted; DOF 2 & 4 resolved · DOF 1 & 3 deferred (see §3) · Date: 2026-06-05

0. The spine: FinePDFs as immutable fixed point, and the re-grounding invariant

The pipeline is a closed, verifiable loop with FinePDFs as the fixed-point ground at both ends:

FinePDFs ──induce (coverage audit / BERTopic)──► ontology
   ▲                                          + skills library
   │                                                 │
   │                                                 ▼
   │                                    generated corpus (prose ⨉ relational data)
   └──────── re-grounds verifiably (BERTopic topic-recovery) ◄────┘

Re-grounding invariant (the contract every skill and build must satisfy). Every CorpusUnit carries provenance to (a) ≥1 ontology element (the what, induced from FinePDFs) and (b) ≥1 FinePDFs SourceSpan (the ground), such that:

  • Fidelity — the unit’s topic distribution re-grounds to its seeding FinePDFs topics under the cached BERTopic model: topic_recovery ≥ τ_topic.
  • Structure — any relational data type-checks against the ontology slot-types: r_axiom ≥ τ_axiom.
  • No drift — every asserted claim resolves to a SourceSpan or ontology axiom: claim_grounding ≥ τ_ground.

A skill or corpus build is admissible only if it preserves this invariant. This is what converts validation from an external step into an intrinsic consistency condition.

Core schemas (referenced by all skill signatures)

TypeFields
TopicVecFinePDFs BERTopic distribution (cached model; the re-grounding coordinate)
SourceSpan{doc_id, char_range, text, topic_vec} — a FinePDFs evidence span
OntologyRef{template_id | class_iri, slot_types, bfo_anchor, verbal_template} (from the catalog)
Claim{text, grounded_to: SourceSpan | Axiom | null, status}
Provenance{ontology_refs[], source_span_ids[], skill_id@ver, target_topic_vec, claims[]}
CorpusUnit{kind: prose|table|diagram|example, content_md, schema?, provenance} where schema (tables) = {columns:[{name, slot_type}], fk_edges:[(col→col)]}

schema.columns[].slot_type is the deterministic CTA/CPA/DED label — ground truth for the downstream model, type-checked against OntologyRef.slot_types.


1. Skills Library Definition

A skill is a typed, versioned generative competency: skill@semver(grounded inputs) → CorpusUnit(s) + provenance, contractually preserving the re-grounding invariant. The library replaces the template catalog as the generation driver: the ontology supplies the what, skills supply the how, FinePDFs verifies that it held.

1.1 First-class skills

S1 · verbalize-axiom

  • (a) Signature: verbalize_axiom(ref: OntologyRef, fillers, evidence: SourceSpan[]) → CorpusUnit(prose)
  • (b) I/O schema: in = one axiom (ref.verbal_template + slot fillers) + evidence spans whose topic_vec defines the target; out = prose unit; provenance.claims each grounded to a span/axiom.
  • (c) Strategy: seed with the DeepOnto verbal_template as a faithful scaffold; LLM elaborates into prose constrained to assert only what the evidence spans support (every sentence → a Claim with grounded_to). Prompt skeleton: “Using only the facts in {evidence}, explain {verbalized axiom}. Cite each claim to a source. Match the register of {evidence}.”

S2 · synthesize-relational-table-with-cross-FKs (the column-annotation-bearing core)

  • (a) Signature: synth_relational_table(refs: OntologyRef[], fillers, evidence: SourceSpan[]) → CorpusUnit(table)
  • (b) I/O schema: out.content_md = markdown table(s); out.schema = {columns:[{name, slot_type}], fk_edges} — the CTA/CPA/DED labels. Cell values drawn from / consistent with evidence.
  • (c) Strategy: map ontology entities→tables, slots→columns (slot_type = label), object-properties→FKs; populate cells from evidence values. Two hard post-conditions: type-check (headers vs slot_typesr_axiom) and value re-grounding (cell distribution re-grounds to evidence.topic_vec). Prompt skeleton: “Construct relational tables instantiating {refs} with realistic values grounded in {evidence}; emit columns with their ontology slot-types and explicit foreign keys.”

S3 · interleave-diagram

  • (a) Signature: interleave_diagram(refs: OntologyRef[], local_structure) → CorpusUnit(diagram)
  • (b) I/O schema: out = mermaid/d2 of ER / taxonomy / dataflow over the same refs; must be cross-consistent with any S2 table (edges ↔ FKs) and S1 prose.
  • (c) Strategy: render the ontology subgraph (BFO anchors + object-property edges); constrain node/edge set to the refs already used in the unit so the diagram cannot introduce ungrounded entities.

S4 · worked-example (the generality driver — Thesis 2)

  • (a) Signature: worked_example(ref: OntologyRef, evidence: SourceSpan[]) → CorpusUnit(prose+data)
  • (b) I/O schema: a concrete instantiated case — a populated record, a query+result, or a short reasoning trace — over real values from evidence.
  • (c) Strategy: instantiate the abstract axiom on concrete grounded data and show the reasoning; this is where transferable skill (not template recall) is taught. Prompt skeleton: “Walk through a concrete instance of {ref} using {evidence}; show each inference step.”

S5 · ground-claim-to-source (the fidelity enforcer; runs inline + as a pass)

  • (a) Signature: ground_claim(claim: Claim, evidence: SourceSpan[]) → GroundedClaim | REJECT
  • (b) I/O schema: attaches grounded_to (span/axiom) or rejects the claim; aggregate → claim_grounding rate for the unit.
  • (c) Strategy: retrieval + entailment check of each claim against evidence/axioms; unsupported claims are dropped or trigger regeneration. This is the skill that makes the corpus verifiable rather than asserted.

S6 · cross-reference (link-concepts)

  • (a) Signature: cross_reference(ref: OntologyRef, fc: FamilyComplex) → CorpusUnit(prose links)
  • (b) I/O schema: prose connecting the unit’s concept to allowable neighbors; the cited family-set must be an allowed simplex (fc.is_allowed, else fc.best_face).
  • (c) Strategy: use the family complex to weave only co-coherent multi-family links (never a measured puncture), producing the relational richness without incoherent cross-family claims.

S7 · topic-anchor (re-grounding conditioner)

  • (a) Signature: topic_anchor(unit: CorpusUnit, target: TopicVec, evidence: SourceSpan[]) → CorpusUnit'
  • (b) I/O schema: rewrites/condition the unit so its embedded topic distribution moves toward target (the seeding FinePDFs topics).
  • (c) Strategy: the skill that directly closes the loop on topic_recovery — style/lexis/emphasis transfer toward the FinePDFs target without altering grounded claims or schema. Applied last, re-verified by the engine.

(Candidate extensions: define-term, summarize-section, counterexample, pose-and-answer — each admitted only via §1.3.)

1.2 Input/output grounding summary

Every skill takes OntologyRef (+ FamilyComplex where relevant) and FinePDFs SourceSpan[], and emits CorpusUnit with full Provenance. No skill may emit a Claim without a grounded_to, nor a table column without a slot_type. This is the static guarantee behind the re-grounding invariant.

1.3 Versioning & extension without breaking the invariant

  • Identity: every skill is skill_id@semver; each CorpusUnit.provenance records the exact versions used → reproducible, traceable builds.
  • Pinned skill-set per build: a corpus build pins a skill-set manifest ({skill_id@ver}), so any corpus is reproducible and its fidelity is attributable.
  • Skill admission test (the gate): a new/changed skill version joins the library only if, on a frozen calibration sample of FinePDFs seeds, the units it produces satisfy topic_recovery ≥ τ_topic ∧ r_axiom ≥ τ_axiom ∧ claim_grounding ≥ τ_ground. The invariant is thus enforced at admission, not hoped for at runtime. Append/version-only; deprecations are explicit.

2. Closed Generate → Re-ground → Refine Engine

2.1 End-to-end control flow (one generation episode)

  1. Seed from FinePDFs — sample a FinePDFs document from a non-holdout (training) BERTopic cluster (§2.5); take its topic mixture (target: TopicVec) and its SourceSpan[] as the re-grounding anchor.
  2. Select ontology content — via the coverage audit, choose OntologyRef[] matched to target, family-diverse, gated by the family complex (is_allowed / best_face).
  3. Compose skills (generation plan): S1 verbalize-axiom → S2 synth-relational-table → S3 interleave-diagram → S4 worked-example → S6 cross-reference, with S5 ground-claim-to-source as an inline guard on every unit, then S7 topic-anchor toward target.
  4. Assemble the units into a chapter with merged Provenance.
  5. Re-ground (VERIFY) — §2.2.
  6. Score & route — compute composite F; if F ≥ τ_F accept into the corpus, else fire the matching refinement trigger (§2.4) and regenerate.
  7. Update family complex — record the cited simplex as allowable (F ≥ floor) or a puncture (measured-below-floor), per the current build_family_complex role.

2.2 Re-grounding step (the loop-closure verifier)

  • Primary — BERTopic topic-recovery (CANONICAL, DOF 2 resolved): embed the generated chapter, run the cached FinePDFs BERTopic model (approximate_distribution), score per-doc recovery against the seeding target topics → topic_recovery (hit@k / cosine). This is the single re-grounding metric, used identically in-loop and at the downstream generality check; the verifier’s prior R_D (MiniLM→KMeans→Hungarian to T_I.pkl) is retired from the in-loop check and retained for offline diagnostics only — so every iteration is scored by the exact embedding+clustering pipeline used downstream. Its earlier 30% (full arm) vs 80% (no-ontology) is the explicit gap to close — not a trade-off.
  • Auxiliary consistency checks:
    • r_axiom — table headers/columns type-check vs ontology slot_types.
    • claim_grounding — fraction of claims with a valid grounded_to.
    • cross_modal — diagram edges ↔ table FKs ↔ prose claims agree.
    • r_iri — cited templates present in prose.
  • Composite fidelity: F = w_t·topic_recovery + w_a·r_axiom + w_g·claim_grounding + w_x·cross_modal (weights are an open DOF — §3; to be calibrated against the downstream signal, not asserted — per the eval-methodology survey).

2.3 Quantitative success criteria (loop termination)

A build is converged/hardened when, over the FinePDFs topic distribution on a held-out calibration sample, all hold and are stable across K iterations:

CriterionSymbolInitial target
Re-groundingtopic_recovery ≥ τ_topic≥ 0.80 (parity with no-ontology, while keeping structure)
Structure retainedr_axiom ≥ τ_axiom≥ 0.45 (family-complex floor)
No hallucinationclaim_grounding ≥ τ_ground≥ 0.95
Cross-modal consistencycross_modal ≥ τ_x≥ 0.90
Composite, stableF ≥ τ_F, Δ over K iters < ετ_F, K, ε TBD (§3)
Coverageaccepted-fraction across topic bins ≥ ρρ TBD

The headline objective is τ_topic ≥ 0.80 with r_axiom ≥ 0.45 simultaneously — i.e., close the 30%→80% re-grounding gap without surrendering structure. Adopted 2026-06-05 as the loop-termination criterion. The conjunction is hard — no weighted trade-off — since any relaxation re-introduces the drift the loop exists to eliminate.

2.4 The two operational levers

  • Lever A — Ontology refinement (incl. family complex). Fired when r_axiom/structural or coverage checks fail. Actions: fix/extend templates & slot-types; re-induce from FinePDFs coverage gaps (audit); update the family complex (promote simplices that co-generate ≥ floor; record punctures that fail). Mechanizable later by the GRPO policy optimizing ontology selection against F.
  • Lever B — Corpus hardening. Fired to move from “passes on a sample” to “passes across the full FinePDFs distribution at volume.” Actions: scale & diversify accepted units across topics/families/skill-compositions; dedup & balance; enforce thresholds distribution-wide; confirm stable F. The output is the dataset whose model exhibits generality.

A failing topic_recovery routes to skills (S1/S7 — the how drifted) and/or ontology selection; a failing r_axiom routes to Lever A (ontology); a failing claim_grounding routes to S5 / unit rejection.

2.5 Held-out FinePDFs generality check (downstream validation)

  • Holdout (DOF 4 resolved — topic-cluster-level): partition FinePDFs at the BERTopic cluster level (the same model used for re-grounding), before any generation; held-out clusters seed/ground no unit, and generation seeds (§2.1.1) are drawn only from non-holdout/training clusters. This enforces cross-region generality (not merely unseen documents) and supplies both the final generality check and intermediate calibration runs.
  • Train the model (DED + CTA/CPA heads) on the hardened corpus.
  • Evaluate on tasks derived from the held-out slice:
    • Data-element discovery (the end model): cluster/embed columns extracted from held-out FinePDFs-derived relational structures into data elements; measure against held-out ground-truth groupings.
    • CTA/CPA: column → slot-type on held-out-derived tables, with the SOTA-grounded protocol — PR-based mAP/LRAP + precision@k (not ROC-AUC), sample-efficiency curves, bootstrap/permutation CIs, vs random-init and a matched-token non-grounded corpus arm.
    • Generality = transfer to held-out FinePDFs-grounded structure, not recall of trained topics/templates.
  • This is Track C, correctly scoped to held-out FinePDFs (the loop’s own ground) — no external benchmark, no template-recognition artifact.

3. Open degrees of freedom (for joint review)

  1. Composite F weights (w_t, w_a, w_g, w_x)DEFERRED to the first calibration run: set from the downstream DED/CTA/CPA signal, not asserted (cf. the verifier’s P2 sweep, which over-weighted topic at 0.45 among other terms).
  2. Thresholds τ_topic, τ_axiom, τ_ground, τ_x, τ_F, K, ε, ρDEFERRED to calibration; the headline pair τ_topic ≥ 0.80 ∧ r_axiom ≥ 0.45 is adopted now (§2.3); the rest set vs. a held-out sample.
  3. Topic-recovery estimatorRESOLVED 2026-06-05: BERTopic-recovery is the single canonical re-grounding metric, in-loop + downstream; R_D (MiniLM→KMeans→Hungarian to T_I.pkl) retired from the loop, offline-diagnostics only. MDL/codelength remains an optional offline lens.
  4. Skill-composition policyDEFERRED to the P5 RL run: fixed sequence (§2.1) vs. ontology-driven vs. learned (GRPO), decided on evidence.
  5. Grounding granularity — per-claim vs. per-unit SourceSpan attachment.
  6. Engine mode — single-pass generate-then-verify vs. per-unit iterative repair.
  7. Refinement actuation — human / agent / GRPO-policy for Levers A & B (the P5 run is the RL form).
  8. Holdout protocolRESOLVED 2026-06-05: topic-cluster-level FinePDFs holdout, partitioned before generation (§2.5) — the stronger, cross-region test.

Agent Swarm Architecture

The agent-swarm modules in src/aegir/swarm/ are architectural substrate for the multi-agent operational pattern that the project will reach for as the metadata landscape scales beyond what a single-policy training loop can address. The system’s current operational training pipeline — described in the semantic-engine authoritative reference and the RLVR-for-ontology-generation chapter — is single-policy. This chapter documents the swarm modules’ design, the engineering rationale for landing them in the codebase ahead of an operational multi-agent task, and the optimization layers (prompt evolution, agent RL) that will target the same verifier R(O, I) once the swarm becomes operational.

When the swarm becomes operational

Two concrete forcing functions move the project from single-policy training to a multi-agent architecture:

  • Scaling to large metadata landscapes. GitTables (≈ 1M tables, 100% generic column names) and the WikiTables corpus together represent more conceptual breadth than a single LoRA-fine-tuned policy of fixed capacity can hold without forgetting. Splitting the work across multiple agents — each specialized to a region of the metadata landscape, sharing a stable verifier — is the architectural answer.
  • Streaming source tagging (Flink SQL, Spark SQL). Streaming query engines produce schemas continuously rather than as static batches; new domains arrive as new data products are stood up. A full GRPO retrain on every new domain is impractical. An online optimization loop that adapts agent prompts and routing faster than a weight retrain is the practical alternative.

The agent-swarm scaffolding exists in the codebase today so the infrastructure is ready when those scaling pressures arrive. The LatentMAS-informed RWKV-state-sharing design described below is the substrate; the GEPA- and Agent-Lightning-class optimization loops further down are the methodological frameworks the project will adopt for operating it.

The architectural substrate

The swarm shares compact RWKV recurrent state tensors between agents rather than exchanging text messages or attention KV caches — a communication medium that is uniquely efficient for recurrent architectures.

Why RWKV state sharing

RWKV’s recurrent state is constant in sequence length. Each layer’s state is a matrix of shape (H, K, V) where H is the number of heads and K = V = head_size. The total state size per layer is

O(H * head_size^2) = O(d_model * head_size) = O(d^2)

independent of how many tokens the agent has processed. For a swarm of N agents, the cost of sharing all recurrent states is

RWKV:        O(N * d^2)          -- constant in sequence length
Transformer: O(N * n * d)        -- linear in sequence length n

At context lengths of 4k–128k tokens with typical d = 512–4096, RWKV state sharing is orders of magnitude cheaper. The LatentMAS paper (arXiv:2511.20639) quantifies this as 235–471× more information-dense than text-based inter-agent communication, since the recurrent state encodes a compressed summary of the entire processing history.

For Aegir’s column-annotation task, this means a specialist trained on (say) geographic column types can share its accumulated understanding of a table’s structure through a single (H, K, V) tensor per layer, rather than generating and parsing natural-language explanations.

Swarm components

The swarm consists of four modules:

ModuleFilePurpose
RWKVStateFusionsrc/aegir/swarm/state_fusion.pyCombine N agent states into one
AlignmentProjectionsrc/aegir/swarm/alignment.pyMap states between different-sized agents
FrozenSpecialistsrc/aegir/swarm/specialist.pyWrap pre-trained models as frozen agents
SwarmOrchestratorsrc/aegir/swarm/orchestrator.pyRouting + reward shaping

State fusion modes

RWKVStateFusion supports three strategies for combining agent states:

  1. weighted_sum — Attention-weighted combination using learnable query/key projections. The orchestrator learns which agents to trust per head.
  2. gated — Per-agent softmax gates. Simpler than attention but still differentiable. A reasonable baseline for initial experiments.
  3. concat_project — Concatenate all agent states and project back to single-agent dimensions. Most expressive but O(N) in parameter count.

See RWKV State Fusion for mathematical details.

Optimization layers for a deployed swarm

The agent swarm is the architectural substrate. The verifier R(O, I) described in RLVR for ontology generation is the reward signal. The remaining question is which optimization loop adjusts the swarm against that reward. Three candidate layers are available today, and the project’s plan is to adopt them in order as operational pressure justifies each.

Weight-level (current): GRPO

The current paper-1 training program updates a single policy’s weights via Group Relative Policy Optimization [Shao et al. 2024] against R(O, I). This is the appropriate choice when there is one policy, the corpus is bounded, and training compute is available in chunks. The in-flight run described in authoritative reference is the first end-to-end test of this layer.

Prompt-level: GEPA

When the swarm is operational and the optimization target is the prompts of the agents rather than their weights, the project will reach for GEPA [Agrawal et al. 2025; arXiv:2507.19457; ICLR 2026 Oral]. GEPA is a Genetic-Pareto reflective prompt optimizer: it samples trajectories from an LLM-based system, uses an LLM to reflect on those trajectories in natural language to diagnose failures, proposes prompt updates targeted at the real observed failure modes, and combines complementary improvements along the Pareto frontier of its own attempts. The paper reports that reflective prompt evolution outperforms GRPO using up to 35× fewer rollouts on the agentic tasks the authors evaluated.

Two GEPA properties are directly relevant to a deployed swarm:

  • Compound-system support. GEPA optimizes the prompts of an arbitrary LLM-based system — including multi-agent pipelines with retrieval, generation, reranking, and synthesis stages. The DSPy implementation (dspy.GEPA) exposes this for any DSPy module, and the same shape applies to a custom swarm with FrozenSpecialist agents.
  • Actionable Side Information (ASI). GEPA’s feedback channel is not just a scalar reward; it accepts structured error messages, profiling data, and reasoning traces. The deterministic verifier R(O, I) already produces this kind of structured feedback — per-component scores, hard-gate failure reasons, R_D topic-alignment diagnostics — which is the feedback shape GEPA is designed to consume.

For Aegir, GEPA becomes the operational optimization loop when the swarm is composing ontology fragments and the goal is to adapt the system’s behavior to a new domain (a new streaming source, a new compliance regime) faster than a full GRPO retrain can deliver.

Agent-level RL: Agent Lightning

When the optimization target is agent behavior — including tool use, retrieval choices, multi-step interaction, and delayed reward — the project will reach for Agent Lightning [Microsoft Research 2025; arXiv:2508.03680]. Agent Lightning decouples agent execution from RL training: it wraps any agent built on LangChain, AutoGen, CrewAI, the OpenAI Agents SDK, LangGraph, or custom Python with effectively zero code changes. The framework’s LightningRL algorithm formalizes agent execution as a Markov decision process, defines a unified data interface, and handles credit assignment so that any agent’s trajectories can be decomposed into training transitions — including in multi-agent scenarios and dynamic workflows.

A particularly direct precedent for Aegir’s streaming-SQL tagging target exists in Agent Lightning’s documentation: a LangGraph-based SQL agent trained with the VERL RL algorithm against task rewards. The Aegir generalization is to substitute R(O, I) — which already discriminates schema-and-ontology quality and is hash-stable across runs — for the SQL-agent reward and run the same training loop against the Flink-SQL / Spark-SQL streaming-tagging task. Agent Lightning also enables selective optimization that targets specific sub-agents or steps in a multi-agent workflow, which fits the swarm’s FrozenSpecialist + SwarmOrchestrator shape directly.

Selecting an optimization layer

The three layers compose rather than compete:

LayerWhat it adjustsWhen to use
GRPO (weight-level)Single-policy weightsBounded corpus; training compute available in chunks; current paper-1 work
GEPA (prompt-level)Prompts of an LLM-based systemOnline adaptation to new domains; multi-agent pipelines; rollout-budget-constrained settings
Agent Lightning (agent-level RL)Agent behavior incl. tool use, routing, multi-stepMulti-agent scenarios with delayed reward; framework-agnostic; streaming-SQL targets

All three target the same verifier R(O, I). That property — the verifier is the durable asset, the optimization layers slot in above it — is the project’s methodological commitment for keeping the verifier work paper-1-ready while leaving room for the swarm generalization downstream.

What this chapter does not commit to

  • The swarm is not yet operational. The current paper-1 training run uses a single policy. The modules above exist in src/aegir/swarm/ but are not exercised by any current training run.
  • GEPA and Agent Lightning are not integrated yet. Both are named here as the methodological frameworks the project will adopt when scaling pressure justifies them. Integration work follows paper 1’s first held-out evaluation.
  • The order of adoption is provisional. Whether prompt-level optimization (GEPA) or agent-level RL (Agent Lightning) becomes operational first depends on which scaling pressure (large-corpus breadth vs. streaming-online adaptation) arrives first. The roadmap tracks both.

References

  • LatentMAS — recurrent-state sharing as multi-agent communication. arXiv:2511.20639.
  • Agrawal, L. A., et al. (2025). GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning. arXiv:2507.19457; ICLR 2026 Oral. Reference implementation: dspy.GEPA.
  • Microsoft Research. (2025). Agent Lightning: Train ANY AI Agents with Reinforcement Learning. arXiv:2508.03680. Documentation includes a LangGraph SQL-agent training example.
  • Shao, Z., et al. (2024). DeepSeekMath: Pushing the limits of mathematical reasoning in open language models. (Source of the GRPO algorithm.)

Internal references.

  • RLVR for ontology generation — the verifier R(O, I) that all three optimization layers target; the methodological chapter for paper 1.
  • Semantic-engine authoritative reference — the operational state of the current single-policy paper-1 work.
  • Roadmap — the two-paper milestone structure and the deferred-work section that names the K2.5 PARL plan as superseded by the layered approach above.

RWKV State Fusion

The RWKVStateFusion module combines recurrent states from multiple specialist agents into a single fused state for the primary agent. Implementation is in src/aegir/swarm/state_fusion.py.

Input Format

Each agent produces a per-layer recurrent state tensor of shape:

(B, H, K, V)

where B is batch size, H = num_heads, and K = V = head_size. Given N agents, the fusion module receives a list of N such tensors and outputs a single tensor of the same shape.

Internally, the input list is stacked into a single tensor of shape (B, N, H, K, V).

Fusion Modes

weighted_sum – Attention Over Agent States

Uses a learnable query vector per head and a key projection to compute attention weights over agents.

Parameters:

  • query: (H, K) – learnable query per attention head
  • key_proj: Linear mapping K*V -> K (no bias)

Computation:

flat   = reshape(stacked, [B, N, H, K*V])
keys   = key_proj(flat)                      # (B, N, H, K)
scores = einsum("bnhk, hk -> bnh", keys, query)
weights = softmax(scores, dim=1)             # (B, N, H)
fused  = einsum("bnh, bnhkv -> bhkv", weights, stacked)

Each head independently learns which agents to attend to. This is the default mode and generally the most effective, since it allows fine-grained per-head routing without excessive parameters.

gated – Learnable Per-Agent Gates

A simpler approach with a single learnable gate vector.

Parameters:

  • gates: (N,) – initialized to 1/N (uniform)

Computation:

weights = softmax(gates, dim=0)   # (N,)
fused   = einsum("n, bnhkv -> bhkv", weights, stacked)

All heads share the same agent weighting. This is cheaper than weighted_sum but less expressive – it cannot learn head-specific preferences for different specialists.

concat_project – Concatenate and Project

The most expressive mode. Concatenates all agent states along the agent dimension and projects back.

Parameters:

  • proj: Linear mapping N*K*V -> K*V (no bias)

Computation:

flat      = reshape(permute(stacked, [0,2,1,3,4]), [B, H, N*K*V])
projected = proj(flat)           # (B, H, K*V)
fused     = reshape(projected, [B, H, K, V])

This allows arbitrary mixing of information across agents within each head but scales linearly in parameters with the number of agents.

Usage Example

from aegir.swarm.state_fusion import RWKVStateFusion

fusion = RWKVStateFusion(
    num_heads=8,
    head_size=64,
    num_agents=3,
    mode="weighted_sum",
)

# agent_states: list of 3 tensors, each (B, 8, 64, 64)
fused_state = fusion(agent_states)  # (B, 8, 64, 64)

Mode Selection Guidelines

ModeParametersPer-head routingBest for
weighted_sumO(H*K + K*V*K)YesGeneral use, default
gatedO(N)NoQuick experiments, few agents
concat_projectO(N*K*V*K*V)YesMaximum expressiveness, small N

LatentMAS Alignment Projection

The AlignmentProjection module maps recurrent states between agents that may have different architectures (different d_model, num_heads, or head_size). Implementation is in src/aegir/swarm/alignment.py.

Problem

When fusing states from multiple agents, all states must share the same (H, K, V) dimensions. But specialists may have been trained with different model sizes. A CTA specialist with d_model=256 and a CPA specialist with d_model=512 produce incompatible recurrent states. The alignment projection resolves this mismatch.

State Types

RWKV recurrent states consist of two kinds of tensors:

Matrix States (att_kv)

The core recurrent state from time mixing. Shape: (B, H, K, V) where K = V = head_size.

Projection: When source and target have different num_heads or head_size, the matrix state is flattened and linearly projected:

S_flat = reshape(S_source, [B, H_s * K_s * V_s])
S_target = W_matrix @ S_flat
S_out = reshape(S_target, [B, H_t, K_t, V_t])

where W_matrix has shape (H_t * K_t * V_t, H_s * K_s * V_s).

The LatentMAS paper (arXiv:2511.20639) proposes using bilinear projection S' = W_l @ S @ W_r^T and computing W_a via ridge regression on paired agent activations. Aegir instead trains the projection end-to-end as part of the swarm’s gradient flow, which avoids the need for a separate alignment data collection phase and allows the projection to co-adapt with the fusion module.

Vector States (att_x_prev, ffn_x_prev)

The previous-timestep hidden state cache used by RWKV’s time-shift mechanism. Shape: (B, D) where D = d_model.

Projection: Simple linear mapping when d_model differs:

x_target = W_vector @ x_source

where W_vector has shape (D_target, D_source).

When Projections Are Needed

The module detects whether projection is needed at initialization:

# Matrix projection: needed when head geometry differs
needs_matrix_proj = (
    source_num_heads != target_num_heads
    or source_head_size != target_head_size
)

# Vector projection: needed when d_model differs
needs_vector_proj = (source_d_model != target_d_model)

When source and target share the same architecture, both projections are identity operations (no parameters allocated).

Usage

from aegir.swarm.alignment import AlignmentProjection

align = AlignmentProjection(
    source_num_heads=4,   source_head_size=64,
    target_num_heads=8,   target_head_size=64,
    source_d_model=256,
    target_d_model=512,
)

# Project matrix state
att_kv_target = align.forward_matrix(att_kv_source)   # (B,4,64,64) -> (B,8,64,64)

# Project vector state
x_prev_target = align.forward_vector(x_prev_source)   # (B,256) -> (B,512)

LatentMAS vs Aegir Approach

AspectLatentMASAegir
Alignment methodRidge regression on collected pairsEnd-to-end gradient training
Training dataRequires parallel agent runsLearned during swarm training
AdaptabilityFixed after alignment phaseContinuously adapts
Projection typeBilinear W_l @ S @ W_r^TFlatten + linear (equivalent expressiveness)

The end-to-end approach is viable because Aegir’s swarm training already has gradient flow through the fusion module. The alignment projection sits in that gradient path and receives signal from the downstream task loss.

K2.5 PARL Orchestrator

The SwarmOrchestrator coordinates a trainable primary Aegir model with multiple frozen specialist agents, following the Parallel Agent Reinforcement Learning (PARL) pattern from Kimi K2.5 (arXiv:2602.02276). Implementation is in src/aegir/swarm/orchestrator.py.

Architecture

                    +-------------------+
                    | SwarmOrchestrator |
                    +-------------------+
                            |
             +--------------+--------------+
             |              |              |
     SpecialistRouter   Primary      FrozenSpecialists
     (sigmoid gates)    (trainable)   (frozen params)
             |              |              |
             |              |    +---------+---------+
             |              |    |         |         |
             |              |  Spec_0   Spec_1   Spec_N
             |              |    |         |         |
             +--> activation -->  state fusion  <----+
                  mask           (RWKVStateFusion)

The primary model is the only component whose parameters are updated during PARL training. Specialists are frozen checkpoints that contribute their recurrent states when activated by the router.

SpecialistRouter

The router decides which specialists to activate for a given input. It maps the primary agent’s hidden representation to per-specialist activation scores:

scores = sigmoid(W_router @ hidden_states)   # (B, num_specialists)
activation_mask = scores > threshold          # default threshold = 0.5

Sigmoid gating (rather than softmax) allows zero, one, or multiple specialists to be activated simultaneously. This is critical for the column annotation task where a table may require expertise from several domain specialists, or none at all.

PARL Reward Structure

The combined reward follows K2.5’s formulation:

r = lambda_1 * r_parallel + lambda_2 * r_finish + r_perf

Reward Components

r_perf – Performance reward. F1 accuracy on the annotation task (CTA or CPA). This is the primary signal that drives annotation quality.

r_parallel – Parallelism and load balancing reward. Encourages efficient specialist utilization: activate specialists when they help, avoid activating them when they don’t. Adapted from H-Net’s lb_loss which penalizes unbalanced routing across experts.

r_finish – Completion quality reward. All columns in a table must be annotated, and the router must not degenerate into always-on or always-off patterns. Penalizes incomplete annotations and trivial routing strategies.

Lambda Annealing Schedule

Following K2.5, the lambda weights anneal over training:

Phaselambda_1 (parallel)lambda_2 (finish)Rationale
Early0.30.1Encourage exploration of specialist activation
Mid0.10.3Shift focus to completion quality
Late0.050.05Let r_perf dominate for final accuracy

The initial values (lambda_parallel=0.3, lambda_finish=0.1) are set in the orchestrator constructor. Annealing is managed by the training loop.

Token-Level Clipping RL

K2.5 uses a variant of PPO with token-level clipping rather than trajectory-level. This provides finer-grained credit assignment:

  • Each token’s routing decision gets its own clipped surrogate objective
  • Critical tokens (column boundaries, type-indicative values) receive higher weight
  • The clipping range narrows over training to stabilize converged policies

Critical-Steps Optimization

Rather than minimizing total computation, the orchestrator minimizes the critical path – the longest chain of sequential dependencies. Specialist activations that can run in parallel do not increase the critical path even if they increase total FLOPs. This encourages the router to prefer parallel specialist activation over sequential reasoning in the primary model when both achieve similar accuracy.

Forward Pass

orchestrator = SwarmOrchestrator(
    primary_model=primary,
    specialists=[spec_cta, spec_cpa, spec_geo],
    fusion=RWKVStateFusion(num_heads=8, head_size=64, num_agents=3),
    d_model=512,
    activation_threshold=0.5,
)

result = orchestrator(
    input_ids=tokens,
    mask=mask,
    routing_hidden=pooled_hidden,  # from primary's first layer
)

# result["output"]           -- primary model output
# result["specialist_outputs"] -- list of activated specialist results
# result["activation_mask"]  -- (B, num_specialists) boolean mask

When routing_hidden is None, specialist activation is skipped entirely and only the primary model runs. This allows the same orchestrator to be used in both supervised pre-training (no specialists) and PARL training (with specialists).

Roadmap

The project converges two coupled artifacts that share substrate and are cited together: a byte-level sequence model for relational metadata understanding, and a BFO/CCO-grounded ontology + synthetic corpus that grounds and supplies its pretraining data. They are coupled by design — the ontology is the domain-adaptive surface, the H-Net+RWKV model is the ultimate fitness measure, and the reasoner (HermiT) is the model-independent oracle between them. A better ontology yields a better corpus yields a better model; the model’s behaviour, in turn, is the gate that certifies the ontology machinery.

The current go-forward programme — the milestone ladder, its gates, and the scaling discipline — is the Signals Programme, of which this page is the landing view. The end-to-end pipeline and the agent-mediated machinery that runs it are specified in End-to-end + Meta-Harness + reasoner. This page summarises what is delivered, what is in flight, and what is gated, and points to the authoritative reference for each.

Two earlier framings have been reorganized under the Signals Programme — not retired. (1) The four-phase K2.5 PARL roadmap (supervised → reward → PARL → swarm RL) folded into the Signals milestone ladder; the swarm modules persist as infrastructure and the RL approach evolves through the GEPA / Agent-Lightning layers (see Agent Swarm). (2) The RLVR-for-ontology-generation line is long-horizon, not superseded: its four-component verifier R(O, I) is now realized as the deterministic membrane stack (HermiT/CCO, OntoClean, OQuaRE) built this cycle, and an SAE-instrumented-Qwen generator fine-tuned by GRPO against that reward is the Signals M4 apparatus for autonomous, local ontology extension. The agent-mediated propose/dispose loop is building and proving that reward now — current work in direct service of M4. The Concept brief, Semantic engine reference, and RLVR chapter carry the M4 research design; the reward-modeling / PARL / swarm-RL design notes remain in the tree.

Delivered substrate

The model track produced a working, air-gap-deployable training and evaluation stack, and a first real backbone.

  • End-to-end training pipeline. BDD-backed training on the gt-signals-dbpedia benchmark (120 DBpedia labels, 814 tables), with per-epoch boundary diagnostics and checkpoint discipline (outputs/runs/{run_id}/… with sidecars and pre-rendered Bokeh plots).
  • Gateway + leaderboard. A FastAPI gateway (port 8091) + React UI + the ABI-patched flash-attn / mamba-ssm build path. The leaderboard reads outputs/runs/ directly — no tracking daemon, no W&B, no MLflow. Deployment targets: devenv (local), CAI (PGlite), Zarf air-gap K8s.
  • v2 mixed-corpus pretrain (2026-04-27). The project’s first real byte-level backbone — 122k training steps on a mixed corpus (FineWeb-Edu + SQaLe + SchemaPile + FinePDFs-lab), single GPU. Stratified held-out evaluation shows non-degenerate representations across the trained-time slices, with bits-per-byte drops on domain-targeted data and general prose held flat. See Training Regime §10 for the full table.
  • Synthetic-corpus byte pretraining imparts transferable column skill. On a cells-only GitTables-DBpedia CTA frozen-backbone probe, a corpus-pretrained backbone reaches 0.66–0.70 accuracy vs. ≈0.12 at random init, with non-overlapping bootstrap CIs and rising Hewitt-Liang selectivity. Caveat: per-column CTA did not separate the full / no-ontology / no-schema ablation arms — flat CTA is surface-solvable — so the ontology’s load-bearing claim moved to the relational axis (the M2 lift, below). See EVIDENCE.md (E1–E3, “already-supported claims”).

These artifacts are the read surface and the warm-start for everything below.

The ontology + corpus, and its rigor program

The ontology is the annotation vocabulary for Column Type / Column Property Annotation (CTA/CPA) over wide relational tables. It is content-derived from FinePDFs (qdrant/ColBERT conceptual filtering), then realized to a HermiT-validated OWL artifact at corpora/ontology/sdg-ontology.{omn,owl} (with HERMIT_CERTIFICATE.md). Its classes are intermediate-depth subsumers — the property-bearing classes a heterogeneous-but-coherent column belongs to — not leaf terms.

How it is authored, and every metric, gate, and membrane that governs it, is the subject of the canonical Ontology Authors Guide. The standing discipline is propose / dispose: an agent (or a human) proposes axioms; a stack of deterministic membranes disposes — a parse membrane (OWLAPI well-formedness), a reasoning-authority membrane (CCO imported, HermiT validates grounding against CCO’s disjointness axioms in the loop), and an OntoClean meta-property membrane (anti-rigid-cannot-subsume-rigid). The two strongest are un-fakeable; a failure returns its reason, and the author refines (the agent-mediated feedback loop).

The rigor program (delivered, gate GREEN). Benchmarked against the IOF/BFO signature, the realized ontology is metered by scripts/ontology_metrology.py (IOF rigor dimensions — definitional_completeness, bfo_grounded, realizable_machinery, def_annotation_coverage; OntoQA/OQuaRE structural metrics; OntoClean taxonomic-correctness proxies) and gated by the OQuaRE quality gate (scripts/ontology_oquare.py — six SQuaRE characteristics, IOF-anchored [1,5] bands, floors aggregate ≥ 3.5 and FunctionalAdequacy ≥ 3.0 plus HermiT-consistent, aim 3.9, wired HARD into aegir.lineup.sync._gate). Two pre-registered objectives, OQ-Structure and OQ-Rigor, are both MET: current definitional_completeness 0.554, bfo_grounded 0.896, realizable_machinery 10, 0 unsatisfiable classes, OQuaRE 4.24 — GREEN. The standing rule: no sync --push of the ontology Data Product below GREEN. See EVIDENCE.md (OQ-Rigor / OQ-Structure).

The corpus + DDL spine. The verified-pipeline corpus is the byte-pretraining data and an independent publishable deliverable (corpora/, the sdg-corpora submodule). Generation runs a gRPC engine (+ a GLM/Grok mix) over ontology- and DDL-grounded prompts (scripts/generate_chapter.py), followed by a four-scorer verification loop (scripts/verify_chapters.py). The DDL spine projects the ontology into SQL tables / views / FKs with referential-integrity-true rows — a relational deliverable also projected into Atlas. Path A (scripts/, the TrainingFlow runs) is the continued-pretraining augmentation on RWKV World v3.

Go-forward — the Signals Programme milestone ladder

The programme is a high-dimensional, multi-objective optimization decomposed into a sequence of factored, low-DOF gates over a scaling ladder; the standing EVIDENCE rule is no scaled spend without a green gate. Full charter, the α×β interaction design, and the final-gate text are on the Signals Programme page; the pre-registered hypotheses, instruments, and decision rules are in EVIDENCE.md.

  • M0 — substrate-evolution machinery. DELIVERED (SUPPORTED). The reasoner gates and computes; the harness evolves: the HermiT consistency gate (inc-2a), the single-file harness H₀ (inc-2b), the Meta-Harness outer loop + a discovered harness H₁ (inc-2c), and the realization-as-CPA beachhead (inc-2d). See End-to-end + Meta-Harness and the §Meta-harness entries in EVIDENCE.md.
  • M1 — architecture baseline (the H-Net isolation gate). UNTESTED. Train H-Net+RWKV on RWKV-7’s open corpus, swapping only the tokenizer for byte-level dynamic chunking, and establish parity up the scaling ladder to RWKV-7-matched params. DOF = 1. Gate: H-Net+RWKV ≥ RWKV-7 at matched scale on standard evals.
  • M2 — instrument validity (the decisive corpus gate + proxy calibration). UNTESTED. A same-architecture, matched-budget matrix at ≥2 ladder rungs — arms {grounded mix / no-ontology ablation / standard-only} — with a 2-factor α×β cell (α = ontology-corpus fraction, β = SQL/DDL fraction) replicated across rungs. Eval: a FLOOR (grounded ≈ standard on general LM evals → non-degeneracy) AND a LIFT (grounded > ablation on relational + de-novo Data-Element elucidation, cells-only / control-task / PR-metric, bootstrap CI). It also calibrates the cheap proxies (R1 / coverage-close / corpus-quality) against the pretrain signal. Preconditions, both MET: the corpus’s max non-repetitive token yield is measured (Y_eff ≈ 7.93M byte-tokens, which caps α at scale), and the discriminating relational eval is realization-as-CPA — per chapter, cited templates → a Domain/Range TBox, corpus tables → an ABox, HermiT realizes each column’s type (CPA by a sound-and-complete oracle), with a Domain/Range-ablated matched-token control (pilot selectivity 0.79, 95% BCa CI [0.63, 0.88], permutation p = 1e-4 → CI-clean).
  • M3 — scale + the FINAL PHASE GATE. UNTESTED. Conditioned on M2: climb to RWKV-7-matched params, extrapolate α*(N)/β*(N) to target scale, and confirm the lift persists.
  • M4 — the forward door (unlocked by the final gate). Iterate the machinery: close the generator loop (SAE-instrumented Qwen with process-reward fine-tuning — the anti-Goodhart reward on the ontological-reasoning circuit, not the verdict), and apply the pipeline to a novel domain with minimal re-tuning (RASE generalization).

The final phase gate

At RWKV-7-matched scale, the ontology-grounded data mix yields an H-Net+RWKV model that (a) matches RWKV-7 on general/standard evals — the non-degeneracy floor — AND (b) exceeds the no-ontology-ablation control on relational understanding + de-novo Data-Element elucidation, CI-clean, with the α×β interaction and the mix-optimum scale-drift characterized.

Passing certifies the ontology machinery as a valid instrument for relational domain adaptation and authorizes the M4 forward door. Failing localizes the break to a named arrow — ontology→corpus degeneracy (M2 floor), corpus→model transfer (M2 lift), or scale-drift (M3) — each with its own remediation. This gate supersedes the proxy-only corpus-as-deliverable gate: the proxies are calibrated by it, never trusted ahead of it.

Shared infrastructure

Both artifacts depend on shared substrate beyond the ontology:

  • Gateway, leaderboard, and lineup. The FastAPI gateway + React UI (above), plus the lineup / KB — the LINEUP navigation primitive (Ward Cunningham, credited; not a wiki — no editing/forking) over the KB, which is a build projection of the three Data Products (ontology / relational / content). The read surface for run sidecars, ontology-rigor metrics, and corpus-quality surfaces alike.
  • Lineage substrate. The Atlas-on-AGE provenance graph, with OpenLineage / Marquez compatibility and Atlas deep integration — implemented in full. The discipline that keeps it non-dependent: the ontology is the single source of truth, so Atlas and every projection stay rebuildable-from-the-ontology-or-it’s-a-bug, never a master (Atlas edits suggest, never commit). See the Signals Programme’s source-of-truth diagram.
  • The engine and meta-harness. A gRPC engine serving Qwen3.6-35B-A3B-FP8 via vLLM under strict layering (engine→vLLM, workloads→gRPC), and the agent-mediated RETE/FSM control spine (src/aegir/meta_harness/) that orchestrates membrane-gated proposal.
  • Worktree-aware development tooling. git worktree-based cross-checkout dev with shared .git and per-worktree service gating; the cross-worktree SAE-feature streaming pipe lives here. See Worktree Aware Development.

Design principles

  1. Each milestone produces a usable artifact. M0 produced the substrate machinery (reasoner gates, evolving harness); the model track produced a training loop with leaderboard, an air-gap-deployable gateway, and a pretrained checkpoint; the ontology is itself a HermiT-validated, OQuaRE-GREEN, citable deliverable independent of any downstream model result.

  2. Empirical gates are real gates, not aspirations. No scaled spend proceeds without a green gate. The OQuaRE publish gate refuses sync --push of the ontology below GREEN; the M1 isolation gate blocks M2; the M2 floor-and-lift gates M3; the M3 final gate authorizes the forward door. A gate that honestly held RED (the ontology rigor gate, before the rigor-evolution loop closed it) is the evidence that the gates bind.

  3. Locked artifacts are hash-tracked end-to-end. Every run records its catalog version, locked-weights/null-statistics hashes, and run id in sidecar metadata; a strict-resume policy refuses to resume any run whose locked artifacts have drifted.

  4. Outward contracts stay narrow. The project publishes the sdg-corpora SHARE tier (ontology + SKOS vocabulary + DDL spine + corpus), trained checkpoints, and (when stable) the SAE feature dictionary. Consumers’ internal architectures (DST fusion, FSM session state, governance pipelines) are not the project’s concern; this decoupling is what makes each artifact shippable in isolation.

  5. Complexity is bounded. Each milestone adds exactly one new dimension of complexity (1-DOF where separable; a 2-factor cell only where the interaction is the hypothesis). Failure modes that respect this discipline are easy to diagnose; an honest revision pass on this document is the only protection against drift.

Long-horizon work

The agent swarm modules in src/aegir/swarm/ are scaffolding for the long-horizon multi-agent training task; no operational training uses them yet. The four-phase K2.5 PARL roadmap (supervised → reward → PARL → swarm RL) folded into the Signals milestone ladder, and the RL approach evolves through the GEPA / Agent-Lightning layers; the reward-modeling, PARL-training, and swarm-RL design notes remain in the tree as that design record.

Signals Programme — Relational Domain Adaptation (aegir workstream)

Status: programme charter (2026-06-16). The go-forward development programme for the aegir track, framed as a workstream of the cross-repo Signals initiative. Builds on end_to_end_and_meta_harness.md (the substrate-evolution machinery) and the EVIDENCE.md gate discipline. Designed to be lifted into a GitHub project across the component repos (§Component map).

Thesis

Signals co-develops a bounded, verifiable, signal-driven agent ecosystem (Holland Signals & Boundaries; intrinsic verifiability per Gaius RASE). This workstream is its relational domain-adaptation engine, and it rests on three commitments established over the inc-2 work:

  1. The ontology is the domain-adaptive surface — not weight-bound software. A de-novo-curated, reasoner-verified ontology carries domain meaning independent of any model, so it can be evolved before a trusted model exists (the regime where a meta-harness cannot operate).
  2. The reasoner (HermiT) is the model-independent oracle — sound & complete, so coherence and realization are ground truth, not learned proxies. It certifies formal correctness; the corpus and the model certify domain correctness. All three anchors stay live (anti-folie-à-deux).
  3. The H-Net+RWKV model is the ultimate fitness measure — trained from scratch on the ontology-grounded corpus, evaluated on relational understanding and de novo Data Element elucidation with conventional evals/post-training, deliberately kept as a trustworthy, non-co-adapting arbiter.

The generator (SAE-instrumented Qwen, fine-tuned to mint the ontology) and the downstream model (H-Net+RWKV, from scratch) are distinct by design: experiment upstream, keep the instrument boring.

Source of truth & data flow — the ontology is primary; everything else serves it

The single source of truth is the generated ontology artifact itself. Atlas, Qdrant, the build/dev/current projection, the lineup, and even the published sdg-corpora are projections, indices, views, and exports of it — never competing stores. The ontology spans two disclosure tiers: KNOW (the full, curated working ontology) ⊇ SHARE (sdg-corpora, the published subset). Same artifact, two tiers — not two artifacts.

      ┌─────────────────────────────────────────────────────────────┐
      │  ONTOLOGY  — the source of truth (the artifact itself)        │
      │     KNOW (full, curated)   ⊇   SHARE (sdg-corpora, published) │
      └─────────────────────────────────────────────────────────────┘
            │ build              │ glossary-sync     │ index        │ export
            ▼                    ▼                   ▼              ▼
      build/dev/current   ◀──in sync──▶   Atlas     Qdrant      sdg-corpora
      (projection)                        (synced VIEW;          (the SHARE tier)
                                           edits ──suggest──▶ ontology curation)

            the lineup  ── navigates / explores / curates all of the above
  • Dependency arrows point inward. Regenerating the ontology re-projects build/dev/current, re-syncs Atlas, re-indexes Qdrant, re-exports sdg-corpora. That inward-pointing dependency is what keeps a multi-store assembly coherent instead of a web of drifting masters.
  • build/dev/current ⟷ Atlas stay in sync because both project from the ontology — not via a direct link. There is no current↔Atlas channel; both are downstream of the one SoT.
  • Atlas edits suggest, they do not commit. Atlas is a rich glossary-editing surface, but a curator’s edit there is a proposal that round-trips into the ontology’s curation queue (reviewed, reasoner-gated, applied), then re-projects outward. So the Atlas glossary-sync is a suggestion-returning projection, not an authoritative store — its write path is a PR against the ontology, never a commit to it. (Same shape as scratch → current promotion; Atlas is just another suggestion inbox alongside the generator’s minted candidates and the authored scratch notes.)
  • The lineup is the navigate / explore / curate layer — the one place ontology-projection, Atlas-sync, and the Qdrant retrieval text are seen together, and from which curation decisions are made.

The verification membrane (corollary of commitment #2). The anchors partition verification by where reality is authored. Inside the loop — the ontology and everything projected from it (DDL spine, corpus structure) — verification is intrinsic: HermiT is the oracle, correctness is proved, not checked. Validation-after-the-fact (shape checkers, assertion contracts, external catalogs) compensates for not generating-correct-by-construction and has no place inside the boundary. Extrinsic verification is legitimate only at the membrane where un-generated reality enters — grounding the corpus against FinePDFs (R1) and the held-out downstream eval — where the system honestly tests itself against a world it did not author. So anchors #2 (formal / intrinsic) and #3 (domain / membrane) are both required and not redundant: they verify opposite sides of the boundary. Consequence — integrate freely, depend on nothing. Under in-situ RASE engineering, switching costs across the metadata/governance plane (OpenMetadata, OpenLineage, Atlas, Ranger, …) are ~zero, so integration carries no lock-in — which licenses deep, enthusiastic integration rather than abstention. We implement the standing directives in full: OL/Marquez compatibility (external tooling sees our lineage events) and Atlas deep integration (a richly extended glossary/lineage surface). This is non-dependence, not non-integration. The discipline that keeps it non-dependent: the ontology is the single source of truth, so Atlas and every projection stay rebuildable-from-the-ontology-or-it’s-a-bug, never a master (Atlas edits suggest, never commit). We sample patterns and adapt protocols on our terms; we owe the plane no gravity — and that freedom is exactly what makes integrating with it generously safe.

The maturation arc this enables. The current ontology form is the template+slot catalog (the seed crystal). The lineup curation is the forge that converts it into a real lexicon of concrete, Atlas-synced, Qdrant-indexed Terms (Lexicon / Category / Term ≡ Atlas Glossary / Category / Term — vocabulary already aligned). A term is “real” when it clears the contract (HermiT-coherent, R1-grounded, novel), passes curation, gains its AtlasGlossaryTerm + qualifiedName, and is Qdrant-indexed. As real terms accumulate, the spent template instantiations retire to archive/ — while the reusable axiom shapes stay live as the generator’s cross-domain exemplar pool (RASE-in-novel-domains needs them; archive ≠ delete). End state: templates are history and the lineup ≡ Atlas glossary ≡ Qdrant index — three views of one real ontology.

The two faces that make a term “real”: (1) Qdrant augmentation, recorded as SKOS annotation properties (built) — per the BERTSubs §4.3.2 multi-label technique, each term’s distinguishing text-features live in the ontology as skos:prefLabel / skos:altLabel (the BERTSubs multi-label set / MaxSim match surfaces) / skos:definition / skos:scopeNote / skos:example — common SKOS constructs with domain values, not novel ones. The lineup panel renders them, they assemble into the ColBERT/MaxSim retrieval text, and L(c1)×L(c2) over the altLabel sets multiplies the BERTSubs subsumption pairs (→ the hierarchy edges for per-term-panel navigation). build.py records them now (seeded from the term vocabulary); curation refines the altLabel set, verified by retrieval-lift (multi- vs single-label, the §4.3.2 ablation — the annotation-layer oracle, distinct from HermiT on the axiom layer). The subsumption hierarchy over those terms is now realized autonomously by mediate_hierarchy (scripts/, built 2026-06-17): mpnet candidates → Grok proposes (ACP) → a two-layer gate — HermiT consistency/coherence/acyclicity and domain vocabulary overlap — admits only verified edges, no human review (the tools are the arbiter). This is a RASE-pattern increment: the agent realizes the hierarchy capability through the meta-harness, intrinsically verified. The lineup renders the result as per-term Broader/Narrower navigation. (2) Atlas glossary-sync — the AtlasGlossaryTerm/Category projector (extending the existing rdbms_* relational projector), a suggestion-returning projection keyed on qualifiedName, carrying the same SKOS annotations as term attributes.

Methodology — factored gates over a scaling ladder

The programme is a high-dimensional, multi-objective optimization decomposed into a sequence of low-DOF gates, valid because the factorization respects the problem’s interaction structure:

  • 1-DOF where separable (architecture ⊥ data-mix; scale is a ladder, not a competing knob).
  • 2-factor where the interaction is the hypothesis — specifically α (ontology-corpus fraction) × β (SQL/DDL fraction): semantics × syntax of relations, the most likely super-additive effect.
  • Report Pareto slices, don’t scalarize — each gate’s output is a frontier (general vs relational vs DE-elucidation); the operating point is chosen once the surface is mapped, not at gate 1.
  • No scaled spend without a green gate (the standing EVIDENCE rule). Each milestone is a pre-registered EVIDENCE entry; a failure localizes to a named arrow, not a diffuse “it didn’t work.”

Component map (the eventual GitHub project)

RepoRole in this workstream
aegirthe engine — ontology + reasoner + meta-harness + H-Net+RWKV + pretraining/eval
gaiusRASE metamodel + MetaAgent — the shared verifier discipline + calibration precedent
corpora (sdg-corpora)the corpus / SKOS / DDL artifact — the publishable deliverable
oss-polyglotthe SQL/DDL syntactic axis — the β data amendments
atelierindependent pre-training efficacy gate (blind classification, reference withheld)
asf-atlasprovenance / lineage — the digital thread across the pipeline
hnet, rwkv-lmreference architectures (dynamic chunking; RWKV-7 baselines + open corpus)
cldr/signalsthe umbrella — boundary/signal contracts; the GitHub project’s home

Milestones

M0 — Foundation (DONE, 2026-06-16). The substrate-evolution machinery: HermiT coherence gate (inc-2a), single-file harness (inc-2b), Meta-Harness outer loop + first discovered harness (inc-2c), realization-as-CPA beachhead (inc-2d). Committed; see EVIDENCE.md. The reasoner gates and computes; the harness evolves.

M1 — Architecture baseline (the H-Net isolation gate). Train H-Net+RWKV on RWKV-7’s open corpus, swapping only the tokenizer for byte-level dynamic chunking; establish parity across the scaling ladder to RWKV-7-matched params. DOF = 1 (the chunking change), corpus held constant. Gate: H-Net+RWKV ≥ RWKV-7 at matched scale on standard evals → architecture certified, isolated.

M2 — Instrument validity (the decisive corpus gate + proxy calibration). A same-architecture, matched-budget matrix at ≥2 ladder rungs:

  • arms: grounded mix / no-ontology ablation / standard-only (the ablation arm already exists).
  • α×β 2-factor cell replicated at two scales — one design yields both the interaction sign and the scale-drift of the mix optimum.
  • eval: FLOOR (grounded ≈ standard on general LM evals → non-degeneracy, the failure mode every upstream proxy is blind to) + LIFT (grounded > ablation on relational + DE-elucidation; cells-only, control tasks, PR-metrics, bootstrap CI).
  • side-product: calibrate the cheap proxies (R1 / coverage-close / corpus-quality) against the pretrain signal — the E1 / RASE calibration loop, run once to certify the proxies that drive iteration.
  • bound to quantify: the corpus’s max non-repetitive token yield (caps α at scale). Gate: floor held AND lift CI-clean AND α×β interaction + scale-drift characterized.

M3 — Scale + the Final Phase Gate. Conditioned on M2 green: climb to RWKV-7-matched params, extrapolate α*(N)/β*(N) to target scale, confirm the lift persists. (Gate text below.)

M4 — The forward door (unlocked by the final gate). Iterate depth/breadth of the machinery: (a) close the generator loop — SAE-instrumented Qwen with process-reward fine-tuning (anti-Goodhart: reward the ontological-reasoning circuit, not just the verdict), the mutually-affirming ontology↔generator cycle anchored by the downstream model; (b) RASE in a novel domain — apply the pipeline to a second information domain with minimal re-tuning and measure what breaks (topic model, family complex, BFO anchoring, R1). Promotes “valid instrument” → “validated method.”

Final phase gate

At RWKV-7-matched scale, the ontology-grounded data mix yields an H-Net+RWKV model that (a) matches RWKV-7 on general/standard evals — the non-degeneracy floor — AND (b) exceeds the no-ontology-ablation control on relational understanding + de novo Data Element elucidation, CI-clean, with the α×β interaction and the mix-optimum scale-drift characterized.

Passing certifies the ontology machinery as a valid instrument for relational domain adaptation and authorizes the RASE-generalization phase (M4 → novel domains). Failing localizes the break to a named arrow — ontology→corpus degeneracy (M2 floor), corpus→model transfer (M2 lift), or scale-drift (M3) — each of which has its own remediation. This gate supersedes the proxy-only corpus-as-deliverable gate: the proxies are calibrated by it, not trusted ahead of it.

Three external anchors (held live throughout)

The symbolic co-evolution (ontology ↔ generator ↔ harness) optimizes formal + proxy signals only; three non-co-adapting anchors keep it honest: the reasoner (formal ground truth), the corpus (empirical fit), and the held-out H-Net+RWKV (behavioral domain truth). No milestone closes on proxies alone.

End-to-end pipeline + Meta-Harness + reasoner activation — spec

Status: design of record (2026-06-16). Organizes the next phase. Supersedes the control-plane framing in meta_harness_boundary.md (the RETE/FSM spine is demoted to harness H₀, see §2). Two reframes drive it: (a) Meta-Harness (Lee et al. 2026, build/resources/2603.28052.pdf) — optimize the harness (the program wrapping a frozen executor) via an outer-loop coding-agent proposer over a filesystem of candidates; (b) the ontology IS the dynamic computational framework — a reasoner-backed (HermiT) artifact we evolve in situ from domain inputs. Discipline (the lesson that produced this spec): adopt the form, not the language — every new piece must ground out in concrete, measured computation.

0. The end-to-end flow (the complete workflow — keep this in view)

A filesystem-DAG. Each stage writes a dir with manifest.json + a run_id that hashes its inputs → content-addressed lineage (no orchestration engine required; see §5).

S0 INPUT      finepdfs-lab corpus  (/raid/datasets/aegir-corpus-v1/finepdfs-lab/)
S1 COVERAGE   ontology_coverage_audit.py → coverage_v1/<run>/ {topic_coverage.parquet,
              topic_centroids.npy, manifest}   [FinePDFs → topics → gap/borderline/covered]
S2 EVOLVE ◄── THE META-HARNESS. mediate.py (spine + ACP/Grok mint + ContractGate[+reasoner])
   │          → evidence/meta_harness/<run>/ {trace, scorecard, candidates.candidate.json}
   │          [gap topic → mint construct → gate (DeepOnto+polyglot+R1+novelty+schema+CONSISTENCY)
   │           → promote]. THIS stage is what the OUTER LOOP (§2) optimizes.
S2' REVIEW    charter: editorial review → promote .candidate → catalog family files (human-in-loop)
S3 DDL/SKOS   ddl.py (template_to_table→render_ddl→validate_ddl/polyglot) + build_skos_vocab.py
              → DDL spine + 548-concept SKOS + Atlas rdbms_* projection
S4 CORPUS     generate_chapter.py (ontology+DDL → chapters + verifiable JSON + reasoning traces)
              → chapters.parquet + raw.exchange
S5 VERIFY     verify_chapters.py → raw.chapter_verification
S6 RELEASE    build_atelier_release.py (columns/vocabulary/reference blind benchmark) + HF/GitHub
S7 PRETRAIN   train_pretrain.py → byte model
S8 EVAL       eval_cells_cta / eval_edge_probe / REALIZATION-CPA (§3c) → column/relational skill

Feedback edges (the loops): S2’s grown ontology → re-run S1 (coverage-close); S5/S8 scores → the OUTER LOOP reward (§2); S1 gaps → S2 targets. The convergence loop = S1→S2→S3→S4→(S7→S8)→back.

1. The two frozen executors (the parallel)

Meta-Harness (paper)Our pipeline
frozen LLM Mfrozen Grok (the minting model) AND frozen HermiT (the reasoner)
evolved harness H (a program)the generation harness (S2) AND the ontology O (a reasoner-executed program)
outer-loop coding-agent proposerthe Meta-Harness loop (§2)
reward = task accuracyR1/coverage-close + consistency + realization-accuracy (Pareto vs cost)

We optimize TWO artifacts against TWO frozen executors: the harness (around Grok) and the ontology (around HermiT). The harness grows the ontology; the reasoner makes the ontology executable.

2. Meta-Harness outer loop (the FORM)

  • A harness H = a single-file program: run(topic, gate) -> (construct, signals) — builds the mint prompt (contract + topic salient terms + exemplars), calls Grok (frozen), gates, iterates. H₀ = the current inc-1 harness, refactored to one clean program (the RETE/FSM ceremony pruned to a minimal loop; let the proposer re-introduce structure only if it earns reward).
  • Candidate filesystem D (the feedback channel): candidates/{NNN}/{harness.py, traces/, scores.json}. Full, uncompressed — NOT the scalar signal vector (the anti-pattern the paper beats).
  • Proposer P = a coding agent (Claude Code/Opus, or Grok-as-coder) + a minimal skill (where to write harnesses, how to grep/cat prior code+traces, what it may edit). It diagnoses from raw traces and rewrites the harness (local edit → full rewrite).
  • Eval / reward = run H on a SEARCH SET of gap topics → batch on-vs-shuffled R1 / coverage-close
    • cost (Grok tokens, iters) → Pareto frontier. Proposer never sees the HELD-OUT topic set.
  • Loop (Algorithm 1): evaluate initial {H₀,…} → for N iters: P reads D, proposes k harnesses, interface-validate + evaluate + log → return Pareto frontier; final eval on held-out.

3. HermiT reasoner activation (make the ontology executable — the FORM, not the word)

HermiT is the sound-and-complete deductive KERNEL (hypertableau, full OWL 2 DL): sound = no false entailments, complete = no missed ones. It is the only formally-guaranteed layer — so consistency, classification, and realization are ground truth, not proxies (R1, verbalize, Grok, the model are the heuristic/stochastic shell; HermiT is the arbiter). DeepOnto integrates it natively: Ontology(path, reasoner_type="hermit") (the DEFAULT — already instantiated on every probe_template load, just never queried) exposes check_consistency(), get_inferred_super_entities/sub_entities(), get_instances(). So activation is calling the loaded reasoner, not wiring one. Three concrete, measured computations:

  • (a) Consistency gate [beachhead]. After a construct passes the syntactic gates, HermiT consistency-checks the cumulative ontology (seed ∪ admitted ∪ candidate). New ContractGate signal consistent; reject if it makes O inconsistent. A deductive check nothing syntactic can do — it’s what keeps the in-situ-evolving ontology a coherent computation. Measure: rejections-for- inconsistency; O provably consistent as it grows.
  • (b) Inferred hierarchy (classification). Coverage/structure read HermiT’s inferred subsumption closure, not the asserted SubClassOf chains.
  • (c) Realization-as-CPA [re-homes G-rel]. Map the corpus’s verifiable-JSON rows → an OWL ABox → HermiT realize → the column/entity types & relations computed by the reasoner. CPA/CTA becomes inference, not a tiny-model probe (which floored → G-rel descoped). Eval = realization accuracy vs the held-out reference.parquet. The relational computation relocates to the reasoner; the model becomes a fast amortization of it, not the thing that must learn it.
  • Caveats (real): OWL profile — generated complex-class constructs push expressivity; keep near OWL 2 EL only as a SPEED fallback if HermiT slows at batch scale (hypertableau is NEXPTIME-worst but practically tractable on modular BFO/CCO ontologies); the ABox bridge (DDL/JSON → assertions) for realization is a genuine new pipeline piece. Reasoner already instantiated by DeepOnto (default reasoner_type="hermit") — activation = calling check_consistency()/get_instances(), not new wiring.

4. How they compose

The outer loop (§2) optimizes the harness that grows the ontology; the reasoner (§3) makes the ontology self-consistent and executable; the end-to-end DAG (§0) is where both live. One sentence: a coding agent evolves the program that grows a reasoner-backed, domain-adaptive ontology, judged by what the reasoner and the corpus compute.

5. Orchestration stance (the Airflow question)

  • Now: the filesystem-DAG (§0) + thin drivers (just recipes + small Python runners) + manifest.json/run_id content-addressed lineage. This carries the whole-workflow understanding (legibility) without runtime complexity, and matches the Meta-Harness grain (filesystem + agent, not a DAG engine). The candidate filesystem D (§2) is the same substrate.
  • NOT Airflow now: it’s a runtime orchestrator for stable/recurrent/scheduled flows; ours is in flux, and Airflow’s scheduler/DB/webserver ceremony would ossify a flow we’re still discovering — and over-orchestrate the part the proposer should navigate.
  • Later (convergence-loop maturity): a lightweight orchestrator — Metaflow (the Gaius precedent) or OpenLineage→Atlas (the project’s existing provenance direction) — when S1→S2→…→S8 runs recurrently and lineage/scheduling pays off. components/ (cldr/signals) holds Airflow if we ever need it; default no.

6. Increment ladder

  • inc-2a (beachhead): HermiT consistency gate in ContractGate (consistent signal over the cumulative ontology) + a seed rule. Smallest real reasoner computation; immediately makes O coherent.
  • inc-2b: H₀-clean — refactor the spine+mint+gate into a single-file harness program with a run(topic, gate) interface; stand up candidates/{NNN}/ + interface validation.
  • inc-2c: the Meta-Harness outer loop — proposer + minimal skill + search/eval/Pareto over the candidate filesystem; reward = batch R1/coverage-close vs cost on the search set.
  • inc-2d: realization-as-CPA — the ABox bridge + HermiT realize + the symbolic-CPA eval vs the held-out reference (G-rel re-homed).

7. Reward / decision rules (measurement, so this stays form not language)

  • Harness search reward: batch on-vs-shuffled R1 / coverage-close on the search set, Pareto vs Grok cost; a discovered harness must beat H₀’s frontier on held-out topics to be adopted.
  • Reasoner: consistency-gate must reject ≥1 genuinely-inconsistent construct (instrument validity) and keep O consistent as it grows; realization-CPA valid iff accuracy > control on the v0.3 backbone-free symbolic path, CI-clean vs the held-out reference.
  • Every increment chains to one of these numbers or it does not ship (the standing rule).

Verification

  1. Control plane unchanged where reused; H₀ run reproduces inc-1 (t124 R1 ≈0.39, promote).
  2. inc-2a: consistency gate rejects a hand-crafted inconsistent construct; passes the t124 construct; O stays consistent across a batch. 3. inc-2c: a discovered harness beats H₀ on held-out coverage-close.
  3. inc-2d: realization-CPA selectivity CI-clean vs reference. Artifacts under evidence/ per stage.

Agent-Mediated Meta-Harness — Signals & Boundaries spine

Status: design of record (2026-06-16). Implements the reactive pivot of just mediate (the agent-mediated reference-ontology builder) as a RETE/FSM control spine within explicit boundary conditions (Holland, Signals and Boundaries). This is a spine, not a feature — built clean and complete up front, because a cogent fact/rule/agenda/FSM architecture cannot be retrofitted out of a procedural loop.

1. The Holland mapping (what the spine actually is)

This is not decoration; it names the structure we established empirically.

Holland S&BAegir meta-harness
Boundary / membranethe contract — the conjunctive gate suite that admits a construct as “inside the ontology”: DeepOnto verbalizes + asserts a complex class · Polyglot validates the DDL · R1 topic-specificity · E6-A topic-survival · BFO-anchor · novelty · schema-realism
Signalsthe gate verdicts — the dense per-construct/topic scorecard the engine reasons over
Internal model / tagsthe objectives (pre-registered, scored) the agent navigates by
Bounded adaptive agentthe FSM — a signal-responsive navigator that “tacks and jibes” toward a contract-passing construct, never sailing straight into the contract (the one-shot generator proved you can’t)
Emergence (many agents)the scale end-state — concurrent multi-topic mediation as supervised actors; this is where the full CAS lives, and where a true RETE discrimination network earns its keep

The membrane is the load-bearing asset; the producer is swappable (one-shot LLM → agent → hand). The harness is the producer that navigates the membrane by signal.

2. The spine API (stable; this is the part you cannot retrofit)

Fact(kind, payload, id, rev)                 # unit of working memory
WorkingMemory                                # assert / update / retract / query; global revision counter
Rule(name, salience, when(signals,ctx)->bool, then(ctx)->Effect)   # production rule
Agenda                                       # conflict set → resolve (salience ↓, specificity ↓, recency ↓) → fire one/cycle; log the full set + choice
Objective(name, score(signals)->float, satisfied(signals)->bool)   # scored, pre-registered
Effector  (protocol, EXTERNAL to the spine):
    mint(topic, feedback) -> construct       # the LLM agent (stochastic)
    gate(construct, topic) -> signals        # the deterministic contract (DeepOnto, polyglot, R1, …)
Trace                                        # append-only JSON: every fact change, rule firing, conflict
                                             #   resolution, state transition, effector call

Matching is a swappable internal, not part of the spine. Rule.when is the alpha-test layer. The first implementation evaluates rules by direct scan (correct, O(rules) per cycle — trivial at our scale). A full RETE discrimination network (alpha/beta nodes, join sharing, token propagation) drops in behind this unchanged API when working memory grows to many concurrent constructs × topics. Swapping the matcher is licensed because the spine is clean; it is not retrofitting the spine.

3. The FSM (5 states, rule-driven transitions)

OBSERVE_CONTRACT  →  gate current construct (or seed) → assert signal facts
SELECT_OBJECTIVE  →  run the rule cycle → highest-salience rule sets the `objective` fact
ACT_AND_SIGNAL    →  invoke Effector.mint for the objective (with feedback from signals)
EVALUATE_FEEDBACK →  Effector.gate the new construct → update signal facts
TERMINATE_OR_ITERATE → a rule fires success (promote) or give-up (max iters); else → OBSERVE

The FSM is the outer loop; rules decide the objective (SELECT) and termination (TERMINATE). Signal facts are asserted in OBSERVE/EVALUATE. Every state is itself a fact, so rules match on (state, signals). Determinism lives here.

4. The signal vector (dense facts the engine reasons over)

Per (topic, construct): r1_on, r1_shuffled, r1_ci_low (on−shuffled bootstrap CI low), deeponto_ok, deeponto_complex (asserted complex class), polyglot_ok, novelty (max cos vs seed∪admitted), schema_entropy (per-anchor column-set entropy vs SchemaPile), n_cols, iterations, state.

5. Seed rules (8–12, grounded in the committed gates; salience in brackets)

  1. no_construct [100] → objective draft_initial
  2. deeponto_fail [90] → fix_verbalizability (membrane floor: unparseable ⇒ no cognition)
  3. not_complex [88] → fix_nontriviality
  4. polyglot_fail [80] → fix_ddl (views need cogent relational schema + values)
  5. r1_not_specific [70] (r1_ci_low <= 0) → enrich_domain_terms (the binding constraint)
  6. novelty_low [60] → diversify (not a seed duplicate)
  7. schema_canned [50] (schema_entropy < floor) → de_can
  8. r1_improving [40] (r1_on rose but not CI-clean, iters<max) → refine
  9. contract_satisfied [120] (all gates pass ∧ r1_ci_low > 0) → promote (terminate-success)
  10. budget_exhausted [110] (iterations >= max) → give_up (terminate, logged)

Higher-salience termination/floor rules dominate; the agenda logs the full conflict set each cycle.

6. Objectives (pre-registered, composable, scored)

draft_initial · enrich_domain_terms · fix_verbalizability · fix_nontriviality · fix_ddl · diversify · de_can · refine · promote · give_up. Each scores its attainability from the current signals; the rules select; the effector executes. Objectives are the only extension point for new behaviour — add an objective + a rule, never a procedural branch.

7. Determinism & reproducibility (precise claim)

  • Control plane is deterministic: given the same signal facts, the same rules fire in the same order (salience → specificity → recency); every firing/transition/conflict-resolution is logged.
  • Generation is stochastic but gated: the LLM effector samples; we do not pretend otherwise. Reproducibility = log every exchange (we already capture reasoning_content to raw.exchange) + seeds. The membrane (contract) is deterministic, so a stochastic mint is always judged identically.
  • Intelligent by default — reasoning is never simulated. Reasoning an ontology out of FinePDFs is irreducibly intelligent; the production effector is ALWAYS a real agent (Qwen / our fine-tunes / Grok). There is no “stub mode” that produces ontology — a deterministic effector that marched a metric upward by formula would be pachinko: it would make the spine read “validated” while the only hard thing is absent from the test. The control plane is tested with fixtures (recorded real exchanges, preferred; or designed signal scenarios = data) that exercise the rule/FSM logic and make no claim about capability. Capability is validated only by a real agent moving R1 (inc-1).

8. Logging substrate (aegir’s, not assumed Atelier infra)

JSON sidecars under evidence/meta_harness/<run>/: trace.jsonl (event stream), scorecard.json (final per-topic contract verdicts), candidates.json (admitted constructs). Exchanges → the existing raw.exchange Iceberg table. No “leaderboard gateway / BDD suite” is assumed — those are not in aegir.

9. Boundaries respected (the anti-patterns)

  • No micro-orchestration: nothing calls just mediate --topics X per tool; the FSM invokes effectors as actions chosen by rules from signals. Sequence is emergent, not scripted.
  • No promotion without the full membrane: a construct is admitted only when the conjunctive contract passes — this is the Goodhart guard (you cannot term-stuff R1 without also clearing DeepOnto-non-triviality, schema-realism/de-canning, novelty, survival). Optimize on a topic’s terms; validate R1 on held-out topics.
  • Architecture must move the metric: the spine earns its complexity only by producing better contract-passing constructs more reliably than a dumb loop. Every increment chains to an R1 / contract delta or it does not ship.

10. Inaugural success criterion (the metric gate on the spine itself)

Run the spine on the gap topics the one-shot generator failed (on-topic R1 ≈ 0.007, Δ+0.003 CI touching 0). The spine is a success iff it drives on-topic R1 to CI-clean-positive on ≥1 of those topics, with every rule activation and state transition logged. “It ran an iteration” is not success; moving the metric is.

Increment ladder

  • inc-0 (this design + engine core): clean spine — facts/WM/rules/agenda/FSM/objectives/trace, naive matcher, stub effector, deterministic end-to-end self-test.
  • inc-1: wire the real effectors (LLM mint; gate = probe_template + render_ddl/polyglot
    • coverage_r1 + schema_complexity); run the inaugural R1 capability proof.
  • inc-2+: reactivity (message-driven, back-pressure, supervision) + concurrent multi-topic actors (the Holland emergence / scale end-state); adopt the RETE discrimination network behind the matcher API when working-memory growth justifies it; evolve toward / merge with swarm/.

Agent-Mediated Refinement Loop (adopted 2026-06-24)

A layered verification cascade that turns single-shot chapter generation (a truncated creative process) into an iterative, membrane-gated refinement loop. The ACP-wrapped agent is the proposer only; every gate is a deterministic membrane effector the agent cannot run or bypass. Extends the meta-harness (meta_harness_boundary), the HermiT membrane (reasoning_gates.py), the DDL spine, and the corpus pipeline.

Why

Audit (docs/scratch/2026-06-23/audit_chapter_quality_methodology.md) found the corpus structurally rich but semantically hollow: concept-salad assembly, ~9% placeholder cells, an L1 mix that puts ASHRAE 62.1 in a medical-imaging column and makes the model rationalize it, and no depth/refinement. These are symptoms of a single LLM pass with no review and no ground-truth re-entry. The fix is not six patches — it is an agent that proposes, is gated by the membrane, and re-enters with a typed critique. The patches become the agent’s toolbelt; the membrane stays the oracle.

Substrate mapping

  • Orchestrator — the meta-harness FSM (src/aegir/meta_harness/fsm_rete.py): states PROPOSE → VALUE_GATE → RI_GATE → PROSE_GATE → COMMIT, with a CRITIQUE loop-back. RETE rules fire one effector per cycle; the Agenda orders the cascade; the append-only trace is the provenance.
  • Proposer (PROPOSE effector) — hermes-agent AIAgent (/home/rch/local/src/oss/hermes-agent/run_agent.py:437), driven as a library, pointed at the local vLLM OpenAI endpoint. Proposes candidate values + scaffold edits
    • prose into build/dev/scratch/. Scaffold tools (scaffold/synth_column, scaffold/rebuild_table, scaffold/draft_prose) registered via registry.register; handlers call our ontology-realization code.
  • Membrane (gate effectors — deterministic, OUR code, run by the FSM not the agent):
    • VALUE_GATE — value-level HermiT: extend reasoning_gates.render_batch with a value-ontology fragment (entity-value pools as class-instances + domain axioms) and classify once. Checks: class membership, property cardinality, disjointness (kills ASHRAE-in-imaging), value-range. → admission set + unsat/equivalence trace; rejects return a structured delta.
    • RI_GATE — admitted values → transient in-memory relational view (DDL spine loader): FK/RI assertions, CREATE VIEW Data-Element predicates, aggregate coherence (column values within hypernym, row-count bounds). → typed critique on failure.
    • PROSE_GATE — only after relational gates pass: structural isomorphism (prose entities ↔ view keys), semantic entailment (embedding / exact-mention), length distribution vs FinePDFs samples + truncation boundary diagnostics. → causal critique + re-invoke with the verified spine as immutable context.
    • COMMIT — full-cascade success: scratch → current corpus JSONL + chapter artifacts (sdg-corpora RC). Metrics sidecar: HermiT admission rate, RI violations, prose↔table entailment, length distribution, iteration count.

Invariant

The agent proposes; the membrane disposes. Gates are FSM effectors external to hermes; the agent can request scaffold tools and draft prose but cannot run HermiT/RI/correspondence or commit. Refinement therefore cannot add confident new errors that survive verification — it can only converge toward an admissible artifact or exhaust its iteration budget.

Key decisions / load-bearing pieces

  • Proposer transport: drive AIAgent as a library for inc-0 (fastest; invariant holds because gates are external). The full ACP-wire form — meta-harness as ACP client exposing scaffold tools to hermes over ACP — is the production refinement (note, not inc-0).
  • Model endpoint: hermes → http://127.0.0.1:8100/v1 (the engine’s vLLM, OpenAI-compat) for inc-0; a thin OpenAI-compat proxy over the gRPC engine for production (preserves strict layering capability_grpc_engine).
  • The value ontology is the critical new authored artifact — the membrane’s value-disjointness/range axioms over the entity-value pools. Bootstrappable by the existing LLM-deriver + HermiT-admission machinery (the value axioms are themselves membrane-gated). VALUE_GATE is only as sharp as this fragment.
  • scratch is the proposal staging (sibling to current/archive) — scratch → current on commit.

Increment plan

  • inc-0 — close the loop on ONE chapter, measure the delta. Take ch0 (the salad/placeholder chapter); PROPOSE (re-synth placeholder columns, flag the salad) → VALUE/RI/PROSE gates → measure placeholder rate, value-coherence, RI, prose-correspondence, length vs the single-shot baseline. Proves the loop improves a chapter. Build order: (a) value-HermiT prototype (aegir env, highest value); (b) hermes smoke against the vLLM; (c) the FSM skeleton wiring the two.
  • inc-1 — value ontology + VALUE_GATE at corpus scope (bootstrap the value axioms, gate the pools).
  • inc-2 — full cascade (RI views + prose correspondence) as meta-harness effectors; dual-register output.
  • inc-3 — skills accumulation + scale (hermes skills library; calibrate-strong-then-distill-local; the ACP-wire transport; reproducibility via cached transcripts/seeds).

Risks

  • Agency must extend to the scaffold (re-select templates, re-synth columns), not just prose — else it polishes the salad. The scaffold tools enforce this.
  • Cost/throughput (~10× single-shot) — free solar GPUs make it time not money; cap iterations; calibrate with a strong agent then run local.
  • Reproducibility — cache agent transcripts + seeds so the corpus stays regenerable-from-truth.
  • Value-ontology coverage — the membrane is bounded by the authored axioms; start with the high-frequency disjointness violations the audit found.

Semantic-Layer-Upkeep — the local-first quality loop

Status: SPEC (2026-06-19, RH). The procedure for keeping the semantic layer (ontology → DDL → views + verbalizations) valuable enough to spend paid-API budget scaling out. The last cycle established structure (RI-true tables/views, SKOS-native names) but the semantic content is thin — and semantic content is what makes the corpus worth pretraining on. This spec adds an embedded-view semantic-quality gate and an upkeep loop we run entirely on local resources before any paid scale-out.

The problem (audited 2026-06-19)

  • Verbalizations are low-entropy. Baseline (scripts/audit_verbalization_entropy.py): 522 templates, 60 distinct syntactic frames, top-5 frames = 69% (§ is a · that · § alone = 167). “X is a Y” monotony. Root cause is our under-use of DeepOnto (it is not a black box): we call OntologyVerbaliser with defaults and take only its single .verbal string — never its config (add_quantifier_word, vocab), never its OntologySyntaxParserRangeNode parse tree, never the relational verbalisers (object_property_domain/range/assertion).
  • Cell values ~67% placeholders (Process 01) + a start_time > end_time bug; ~33% (enums) are real.
  • Column vocabularies “canned” — all BFO anchors below SchemaPile p10 (de-canning), because same-anchor tables inherit an identical attribute set.

The three quality dimensions (metric · floor · lever)

DimensionMetricFloor (provisional, ratchet up)Lever (local)
Verbalization diversityskeleton-frame entropy + top-5 share + relational share (audit_verbalization_entropy.py)top-5 share ↓, frame entropy ↑ vs baselineDeepOnto parse-tree re-render (config + relational verbalisers) → diverse set; local-LLM elaboration
Value semanticsplaceholder-ratio + domain-term fraction + time-order integrityplaceholder ≤ 0.30 · domain ≥ 0.40 · 0 time-order violationsricher enums + curated pools (sdg-vocab), intra-row temporal coherence (start<end=start+dur); local-LLM-seeded RI-safe domain entity values
Column-name diversityde-canning column-name entropy h_colset vs SchemaPile p10 (check_decanning_entropy.py; distinct_ratio reported as context)every anchor ≥ SchemaPile h_colset p10 (we land at/above its median)enrich anchor DataProperty pool + per-template stratified anchor-attributes

The embedded-view semantic-quality gate (scripts/semantic_layer_gate.py) composes the three into one per-dimension pass/fail, pre-registered in EVIDENCE.md. A gate is a floor to clear on the way — not the objective (see Non-goals).

Provisional scaffolding / NON-GOALS (load-bearing — RH 2026-06-19)

The simplifications below are expedient scaffolding to get an early result over the line — they are NOT goals, and must never be codified as design targets (the “illustrative, not definitive” discipline; cf. the Provenance DAG). See memory provisional_scaffolding_not_goals.

  • “entity columns are never FKs ⇒ LLM-seeded values are RI-safe” — holds only for today’s simple schemas. The real product has entity columns that are foreign keys in dense webs.
  • one-FK-per-table (cross_family_fks takes refs[0]), slot-derived structure, RI=1.0 by construction over simple tables — current floors, not the shape of the target.
  • de-canning floored on h_colset (entropy), curated/deterministic value pools, realization-CPA firing only on object-property templates — proxies/guards/current-scope, not the destination. The entropy floor is the right metric for ontology-grounded tables (raw distinct_ratio over-penalises legitimate, correct- by-construction shared typed attributes), but matching SchemaPile is still a floor: the north star is concept-specific columns, not a generic anchor pool stratified into variety.

North star: the true final data product carries significant real-world relational complexity — dense many-to-many relations, FK-bearing entity columns, complex multi-table schemas, and domain-real values and prose. The upkeep loop’s job is to advance toward that; when the work matures, the scaffolding is retired, not enshrined.

Local LLM substrate — the Aegir capability/gRPC engine

LLM-using levers run on a local capability/gRPC engine (mirroring Gaius; src/aegir/engine/), serving Qwen 3.6+ via vLLM. Strict layering: the engine is the sole vLLM client and owns the capability→model mapping; workloads connect only to the gRPC engine (Complete), never to vLLM, never handed an endpoint URL. Federation with Gaius’s engine is the roadmap. This is normal local overhead, not a gate.

Thinking-trace retention. Qwen 3.6 reasons verbosely, and the reasoning trace is a corpus value-add (cf. Cerebras GLM reasoning-trace retention in published datasets) — so the engine retains it rather than suppressing it. CompleteResponse carries reasoning_content (the separated trace, when a model/parser splits it cleanly) alongside text and finish_reason; for a checkpoint that embeds its trace inline with no parseable delimiter, the trace is retained within text. The engine is sized for long traces without OOM: max_model_len × max_num_seqs is held at the proven-safe KV footprint (e.g. 16384×8 ≡ 8192×16), and token budgets are generous (the workload accepts the wait). Use client.complete_detailed() to capture the trace for the corpus; complete() returns just the answer text.

Resource principle / sequencing

Local GPU, local LLM, and HermiT/JVM are normal programming overhead — used freely. The only gate is a paid remote API. So the entire upkeep loop runs + is gated locally:

iterate (deterministic + local-LLM levers) → re-run the semantic-quality gate → confirm in the lineup
   → repeat until gate green → ONLY THEN scale out the end-stage corpus with paid Grok/Cerebras

Confirmation surface — the lineup

The lineup Schema lens is where a curator sees the layer advancing: build.py::project_relational surfaces sample rows (from base_rows.parquet), the verbalization, and per-table/anchor quality badges. just kb-build rebuilds; browse /lineup.

Where this sits

Inserted before the paid corpus scale-out that feeds M2/M3. M1 (H-Net isolation, local GPU) is independent and may proceed in parallel. Full implementation plan: ~/.claude/plans/unified-noodling-flurry.md.

Phase 1: Supervised Bootstrapping

Phase 1 fine-tunes a CTA/CPA head from the v2 mixed-corpus pretrain checkpoint, not from random initialization. The original Phase 1 plan (train from random on Column Type Annotation) was invalidated by the 2026-04-19 representation-collapse incident on SOTAB v2 Schema.org CTA; the v2 byte-level pretrain (2026-04-27) produces the well-conditioned starting point that the fine-tune proceeds from. The fine-tune is the M2 empirical gate.

Train the Aegir column-annotation head on Column Type Annotation (CTA) and Column Property Annotation (CPA) benchmarks, starting from the v2 byte-level pretrain checkpoint at outputs/mixed-v2/20260426T232240Z/final.pt. This phase establishes baseline performance and demonstrates that the pretrained backbone escapes the failure mode that the from-random approach hit.

Objective

Produce a single Aegir checkpoint that achieves competitive F1 scores on standard CTA/CPA benchmarks, operating directly on raw byte sequences (no external tokenizer), via head fine-tune from a pretrained backbone.

Why we are not training from random

The 2026-04-19 SOTAB-CTA run (small config, 56M params, 3 epochs, lr 3e-4) produced complete representation collapse:

  • Every val sample produced the identical pooled embedding to within bf16 rounding noise (max pairwise L2 = 0.020 on vectors of mean norm 6.98).
  • The classifier predicted currency on 100% of 1500 val samples; exact-match accuracy equalled the val base rate of the mode class.
  • MCL geometry audit returned one cluster at every inflation tested.

A subsequent hygiene-only rerun (lr 5e-5, weight decay 1e-4, warmup, gradient clipping) reproduced the collapse to within 1 part in 10³ on both train and val loss. This is not a hyperparameter bug. The underlying issue is that H-Net + RWKV-7 is architecturally a language model — its mechanisms (dynamic chunker boundary learning, RWKV-7 time decay, recurrent state evolution) are designed for dense per-token supervision, not for sparse classification gradients. From-random direct CTA does not give the architecture the gradient signal it needs to stabilize.

The full diagnosis is in Diagnostic Case Study and the staged plan that resolved it is in Training Regime.

What changed since the original plan

ElementOriginal planCurrent plan
Starting pointRandom initializationv2 mixed-corpus pretrain checkpoint
First objectiveDirect CTA softmax over 91 SOTAB labelsHead fine-tune; backbone optionally frozen for the first probe
Success criterionF1 > 0.85 macro on SOTAB-CTA easyLiveness gate first (≥ 0.10 macro F1, ≥ 3 MCL clusters, ≥ 10 distinct predicted labels). Competitive F1 numbers are downstream of liveness.
VocabularySchema.org 91 labels (stale; correct count is 82)82 labels via vocab_label_map.json; multi-benchmark support via _LABEL_DIMS keys
Loss designCategorical cross-entropy on leaf labelsSame for the first probe; hierarchical path-prediction is a Stage C extension once liveness is established
ComputeUp to 6 × RTX 4090 DDP at small configSingle GPU sufficient for fine-tune from healthy v2 backbone; multi-GPU is M3 not M2

Target Datasets

DatasetTaskTablesColumnsLabel Classes
SOTAB-CTA (Schema.org)Column Type Annotation~50k~500k82 (verified — fixes stale 91)
SOTAB-CTA (DBpedia)Column Type Annotation~50k~500k101 / 53 (full / restricted)
GitTablesCTA (large-scale)~1.5M~15M122 DBpedia types
WikiTablesCTA/CPA~1.7M~6MDBpedia ontology

Liveness gate before competitive F1 targets

Per the v2 → SOTAB head fine-tune gate, the v2 → SOTAB head fine-tune must clear three liveness checks before any “competitive F1” target is meaningful:

  • ≥ 3 distinct embedding clusters at coarse MCL inflation
  • ≥ 0.10 macro F1 on the SOTAB v2 Schema.org CTA validation set
  • Predictions distributed across ≥ 10 distinct labels (no mode-class collapse)

These are deliberately undemanding. They distinguish “the model is making different predictions for different inputs” from “the model has collapsed to a constant function.” Until they pass, the F1 targets below are aspirational; once they pass, they become the next thing to optimize.

Aspirational F1 targets (post-liveness)

BenchmarkMetricTarget
SOTAB-CTA (easy)Macro F1> 0.85
SOTAB-CTA (hard)Macro F1> 0.65
SOTAB-CPAMacro F1> 0.75

Source: published REVEAL and SOTAB baselines. These are competitive, not state-of-the-art; SOTA on SOTAB-CTA is in the high 0.8s for specialized fine-tunes. We pursue them only after liveness is established.

Byte-Level Input

Aegir operates on raw byte sequences (vocab_size=65536 to cover byte values plus special tokens). Tables are serialized into a linear byte stream with role markers distinguishing the target column from context columns.

Dynamic chunking learns tokenization from raw bytes. The RoutingModule in the hierarchical backbone predicts chunk boundaries based on cosine similarity between adjacent hidden states. The v2 pretrain has already given the chunker a healthy boundary distribution on natural-language and DDL-flavored bytes; the fine-tune extends it to table serializations without re-learning byte statistics.

Serialization Format

Tables are serialized using the format in src/aegir/data/serialization.py:

[CLS] col_name: val1 | val2 | val3 [SEP] ctx_col1: v1 | v2 [SEP] ctx_col2: ...

The target column comes first, followed by context columns selected via MMR (Maximal Marginal Relevance) to maximize diversity while staying within the byte budget.

Training Configuration

uv run --no-sync python train.py \
    --task sotab-cta \
    --model-size small \
    --pretrained outputs/mixed-v2/20260426T232240Z/final.pt \
    --epochs 10 \
    --batch-size 32 \
    --lr 1e-4 \
    --warmup-steps 500

Hygiene parameters (lr 1e-4 with warmup, gradient clipping at max_norm=1.0, weight decay 1e-4) are the same as the v2 pretrain. The pretrained backbone is in a well-conditioned region of parameter space; the fine-tune does not need to escape from a saturating decay basin.

Single-GPU sufficient for the liveness gate

The liveness gate does not require multi-GPU. A single 4090 fine-tunes the small-config backbone on SOTAB-CTA in under an hour at the budgets that matter for liveness. Multi-GPU step-up belongs to M3, after the gate clears and we are pushing for the aspirational F1 targets.

Model Sizes (current, verified)

Sized_modelarch_layoutApprox params
tiny[128, 192, 192]["w2", ["w2", ["w4"], "w2"], "w2"]~13.5M
small[256, 384, 384]["w4", ["w4", ["w8"], "w4"], "w4"]~56M (Apr 19 SOTAB run)
base[768, 1024, 1024]["w4", ["w4", ["w12"], "w4"], "w4"]target ~500M

The Apr 19 representation-collapse run was at small; the v2 pretrained backbone matching it is the natural starting point for the liveness gate. base is the target for competitive F1 numbers post-M2.

Success Criteria

Phase 1 is complete when, in order:

  1. The v2 → SOTAB head fine-tune passes the liveness gate (docs/current/ontology/charter.md).
  2. Dynamic chunking continues to produce stable boundary predictions on the table-byte distribution (no degenerate all-boundary or no-boundary patterns under the fine-tune).
  3. The model meets or exceeds aspirational F1 targets on SOTAB-CTA/CPA at base config.
  4. The trained checkpoint is frozen and used as a specialist in the far-future Phase 3 PARL training.

Phase Gate — Governance & DDL Spine (v0.3)

Date: 2026-06-07 · Decision: PASS · Commits: f3795da, 3337176 (on the prior AGE-backend hardening, 614ea8f and the rch/signals fork chain).

This gate certifies the relational-metadata substrate for the v0.3 corpus: the corpus’s tables are now backed by deterministically-generated, machine-verified SQL DDL, catalogued in Apache Atlas as a first-class rdbms_* footprint with column-level lineage. It establishes the second verifiable axis of the corpus thesis.

The thesis it certifies

A v0.3 corpus table is a view on a larger relational footprint, and that footprint is verifiable on two independent axes:

AxisSource syntaxDeterministic verifierCoverage measure
SemanticOWL ManchesterDeepOnto (JVM)BFO/CCO families, axiom kinds
SyntacticSQL DDL / viewspolyglot (Rust)SQL features: types, constraints, FKs

polyglot : SQL :: DeepOnto : ontology. Each corpus column carries an ontology type (semantic) and a SQL type (syntactic); CTA is the map between them.

Scope delivered

  1. polyglot vendored as a submodule (components/polyglot = zndx/oss-polyglot @ rch/devenv) — a Rust/PyO3 SQL parser/validator/transpiler (sqlglot port, 34 dialects). Task-built via devenv tasks run polyglot:build (off the uv sync path, per the patched-wheel convention). The fork branch exists for the near-term Kudu dialect extension.
  2. DDL spine (src/aegir/ontology/ddl.py, scripts/build_ddl_spine.py): every catalog template lowered to a CREATE TABLE (reusing type_check schema types), cross-family foreign keys gated by the empirical FamilyComplex, validated against Trino ∩ Spark plus a Spark Iceberg-flavored variant, with a SQL-feature coverage inventory (the syntactic analogue of ontology_coverage_audit.py).
  3. DDL-native Atlas projector (scripts/project_atlas_ddl.py, supersedes the hive prototype): rdbms_instance/db/table/column + rdbms_foreign_key entities, corpus tables as VIEWs, column-level lineage via polyglot OpenLineage → aegir.governance.olaegir_hx (new columnLineage-facet ingestion).

Gate evidence (verified, restart-durable)

CriterionResult
DDL spine generates + validates (all 540 templates)540/540 canonical (Trino∩Spark) + 540/540 Iceberg variant (Spark) ✅
Relational footprint in Atlas91 rdbms_table (56 base + 35 views), 252 rdbms_column
Join structure explicit36 rdbms_foreign_key resolving table/key_columns/references_* ✅
Column-level lineage (view ← base)94 DERIVES_FROM edges ✅
Semantic overlay263 classifications (CTA/CPA/domain), OntologyProvenance BM on 56, glossary ✅
Durabilitysurvives devenv restart; names decode in search ✅

Significance / what this unblocks

  • The “views on a footprint” thesis is now concrete and browsable in the Atlas UI, not a slide — it is the illustration the corpus paper needs.
  • The corpus is engine-verified for Trino / Spark / Iceberg without running those engines (that runtime stays in Signals) — Aegir asserts forward-compatibility statically.
  • The syntactic-coverage axis is a new, cheap, deterministic signal to drive generation toward under-covered SQL features — complementary to the ontology coverage audit.

Known issues / deferred (none gate-blocking)

  • classification-filter search (filter entities by CTA/CPA tag) — deferred to a follow-up.
  • Kudu dialect — the first polyglot fork extension (required near-term).
  • Hybrid → forward construction — corpus views are still synthesized; next is emitting real CREATE VIEW from the generation pipeline.
  • Cosmetic: Atlas soft-delete dev cruft (UI-invisible); glossary term→entity meanings display.

Operational notes (for the next operator)

  • A long-running Atlas instance can develop a stale AGE connection pool (every create path 500s with “Failed to execute vertex query” while the graph is healthy) — restart Atlas.
  • polyglot OpenLineage wants an object outputDataset and a SELECT (not CREATE VIEW).
  • rdbms_foreign_key needs a name; a businessMetadataDef’s applicableEntityTypes is fixed at create-time (recreate to retarget).
  • maturin --release LTO is pathologically slow; iterate with debug builds.

Phase — SHARE Docs (browsable corpus in sdg-corpora)

Status: Phase A DELIVERED (in-aegir mdbook renderer, commit ae7dbee, just kb-mdbook); content layer MATERIALIZED (collections in-tree); Phase B (graduate to sdg-corpora) PENDING — the gate is met at Phase B. · Decision: scoped 2026-06-17 · Depends on: the lineup (UI-U0/U1/U2, delivered), the aegir.lineup sync SHARE verb (delivered), and a corpus regen against the current ontology (see Dependencies).

This phase turns the public sdg-corpora repo from machine-readable artifacts (parquet / TTL / JSON) into a human-browsable, cross-linked mdbook document — the static SHARE-tier rendering of the lineup. Where the aegir gateway lineup is the dynamic, authenticated navigation for us, this is the portable, public rendering anyone gets by opening the repo on GitHub: no gateway, no model, no auth.

Update — collection-structured, chapters-in-tree (2026-06-17, RH)

Two corrections to the original framing, both now in force:

  1. The corpus is the deliverable; it lives in the distribution tree. The earlier “chapters are release/HF assets, not in the repo” was not a real decision — .gitignore never excluded them; they were simply never committed, and a scale-era LFS note got mis-narrated as “deliberately leaving deliverables out.” The only principled withholding is the Atelier answer-key (reference.parquet, the column→SKOS-code scoring key — held out so the blind eval stays valid). Everything else — chapters (as text), underlying tables, terms — ships in the tree. (Bulk binary parquet via LFS/release is a real concern only at ~100K-table scale; not a v0.3 reason to omit the actual product.)

  2. The unit is a collection, not a flat chapter list. A collection = a FinePDFs-grounded topic (carried forward from the coverage audit) + its chapters (prose + embedded views) + the underlying relational tables the views project from (semantic-column DDL) + the grounding ontology terms + a manifest of the cross-links. One concept serves as the unit of distribution = navigation = mdbook section.

Materialized (scripts/build_collections.py, deterministic): the v0.3 corpus → corpus/collections/topic-NNN-<family>/{README.md, chapters/<id>.md, tables/<name>.sql, manifest.json} + INDEX.md. Current release: 121 populated collections (1,977 chapters, 4,116 table DDLs) + 79 gap topics (no chapters yet), in-tree. The mdbook renderer (below) organizes by collection (a collection = a book section); Phase A/B otherwise unchanged. Known caveat this release surfaces: the chapters’ embedded views are the generation-time (thin) schema, while the underlying tables carry this session’s semantic columns — the divergence motivates the corpus regen.

The thesis it certifies

The v0.3 corpus is not just downloadable — it is browsable as a document. A reader can open a chapter, see its rendered view-tables, click through to the relational tables those views project from, and from each table click through to the fully-resolved ontology entry that grounds it — all as static, cross-linked, attribution-clean pages. The same content ↔ relational ↔ ontology lineup primitive (Ward Cunningham’s lineup, credited; this is a navigation primitive, not a wiki — no editing/forking) rendered once for us live, and once for the public statically.

This is the presentation half of the SHARE boundary: sync shares the data; this shares the navigable document.

The artifact (what a reader gets)

An mdbook site (the devenv already ships mdbook + d2/katex/mermaid) with three cross-linked layers:

PageRendersLinks out to
Chapterthe chapter prose + its markdown view-tables, renderedeach relational table the views project from
Relational tablethe table’s columns + types (from ddl_statements.parquet), FK edges, a sample(1) the ontology entry it instantiates · (2) FK-linked tables · (3) chapters that view it
Ontology entry (resolved-template)see belowthe relational table it drives · broader/narrower entries · its family / BFO upper-type

Locked design decisions

  1. “Hydrated” = resolved-template (confirmed). The ontology entry page shows not the abstract template (Class: {X:Class} SubClassOf: cco:Artifact, sdg:hasSKU some {Y:Class}, slot placeholders) but the template resolved for its specific table: the concrete class name, the Manchester axiom with slots named as the actual columns/relationships the table instantiates, the SKOS definition (verbalization), the BFO/CCO anchor, the broader/narrower neighbours, and the slot→column mapping that drives the table’s DDL. It answers “what does this table mean, ontologically, fully concrete?”not the corpus row instances (that richer instance-hydration is explicitly out of scope for this phase; revisit later if wanted).

  2. The renderer reuses aegir.lineup.build; it does not reinvent the cross-link. build.py already projects the tri-layer KB (ontology / relational / content notes with [[wikilinks]]) into build/dev. The mdbook emitter (scripts/render_lineup_mdbook.py) is an output target over that projection: KB projection → book.toml + src/SUMMARY.md + one flat page per note, with the lineup [[id|label]] wikilinks lowered to mdbook relative links. The SUMMARY leads with the collections × lens pivot (the landing), then the collections, the lenses, and the ontology / relational / content products — the lineup’s panel-trail flattened into a navigable document.

  3. Self-contained from published artifacts. The final renderer reads only what ships in corpora/ontology/catalog/*.json (with broader + slot_types + manchester + verbal + bfo_anchor), ddl/<run>/ddl_statements.parquet (the already-resolved tables — so no need to port template_to_table), and corpus/. A reader cloning sdg-corpora builds the site with zero aegir dependency.

Phasing (A → B) — prove before scaffolding

Phase A — prove the renderer in aegir. DELIVERED (commit ae7dbee, #49). An mdbook target over aegir.lineup’s projection (scripts/render_lineup_mdbook.py, run via just kb-mdbook — optionally --build to invoke mdbook build) reuses build.py’s projection. It outputs the cross-linked site, lowering wikilinks to relative page links, with the collections × lens landing at the head of SUMMARY.md. Cheap, where the projection + hydration data already live. No new devenv, no new repo structure.

Phase B — graduate to sdg-corpora as a proper project. PENDING (the gate, #50). Once A is proven, extract a self-contained renderer into sdg-corpora:

  • promote sdg-corpora to a Python project: its own devenv, a src/sdg/ package, and a just docs-sync recipe (mirrors aegir’s just kb-sync ergonomics);
  • the renderer reads only the repo’s published artifacts (design decision 3) → emits docs/ (the mdbook book) inside sdg-corpora;
  • publish the rendered book (GitHub Pages or committed book/).

The gate

PASS when: a fresh clone of sdg-corporajust docs-sync produces a browsable mdbook site where every chapter, its rendered view-tables, the relational tables, and the resolved-template ontology entries are cross-linked and all links resolve, built with zero aegir dependency. (Phase A is the in-aegir proof of the renderer + link model — now delivered; the gate is met at Phase B.)

Dependencies

  • Corpus consistency. The docs are only honest once the corpus is regenerated against the current ontology — otherwise old-ontology chapter view-tables would link to new-ontology entries. Sequence this phase after (or bundled with) a corpus regen via the generate_chapter.py pipeline. Until then, Phase A can run against the current (mismatched) artifacts purely to prove the renderer.
  • sync already done. The ontology Data Product is current in corpora/ as of the SHARE-verb work; this phase consumes those artifacts.

Scale (deferred)

At v0.3 scale (~hundreds of tables / ontology entries / ~2,000 chapters) a flat mdbook renders fine. At the production target (~100K relational tables / ~10K chapters) 100K+ static pages need sharding / pagination / on-demand rendering — that design is deferred to when production data lands and must not block the v0.3 browsable release. log() any truncation if a cap is applied at scale.

Non-goals (this phase)

  • Instance-level hydration (corpus rows as ontology individuals) — resolved-template only.
  • Editing / forking — the lineup is a navigation primitive, not a wiki.
  • Replacing the live gateway lineup — this is its static, public complement, not a substitute.
  • Production-scale rendering — see Scale.

Lineup Landing — the collections × lens pivot

Status: BUILT (the spec that shipped). · Co-designed with RH 2026-06-17/18; landed as the lineup’s current beachhead (collections × lens pivot + the de-flattened many-to-many graph + the TF-IDF lens chords — see §Implications). Defines what a user sees on clicking lineup, and the data model the view sits on. Distinct from (but shares its graph with) the SHARE-Docs phase.

What the user is seeking — a trailhead, not a dashboard

Cunningham’s lineup is trail-following (it is a navigation primitive, not a wiki — no edit/fork). So clicking “lineup” almost never means “show me a report” — it means “give me a good place to start a trail.” The landing’s job is to offer the strongest trailheads and get out of the way. Presume: orientation → first trail step.

Scenario: Orient (the default, first-visit intent)
  Given I open the lineup with no specific target
  When the `current` root loads
  Then I see collections as rows under the default lens, with the shape of the KB
  So I can pick a direction in one glance.

Scenario: Browse the corpus (the richest axis)
  Given I want to explore content
  When I read the landing
  Then collections are first-class rows, each a topic-grounded bundle
  So I can dive into one (its chapters, tables, terms).

Scenario: Re-orient by lens
  Given I am looking at collections under the `terms` lens
  When I switch the lens to `schema` (or `content`)
  Then the column axis swaps (terms → base-tables → topics) while collections stay the rows
  So I see the same collections through a different facet without losing my place.

Scenario: Transpose
  Given a cell links a collection to a term (or topic, or table)
  When I follow it from the terminal side
  Then I get "this term → the collections that realize it" (the inverse trail)
  So the pivot reads both ways.

Scenario: Seek a known thing
  Given I know what I'm after
  When the landing loads
  Then a jump/search box is focusable immediately
  So I skip traversal.

Scenario: Resume
  Given I was mid-investigation
  When I open `scratch` from the root dropdown
  Then I see my authored notes / recent trail
  So I continue where I left off.

The beachhead: collections × lens

current = a pivot. Collections are the fixed rows; the lens selects the column dimension (terms default → schemacontent). Cells are incidence — i.e. trailheads (optionally carrying a count). The lens swaps what the columns are; the unit never changes. This is the user’s “compound / pivot-table” intuition, made precise: the lens is the column-axis selector, not a co-equal second data axis.

The truth underneath: a multi-hop graph, every terminal many-to-many

The pivot is a surface over a graph whose hub is the document and whose relational join-node is the view:

collection ──member──▶ document ──realizes/cites──▶ term
                               ──target+style──────▶ topic
                               ──embeds──▶ view ──hydrates from──▶ base table

Each lens reaches a terminal (term / topic / base-table) through an intermediary. Flattening any intermediary collapses a real many-to-many into a false diagonal or many-to-few — which is exactly what an early materialization (build_collections.py, grouping on the single target_topic_id, going straight to base tables) did:

LensIntermediary (must not be flattened)TerminalCardinality vs collections
termsdocument (realizes)termmany-to-many
contentdocument (target + style_topic_ids)topicmany-to-many
schemadocument → viewbase tablemany-to-many

The data already proves it (v0.3 corpus, before any de-flatten):

  • Topics: 160 distinct topics touched; 160 / 160 already span > 1 collection; one spans 32; a chapter touches ~3 topics (target + style).
  • Base tables: 146 cited; 95 (65%) already feed > 1 collection; one feeds 35; avg 4.5.

So there is no diagonal and no many-to-few — those were dropped-hop artifacts. Every terminal is many-to-many because the document is a shared hub and the terminals recur (a topic styles many documents; a base table hydrates many views). The “collection ≈ topic” 1:1 in the current release is a degenerate, transitional instance of this graph, not its shape.

The rule, found independently on two axes: the view is to the schema axis what style_topic_ids is to the content axis — the intermediary that, surfaced, turns a flattened projection back into the real graph.

First-class entities

The model requires two entities that are currently implicit to become first-class lineup nodes:

  • document — the universal hub (everything hangs off it: terms, topics, views).
  • view — the relational join-node (view → base tables; a view may source tables across topics/PDFs; a base table feeds many views). This is the project’s founding thesis made navigable: a corpus table is a view on a larger, shared relational footprint. The Atlas projection already models views (view_<table>, join_<a>__<b>); the lineup/collections layer must too.

Plus the terminals (term, topic, base-table) and the unit (collection).

Lens renderings (all pivots; the card is orthogonal)

  • terms (default — leads with the ontology grounding, the thesis): collection → terms, invertible to term → collections.
  • schema: collection → documents → views → base tables — the shared footprint; pivot center is the view. Surfaces both per-document views and the shared base tables.
  • content: collection ↔ topics (the densest cross-axis). The README-style “expanded topic” card is the drill-in detail of one collection — not the lens itself. Pivot → click a cell → card. Two layers; don’t collapse them (that collapse was the 1:1 artifact).

Affordances the pivot grants for free: transpose (read either direction), counts as density (orientation before drill-in), cell = trailhead (keeps it a lineup, not a report).

Tri-root: a layer dropdown, not a browse axis

Pull archive / current / scratch into a compact top dropdown, current default. It selects the layer, not the browse axis, so it shouldn’t eat left-nav space. Each lands consistently: current → the launchpad (the KNOW projection) · scratch → authored notes + recents (resume / curation) · archive → snapshots + aged.

Loading

Render the trailhead structure immediately (lenses + collections + search are known from the index at once); stream counts and recents. Orientation starts before the data fully lands — the skeleton is the trailheads, not spinners.

Implications & dependencies

The three implications below are DONE in the KB projector (src/aegir/lineup/build.py): the lineup projection — not the standalone scripts/build_collections.py release materializer — is now the live source of the landing’s graph.

  1. De-flatten the materializer. build.py::project_collections carries the intermediaries the early build_collections.py materialization dropped: style_topic_ids (→ content many-to-many) and the view layer (→ schema many-to-many). One change of kind (“stop collapsing the hub and the view”), not three axis-specific ones (tasks #52/#53).
  2. Model document and view as first-class lineup notes (with their edges) so the pivot is computed over the true graph; project_lenses emits the collections × {terms, schema, content} pivots over it, and the TF-IDF lens chords (aegir.viz.lineup_app, task #54) render the live associations.
  3. Shares the graph with SHARE-Docs — the mdbook renders the same collection ↔ document ↔ {terms, topics, views→tables} structure; both are built against this model.

Versioning — namespaced archive snapshots

Before a regen (or any version cut), just kb-snapshot --key <key> freezes current/ into a namespaced, self-contained zettelkasten under archive/<key>/: every note id and [[wikilink]] is prefixed <key>/, so the snapshot coexists with the regenerated current (no id collision) and is internally navigable (clicking inside stays inside). A registry note (kind archive-snapshot) is the Archive-dropdown entry point; a _manifest.json pins the corpus/coverage/catalog it was projected from (reproducible). The snapshot survives kb-build (which rebuilds current only).

2026Q2 — pre-regen snapshot (taken 2026-06-18, calendar quarter): 3,397 notes · corpus sdg_corpus_v0_3/d7646714… · catalog ae7dbee. Regenerate with AEGIR_CORPUS_RUN=…/d7646714…/chapters.parquet just kb-build at catalog ae7dbee, then just kb-snapshot --key 2026Q2. This preserves the pre-regen lineup so the corpus regen is safe.

Scale note: merging a full snapshot into the live index ~doubles it per snapshot. Fine for one; when snapshots accumulate, move to per-snapshot frozen indices mounted on demand (keep the live index current-only) — the deferred refinement.

Non-goals

  • A wiki (no edit/fork — it’s Cunningham’s lineup).
  • Replacing the live gateway lineup — this is the live lineup’s landing.
  • A numeric-aggregation pivot — cells are incidence/trailheads (counts are an optional density hint).

Leaderboard → Convergence Observatory (enhancement ideas)

Status: IDEAS — not built. Captured 2026-06-18 (RH) once the FinePDFs→ontology→DDL→ corpus-with-embedded-views pipeline materialized and the viz layer became live HoloViews/Panel/ Datashader served by a bokeh server behind the gateway proxy (UI-U5; commits 8c42bbd / 6ac9491). The leaderboard’s only prior design intent was “kinda like W&B but with HoloViews.” Now there’s a reason to aim higher.

Where it is today

One row per training run; clicking a run opens a drawer with live HoloViews curves (loss / F1 / per-stage chunking boundary), rendered by aegir.viz.runs_app over the bokeh server, embedded via <PanelView>. Air-gapped, no npm @bokeh/bokehjs.

Built — the Training section + Sweeps + Reward (v1). The lineup left-nav now has a TRAINING group below the lenses. Reward (reward_app) is the GRPO health monitor — reward R mean±std band (the variance collapse-canary) + R_A pass-rate + z-scored advantage, off a run’s GRPO metrics_jsonl. Its first entry, Sweeps, opens a live HoloViews parallel-coordinates panel (aegir.viz.sweeps_app) over the runs — each run a line across model_size · num_params · lr · epochs · macro/micro F1 · val_loss, colored by best macro-F1, reading live from outputs/runs. The Landing “Runs” card is now “Sweeps” → /lineup?open=training/sweeps. HoloViews-native by directive (no canonical-PCP widget) — the door is open to the superior version (datashade for many runs, hv.link_selections axis brushing). The TRAINING group is extensible: the ideas below become further entries (each a kind:"training" note with a viz_app frontmatter + a bokeh-server app).

The reframe

The pipeline now emits coupled Data Products (ontology · relational/DDL footprint · corpus with embedded views) and model runs, and the convergence loop couples them (proxy signals → model eval; see aegir-convergence-loop). With a live viz layer, Atlas provenance, and the lineup all in place, the leaderboard should grow from “training curves” into the convergence observatory: the join of model-runs × data-products × Signals gates. It’s the natural surface to answer “is the ontology load-bearing, and is the model elucidating the relational structure we built?”

Enhancement ideas (roughly prioritized)

  1. Run ↔ data-product lineage (highest leverage). Each run records the corpus snapshot (sdg_corpus_v0_3/<hash>), ontology catalog (ae7dbee), and coverage run it trained on. Surface them as a “provenance” tab that cross-links into the lineup (the collections/chord it consumed) and Atlas (the RE_GROUNDS_TO loop-closure subgraph). Closes the run↔data-product loop visibly — the same live-viz embed the chord uses. See atlas_age_provenance_graph, lineup_kb_projection.

  2. Signals gate panels. Per run, show the M1/M2/M3 criteria status (M1 H-Net isolation; M2 3-arm × α×β instrument validity; final gate: matches RWKV-7 on general non-degeneracy AND beats the no-ontology ablation on relational + DE-elucidation, CI-clean). A pass/fail strip turns the leaderboard into the gate dashboard, not just curves. See signals_programme.

  3. Ablation-arm comparison. The corpus carries full / no-ontology / no-schema arms — overlay their curves (HoloViews overlay / HoloMap by arm) for the same data to read the ontology’s load-bearing-ness directly, instead of eyeballing separate runs.

  4. Eval-instrument-aware panels (per ontology-cpa-eval-methodology). The instrument is the binding constraint, so plot sample-efficiency curves (perf vs #examples) not full-data points, PR metrics not ROC-AUC, with bootstrap/permutation CIs, plus control-task / held-out-type / MDL-probing panels. This is where “W&B-like” stops being enough.

  5. DE-elucidation / CPA progress (the north-star). Track the model’s data-element-elucidation (CPA) over runs and tie it to the lineup Schema-chord densification (already framed as the term↔table many-to-many progress metric). The leaderboard becomes where “is the model learning the relational structure we built?” is answered.

  6. Datashader at scale. Per-step loss/grad traces over millions of steps, and a coverage heatmap of the relational footprint (which tables/views a run’s training data exercised) → server-side Datashader rasterization. Aligns with the tier-one compute posture (compute_posture_tier_one); the live stack already supports it.

  7. Interactive Panel widgets (now unblocked by the live server — the static path couldn’t). Run grouping/tags, metric pickers, cross-run config diff, run notes — Panel widgets / Tabulator.

  8. Corpus-quality panels beside model metrics. Surface the convergence proxies (coverage-close R1, topic-recovery, family-complex) next to model metrics, since the loop is one coupled system — one observatory for both products.

Dependencies / sequencing

  • Most of these need runs to carry data-product provenance in metadata.json (corpus/ontology/ coverage hashes + arm) — a small RunArtifacts.start addition; do that first (cheap, unblocks #1–#5).
  • #2/#5 depend on the Signals gates + the discriminating relational eval instrument (M2/M3 tasks #42/#43).
  • #6 needs datashader/dask added (deliberate uv pip install; the live viz layer already fits).
  • Keep the live HoloViews/PanelView path — these are all panes on the same bokeh server, embedded the same way; no new viz transport needed.

RL training (GRPO/RLVR) — the observatory is the reward instrument

P5 already runs GRPO/RLVR (src/aegir/rl/): the policy generates ontology compositions, the deterministic verifier composite R is the reward (parallel_verifyverifier), z-score group advantages, no critic. GRPO is the right fit and PPO is not the question: a value network is pure overhead (memory + a second model to tune) when the reward is a cheap, parallelized, deterministic grader — PPO earns its keep only with learned/noisy reward models or dense per-token credit, neither of which applies here. The live questions are not PPO; they are (a) reward-variance collapse (if R saturates or the structural gate R_A zeroes a whole group, advantages vanish → “0 reward variance forever”) and (b) GRPO refinements (length-bias debias, advantage_normalization choice — already parameterized). Both are observability + reward-design problems → this observatory.

The earlier ideas are not left behind — RL makes them central, and most are nearly free because GRPOMetrics already logs them to a metrics_jsonl:

  • Reward dynamicsrewards_mean ± rewards_std band + min/max: the headline RL panel, and the reward-variance band IS the GRPO health monitor (the collapse canary). advantage_mean/std = signal strength. Already logged → a panel reading the GRPO metrics_jsonl (like runs_app).
  • Reward-component decompositionR_A·(0.50·R_B + 0.05·R_C + 0.45·R_D) over training (a stacked / small-multiples / PCP view). Small add: have parallel_verify log the sub-scores, not just R. This is idea #4 (eval-instrument-aware) for RL.
  • Verifier-pass-rate gatesR_A structural-gate %, HermiT-consistency %, coverage-close % = idea #2 (Signals gate panels), as the RL pass-rate dashboard.
  • SAE feature stream — already live (/api/p5/sae/stream); an RL-interpretability panel, datashade-d over GRPO steps (idea #6 at scale).
  • Sweeps PCP (built) → the GRPO hyperparameter tuner: group_size · kl_coefficient · advantage_normalization · lr × outcomes (final reward, pass-rate). This is the canonical RL-sweep use.
  • Unify the two run surfaces: /api/runs (supervised CTA/CPA) + /api/p5/runs (GRPO/RLVR) both flow into the lineup observatory (a Training ▸ Reward entry beside Sweeps). They are separate today.
  • GRPO-vs-PPO, if ever litigated, is an ablation-arm comparison (idea #3) — but the reward shape says GRPO; spend the cycles on reward granularity + curriculum, watched via the variance band.
  • Empirical guardrails — a low-cost GRPO sweep (drive it with the Sweeps PCP): group_size (4–16 is the stable-baseline range), advantage_normalization (z_score vs centered), kl_coefficient scaling, plus a simple length-normalization baseline arm. Settles the refinement questions for the cost of a handful of short runs — exactly what the PCP is for.
  • Downstream coupling metrics — surface one or two proxy-downstream signals beside reward / pass-rate: post-verbalization RWKV byte-per-byte loss on DDL / SchemaPile slices, or CTA/CPA F1 lift on held-out tables. A single panel tying reward ↑ to downstream-loss ↓ is the strongest “the proxy is real” evidence the observatory can show — it validates the convergence loop’s load-bearing assumption (higher-R ontologies → better synthetic corpus → better model) rather than trusting R on faith. This is the open calibration gap in aegir-convergence-loop (cf. E1); the observatory is where it gets watched continuously.

Real-world use: a verifiable-reward RL observatory (reward dynamics + verifier-pass-rate gates + data-product lineage + reproducible provenance, with the catalog hot-reload closing the loop — edit the ontology, the reward changes on the next rollout) is genuinely product-grade RLVR experiment management, differentiated from W&B by the verifiable-reward + ontology-grounded-lineage semantics.

Provenance — Verifiable Tasks & Lineage (the through-line)

Status: BUILT (instance-level ego-graph) — ILLUSTRATIVE, not definitive. Captured 2026-06-19 (RH); the first slice (a live type-level Atlas DAG) landed same day, and was superseded 2026-06-24 by an instance-level ReactFlow ego-graph (#97, commit 0657569). Reframes the top-line “Tasks” card (an unlinked Statistic, originally conceived as an Atropos-style RL-task surface) into Provenance: Verifiable Tasks & Lineage — the spine the whole pipeline already has. The card links to /lineup?open=training/provenance; the panel is a Training ▸ Provenance sibling of Sweeps and Reward. The verification overlay (per-edge gate verdicts) is still absent — that is the increment that turns the navigable lineage walk into the thesis artifact (§Maturity, §Dependencies).

What is built (v1.5 — the instance-level ego-graph)

The panel renders a node’s first-order neighbourhood as a ReactFlow graph (ui/src/components/ProvenanceGraph.tsx, @xyflow/react 12), read live from the aegir_hx Atlas / Apache AGE graph via the gateway endpoint /api/provenance/ego?focal=<vid> (src/aegir/gateway/app.py). Outgoing/derived neighbours sit right, incoming/sources left, edges are labelled by relationship type; clicking a neighbour opens its ego-graph in a new panel — the lineage is walked node-by-node in panel-trail fashion, not shown as one static type-level DAG. With no focal, the endpoint seeds an anchor (a Dataset / Run / Chapter); each node uses the AGE internal id() as identity, the neighbourhood is capped at 40 (with “more exist” surfaced), and the Atlas-core vertex label and __rdbms_* table internals are excluded. The panel degrades gracefully to an empty state when the graph is down. A provenance/<vid> synthetic note (gateway kb_note) lets a clicked node resolve as a panel; the lineup note is kind:"provenance" carrying an ego_focal seed (src/aegir/lineup/build.py).

This replaced the original v1 — a type-level HoloViews/bokeh DAG (src/aegir/viz/provenance_app.py: the convergence chain Family/Topic → Template → Chapter → Column/Dataset → Job/Run via networkx.multipartite_layouthv.Graph). That bokeh app is left in place but no longer embedded; the live path is the React ReactFlow component (the @xyflow/react GraphRenderer renders correctly client- side, where the npm @bokeh/bokehjs build of an hv.Graph did not — the reason for the move).

Maturity: illustrative, NOT definitive

The current panel is a legibility sketch. It proves the surface (live Atlas graph → ReactFlow ego-graph → tap-to-walk at the narrow lens width), but several modelling choices remain provisional scaffolding. Do not build heavily on the current shapes. The axes (RH, 2026-06-19), with their status updated to the ego-graph:

  • Granularity — type-level → instance-level. ✅ DONE. The ego-graph nodes are now the versioned artifacts themselves (a specific Chapter, Run, Dataset, Template, …, by AGE node id), not artifact types with aggregated counts. The earlier “shape of the pipeline” cartoon is superseded by the real, walkable lineage neighbourhood.
  • Node→panel routing — coarse type→whole-lens → contextual drill-in. PARTIALLY DONE. Clicking a node now opens that node’s own ego-graph (its identity seeds the next panel), rather than the type’s whole lens. The remaining gap is a richer artifact-detail panel — e.g. a Run/Job should reach a run-detail view (or Sweeps/Reward scoped to that run), not just its lineage neighbourhood.
  • Artifact set & layout — curated whitelist → topology-derived. STILL OPEN. The node set is now derived from the live graph (the focal’s actual neighbours), but the global lineage is still not laid out from topology — there is no whole-graph multipartite/loop-aware view, and the RE_GROUNDS_TO loop-closure edge is a genuine cycle that a single ego-hop does not render as a loop. A topology-derived overview (expand/collapse between the walk and a whole-graph layout) remains future work.
  • The verification overlay is absent — the “Verifiable” half is unbuilt. STILL OPEN. Edges are plain derivations; the point of Verifiable Tasks & Lineage is per-edge gate verdicts (R-pass · HermiT-consistent · coverage-R1 · downstream-eval-lift) encoded on the graph. That is the increment that turns the navigable lineage into the thesis artifact (§Dependencies).

So: current implementation = illustrative, instance-level navigation without verdicts. Definitive = topology-derived overview + contextual artifact-detail panels + a verification overlay. Treat the present node/edge shapes as scaffolding to be extended, not as settled design.

Why the pivot (what Atropos told us)

NousResearch/Atropos is a clean RL-environments gym: an BaseEnv bundles rollout generation + scoring + dataset, runs as a microservice pushing ScoredDataGroups (trajectories + scores + metadata) to a trainer-agnostic Trajectory API (run-api), with verifiable/rule-based rewards front and center (GSM8K exact-match, tool-calling, code-exec). It nails the RL-task half — and the tell is what it lacks: no formal versioning/ provenance system (provenance is “implicit in server state” + JSONL lineage). That gap is exactly Aegir’s asset. Atropos’s “task” = a verifiable environment; our insight — a sequence of events and gates over versionable intermediate artifacts — is what Atropos doesn’t model and Atlas only half-models. Their union is the differentiator, so the card should name it.

The data model

A provenance DAG: nodes are versioned artifacts (FinePDFs ground → ontology catalog vN → DDL spine → corpus snapshot → model checkpoint → GRPO/eval run); edges are verifiable events — a derivation that passed a gate / earned a reward / lifted a downstream eval. Each edge carries its verdict.

This single structure subsumes the three things the card was straddling:

  • RL Tasks (Atropos-style) = one edge kind: a rollout scored by the verifier R → a policy/ checkpoint. The GRPO loop + parallel_verify already is this.
  • Enterprise lineage (Atlas) = the DAG itself. Atlas is already the provenance store (OpenLineage datasets/jobs/runs + the RE_GROUNDS_TO loop-closure edge — see atlas_age_provenance_graph). So Provenance = the Atlas lineage graph + a verification overlay.
  • Gates = the edge verdicts (Signals M1/M2/M3, HermiT consistency, realization-as-CPA, coverage-R1, the TBD downstream RWKV evals) — the “verifiable” in Verifiable Tasks & Lineage.

It also subsumes the observatory’s run↔data-product lineage (idea #1 in leaderboard_observatory.md) — that was a slice; Provenance is its substrate. And it makes the convergence loop legible as a chain, not a vibe: ontology vN —(R↑, HermiT✓)→ corpus —(byte/byte↓)→ model (cf. aegir-convergence-loop).

Atlas integration

Atlas (OpenLineage on AGE) holds the lineage; Provenance adds the verification overlay on the edges (R-pass, HermiT-consistent, coverage-R1, downstream-eval-lift) and the artifact versions (catalog versions, the lineup archive snapshots, corpus hashes, checkpoints). The Provenance panel sources the Atlas graph live (/api/provenance/ego) and walks it node-by-node — the integration RH sensed. Direction: emit the RL/eval gate events as OpenLineage facets on the existing run/dataset nodes, then render those facets as the per-edge verdict overlay.

Adopt-vs-keep Atropos (orthogonal to the pivot)

Provenance wraps whichever RL harness — keep grpo_loop (our verifier R / HermiT / reasoner is a richer reward than exact-match), but Atropos’s microservice + Trajectory-API decoupling is a good pattern to borrow if we grow to many verifiable tasks (DE-elucidation, CPA, downstream RWKV evals as separate environments feeding one trajectory queue). Borrow the shape, not necessarily the code.

The card / panel (as built)

“Tasks” (unlinked stub) → Provenance → a lineup panel rendering a node’s instance-level ego-graph (ReactFlow / @xyflow/react), sourced live from the aegir_hx Atlas graph via /api/provenance/ego. Unifies the lenses (artifacts) + Sweeps/Reward (runs) into one navigable lineage. Landed as a Training ▸ Provenance sibling (not its own nav group): a kind:"provenance" note carrying an ego_focal seed. Unlike the bokeh Sweeps/Reward panels (which mount a viz_app via PanelView), Provenance is a native React component, the move that fixed the client-side graph-render path. The verification overlay (per-edge gate verdicts) is the next increment on top of this surface.

Dependencies / sequencing

  • DONE (v1) — Cheapest first slice: rendered the existing Atlas lineage subgraph as a type-level HoloViews graph panel (provenance_app.py: Family/Topic → Template → Chapter → Column/Dataset → Job/Run via networkx.multipartite_layouthv.Graph), proving the surface. Superseded by the ego-graph below.
  • DONE (v1.5) — Instance-level ReactFlow ego-graph (ProvenanceGraph.tsx + /api/provenance/ego) replacing the static type-level DAG: a node’s 1-hop neighbourhood, click-to-walk in panel-trail fashion, graceful empty state when Atlas is down (#97, commit 0657569).
  • The verification overlay needs gate verdicts as data: the RLVR reward (have it), HermiT/coverage (have them), downstream RWKV evals (TBD — the observatory’s downstream-coupling metrics feed here).
  • A topology-derived whole-graph overview (loop-aware layout, expand/collapse against the ego-walk).
  • Artifact versions: catalog versions + corpus hashes + lineup archive snapshots already exist; wire them as node versions (the RunArtifacts.start provenance stamp — also the observatory unblocker).

Development Guide

Building and Running

Critical: Always Use --no-sync

uv run --no-sync python main.py

The --no-sync flag prevents uv from re-resolving and reinstalling dependencies before running. This is required because flash-attn, flash-linear-attention (fla), mamba-ssm, and causal-conv1d are patched CUDA extensions that were built manually with corrected CXX11 ABI flags. Running uv run without --no-sync will clobber these patched builds with incompatible PyPI wheels.

Smoke Tests

# Model instantiation and forward pass shapes
uv run --no-sync python main.py

# Training loop validation (tiny model, synthetic data)
uv run --no-sync python train.py --smoke-test --model-size tiny --epochs 3

Multi-GPU Training

# 6x RTX 4090 training
uv run --no-sync torchrun --nproc_per_node=6 train.py \
    --model-size small \
    --epochs 100 \
    --batch-size 64 \
    --lr 1e-4

Training uses DDP (DistributedDataParallel), AMP with bf16, cosine LR schedule with linear warmup, and load balancing loss for dynamic chunking regularization.

CUDA Extension Build Notes

The devenv/Nix environment provides GCC 15, which sets _GLIBCXX_USE_CXX11_ABI=1. However, PyTorch’s cu124 wheels are built with _GLIBCXX_USE_CXX11_ABI=0. This ABI mismatch causes segfaults when CUDA extensions link against the wrong ABI.

Patching Procedure

Both mamba-ssm and flash-attn have a CachedWheelsCommand in their setup.py that downloads prebuilt wheels from GitHub releases, bypassing local compilation. To force a local build with the correct ABI:

  1. Set environment variables to force local build:

    export MAMBA_FORCE_BUILD=TRUE
    export FLASH_ATTENTION_FORCE_BUILD=TRUE
    
  2. Use env -i with system GCC-11 to get the correct ABI:

    env -i PATH=/usr/bin:$PATH HOME=$HOME \
        pip install --no-build-isolation /tmp/mamba_src/mamba_ssm-2.3.1/
    
  3. Patch setup.py in each extension to add explicit _abi_flag matching torch’s ABI.

Patched source trees are kept in /tmp/mamba_src/ and /tmp/flash_src/. See docs/scratch/2026-03-28/010808_deps_smoke_train.md for the full step-by-step procedure.

Verifying the Build

After patching, verify that the extensions load correctly:

uv run --no-sync python -c "import mamba_ssm; print('mamba-ssm OK')"
uv run --no-sync python -c "import flash_attn; print('flash-attn OK')"
uv run --no-sync python -c "from fla.ops.rwkv7 import chunk_rwkv7; print('fla OK')"

Adding New Block Types

The architecture supports mixed block types (Mamba2, MHA, RWKV-7, RWKV-8 ROSA) within a single model. To add a new block type:

1. Implement the Mixer Class

Create a new module that implements three methods:

class MyNewMixer(nn.Module):
    def forward(self, hidden_states, inference_params=None, **kwargs):
        """Full-sequence forward pass. Input: (B, L, D). Output: (B, L, D)."""
        ...

    def step(self, hidden_states, inference_params):
        """Single-token autoregressive step. Input: (B, 1, D). Output: (B, 1, D)."""
        ...

    def allocate_inference_cache(self, batch_size, max_seqlen, dtype=None, **kwargs):
        """Allocate KV cache or recurrent state for inference."""
        ...

2. Register in create_block()

Add the new type to src/aegir/modules/block.py:

def create_block(arch, d_model, ...):
    if arch in ("x", "X"):  # new block type code
        from my_module import MyNewMixer
        mixer_cls = partial(MyNewMixer, **factory_kwargs, layer_idx=layer_idx)
    ...

Convention: lowercase letter = mixer only (no MLP), uppercase = mixer + SwiGLU MLP.

3. Add to Isotropic Forward Loop

In src/aegir/modules/isotropic.py, add the new block type to:

  1. The regex pattern that parses layout strings:

    layout_parse = re.findall(r"([mMtTrRwWxX])(\d+)", arch_layout)
    
  2. The forward loop’s block-type dispatch:

    elif arch in ("x", "X"):
        layer_mixer_kwargs = {}  # or whatever kwargs your mixer needs
        if hidden_states.dim() == 2:
            hidden_states = hidden_states.unsqueeze(0)
            residual = None if residual is None else residual.unsqueeze(0)
    

4. Test

# Verify the new block type instantiates and runs
uv run --no-sync python main.py

Project Structure

aegir/
  main.py                          -- Smoke tests
  train.py                         -- Training script (DDP, AMP, cosine LR)
  src/aegir/
    models/
      config.py                    -- AegirConfig, SSMConfig, AttnConfig, RWKVConfig
      aegir.py                     -- Recursive hierarchical backbone
      heads.py                     -- AegirForCausalLM, AegirForColumnAnnotation
    modules/
      block.py                     -- Block factory (create_block)
      isotropic.py                 -- Flat block stack with mixed types
      dc.py                        -- Dynamic chunking (RoutingModule, ChunkLayer, DeChunkLayer)
      rwkv7_tmix.py                -- RWKV-7 full TimeMix (fla kernels)
      rwkv.py                      -- RWKV-8 ROSA time mixing + relu^2 channel mixing
      rosa.py                      -- ROSA suffix automaton (CPU-based)
      mlp.py                       -- SwiGLU MLP
    swarm/
      state_fusion.py              -- RWKVStateFusion (3 modes)
      alignment.py                 -- AlignmentProjection (cross-agent state mapping)
      specialist.py                -- FrozenSpecialist wrapper
      orchestrator.py              -- SwarmOrchestrator (K2.5 PARL)
    data/
      serialization.py             -- Table-to-byte-sequence serialization
      context_select.py            -- MMR context column selection
      table_dataset.py             -- PyTorch dataset for table benchmarks
    utils/
      train.py                     -- Load balancing loss, F1 metrics, param grouping
  docs/                            -- mdbook documentation (this book)
  ref/                             -- Reference papers

Documentation

Build and serve the documentation locally:

just docs-build           # → docs/current/book/
just docs-serve           # http://localhost:3000

# Equivalent invocations without just:
mdbook build docs/current
mdbook serve docs/current

The mdbook layout is the standard mdbook init shape, rooted at docs/current/: configuration in docs/current/book.toml, sources under docs/current/src/, theme overrides in docs/current/theme/, generated HTML emitted to docs/current/book/. Per-session scratch notes and archived chapters live outside this book root, under docs/scratch/ and docs/archive/ respectively, so the build picks up only curated content.

The book uses mdbook with d2 (architecture diagrams) and a client-side MathJax 3 shim for math, all provisioned by devenv.

Worktree Aware Development

Ægir’s dev tooling is worktree aware: just recipes and devenv up detect whether the current checkout is the primary one or a secondary linked worktree (created by git worktree add), and gate shared-state services accordingly. This lets you split a working session across two checkouts — e.g. systems work in one, UI iteration in another — without port collisions, double-launched databases, or two copies of a training run fighting over the same checkpoint directory.

Why bother

Plain git clone gives you one working directory. When you need to look at two branches simultaneously — say, low-level model work on trunk and UI iteration on feat/leaderboard-redesign — the usual options are all expensive:

  • git stash + git checkout repeatedly: loses focus, scrambles your editor state, gates the two contexts behind serial discipline.
  • A second git clone: doubles the disk usage, doubles the dependency install, doesn’t share .git/objects so fetches are duplicated, two separate venvs to keep in sync.
  • A pair of containers: heavyweight, hostile to GPU passthrough, and the in-host tooling has to be reproduced inside each.

git worktree add <path> <branch> is the under-used answer:

# In the primary checkout
git worktree add ../ae-ui-dev rch/ui-dev

This creates a second working directory at ../ae-ui-dev checked out to rch/ui-dev, sharing .git/objects with the primary. Two editor windows, two shells, one repo. The catch: both checkouts share the host. Run devenv up in both and they fight for :5555 (Postgres), :6355 (Qdrant), :8091 (gateway), and the on-disk Postgres data dir.

Ægir’s worktree-aware tooling resolves that without the user having to remember which checkout is “the one with the database in it”.

Detection

bin/detect-worktree-role.sh reads two filesystem signals:

The script’s logic is two test statements:

if [ -d "$top/.git" ]; then
    echo primary
elif [ -f "$top/.git" ]; then
    echo secondary
fi

That’s it. No state, no daemons, no config files. The check runs in <5 ms and is called from both the Justfile (parse-time) and devenv.nix (via the AEGIR_WORKTREE_ROLE env var that the primary’s enterShell sets for downstream tools).

What’s gated

Service / recipePrimarySecondaryOverride
services.postgres (devenv)enableddisabledn/a
processes.qdrant (devenv)enableddisabledn/a
processes.gateway (devenv)enableddisabledn/a
processes.vite-dev (devenv)enableddisabledn/a
just gatewayrunsrefusesALLOW_SECONDARY=1 + AEGIR_GATEWAY_PORT=…
just p5-trainrunsrefusesALLOW_SECONDARY=1 + AEGIR_P5_OUTPUT_DIR=…
just ui-devrunsruns(no guard — Vite picks an alt port)
just whoamirunsruns(diagnostic only)
just sync / bdd-* / etc.runsruns(no shared-state writes)

Secondary worktrees connect to the primary’s services via localhost:<port>. The filesystem path /raid/checkpoints/p5/ is shared — the primary writes, the secondary’s gateway (if running, via ALLOW_SECONDARY) reads, the secondary’s UI subscribes to the SSE stream regardless of which gateway it talks to.

Workflow

# ── In the primary checkout (e.g. systems work) ───────────────
just whoami                 # → primary
devenv up                   # postgres, qdrant, gateway, vite-dev all start
just p5-train               # 9B-local GRPO/RLVR training
                            # writes /raid/checkpoints/p5/sae_features.live.jsonl
# ── In the secondary checkout (e.g. UI iteration) ─────────────
git worktree add ../ae-ui-dev rch/ui-dev
cd ../ae-ui-dev
just whoami                 # → secondary
devenv up                   # skips services with a hint; safe to call
just ui-dev                 # Vite dev server (auto-picks free port)
                            # subscribes to primary's :8091 gateway
                            # SSE: GET /api/p5/sae/stream

The UI sees feature activations within seconds of each GRPO step — see Cross-worktree SAE streaming for the data path.

Overriding the role

Set AEGIR_WORKTREE_ROLE to override the detection script. Useful when:

  • A CI harness wants to mock secondary behavior in the primary checkout.
  • You’ve added a third worktree and want it to behave as primary (and accepted responsibility for picking a non-colliding port set).
  • The secondary checkout is the systems checkout for a session (you flipped roles deliberately).
AEGIR_WORKTREE_ROLE=secondary just whoami     # force-secondary
AEGIR_WORKTREE_ROLE=primary devenv up         # force-primary

devenv reads the env var via lib.maybeEnv "AEGIR_WORKTREE_ROLE" "primary", so the primary default is preserved when the var is unset.

When to bypass the guards

Each guard has an explicit override path so you’re never blocked, just forced to be deliberate:

  • ALLOW_SECONDARY=1 AEGIR_P5_OUTPUT_DIR=/raid/checkpoints/p5-experiment-2 \ just p5-train --policy-preset 9b-local-l0-100 — runs a parallel training run from the secondary checkout against a distinct output directory.
  • ALLOW_SECONDARY=1 AEGIR_GATEWAY_PORT=8092 just gateway — runs a satellite gateway from the secondary checkout on a distinct port (e.g. for testing a UI build against an isolated backend).

The pattern is: the recipe refuses by default to surface the conflict, and the env vars give you the explicit knobs needed to make the override collision-free.

Adding new worktree-aware recipes

When adding a recipe that binds a port or writes shared state, branch on {{worktree_role}} early:

my-new-recipe:
    #!/usr/bin/env bash
    set -euo pipefail
    if [ "{{worktree_role}}" = "secondary" ] && [ "${ALLOW_SECONDARY:-0}" != "1" ]; then
        echo "[aegir] worktree role = secondary; refusing my-new-recipe." >&2
        echo "        Run in primary, or ALLOW_SECONDARY=1 with a distinct OUTPUT_DIR." >&2
        exit 2
    fi
    # ... actual recipe body ...

The diagnostic message should name (a) the conflicting resource and (b) the env var(s) that disambiguate an override.

For pure-compute recipes (no port, no shared write) — e.g. the BDD suite, smoke tests, schema validators — leave them unguarded. They’re safe to run in any worktree concurrently.

See also

  • Cross-worktree SAE streaming — the filesystem-and-SSE pipeline that lets a UI worktree observe a training run in the primary worktree in near-real-time.
  • bin/detect-worktree-role.sh — the detection script.
  • Justfile — the worktree_role parse-time variable and per-recipe guards.
  • devenv.nix — the worktreeRole let-binding and lib.mkIf isPrimary service gates.

Cross-worktree SAE streaming

The motivating use case for Worktree Aware Development: a P5 GRPO/RLVR training run executes in the primary worktree while a React UI runs in a secondary worktree, and the UI observes SAE feature activations from the running training process in near-real-time.

Pipeline

The two JSONL files participate at different cadences:

  • sae_features.live.jsonl at the run root (default /raid/checkpoints/p5/sae_features.live.jsonl) — appended every sae_live_spill_every_n_steps GRPO steps (default: 1) by SidecarCallback.on_step_end. Records produced between spills are cleared from the in-memory buffer on each spill, so the live tail and the per-checkpoint snapshots are mutually exclusive (no double-counting).
  • checkpoint-N/sae_features.jsonl under each checkpoint dir — point-in-time snapshot written on every Trainer save event (every 50 steps by default). Primary source for post-hoc analysis; the fallback for the SSE stream when no live tail exists yet (i.e. resume-from-checkpoint where the trainer has saved but hasn’t yet reached the first live-spill cadence tick).

Endpoints

GET /api/p5/runs

Walks cfg.p5.output_dir for checkpoint-N/ subdirs. Returns one row per checkpoint with the aegir_metadata.json sidecar payload plus a count of SAE feature records spilled to that checkpoint:

{
  "rows": [
    {
      "step": 50,
      "checkpoint": "checkpoint-50",
      "metadata": { "run_id": "…", "catalog_version": "0.7.0-combined", … },
      "n_sae_records": 320,
      "checkpoint_dir": "/raid/checkpoints/p5/checkpoint-50"
    }
  ],
  "count": 1,
  "p5_dir": "/raid/checkpoints/p5",
  "p5_dir_exists": true
}
GET /api/p5/sae/stream

Server-sent events stream of SAE feature records. Polls cfg.p5.output_dir every 0.5 s. Prefers the live JSONL when present; falls back to the latest checkpoint’s snapshot otherwise. Five event kinds:

EventBody shapeWhen
(default)SAELogRecord JSONone per new line in the active source
event: source{source: "live" | "snapshot:…", path}active source switched
event: checkpoint{step, checkpoint}snapshot fallback rolled to a new checkpoint dir
event: idle{reason}no live log and no checkpoints exist yet
event: heartbeat{}every ~10 s, keeps proxies awake

Record fields

Each default data: event carries one SAELogRecord dict:

{
  "step": 42,
  "token_index": 8,
  "layer_index": 16,
  "top_feature_indices":     [3201, 14872, 5891, …],
  "top_feature_activations": [4.21, 3.87,  3.04, …],
  "reconstruction_loss":     0.0142
}
  • top_feature_indices + top_feature_activations — the K SAE features that fired hardest at this layer for this token (default K=16).
  • reconstruction_loss||x − SAE(x)||² on the residual stream. A spike here flags an under-budgeted moment (the L0 sparsity dropped a concept that mattered for this composition).
  • layer_index — which transformer layer’s residual produced this record. The hooked layers are chosen by default_layers_to_hook(num_layers, n) (aegir.rl.policy), n evenly-spaced indices via step = num_layers // n. n is the --sae-num-layers flag, default 2 — so Qwen3.5-9B-Base (40 layers) hooks [0, 20] and the 27B base (64 layers) hooks [0, 32]. Raising --sae-num-layers 4 spreads them wider ([0, 10, 20, 30] on the 9B, [0, 16, 32, 48] on the 27B).

Smoke test

From the secondary worktree, while the primary runs just p5-train:

# In the primary checkout
just p5-train --policy-preset 9b-local-l0-50

# In the secondary checkout (a different shell)
curl http://localhost:8091/api/p5/runs | jq '.count, .rows[0]'
curl -N http://localhost:8091/api/p5/sae/stream | head -20

The -N flag on curl disables output buffering; without it, SSE events queue in 4 KB chunks and you won’t see them until enough land.

What the UI gets to visualize

The brief’s morphism reading: input bytes → SAE features → ontology term selection → output bytes. The UI surfaces this in near-real-time so a user can correlate a bad reward with the features that fired during the failing generation:

  • A heatmap of top_feature_activations over (layer, time) shows which layers’ SAE dictionaries dominated each generation.
  • A scatter of reconstruction_loss vs. final R (verifier reward, recorded separately in grpo_metrics.jsonl) tests whether the L0 sparsity budget is dropping concepts that matter — a negative correlation motivates a post-hoc L0=50 ablation.
  • Sustained activation of a small feature subset across gate-passing generations is the post-hoc interpretability claim: those features are the policy’s “vocabulary” for ontology-term selection in the morphism.

Configuration

aegir.config.P5Cfg (mirroring aegir.rl.checkpointing.CheckpointConfig):

FieldDefaultOverride
output_dir/raid/checkpoints/p5AEGIR_P5_OUTPUT_DIR
sae_log_filenamesae_features.jsonl(HOCON aegir.p5.sae_log_filename)
sae_live_log_filenamesae_features.live.jsonl(HOCON aegir.p5.sae_live_log_filename)
metadata_filenameaegir_metadata.json(HOCON aegir.p5.metadata_filename)

aegir.rl.checkpointing.CheckpointConfig:

FieldDefaultNotes
sae_live_spill_every_n_steps1Cadence of SidecarCallback.on_step_end flush

The primary and secondary worktrees see the same output_dir because both inherit AEGIR_P5_OUTPUT_DIR (or the default) — the /raid path is host-shared. No additional IPC, no additional ports, no additional state.