Leaderboard → Convergence Observatory (enhancement ideas)
Status: IDEAS — not built. Captured 2026-06-18 (RH) once the FinePDFs→ontology→DDL→ corpus-with-embedded-views pipeline materialized and the viz layer became live HoloViews/Panel/ Datashader served by a bokeh server behind the gateway proxy (UI-U5; commits 8c42bbd / 6ac9491). The leaderboard’s only prior design intent was “kinda like W&B but with HoloViews.” Now there’s a reason to aim higher.
Where it is today
One row per training run; clicking a run opens a drawer with live HoloViews curves (loss / F1 /
per-stage chunking boundary), rendered by aegir.viz.runs_app over the bokeh server, embedded via
<PanelView>. Air-gapped, no npm @bokeh/bokehjs.
Built — the Training section + Sweeps + Reward (v1). The lineup left-nav now has a TRAINING
group below the lenses. Reward (reward_app) is the GRPO health monitor — reward R mean±std band
(the variance collapse-canary) + R_A pass-rate + z-scored advantage, off a run’s GRPO metrics_jsonl.
Its first entry, Sweeps, opens a live HoloViews parallel-coordinates panel
(aegir.viz.sweeps_app) over the runs — each run a line across model_size · num_params · lr · epochs · macro/micro F1 · val_loss, colored by best macro-F1, reading live from outputs/runs. The Landing
“Runs” card is now “Sweeps” → /lineup?open=training/sweeps. HoloViews-native by directive (no
canonical-PCP widget) — the door is open to the superior version (datashade for many runs,
hv.link_selections axis brushing). The TRAINING group is extensible: the ideas below become further
entries (each a kind:"training" note with a viz_app frontmatter + a bokeh-server app).
The reframe
The pipeline now emits coupled Data Products (ontology · relational/DDL footprint · corpus with embedded views) and model runs, and the convergence loop couples them (proxy signals → model eval; see aegir-convergence-loop). With a live viz layer, Atlas provenance, and the lineup all in place, the leaderboard should grow from “training curves” into the convergence observatory: the join of model-runs × data-products × Signals gates. It’s the natural surface to answer “is the ontology load-bearing, and is the model elucidating the relational structure we built?”
Enhancement ideas (roughly prioritized)
-
Run ↔ data-product lineage (highest leverage). Each run records the corpus snapshot (
sdg_corpus_v0_3/<hash>), ontology catalog (ae7dbee), and coverage run it trained on. Surface them as a “provenance” tab that cross-links into the lineup (the collections/chord it consumed) and Atlas (theRE_GROUNDS_TOloop-closure subgraph). Closes the run↔data-product loop visibly — the same live-viz embed the chord uses. See atlas_age_provenance_graph, lineup_kb_projection. -
Signals gate panels. Per run, show the M1/M2/M3 criteria status (M1 H-Net isolation; M2 3-arm × α×β instrument validity; final gate: matches RWKV-7 on general non-degeneracy AND beats the no-ontology ablation on relational + DE-elucidation, CI-clean). A pass/fail strip turns the leaderboard into the gate dashboard, not just curves. See signals_programme.
-
Ablation-arm comparison. The corpus carries
full / no-ontology / no-schemaarms — overlay their curves (HoloViews overlay / HoloMap by arm) for the same data to read the ontology’s load-bearing-ness directly, instead of eyeballing separate runs. -
Eval-instrument-aware panels (per ontology-cpa-eval-methodology). The instrument is the binding constraint, so plot sample-efficiency curves (perf vs #examples) not full-data points, PR metrics not ROC-AUC, with bootstrap/permutation CIs, plus control-task / held-out-type / MDL-probing panels. This is where “W&B-like” stops being enough.
-
DE-elucidation / CPA progress (the north-star). Track the model’s data-element-elucidation (CPA) over runs and tie it to the lineup Schema-chord densification (already framed as the term↔table many-to-many progress metric). The leaderboard becomes where “is the model learning the relational structure we built?” is answered.
-
Datashader at scale. Per-step loss/grad traces over millions of steps, and a coverage heatmap of the relational footprint (which tables/views a run’s training data exercised) → server-side Datashader rasterization. Aligns with the tier-one compute posture (compute_posture_tier_one); the live stack already supports it.
-
Interactive Panel widgets (now unblocked by the live server — the static path couldn’t). Run grouping/tags, metric pickers, cross-run config diff, run notes — Panel widgets / Tabulator.
-
Corpus-quality panels beside model metrics. Surface the convergence proxies (coverage-close R1, topic-recovery, family-complex) next to model metrics, since the loop is one coupled system — one observatory for both products.
Dependencies / sequencing
- Most of these need runs to carry data-product provenance in
metadata.json(corpus/ontology/ coverage hashes + arm) — a smallRunArtifacts.startaddition; do that first (cheap, unblocks #1–#5). - #2/#5 depend on the Signals gates + the discriminating relational eval instrument (M2/M3 tasks #42/#43).
- #6 needs
datashader/daskadded (deliberateuv pip install; the live viz layer already fits). - Keep the live HoloViews/PanelView path — these are all panes on the same bokeh server, embedded the same way; no new viz transport needed.
RL training (GRPO/RLVR) — the observatory is the reward instrument
P5 already runs GRPO/RLVR (src/aegir/rl/): the policy generates ontology compositions, the
deterministic verifier composite R is the reward (parallel_verify → verifier), z-score group
advantages, no critic. GRPO is the right fit and PPO is not the question: a value network is pure
overhead (memory + a second model to tune) when the reward is a cheap, parallelized, deterministic
grader — PPO earns its keep only with learned/noisy reward models or dense per-token credit, neither
of which applies here. The live questions are not PPO; they are (a) reward-variance collapse (if R
saturates or the structural gate R_A zeroes a whole group, advantages vanish → “0 reward variance
forever”) and (b) GRPO refinements (length-bias debias, advantage_normalization choice — already
parameterized). Both are observability + reward-design problems → this observatory.
The earlier ideas are not left behind — RL makes them central, and most are nearly free because
GRPOMetrics already logs them to a metrics_jsonl:
- Reward dynamics —
rewards_mean ± rewards_stdband + min/max: the headline RL panel, and the reward-variance band IS the GRPO health monitor (the collapse canary).advantage_mean/std= signal strength. Already logged → a panel reading the GRPOmetrics_jsonl(likeruns_app). - Reward-component decomposition —
R_A·(0.50·R_B + 0.05·R_C + 0.45·R_D)over training (a stacked / small-multiples / PCP view). Small add: haveparallel_verifylog the sub-scores, not justR. This is idea #4 (eval-instrument-aware) for RL. - Verifier-pass-rate gates —
R_Astructural-gate %, HermiT-consistency %, coverage-close % = idea #2 (Signals gate panels), as the RL pass-rate dashboard. - SAE feature stream — already live (
/api/p5/sae/stream); an RL-interpretability panel,datashade-d over GRPO steps (idea #6 at scale). - Sweeps PCP (built) → the GRPO hyperparameter tuner:
group_size · kl_coefficient · advantage_normalization · lr× outcomes (final reward, pass-rate). This is the canonical RL-sweep use. - Unify the two run surfaces:
/api/runs(supervised CTA/CPA) +/api/p5/runs(GRPO/RLVR) both flow into the lineup observatory (a Training ▸ Reward entry beside Sweeps). They are separate today. - GRPO-vs-PPO, if ever litigated, is an ablation-arm comparison (idea #3) — but the reward shape says GRPO; spend the cycles on reward granularity + curriculum, watched via the variance band.
- Empirical guardrails — a low-cost GRPO sweep (drive it with the Sweeps PCP):
group_size(4–16 is the stable-baseline range),advantage_normalization(z_score vs centered),kl_coefficientscaling, plus a simple length-normalization baseline arm. Settles the refinement questions for the cost of a handful of short runs — exactly what the PCP is for. - Downstream coupling metrics — surface one or two proxy-downstream signals beside reward / pass-rate: post-verbalization RWKV byte-per-byte loss on DDL / SchemaPile slices, or CTA/CPA F1 lift on held-out tables. A single panel tying reward ↑ to downstream-loss ↓ is the strongest “the proxy is real” evidence the observatory can show — it validates the convergence loop’s load-bearing assumption (higher-R ontologies → better synthetic corpus → better model) rather than trusting R on faith. This is the open calibration gap in aegir-convergence-loop (cf. E1); the observatory is where it gets watched continuously.
Real-world use: a verifiable-reward RL observatory (reward dynamics + verifier-pass-rate gates + data-product lineage + reproducible provenance, with the catalog hot-reload closing the loop — edit the ontology, the reward changes on the next rollout) is genuinely product-grade RLVR experiment management, differentiated from W&B by the verifiable-reward + ontology-grounded-lineage semantics.