Embeddings
The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.
Architecture
The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.
Data Flow
- Backend serves the parquet file via
/api/datasets/{id}/data - React fetches the parquet and loads it into DuckDB WASM via a Mosaic coordinator
- EmbeddingAtlas queries the DuckDB table for rendering: x/y coordinates, categories, text for tooltips
- All filtering, search, and aggregation happens client-side — no round-trips to the server
Parquet Schema
The Embeddings page expects parquet files with these columns:
| Column | Type | Required | Description |
|---|---|---|---|
id | string | yes | Unique row identifier |
x | float32 | yes | 2D projection x-coordinate (UMAP) |
y | float32 | yes | 2D projection y-coordinate (UMAP) |
text | string | recommended | Tooltip and search text |
category | string | recommended | Color-coding category |
Additional columns (e.g., source_table, belief, plausibility) are automatically available as cross-filter charts.
GitTables Dataset
The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.
To prepare the visualization parquet:
# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet
# Then seed the database
just seed
The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.
Naming: Embeddings vs Apache Atlas
The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.
- Embeddings (Atelier) — Interactive scatter plot of classification embeddings
- Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000
To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.