Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Embeddings

The Embeddings page provides interactive visualization of classification results. It renders 2D projections of embedding vectors, allowing users to explore clusters, search data points, and cross-filter by metadata columns.

Architecture

BrowserFastAPI GatewayreactduckdbviewerReact AppDuckDB WASMIn-browser SQL engine
Loads parquet directlyEmbeddingAtlas ComponentWebGPU scatter plot
Density contours
Search + filters/api/datasets/{id}/data  fetch parquetSQL queriesGET parquetIn-browser SQL engine
Loads parquet directly












WebGPU scatter plot
Density contours
Search + filters


















The viewer runs entirely in the browser. DuckDB WASM loads parquet data locally and the EmbeddingAtlas component (from Apple’s embedding-atlas library) renders the visualization using WebGPU with WebGL 2 fallback.

Data Flow

  1. Backend serves the parquet file via /api/datasets/{id}/data
  2. React fetches the parquet and loads it into DuckDB WASM via a Mosaic coordinator
  3. EmbeddingAtlas queries the DuckDB table for rendering: x/y coordinates, categories, text for tooltips
  4. All filtering, search, and aggregation happens client-side — no round-trips to the server

Parquet Schema

The Embeddings page expects parquet files with these columns:

ColumnTypeRequiredDescription
idstringyesUnique row identifier
xfloat32yes2D projection x-coordinate (UMAP)
yfloat32yes2D projection y-coordinate (UMAP)
textstringrecommendedTooltip and search text
categorystringrecommendedColor-coding category

Additional columns (e.g., source_table, belief, plausibility) are automatically available as cross-filter charts.

GitTables Dataset

The initial dataset is derived from the GitTables CTA benchmark — 2,517 columns extracted from real tables, annotated with 122 DBpedia property types. These instance labels serve as the controlled vocabulary to be grounded in the SIGDG ontology.

To prepare the visualization parquet:

# From signals evaluation output (recommended)
just prepare-gittables ~/local/src/cldr/signals/build/gittables_eval.parquet

# Then seed the database
just seed

The preparation script computes sentence-transformer embeddings and UMAP 2D projections. The resulting parquet includes DST evidence fusion columns (belief, plausibility, uncertainty gap) when derived from the signals evaluation output.

Naming: Embeddings vs Apache Atlas

The Embeddings page is powered by Apple’s embedding-atlas library. This is unrelated to Apache Atlas, the Cloudera metadata governance catalog used by the signals pipeline.

  • Embeddings (Atelier) — Interactive scatter plot of classification embeddings
  • Apache Atlas (Cloudera/signals) — Metadata governance catalog on port 21000

To avoid confusion, all user-facing surfaces use “Embeddings”. The embedding-atlas library name appears only in developer documentation and package.json.