Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Data Sources & Versioning

Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.

Data Model

DataSource (1)                      Dataset versions (N)
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "ootb-sample"       │───1:N──│ v3 (active) — 2 min ago  │
│ type: "sample"          │        │ v2 — yesterday           │
│ display: "Sample"       │        │ v1 — built-in            │
│ vocab_mode: "universal" │        └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive"            │        └──────────────────────────┘
│ display: "hive:prod/…"  │
│ vocab_mode: "hive"      │
└─────────────────────────┘

Source Types

TypeTables loaded fromVocabularyCreated by
sampledata/sample/tables/*.csvExpanded ICE ontology (316 leaves)Auto-seeded on first boot
hiveCAI data connectionDomain annotations from vocab_uriUser creates via Status page
synthdata/synth/tables/*.csvDomain annotations from vocab_uriGenerated by scripts/generate_synth_source.py

Vocabulary routing: For in-situ classification, the customer’s domain vocabulary IS the classification target — the LLM reads labels and descriptions and classifies into the domain’s hierarchical dot-codes. The annotations table location is configured per source via vocab_uri (e.g. meta.vocab, meta.annotations), decoupling data tables from the vocabulary. Multiple sources can share the same annotations table.

Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via VocabMapping. This requires dedicated training hardware and is not yet implemented.

Database Schema

CREATE TABLE data_sources (
    id TEXT PRIMARY KEY,
    source_type TEXT NOT NULL,          -- 'sample' | 'hive'
    source_uri TEXT NOT NULL DEFAULT '',
    display_name TEXT NOT NULL,
    vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
    vocab_uri TEXT NOT NULL DEFAULT '',  -- e.g. 'meta.vocab', 'meta.annotations'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT                       -- JSON: table_count, column_count
);

-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;

Vocabulary Routing

When a pipeline run starts, the source_id determines which vocabulary loads:

  • ootb-sample: load_sample_vocabulary()data/sample/ontology.json (316 BFO-grounded leaves across the CCO ICE trichotomy)
  • hive/synth: Domain annotations loaded directly from the table specified by vocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.
  • No source: Falls back to universal vocabulary (16 PII leaves)

LLM Robustness

The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.

  • Adaptive batch sizing: _estimate_safe_batch_size() reduces columns_per_call for large vocabularies (e.g. 290 categories → 41)
  • Truncation retry: When LLMResponse.truncated is detected, the batch is halved and retried recursively until all columns are classified
  • Metrics: truncation_count and effective_batch_size tracked in BootstrapState and exposed via the agent’s check_convergence tool

Sample Source

The built-in “Sample” source (source_id ootb-sample) ships with Atelier so new deployments show meaningful data immediately. When the landing page loads and “Connected” turns green, the stats cards show 316 Terms and 316 Entities. The ootb- prefix in the id is an internal marker distinguishing shipped sources from user-registered connections — it is not shown in the UI.

Expanded Vocabulary (ICE.* Ontology)

The vocabulary follows the CCO ICE (Information Content Entity) trichotomy, grounded in BFO via atelier-vocab.ttl:

ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│   ├── ICE.NONSENSITIVE.DESIGNATIVE   ⊑ cco:DesignativeICE
│   │   ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│   │   ├── .CODE (.ID, .ABBREV, .POSTAL)
│   │   ├── .GEO  (.COUNTRY, .REGION, .CITY, .LOCATION)
│   │   ├── .REF  (.CITATION, .VERSION, .SOURCE)
│   │   └── .TITLE
│   ├── ICE.NONSENSITIVE.DESCRIPTIVE   ⊑ cco:DescriptiveICE
│   │   ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│   │   ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│   │   ├── .MEASUREMENT (~20 subtypes)
│   │   └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│   └── ICE.NONSENSITIVE.PRESCRIPTIVE  ⊑ cco:PrescriptiveICE
│       └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│   ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│   ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│   └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
    └── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …

351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.

Design principle: every category is our own BFO-grounded term. External sources (GitTables, meta-tagging) inform which conceptual space to cover; we never import their raw tags. The mapping goes outward from our vocabulary via atelier-vocab.ttl, not inward.

Sample Tables

25 mixed-domain tables with 316 columns (100 rows each). Tables are deliberately cross-domain — a customers table contains identity, contact, metadata, and categorical columns — so the classification pipeline cannot rely on table name alone.

~25% of columns use opaque names (field_42, var_abc, col_xyz) to exercise the pipeline’s ability to classify from values and context rather than column name heuristics.

Generated by scripts/generate_sample_source.py. The curated reference for the Sample source fixture is committed in data/sample/reference_labels.json (scope: fixture-only, for OOTB demo and unit tests).

For UAT / production evaluation, the curated reference lives at build/meta-tagging-clean/curated_reference.csv (gitignored) — built by scripts/parity/build_curated_reference.py from direct reference-column evidence plus name-index lookup with Ontology > Annotation > Common Names priority. UAT’s own classification outputs are provisional predictions and are scored against this curated reference at build/results/parity/delta_report.md.

Auto-Import on First Boot

The gateway seeds the Sample source (id ootb-sample) via a FastAPI lifespan context manager:

  1. Check if ootb-sample source has any dataset versions
  2. If none, read sample_source_stats() (table count, column count)
  3. Create dataset version 1 with the stats as metadata
  4. Update source metadata JSON

This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.

API

REST Endpoints

EndpointMethodDescription
/api/data-sourcesGETList all data sources
/api/data-sourcesPOSTCreate a new data source
/api/datasets?source_id=XGETList versions for a source
/api/datasets/{id}/activatePOSTSet a version as active
/api/vocabulary/stats?source_id=XGETTerm count (source-aware)
/api/fsm/start?source_id=XPOSTStart pipeline for a source

gRPC RPCs

RPCDescription
ListDataSources()List all sources
StartClassification(source_id=…)Start pipeline for a source

UI Integration

The Status page has two new cards:

  • Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
  • Classification Pipeline card: “Start Classification” passes activeSourceId to /api/fsm/start?source_id=…

The Landing page stats cards reflect the active source:

  • Terms: vocabulary size for the active source (316 for the Sample source)
  • Entities: column count from the active dataset version
  • Sources badge: shows count when multiple sources exist

DatasetContext

interface DatasetContextValue {
  sources: DataSourceInfo[];
  activeSourceId: string | null;
  setActiveSourceId: (id: string) => void;
  datasets: DatasetInfo[];           // for activeSourceId
  activeDatasetId: string | null;
  setActiveDatasetId: (id: string) => void;
  refreshSources: () => Promise<void>;
  refreshDatasets: () => Promise<void>;
}

Key Files

FileRole
db/migrations/20260414…_data_sources_and_versions.sqlSchema migration
src/atelier/db/model.pyDataSource ORM model
src/atelier/db/dao.pySource + version DAO methods
src/atelier/classify/sampler.pyload_sample_source(), sample_source_stats()
src/atelier/classify/taxonomy.pyload_sample_vocabulary()
src/atelier/classify/pipeline.pySource-aware routing
src/atelier/gateway.pyREST endpoints + auto-import lifespan
data/sample/ontology.jsonExpanded vocabulary (316 leaves)
data/sample/tables/*.csv25 sample tables
data/sample/reference_labels.json316-entry Sample-source fixture reference labels
build/meta-tagging-clean/curated_reference.csv (gitignored)UAT-corpus curated reference
scripts/expand_vocabulary.pyVocabulary expansion script
scripts/generate_sample_source.pySample table generation script
ui/src/contexts/DatasetContext.tsxSource-aware React context
ui/src/pages/Status.tsxData source + version UI