Data Sources & Versioning

Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.

Data Model

DataSource (1)                      Dataset versions (N)
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "ootb-sample"       │───1:N──│ v3 (active) — 2 min ago  │
│ type: "sample"          │        │ v2 — yesterday           │
│ display: "Sample"       │        │ v1 — built-in            │
│ vocab_mode: "universal" │        └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐        ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive"            │        └──────────────────────────┘
│ display: "hive:prod/…"  │
│ vocab_mode: "hive"      │
└─────────────────────────┘

Source Types

Type	Tables loaded from	Vocabulary	Created by
`sample`	`data/sample/tables/*.csv`	Expanded ICE ontology (316 leaves)	Auto-seeded on first boot
`hive`	CAI data connection	Domain annotations from `vocab_uri`	User creates via Status page
`synth`	`data/synth/tables/*.csv`	Domain annotations from `vocab_uri`	Generated by `scripts/generate_synth_source.py`

Vocabulary routing: For in-situ classification, the customer’s domain vocabulary IS the classification target — the LLM reads labels and descriptions and classifies into the domain’s hierarchical dot-codes. The annotations table location is configured per source via vocab_uri (e.g. meta.vocab, meta.annotations), decoupling data tables from the vocabulary. Multiple sources can share the same annotations table.

Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via VocabMapping. This requires dedicated training hardware and is not yet implemented.

Database Schema

CREATE TABLE data_sources (
    id TEXT PRIMARY KEY,
    source_type TEXT NOT NULL,          -- 'sample' | 'hive'
    source_uri TEXT NOT NULL DEFAULT '',
    display_name TEXT NOT NULL,
    vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
    vocab_uri TEXT NOT NULL DEFAULT '',  -- e.g. 'meta.vocab', 'meta.annotations'
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    metadata TEXT                       -- JSON: table_count, column_count
);

-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;

Vocabulary Routing

When a pipeline run starts, the source_id determines which vocabulary loads:

ootb-sample: load_sample_vocabulary() → data/sample/ontology.json (316 BFO-grounded leaves across the CCO ICE trichotomy)
hive/synth: Domain annotations loaded directly from the table specified by vocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.
No source: Falls back to universal vocabulary (16 PII leaves)

LLM Robustness

The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.

Adaptive batch sizing: _estimate_safe_batch_size() reduces columns_per_call for large vocabularies (e.g. 290 categories → 41)
Truncation retry: When LLMResponse.truncated is detected, the batch is halved and retried recursively until all columns are classified
Metrics: truncation_count and effective_batch_size tracked in BootstrapState and exposed via the agent’s check_convergence tool

Sample Source

The built-in “Sample” source (source_id ootb-sample) ships with Atelier so new deployments show meaningful data immediately. When the landing page loads and “Connected” turns green, the stats cards show 316 Terms and 316 Entities. The ootb- prefix in the id is an internal marker distinguishing shipped sources from user-registered connections — it is not shown in the UI.

Expanded Vocabulary (ICE.* Ontology)

The vocabulary follows the CCO ICE (Information Content Entity) trichotomy, grounded in BFO via atelier-vocab.ttl:

ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│   ├── ICE.NONSENSITIVE.DESIGNATIVE   ⊑ cco:DesignativeICE
│   │   ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│   │   ├── .CODE (.ID, .ABBREV, .POSTAL)
│   │   ├── .GEO  (.COUNTRY, .REGION, .CITY, .LOCATION)
│   │   ├── .REF  (.CITATION, .VERSION, .SOURCE)
│   │   └── .TITLE
│   ├── ICE.NONSENSITIVE.DESCRIPTIVE   ⊑ cco:DescriptiveICE
│   │   ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│   │   ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│   │   ├── .MEASUREMENT (~20 subtypes)
│   │   └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│   └── ICE.NONSENSITIVE.PRESCRIPTIVE  ⊑ cco:PrescriptiveICE
│       └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│   ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│   ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│   └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
    └── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …

351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.

Design principle: every category is our own BFO-grounded term. External sources (GitTables, meta-tagging) inform which conceptual space to cover; we never import their raw tags. The mapping goes outward from our vocabulary via atelier-vocab.ttl, not inward.

Sample Tables

25 mixed-domain tables with 316 columns (100 rows each). Tables are deliberately cross-domain — a customers table contains identity, contact, metadata, and categorical columns — so the classification pipeline cannot rely on table name alone.

~25% of columns use opaque names (field_42, var_abc, col_xyz) to exercise the pipeline’s ability to classify from values and context rather than column name heuristics.

Generated by scripts/generate_sample_source.py. The curated reference for the Sample source fixture is committed in data/sample/reference_labels.json (scope: fixture-only, for OOTB demo and unit tests).

For UAT / production evaluation, the curated reference lives at build/meta-tagging-clean/curated_reference.csv (gitignored) — built by scripts/parity/build_curated_reference.py from direct reference-column evidence plus name-index lookup with Ontology > Annotation > Common Names priority. UAT’s own classification outputs are provisional predictions and are scored against this curated reference at build/results/parity/delta_report.md.

Auto-Import on First Boot

The gateway seeds the Sample source (id ootb-sample) via a FastAPI lifespan context manager:

Check if ootb-sample source has any dataset versions
If none, read sample_source_stats() (table count, column count)
Create dataset version 1 with the stats as metadata
Update source metadata JSON

This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.

API

REST Endpoints

Endpoint	Method	Description
`/api/data-sources`	GET	List all data sources
`/api/data-sources`	POST	Create a new data source
`/api/datasets?source_id=X`	GET	List versions for a source
`/api/datasets/{id}/activate`	POST	Set a version as active
`/api/vocabulary/stats?source_id=X`	GET	Term count (source-aware)
`/api/fsm/start?source_id=X`	POST	Start pipeline for a source

gRPC RPCs

RPC	Description
`ListDataSources()`	List all sources
`StartClassification(source_id=…)`	Start pipeline for a source

UI Integration

The Status page has two new cards:

Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
Classification Pipeline card: “Start Classification” passes activeSourceId to /api/fsm/start?source_id=…

The Landing page stats cards reflect the active source:

Terms: vocabulary size for the active source (316 for the Sample source)
Entities: column count from the active dataset version
Sources badge: shows count when multiple sources exist

DatasetContext

interface DatasetContextValue {
  sources: DataSourceInfo[];
  activeSourceId: string | null;
  setActiveSourceId: (id: string) => void;
  datasets: DatasetInfo[];           // for activeSourceId
  activeDatasetId: string | null;
  setActiveDatasetId: (id: string) => void;
  refreshSources: () => Promise<void>;
  refreshDatasets: () => Promise<void>;
}

Key Files

File	Role
`db/migrations/20260414…_data_sources_and_versions.sql`	Schema migration
`src/atelier/db/model.py`	`DataSource` ORM model
`src/atelier/db/dao.py`	Source + version DAO methods
`src/atelier/classify/sampler.py`	`load_sample_source()`, `sample_source_stats()`
`src/atelier/classify/taxonomy.py`	`load_sample_vocabulary()`
`src/atelier/classify/pipeline.py`	Source-aware routing
`src/atelier/gateway.py`	REST endpoints + auto-import lifespan
`data/sample/ontology.json`	Expanded vocabulary (316 leaves)
`data/sample/tables/*.csv`	25 sample tables
`data/sample/reference_labels.json`	316-entry Sample-source fixture reference labels
`build/meta-tagging-clean/curated_reference.csv` (gitignored)	UAT-corpus curated reference
`scripts/expand_vocabulary.py`	Vocabulary expansion script
`scripts/generate_sample_source.py`	Sample table generation script
`ui/src/contexts/DatasetContext.tsx`	Source-aware React context
`ui/src/pages/Status.tsx`	Data source + version UI

Keyboard shortcuts

Atelier