Data Sources & Versioning
Atelier organizes classification work around data sources — each source contains input tables, and every pipeline run against a source produces a new dataset version. This replaces the earlier flat dataset model and enables the OOTB onboarding experience.
Data Model
DataSource (1) Dataset versions (N)
┌─────────────────────────┐ ┌──────────────────────────┐
│ id: "ootb-sample" │───1:N──│ v3 (active) — 2 min ago │
│ type: "sample" │ │ v2 — yesterday │
│ display: "Sample" │ │ v1 — built-in │
│ vocab_mode: "universal" │ └──────────────────────────┘
└─────────────────────────┘
┌─────────────────────────┐ ┌──────────────────────────┐
│ id: "hive-prod-default" │───1:N──│ v1 (active) — 1 hour ago │
│ type: "hive" │ └──────────────────────────┘
│ display: "hive:prod/…" │
│ vocab_mode: "hive" │
└─────────────────────────┘
Source Types
| Type | Tables loaded from | Vocabulary | Created by |
|---|---|---|---|
sample | data/sample/tables/*.csv | Expanded ICE ontology (316 leaves) | Auto-seeded on first boot |
hive | CAI data connection | Domain annotations from vocab_uri | User creates via Status page |
synth | data/synth/tables/*.csv | Domain annotations from vocab_uri | Generated by scripts/generate_synth_source.py |
Vocabulary routing: For in-situ classification, the customer’s domain
vocabulary IS the classification target — the LLM reads labels and
descriptions and classifies into the domain’s hierarchical dot-codes.
The annotations table location is configured per source via vocab_uri
(e.g. meta.vocab, meta.annotations), decoupling data tables from the
vocabulary. Multiple sources can share the same annotations table.
Future work: A portable pre-trained model (classify-ICE-then-map) would classify against the built-in ICE vocabulary and translate results to customer terms via
VocabMapping. This requires dedicated training hardware and is not yet implemented.
Database Schema
CREATE TABLE data_sources (
id TEXT PRIMARY KEY,
source_type TEXT NOT NULL, -- 'sample' | 'hive'
source_uri TEXT NOT NULL DEFAULT '',
display_name TEXT NOT NULL,
vocabulary_mode TEXT NOT NULL DEFAULT 'auto',
vocab_uri TEXT NOT NULL DEFAULT '', -- e.g. 'meta.vocab', 'meta.annotations'
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
metadata TEXT -- JSON: table_count, column_count
);
-- Datasets gain source + version columns:
ALTER TABLE datasets ADD COLUMN source_id TEXT REFERENCES data_sources(id);
ALTER TABLE datasets ADD COLUMN version_number INTEGER NOT NULL DEFAULT 1;
ALTER TABLE datasets ADD COLUMN is_active BOOLEAN NOT NULL DEFAULT TRUE;
ALTER TABLE datasets ADD COLUMN summary TEXT;
ALTER TABLE datasets ADD COLUMN fsm_run_id TEXT;
Vocabulary Routing
When a pipeline run starts, the source_id determines which vocabulary
loads:
ootb-sample:load_sample_vocabulary()→data/sample/ontology.json(316 BFO-grounded leaves across the CCO ICE trichotomy)hive/synth: Domain annotations loaded directly from the table specified byvocab_uri. The domain vocabulary IS the classification target — no composition with the universal base. Hive sources always require an annotations table.- No source: Falls back to universal vocabulary (16 PII leaves)
LLM Robustness
The LLM classification batch uses adaptive sizing to avoid context truncation. With large vocabularies (>200 categories), the system prompt embedding the full category table can consume significant context.
- Adaptive batch sizing:
_estimate_safe_batch_size()reducescolumns_per_callfor large vocabularies (e.g. 290 categories → 41) - Truncation retry: When
LLMResponse.truncatedis detected, the batch is halved and retried recursively until all columns are classified - Metrics:
truncation_countandeffective_batch_sizetracked inBootstrapStateand exposed via the agent’scheck_convergencetool
Sample Source
The built-in “Sample” source (source_id ootb-sample) ships with
Atelier so new deployments show meaningful data immediately. When the
landing page loads and “Connected” turns green, the stats cards show
316 Terms and 316 Entities. The ootb- prefix in the id is an
internal marker distinguishing shipped sources from user-registered
connections — it is not shown in the UI.
Expanded Vocabulary (ICE.* Ontology)
The vocabulary follows the CCO ICE (Information Content Entity) trichotomy,
grounded in BFO via atelier-vocab.ttl:
ICE (root) ≡ cco:InformationContentEntity
├── ICE.NONSENSITIVE
│ ├── ICE.NONSENSITIVE.DESIGNATIVE ⊑ cco:DesignativeICE
│ │ ├── .NAME (.PERSON, .ORG, .PRODUCT, .SCIENTIFIC)
│ │ ├── .CODE (.ID, .ABBREV, .POSTAL)
│ │ ├── .GEO (.COUNTRY, .REGION, .CITY, .LOCATION)
│ │ ├── .REF (.CITATION, .VERSION, .SOURCE)
│ │ └── .TITLE
│ ├── ICE.NONSENSITIVE.DESCRIPTIVE ⊑ cco:DescriptiveICE
│ │ ├── .TEXT (.DESCRIPTION, .COMMENT, .ABSTRACT, .DEFINITION)
│ │ ├── .CATEGORICAL (.TYPE, .CATEGORY, .RANK, .LANGUAGE)
│ │ ├── .MEASUREMENT (~20 subtypes)
│ │ └── .TEMPORAL (.DATE, .YEAR, .DURATION, .PERIOD, …)
│ └── ICE.NONSENSITIVE.PRESCRIPTIVE ⊑ cco:PrescriptiveICE
│ └── .FORMAT, .FORMULA, .ROUTE, .ROLE
├── ICE.SENSITIVE
│ ├── ICE.SENSITIVE.PID (~40 leaves: CONTACT, IDENTITY, FINANCIAL, HEALTH)
│ ├── ICE.SENSITIVE.TECHNICAL (IPADDR, DEVID, URL, HOSTNAME, …)
│ └── ICE.SENSITIVE.BUSINESS (.TRADE_SECRET, .CONTRACT_VALUE, …)
└── ICE.METADATA
└── .TIMESTAMP, .RECID, .STATUS, .VERSION, .CREATED_BY, …
351 total categories: 316 leaves + 35 internal nodes across 5 subtrees.
Design principle: every category is our own BFO-grounded term. External
sources (GitTables, meta-tagging) inform which conceptual space to cover;
we never import their raw tags. The mapping goes outward from our vocabulary
via atelier-vocab.ttl, not inward.
Sample Tables
25 mixed-domain tables with 316 columns (100 rows each). Tables are
deliberately cross-domain — a customers table contains identity,
contact, metadata, and categorical columns — so the classification
pipeline cannot rely on table name alone.
~25% of columns use opaque names (field_42, var_abc, col_xyz)
to exercise the pipeline’s ability to classify from values and context
rather than column name heuristics.
Generated by scripts/generate_sample_source.py. The curated
reference for the Sample source fixture is committed in
data/sample/reference_labels.json (scope: fixture-only, for OOTB
demo and unit tests).
For UAT / production evaluation, the curated reference lives at
build/meta-tagging-clean/curated_reference.csv (gitignored) — built
by scripts/parity/build_curated_reference.py from direct
reference-column evidence plus name-index lookup with
Ontology > Annotation > Common Names priority. UAT’s own
classification outputs are provisional predictions and are scored
against this curated reference at
build/results/parity/delta_report.md.
Auto-Import on First Boot
The gateway seeds the Sample source (id ootb-sample) via a FastAPI
lifespan context manager:
- Check if
ootb-samplesource has any dataset versions - If none, read
sample_source_stats()(table count, column count) - Create dataset version 1 with the stats as metadata
- Update source metadata JSON
This runs once at startup. If the database isn’t ready (migrations haven’t run), seeding is silently skipped.
API
REST Endpoints
| Endpoint | Method | Description |
|---|---|---|
/api/data-sources | GET | List all data sources |
/api/data-sources | POST | Create a new data source |
/api/datasets?source_id=X | GET | List versions for a source |
/api/datasets/{id}/activate | POST | Set a version as active |
/api/vocabulary/stats?source_id=X | GET | Term count (source-aware) |
/api/fsm/start?source_id=X | POST | Start pipeline for a source |
gRPC RPCs
| RPC | Description |
|---|---|
ListDataSources() | List all sources |
StartClassification(source_id=…) | Start pipeline for a source |
UI Integration
The Status page has two new cards:
- Data Source card: dropdown selector for sources + version table showing version number, column count, timestamp, and summary. Click a row to activate that version.
- Classification Pipeline card: “Start Classification” passes
activeSourceIdto/api/fsm/start?source_id=…
The Landing page stats cards reflect the active source:
- Terms: vocabulary size for the active source (316 for the Sample source)
- Entities: column count from the active dataset version
- Sources badge: shows count when multiple sources exist
DatasetContext
interface DatasetContextValue {
sources: DataSourceInfo[];
activeSourceId: string | null;
setActiveSourceId: (id: string) => void;
datasets: DatasetInfo[]; // for activeSourceId
activeDatasetId: string | null;
setActiveDatasetId: (id: string) => void;
refreshSources: () => Promise<void>;
refreshDatasets: () => Promise<void>;
}
Key Files
| File | Role |
|---|---|
db/migrations/20260414…_data_sources_and_versions.sql | Schema migration |
src/atelier/db/model.py | DataSource ORM model |
src/atelier/db/dao.py | Source + version DAO methods |
src/atelier/classify/sampler.py | load_sample_source(), sample_source_stats() |
src/atelier/classify/taxonomy.py | load_sample_vocabulary() |
src/atelier/classify/pipeline.py | Source-aware routing |
src/atelier/gateway.py | REST endpoints + auto-import lifespan |
data/sample/ontology.json | Expanded vocabulary (316 leaves) |
data/sample/tables/*.csv | 25 sample tables |
data/sample/reference_labels.json | 316-entry Sample-source fixture reference labels |
build/meta-tagging-clean/curated_reference.csv (gitignored) | UAT-corpus curated reference |
scripts/expand_vocabulary.py | Vocabulary expansion script |
scripts/generate_sample_source.py | Sample table generation script |
ui/src/contexts/DatasetContext.tsx | Source-aware React context |
ui/src/pages/Status.tsx | Data source + version UI |