Stage 3: Synthetic Data Generation
Given a relational schema with known ontological provenance, the third stage populates tables with realistic synthetic data. The goal is not just to fill rows – it is to produce data distributions that exercise the same patterns and confusable types the model will encounter in real enterprise databases.
Population Pipeline
Value Generation
Each column type maps to a specialized generator that produces realistic values. The generator selection is driven by the ontological provenance of the column – a column traced to BFO:Quality with domain healthcare produces different values than one traced to BFO:Quality with domain finance.
Generator Categories
| Column Semantics | Generator | Example Values |
|---|---|---|
| Person name | Faker (locale-aware) | “Maria Santos”, “James O’Brien” |
| Date/timestamp | Range-bounded random | 2019-03-15, 2024-11-02T14:30:00 |
| Identifier (UUID) | UUIDv4 | f47ac10b-58cc-4372-a567-0e02b2c3d479 |
| Identifier (sequential) | Auto-increment with prefix | PAT-00001, ENC-2024-0042 |
| Medical code (ICD-10) | Sampled from code registry | J18.9, I25.10, E11.65 |
| Financial code (IBAN) | Country-specific format | DE89370400440532013000 |
| Categorical | Weighted sampling from enum | active, closed, pending |
| Free text | Template + Faker | “Patient presents with acute chest pain” |
| Numeric measure | Distribution-sampled | 98.6, 120/80, 72 |
| Boolean flag | Bernoulli(p) | true, false |
| Address | Locale-aware composite | “123 Main St, Springfield, IL 62704” |
| Pattern-based | maria.santos@hospital.org | |
| Phone | Country-format | +1-555-0123 |
Referential Integrity
Tables are populated in topological order (parents before children) to guarantee that every foreign key value references an existing parent row. The population engine:
- Sorts tables by foreign key dependencies (detecting and breaking cycles if needed)
- Populates root tables (no FK dependencies) first
- For each child table, samples FK values from the parent table’s primary key column
- Respects cardinality constraints: a
NOT NULLFK always gets a valid reference; an optional FK getsNULLwith configurable probability
Distribution Control
Real databases are not uniformly distributed. The generation config controls:
- Cardinality: How many child rows per parent (e.g., 1–30 encounters per patient, following a power-law distribution)
- Null ratio: What fraction of nullable columns contain NULL (typically 5–30% in real data)
- Value entropy: How many distinct values appear in categorical columns (a
statuscolumn might have 3 values; adiagnosis_codecolumn might have 500) - Skew: Zipfian distributions for columns where a few values dominate (e.g., 80% of encounters are
status='closed') - Temporal patterns: Dates that follow realistic patterns (weekday-heavy, seasonal, monotonically increasing)
Diversity from Source Text
The curated text input drives diversity along two independent axes:
Domain Diversity
Different passages produce different ontological domains, which produce structurally distinct databases:
| Source Domain | Example Tables | Distinctive Patterns |
|---|---|---|
| Healthcare | patient, encounter, diagnosis, medication | ICD-10 codes, temporal encounter sequences |
| Finance | account, transaction, instrument, counterparty | IBAN/SWIFT codes, decimal precision, audit trails |
| Supply Chain | shipment, warehouse, item, carrier | GPS coordinates, weight/volume, tracking IDs |
| Education | student, course, enrollment, grade | GPA calculations, semester cycles |
| HR/Payroll | employee, department, payroll, benefit | SSN patterns, salary ranges, org hierarchies |
Structural Diversity
Even within a single domain, different passages emphasize different relationships, producing varied schema structures:
- A passage about emergency triage produces schemas with acuity levels, wait times, and disposition tracking
- A passage about chronic disease management produces schemas with longitudinal encounters, medication histories, and care plans
- A passage about hospital billing produces schemas with insurance claims, procedure codes, and payment reconciliation
All three are “healthcare databases” but have substantially different table structures, column types, and relationship patterns. This structural diversity is what trains the model to generalize beyond surface patterns.
Confusable Type Injection
A key training challenge is confusable pairs – columns with nearly identical value distributions but different semantic types. The generation pipeline deliberately injects these:
| Confusable Pair | Value Pattern | Distinguishing Context |
|---|---|---|
| Advertising ID vs GUID | Both UUIDv4 format | Table context (ad_events vs generic) |
| Bank account vs payment card | Both numeric strings | Length, check digit algorithm |
| Phone number vs fax number | Both +1-XXX-XXX-XXXX | Column name, co-occurring columns |
| ZIP code vs department code | Both 5-digit numbers | Geographic context vs org context |
| Patient ID vs provider ID | Both XXX-NNNNN format | Foreign key relationships |
By generating schemas where these confusable types coexist – often in the same database – the model learns to resolve ambiguity using cross-column and cross-table context rather than single-column pattern matching.
Scale Arithmetic
Working through concrete numbers:
| Stage | Count | Basis |
|---|---|---|
| FineWeb-Edu passages | ~500M | 1.3T tokens / ~2,600 tokens per passage |
| Ontology fragments | ~1–5 per passage | Domain-dependent entity density |
| Schemas per fragment | ~1–10 | Normalization and naming variation |
| Tables per schema | ~5–50 | Domain complexity |
| Rows per table | ~100–10,000 | Configurable per generation |
| Total table instances | >10 billion | Conservative lower bound |
The bottleneck is LLM inference for ontology extraction (Stage 1), not data generation. Once an ontology fragment exists, schema projection and data population are purely procedural and can run on commodity hardware at millions of tables per hour.