Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Stage 3: Synthetic Data Generation

Given a relational schema with known ontological provenance, the third stage populates tables with realistic synthetic data. The goal is not just to fill rows – it is to produce data distributions that exercise the same patterns and confusable types the model will encounter in real enterprise databases.

Population Pipeline

Value Generation

Each column type maps to a specialized generator that produces realistic values. The generator selection is driven by the ontological provenance of the column – a column traced to BFO:Quality with domain healthcare produces different values than one traced to BFO:Quality with domain finance.

Generator Categories

Column SemanticsGeneratorExample Values
Person nameFaker (locale-aware)“Maria Santos”, “James O’Brien”
Date/timestampRange-bounded random2019-03-15, 2024-11-02T14:30:00
Identifier (UUID)UUIDv4f47ac10b-58cc-4372-a567-0e02b2c3d479
Identifier (sequential)Auto-increment with prefixPAT-00001, ENC-2024-0042
Medical code (ICD-10)Sampled from code registryJ18.9, I25.10, E11.65
Financial code (IBAN)Country-specific formatDE89370400440532013000
CategoricalWeighted sampling from enumactive, closed, pending
Free textTemplate + Faker“Patient presents with acute chest pain”
Numeric measureDistribution-sampled98.6, 120/80, 72
Boolean flagBernoulli(p)true, false
AddressLocale-aware composite“123 Main St, Springfield, IL 62704”
EmailPattern-basedmaria.santos@hospital.org
PhoneCountry-format+1-555-0123

Referential Integrity

Tables are populated in topological order (parents before children) to guarantee that every foreign key value references an existing parent row. The population engine:

  1. Sorts tables by foreign key dependencies (detecting and breaking cycles if needed)
  2. Populates root tables (no FK dependencies) first
  3. For each child table, samples FK values from the parent table’s primary key column
  4. Respects cardinality constraints: a NOT NULL FK always gets a valid reference; an optional FK gets NULL with configurable probability

Distribution Control

Real databases are not uniformly distributed. The generation config controls:

  • Cardinality: How many child rows per parent (e.g., 1–30 encounters per patient, following a power-law distribution)
  • Null ratio: What fraction of nullable columns contain NULL (typically 5–30% in real data)
  • Value entropy: How many distinct values appear in categorical columns (a status column might have 3 values; a diagnosis_code column might have 500)
  • Skew: Zipfian distributions for columns where a few values dominate (e.g., 80% of encounters are status='closed')
  • Temporal patterns: Dates that follow realistic patterns (weekday-heavy, seasonal, monotonically increasing)

Diversity from Source Text

The curated text input drives diversity along two independent axes:

Domain Diversity

Different passages produce different ontological domains, which produce structurally distinct databases:

Source DomainExample TablesDistinctive Patterns
Healthcarepatient, encounter, diagnosis, medicationICD-10 codes, temporal encounter sequences
Financeaccount, transaction, instrument, counterpartyIBAN/SWIFT codes, decimal precision, audit trails
Supply Chainshipment, warehouse, item, carrierGPS coordinates, weight/volume, tracking IDs
Educationstudent, course, enrollment, gradeGPA calculations, semester cycles
HR/Payrollemployee, department, payroll, benefitSSN patterns, salary ranges, org hierarchies

Structural Diversity

Even within a single domain, different passages emphasize different relationships, producing varied schema structures:

  • A passage about emergency triage produces schemas with acuity levels, wait times, and disposition tracking
  • A passage about chronic disease management produces schemas with longitudinal encounters, medication histories, and care plans
  • A passage about hospital billing produces schemas with insurance claims, procedure codes, and payment reconciliation

All three are “healthcare databases” but have substantially different table structures, column types, and relationship patterns. This structural diversity is what trains the model to generalize beyond surface patterns.

Confusable Type Injection

A key training challenge is confusable pairs – columns with nearly identical value distributions but different semantic types. The generation pipeline deliberately injects these:

Confusable PairValue PatternDistinguishing Context
Advertising ID vs GUIDBoth UUIDv4 formatTable context (ad_events vs generic)
Bank account vs payment cardBoth numeric stringsLength, check digit algorithm
Phone number vs fax numberBoth +1-XXX-XXX-XXXXColumn name, co-occurring columns
ZIP code vs department codeBoth 5-digit numbersGeographic context vs org context
Patient ID vs provider IDBoth XXX-NNNNN formatForeign key relationships

By generating schemas where these confusable types coexist – often in the same database – the model learns to resolve ambiguity using cross-column and cross-table context rather than single-column pattern matching.

Scale Arithmetic

Working through concrete numbers:

StageCountBasis
FineWeb-Edu passages~500M1.3T tokens / ~2,600 tokens per passage
Ontology fragments~1–5 per passageDomain-dependent entity density
Schemas per fragment~1–10Normalization and naming variation
Tables per schema~5–50Domain complexity
Rows per table~100–10,000Configurable per generation
Total table instances>10 billionConservative lower bound

The bottleneck is LLM inference for ontology extraction (Stage 1), not data generation. Once an ontology fragment exists, schema projection and data population are purely procedural and can run on commodity hardware at millions of tables per hour.