Stage 2: Schema Projection
Schema projection transforms BFO-grounded ontology fragments into relational database schemas through a two-step process: first into SysMLv2 systems engineering models, then into programmatic data objects and SQL schemas. The intermediate SysMLv2 representation captures structural constraints, lifecycle semantics, and system-level relationships that flat entity-relationship modeling would lose.
Why SysMLv2 as Intermediate Representation
Using SysMLv2 (OMG, approved July 2025) as an intermediate representation between ontology and database schema is unconventional – and deliberate. SysMLv2 provides formal constructs that bridge the gap between abstract ontological entities and concrete data structures:
| SysMLv2 Construct | Ontological Concept | Database Primitive |
|---|---|---|
| Block Definition | Entity type | Table |
| Part Property | Composition | One-to-many FK |
| Reference Property | Association | Many-to-many junction table |
| Port | Interface/boundary | Shared column (FK target) |
| Attribute | Data property | Column |
| Constraint | Axiom | CHECK constraint |
| State Machine | Lifecycle | Status enum + temporal columns |
| Requirement | Validation rule | Application-level validation |
The openCAESAR project provides an OWL2-DL ontology for SysMLv2, making the ontology-to-SysMLv2 projection formally well-defined. This means we’re not hand-waving the transformation – there’s a rigorous mapping from BFO-grounded classes and relations to SysMLv2 blocks and connections.
The critical advantage: SysMLv2 models encode systems with internal structure, constraints, and lifecycle semantics. The generated databases aren’t just flat tables with columns – they’re projections of coherent systems where referential integrity, state transitions, and constraint propagation all have formal justification in the source model.
Projection Pipeline
Step 1: Ontology → SysMLv2
Each BFO class maps to a SysMLv2 construct based on its upper-level category:
- BFO:Object →
part def(a concrete block with owned parts) - BFO:Process →
action defwith a state machine (lifecycle semantics) - BFO:Role →
port def(an interface that objects can fulfill) - BFO:Quality →
attribute def(a typed value property) - BFO:GDC (Generically Dependent Continuant) →
part defwithsubsets informationEntity(a record or document)
Relations map to SysMLv2 connections:
- Composition (whole-part) →
partusage within a block - Association →
refusage with multiplicity - Participation (Object in Process) →
performaction usage
Axioms map to constraint def blocks with OCL-like expressions.
Step 2: SysMLv2 → Programmatic Objects
The SysMLv2 model is projected into Python dataclasses via template-based code generation:
@dataclass
class Patient:
patient_id: str # from attribute def
date_of_birth: date # from attribute def
gender: str # from attribute def
encounters: list # from part usage (1..*)
@dataclass
class Encounter:
encounter_id: str # generated primary key
patient_id: str # from owning block (FK)
encounter_date: datetime # from attribute def
status: str # from state machine states
provider_id: str # from ref usage (FK)
diagnoses: list # from part usage (1..*)
@dataclass
class Diagnosis:
diagnosis_id: str # generated primary key
encounter_id: str # from owning block (FK)
code: str # from attribute def
description: str # from attribute def
coded_by: str # from ref usage (FK)
Step 3: Data Objects → Relational Schema
The dataclasses are mapped to SQLAlchemy models and CREATE TABLE statements:
CREATE TABLE patient (
patient_id VARCHAR(36) PRIMARY KEY,
date_of_birth DATE NOT NULL,
gender VARCHAR(10) NOT NULL
);
CREATE TABLE encounter (
encounter_id VARCHAR(36) PRIMARY KEY,
patient_id VARCHAR(36) NOT NULL REFERENCES patient(patient_id),
encounter_date TIMESTAMP NOT NULL,
status VARCHAR(20) NOT NULL CHECK (status IN ('active', 'closed')),
provider_id VARCHAR(36) NOT NULL REFERENCES provider(provider_id)
);
CREATE TABLE diagnosis (
diagnosis_id VARCHAR(36) PRIMARY KEY,
encounter_id VARCHAR(36) NOT NULL REFERENCES encounter(encounter_id),
code VARCHAR(10) NOT NULL,
description TEXT,
coded_by VARCHAR(36) REFERENCES provider(provider_id)
);
Ontological Mapping Rules
The projection preserves ontological structure through systematic rules:
| Ontological Structure | Relational Mapping | Provenance Preserved |
|---|---|---|
| Entity type | Table | Table name ↔ BFO class |
| Data property | Column | Column name ↔ property IRI |
| Object property (1:N) | Foreign key | FK ↔ relation IRI |
| Object property (M:N) | Junction table | Junction ↔ relation IRI |
| Subsumption hierarchy | Table-per-type inheritance | Parent FK ↔ rdfs:subClassOf |
| Disjointness axiom | CHECK constraint | Constraint ↔ axiom |
| Cardinality constraint | NOT NULL / UNIQUE | Column constraint ↔ cardinality |
The critical property is that every schema element traces back to a specific ontological element. This traceability is what makes the training objective possible: when the model predicts that two columns belong to the same data element, we can verify that prediction against the source ontology.
Schema Variation
A single ontology fragment can produce multiple valid database schemas through controlled variation:
- Normalization level: 1NF, 2NF, 3NF, or fully denormalized
- Inheritance strategy: Table-per-type, table-per-hierarchy, or single-table with discriminator
- Naming conventions:
snake_case,camelCase, abbreviated, or obfuscated (col_1,field_a) - Type mappings:
DATEvsVARCHARfor dates,INTEGERvsVARCHARfor codes
This variation is essential for training robustness. Real-world databases use all of these conventions, often mixed within a single schema. By generating diverse schemas from the same ontological source, the model learns to recognize semantic equivalence across surface-level variation.
Formal Mapping
The schema projection is a function:
\[ \pi: O \to \mathcal{S} \]
where \(O = (C, R, A, \iota)\) is an ontology fragment and \(\mathcal{S} = \{S_1, \ldots, S_k\}\) is a set of valid relational schemas. Each schema \(S_i = (T, K, F, \Gamma)\) consists of:
- \(T = \{t_1, \ldots, t_n\}\) – tables, each with columns \(\text{cols}(t_j)\)
- \(K\) – primary key constraints
- \(F\) – foreign key constraints
- \(\Gamma\) – CHECK constraints
The projection must satisfy:
\[ \forall, t \in T,\ \exists, c \in C : \text{name}(t) \xleftarrow{\pi} c \]
\[ \forall, f \in F,\ \exists, r \in R : f \xleftarrow{\pi} r \]
That is, every table traces to a class and every foreign key traces to a relation. This bidirectional traceability is the formal guarantee that makes ontological entity recovery a well-defined training objective.