Architecture¶
SynthoHive is organized into six modular packages, each handling a distinct concern in the synthetic data pipeline.
Package Overview¶
- interface:
Synthesizerfacade,Metadata,TableConfig,PrivacyConfigentry points. See API. - core:
DataTransformerfor normalization/encoding, andCTGAN(Conditional WGAN-GP) for deep generative modeling. See API. - relational:
StagedOrchestratormanaging the generation DAG,SchemaGraphfor dependency analysis, andLinkageModelfor parent-child cardinality learning. See API. - privacy:
PIISanitizerwith regex-based detection, andContextualFakerfor locale-aware obfuscation. See API. - validation:
ValidationReportandStatisticalValidatormeasuring KS/TVD metrics. See API. - connectors:
SparkIOfor scalable I/O andRelationalSamplerfor stratified data sampling. See API.
Module Interaction¶
flowchart TD
User["User Code"] --> Synth["Synthesizer (interface)"]
Synth --> Orch["StagedOrchestrator (relational)"]
Synth --> Priv["PIISanitizer (privacy)"]
Orch --> Graph["SchemaGraph (relational)"]
Orch --> CTGAN["CTGAN (core.models)"]
Orch --> Link["LinkageModel (relational)"]
CTGAN --> Trans["DataTransformer (core.data)"]
Orch --> IO["SparkIO (connectors)"]
Synth --> Val["ValidationReport (validation)"]
Key Flows¶
- Fit: The
Synthesizerdelegates toStagedOrchestrator, which usesSchemaGraphto determine topological order. For each table,DataTransformerprofiles columns,CTGANtrains (optionally conditioned on parent context), andLinkageModellearns child counts. - Sample: Generators produce rows in topological order.
LinkageModeldrives child counts, and referential integrity is enforced via FK assignment. Secondary FKs are populated by random sampling from already-generated parent tables. - Privacy:
PIISanitizerdetects and masks/fakes PII before training.ContextualFakerinjects locale-aware replacements based on row context (country, region). - Validation:
StatisticalValidatorcomputes KS test (numeric), TVD (categorical), and correlation distance (Frobenius norm).ValidationReportgenerates HTML or JSON reports.
Design Principles¶
- Per-table models: Each table gets its own CTGAN instance rather than one monolithic model. This scales linearly and allows independent tuning.
- Conditional generation: Child tables are trained on data joined with parent context columns, so generated children reflect realistic parent-child correlations.
- Pluggable models: The
ConditionalGenerativeModelabstract base class allows swapping in custom model implementations via themodel_clsparameter. - Privacy before training: PII sanitization occurs upstream of model training, ensuring no raw PII enters the generative model.
See Data Flow for a stepwise diagram and Guides for hands-on steps.