Changelog
[1.4.0] - 2026-03-27
Security
- Fixed unsalted SHA-256 hashing in PII sanitizer - now uses HMAC-SHA256 with per-instance random salt to prevent rainbow table attacks
- Fixed SQL injection vulnerability in
save_to_hive() - path values are now validated
- Added security warning for
torch.load(weights_only=False) usage
- Fixed
Synthesizer.load() to type-check deserialized objects
Fixed
Core Models (CTGAN)
- CRITICAL: Embedding layers are now included in the generator optimizer - previously they were never updated during training
- CRITICAL: Checkpoint validation no longer crashes when model uses context conditioning
- Fixed optimizer zero_grad ordering to follow standard PyTorch pattern (zero_grad -> forward -> backward -> step)
sample() now restores generator training mode after evaluation
- Discriminator now uses both dimensions from
discriminator_dim tuple
- Added gradient clipping (max_norm=1.0) to both generator and discriminator optimizers
- Replaced
assert statements with proper ValueError raises
- Added input validation for
num_rows in sample()
- CRITICAL:
fit() now properly resets internal state on re-fit, preventing stale data corruption
- CRITICAL: Per-column seeds now use deterministic SHA-256 hash instead of Python's non-deterministic
hash()
- CRITICAL: All-null numeric columns no longer crash BayesianGMM - handled gracefully with constant representation
- Empty DataFrames now raise clear
ValueError instead of cryptic sklearn errors
- BayesianGMM
n_components is now clamped to available sample count
- Unknown categories during transform are handled gracefully instead of crashing
- Added epsilon to VGM normalization to prevent division by zero
- Removed dead
"number" dtype reference in constraint checking
Relational Orchestration
- CRITICAL: Self-referencing foreign keys (e.g.,
manager_id -> employees.id) no longer raise false cycle errors
- Zero child rows now create empty DataFrames instead of silently skipping (preventing downstream crashes)
write_pandas default mode changed from "append" to "overwrite" to prevent data corruption
- Added validation that
parent_context_cols exist in parent table
- PK assignment now uses actual DataFrame length instead of requested row count
validate_schema() is now called at the start of fit_all()
- NegBinom linkage model handles edge cases (zero variance, underdispersion) with Poisson/constant fallbacks
get_table() return values are now null-checked with clear error messages
- Replaced all
print() calls with structured log.info() logging
Privacy & Sanitization
- Fixed all default PII regex patterns to be anchored (preventing false positives on substrings)
- Added PII detection for names, addresses, and dates of birth
- Added column name alias matching (e.g., "mobile", "cell", "tel" all match "phone")
_mask_value custom fallback now correctly applies per-value instead of per-Series
- Null values are now preserved through masking and hashing operations
- PII detection now uses random sampling instead of biased
head(100)
- Added input validation for
pii_map column names
Interface & Connectors
fit() now warns when sampling_strategy parameter is passed but not yet implemented
Synthesizer serialization now excludes SparkSession (via __getstate__/__setstate__)
- Fixed incorrect error message in
generate_validation_report (said "synthetic" when reading "real" data)
- Integer-to-float FK dtype compatibility is now handled correctly
PrivacyConfig.epsilon now validates positive values
Metadata.add_table() now raises SchemaError instead of generic ValueError
RelationalSampler now cascades sampling through full hierarchy (not just one level)
- Fixed ambiguous column references in relational sampling joins (using semi-join)
Exceptions
- Added
GenerationError exception class for synthesis failures
- Added
PrivacyError exception class for sanitization failures
Tests
- Fixed
AssertionError typo in seed regression test
- Replaced
sys.exit(0) with pytest.skip() in retail test
- Fixed tautological null handling assertions
- Added deterministic seeds to validation and observability tests
- Migrated hardcoded file paths to pytest
tmp_path fixtures
1.3.0
Added
- Pluggable model architecture:
Synthesizer and StagedOrchestrator accept a model_cls parameter for custom ConditionalGenerativeModel implementations.
enforce_constraints parameter on CTGAN.sample() to raise ConstraintViolationError when generated data violates min/max constraints.
- Training observability: structured logging events (
training_start, epoch_end, training_complete) with metrics.
- Validation-metric checkpointing: best model saved based on validation metric computed at configurable intervals.
SchemaValidationError with comprehensive FK validation (type mismatches, missing columns, invalid references).
- Typed exception hierarchy:
SynthoHiveError, SchemaError, TrainingError, SerializationError, ConstraintViolationError.
- SQL injection protection in
save_to_hive() via identifier allowlist.
Changed
- Version bump from 1.2.3 to 1.3.0.
LinkageModel default method changed from GaussianMixture to empirical histogram resampler (with optional NegBinom fit).
1.2.3
Fixed
TypeError in DataTransformer when applying numeric constraints (min, max, dtype) to categorical/string columns.
- Added robust type coercion to ensure constraints are applied correctly to transformed data.
1.2.2
Fixed
- CTGAN embedding cardinality to avoid
IndexError when using high-cardinality categorical columns.
- Databricks example returns in-memory DataFrames and cleans timestamps/nulls for safer Arrow/pandas conversion.
Added
- Initial MkDocs site scaffold with guides, demos, and API reference.