Quality Validation¶
How do you know your synthetic data is good? SynthoHive provides a comprehensive ValidationReport to quantify fidelity.
Metrics¶
1. Kolmogorov-Smirnov (KS) Test¶
- Target: Continuous Columns (Float/Int)
- Measure: The maximum distance between the cumulative distribution functions (CDFs) of real and synthetic data.
- Interpretation:
0.0= Perfect fit (Distributions are identical).1.0= Totally different.- Typically, < 0.1 is considered excellent quality.
2. Total Variation Distance (TVD)¶
- Target: Categorical Columns
- Measure: Half the sum of absolute differences between category probabilities.
- Interpretation:
0.0= Perfect fit (Category frequencies match exactly).1.0= Totally different.
3. Correlation Distance¶
- Target: Column Pairs
- Measure: We compute correlation matrices (Pearson for continuous columns) for both Real and Synthetic datasets. The score is the Frobenius norm (L2 norm) of the difference matrix.
- Goal: Measures how well the model captured relationships between columns (e.g., Age vs. Income).
Output Formats¶
HTML Report¶
The ValidationReport.generate() method produces a self-contained HTML file containing:
- Column Validation Metrics: KS test and TVD results per column with pass/fail indicators.
- Correlation Distance: Frobenius norm comparing real vs. synthetic correlation matrices.
- Detailed Statistics: Side-by-side descriptive statistics (mean, std, min, max for numeric; unique count, top value for categorical).
- Row Previews: Snippets of raw data to verify formatting.
JSON Report¶
When the output path ends with .json, the report generates a structured JSON file containing all computed metrics, suitable for programmatic consumption or CI/CD pipelines.
Usage¶
HTML Report¶
from syntho_hive.validation.report_generator import ValidationReport
report = ValidationReport()
report.generate(
real_data={"users": real_df}, # Dict[str, pd.DataFrame]
synth_data={"users": synth_df}, # Dict[str, pd.DataFrame]
output_path="report.html"
)
JSON Report¶
report.generate(
real_data={"users": real_df},
synth_data={"users": synth_df},
output_path="metrics.json"
)
Next Steps¶
- Demo 03: Full runnable example generating both HTML and JSON reports.
- Fitting Guide: Tune training parameters to improve validation scores.