Privacy Guardrails¶
Synthetic data is only useful if it effectively masks sensitive information while retaining utility. SynthoHive applies privacy controls before any fitting or training occurs, ensuring that no raw PII ever enters the generative models.
The Sanitize workflow¶
flowchart LR
A[Raw Data] --> B{PII Analyzer}
B -- Regex/NER --> C[Detected Map]
C --> D[Sanitizer Engine]
D -- Rules --> E(Clean Data)
E --> F[Training]
Faker vs. Embeddings¶
It is important to distinguish when to use Faker versus Embeddings:
- Faker (Sanitization): Used when you need to hide the original values and replace them with plausible dummies. This creates new values (names, emails) that never existed in your real dataset. Use this for PII.
- Embeddings (Modeling): Used when you want the model to learn associations between high-cardinality categories (e.g., "ZipCode" correlated with "Income"). The generative model uses embeddings to capture these patterns but typically reproduces values from the existing vocabulary of the training data.
Tip
Use Faker for privacy (hiding identity). Use Embeddings for utility (preserving statistical relationships in categories like Brands, Cities, or Diagnosis Codes).
Detection Strategies¶
We use a combination of heuristics to detect Personal Identifiable Information (PII):
- Column Naming: Checks for keywords like "email", "ssn", "phone", "address", "name", "dob". Alias matching is supported (e.g., "mobile", "cell", and "tel" all match the "phone" rule).
- Pattern Matching: Scans a random sample of data against a library of anchored regular expressions (Email, IPv4, Credit Cards, Social Security Numbers, Names, Addresses, Dates of Birth). Patterns are fully anchored to prevent false positives on substrings.
- Thresholding: A column is flagged only if > 50% of non-null sampled values match a pattern.
Sampling behavior (v1.4.0)
PII detection now uses a random sample of rows instead of the first 100 rows (head(100)). This eliminates bias when data is sorted or partitioned. Null values are preserved through all masking and hashing operations.
Sanitization Actions¶
Once PII is detected (or manually configured), you can apply one of several actions:
| Action | Description | Use Case |
|---|---|---|
drop |
Removes the column entirely. | High-risk fields with zero utility (e.g., internal system IDs). |
mask |
Replaces all but the last 4 characters with *. |
Credit cards, phone numbers where visual format matters. |
hash |
Replaces value with HMAC-SHA256 hash (salted per sanitizer instance). | Maintaining distinctness for joining without revealing value (e.g., User IDs). |
fake |
Replaces with realistic fake data. | Names, Addresses, Emails that need to look "real" for the model. |
keep |
Retains the original value. | Low-sensitivity fields or when data is already sanitized source. |
custom |
Uses a user-provided function. | complex logic requiring row-level context. |
Context-Aware Faking¶
Standard fakers generate random data. Contextual Faking ensures consistency across columns.
How it works¶
The ContextualFaker looks for specific keys in the row data to determine the locale for generation:
* country
* locale
* region
Example: If a record has Country="US", the sanitizer will generate a US-formatted phone number (e.g., +1-555...). If Country="JP", it generates a Japanese number.
Supported locales include: US, UK/GB, JP, DE, FR, CN, IN.
Internal Mechanism¶
Internally, ContextualFaker leverages the faker python library. It optimizes performance by maintaining a cache of Faker instances initiated for each required locale (e.g., ja_JP for Japan, de_DE for Germany).
- Locale Resolution: It scans the row for the context keys.
- Instance Selection: It maps the value (e.g., "JP") to a specific
Fakerlocale identifier (e.g.,ja_JP) using an internalLOCALE_MAP. - Generation: It delegates the generation to that specific locale's provider. For example,
fake.phone_number()forja_JPproduces a valid Japanese format compliant with local regex rules.
If no context is found, it falls back to a default en_US generator.
Configuration¶
You can customize rules via the PrivacyConfig object:
from syntho_hive.privacy.sanitizer import PIISanitizer, PrivacyConfig, PiiRule
config = PrivacyConfig(rules=[
PiiRule(name="custom_code", patterns=[r"^[A-Z]{2}-\d{4}$"], action="mask"),
PiiRule(name="audit_id", patterns=[r"^AUD-\d+"], action="keep")
])
Custom Logic¶
For complex scenarios, you can define a custom generator function that receives the entire row context. This is useful for conditional sanitization or format-preserving encryption.
def redact_sensitive_id(row: dict) -> str:
# Example: Redact differently based on user role from another column
if row.get('role') == 'admin':
return row['user_id'] # Keep visible for admins (in this hypothetical flow)
return f"REDACTED-{hash(row['user_id']) % 1000}"
config = PrivacyConfig(rules=[
PiiRule(
name="custom_logic",
patterns=[r"user_\d+"],
action="custom",
custom_generator=redact_sensitive_id
)
])