Feature Embeddings¶

SynthoHive provides specialized handling for High Cardinality categorical columns (e.g., User ID, Zip Code, Product ID) using Entity Embeddings. This avoids the computational explosion of One-Hot Encoding and allows the model to learn semantic relationships between categories.

When to use Embeddings vs One-Hot?¶

Feature	One-Hot Encoding	Entity Embeddings
Technique	Creates a binary column for every unique category.	Maps each category to a dense vector of floating point numbers.
Cardinality	Low (< 50 unique values).	High (> 50 unique values).
Memory Usage	High (Sparse but wide).	Low (Compact dense vectors).
Relationships	Independent. No relationship between 'A' and 'B'.	Learned. Similar categories end up close in vector space.
Example	`Gender`, `MaritalStatus`.	`ZipCode`, `UserID`, `ICD9_Code`.

How it Works¶

The transformation pipeline automatically detects high-cardinality columns based on a threshold.

1. Detection¶

During DataTransformer.fit(), the system checks the number of unique values in each categorical column. If num_unique > embedding_threshold (default: 50), the column is flagged for embedding.

2. Transformation¶

One-Hot: Converts string "A" -> [1, 0, 0].
Embedding: Converts string "A" -> Integer Index (42).

3. Model Training (CTGAN)¶

Inside the neural network model: * Generator: Outputs a probability distribution (logits) over all possible categories. * Discriminator: Feeds the index (for real data) or probability-weighted vector (for fake data) into a learnable Embedding Layer. * Learning: The model learns to place similar entities near each other. For example, if Zip Codes 10001 and 10002 have similar correlations with Income, their embedding vectors will become similar during training.

Configuration¶

You can control the threshold for switching to embeddings globally or per-model.

Global Configuration¶

Set the embedding_threshold when initializing the synthesizer or calling fit.

synth.fit(
    data=df,
    embedding_threshold=100  # Only use embeddings if > 100 unique values
)

Lowering this value forces more columns to use embeddings, which saves memory but might reduce precision for small categorical sets. Increasing it uses One-Hot for more columns, which is more precise but memory-intensive.

Use Cases¶

1. Geographical Data¶

Zip codes, Cities, or State abbreviations often have hundreds of values. Embeddings allow the model to learn valid geography (e.g., that "NY" and "NJ" are related) rather than treating them as unrelated tokens.

2. ID Columns¶

While primary keys like UserID are usually excluded, you might have Foreign Keys or distinct identifiers like ProductCode that you want to synthesize while preserving their statistical properties.

3. Medical Codes¶

ICD-9 or CPT codes have thousands of distinct values. Embeddings are essential for synthesizing electronic health records (EHR) effectively.