Connectors¶
syntho_hive.connectors.spark_io.SparkIO ¶
Utility for reading and writing datasets via Spark and Delta Lake.
Source code in syntho_hive/connectors/spark_io.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
read_dataset ¶
read_dataset(path_or_table: str, format: str = None, **kwargs: Union[str, int, bool, float]) -> DataFrame
Read a dataset from a table name or filesystem path.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path_or_table
|
str
|
Hive table name or filesystem/URI path. |
required |
format
|
str
|
Optional explicit format override (e.g., |
None
|
**kwargs
|
Union[str, int, bool, float]
|
Additional Spark read options. |
{}
|
Returns:
| Type | Description |
|---|---|
DataFrame
|
Spark DataFrame loaded from the specified source. |
Source code in syntho_hive/connectors/spark_io.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 | |
write_dataset ¶
write_dataset(df: DataFrame, target_path: str, mode: str = 'overwrite', partition_by: Optional[str] = None, format: str = 'parquet')
Write a Spark DataFrame to storage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Spark DataFrame to persist. |
required |
target_path
|
str
|
Output path (directory or table location). |
required |
mode
|
str
|
Save mode, e.g., |
'overwrite'
|
partition_by
|
Optional[str]
|
Optional column name to partition by. |
None
|
format
|
str
|
Output format, defaults to |
'parquet'
|
Source code in syntho_hive/connectors/spark_io.py
66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 | |
write_pandas ¶
write_pandas(pdf: DataFrame, target_path: str, mode: str = 'overwrite', format: str = 'parquet')
Write a Pandas DataFrame using Spark-backed persistence.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
pdf
|
DataFrame
|
Pandas DataFrame to persist. |
required |
target_path
|
str
|
Output path for the written dataset. |
required |
mode
|
str
|
Save mode for Spark writer (default |
'overwrite'
|
format
|
str
|
Storage format, defaults to |
'parquet'
|
Source code in syntho_hive/connectors/spark_io.py
88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 | |
syntho_hive.connectors.sampling.RelationalSampler ¶
Relational stratified sampler for parent-child table hierarchies.
Source code in syntho_hive/connectors/sampling.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |
sample_relational ¶
sample_relational(root_table: str, sample_size: int, stratify_by: Optional[str] = None) -> Dict[str, DataFrame]
Sample a root table and cascade the sample to child tables.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
root_table
|
str
|
Name of the parent/root table to sample. |
required |
sample_size
|
int
|
Approximate number of rows to retain from the root. |
required |
stratify_by
|
Optional[str]
|
Optional column for stratified sampling. |
None
|
Returns:
| Type | Description |
|---|---|
Dict[str, DataFrame]
|
Dictionary mapping table name to sampled Spark DataFrame. |
Source code in syntho_hive/connectors/sampling.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 | |