Snowflake recently introduced a game-changing feature: Synthetic Data Generation. Having worked across various organizations, I can confidently say this is a much-needed solution that many teams have been eagerly awaiting.
Why Synthetic Data Matters?
As a data engineer, testing datasets effectively is crucial, whether for unit testing, performance testing, or other validation activities. However, the reality of development environments is often far from ideal. Typically, the data available for testing is either dummy or incomplete, making it difficult to simulate production-like scenarios accurately.
Even with Snowflake’s advanced cloning capabilities — which allow you to replicate data with minimal additional cost — there are challenges. For instance, due to data classification policies, sensitive production data cannot always be cloned into the development environment. Security and Data Governance restrictions often prevent teams from obtaining the necessary approvals to use real data in non-production settings.
Enter Synthetic Data Generation:
This is where Snowflake’s synthetic data generator comes to the rescue. It allows you to create realistic, production-like test data within the development environment, without any of the risks associated with using sensitive or restricted data.
How This Helps?
With synthetic data, you can:
- Simulate production scenarios for robust testing.
- Avoid dependency on security and governance approvals.
- Ensure compliance with data privacy and classification policies.
- Save time and effort by generating accurate test data directly within Snowflake.
When to Use It? This feature is perfect for scenarios where production data is off-limits but high-quality test data is essential. For instance, you might need realistic data to test the performance of a query or validate the output of a new pipeline — all without compromising sensitive information.
Advantages of Snowflake’s Synthetic Data Generator
Traditionally, creating synthetic data required using programming languages and tools that demanded significant programming knowledge, a compatible platform, and a considerable amount of time. Despite all this effort, the generated output often fell short of mimicking real production data, leading to challenges for developers.
With Snowflake’s Synthetic Data Generator, those limitations are a thing of the past.
What is Synthetic Data in Snowflake?
Synthetic data in Snowflake is a true substitute for real data. This feature takes real data as input and produces synthetic data as output. The generated synthetic data retains the statistical properties and patterns of the original data while ensuring that no sensitive or classified information is exposed.
Is it Complex to Use in Snowflake?
Not at all! Snowflake has designed this feature to be intuitive and straightforward. Unlike traditional methods, you don’t need advanced programming skills or external tools to generate synthetic data. The process integrates seamlessly within the Snowflake platform, making it accessible to both data engineers and analysts.
By simplifying the creation of realistic test data, Snowflake ensures that developers can focus on what matters most — building and testing robust data solutions — without the usual hurdles.
Snowflake’s Synthetic Data Generator is powered by an inbuilt stored procedure that users can simply call to generate synthetic data by passing the input table and few parameter. This eliminates the need for complex setups or external tools, making the process straightforward and efficient.
Syntax:
SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA(
{
'datasets': [
{
'input_table': '<input_table_name>',
'output_table' : '<output_table_name>',
[ 'columns': { '<column_name>': {join_key:[TRUE | FALSE]} [, '<column_name>': {join_key:[TRUE | FALSE]} , ... ]]
}
[ ,
'input_table': '<input_table_name>',
'output_table' : '<output_table_name>',
[ 'columns': { '<column_name>': {join_key:[TRUE | FALSE]} [, '<column_name>': {join_key:[TRUE | FALSE]} , ... ]]
, ... ]
]
[ , 'privacy_filter': <boolean> ]
[ , 'replace_output_tables': <boolean> ]
}
)
When working with Snowflake’s Synthetic Data Generator, the procedure (SNOWFALAKE.DATA_PRIVAE.GENERATE_SYNTHETIC_DATA
)accepts the following key input fields:
- Input Tables: Specifies the source tables containing the real data to be synthesized.
- Output Tables: Specifies the name of the output table where the generated synthetic data will be stored.
- Join Keys: These columns are uniquely encoded in the synthetic data to preserve referential integrity. This ensures that relationships between tables are maintained in the synthetic dataset, making it usable for joins and related operations.
- Privacy Filter: A configuration option that adds an additional layer of privacy. It removes rows from the output table if they are too similar to the original input dataset, helping to mitigate potential privacy risks.
- Replace Output Tables: If the output table already exists:
Set 'replace_output_tables': True
to overwrite the existing table with the new synthetic data. If not set, an error will occur if the table already exists.
This flexibility in configuration allows users to generate synthetic data tailored to their specific testing and development needs while ensuring security and consistency.
Example 1: Simple Scenario for Generating Synthetic Data.
Let’s dive into a straightforward example to demonstrate how to generate synthetic data using Snowflake’s stored procedure. This example will walk you through the execution script and explain its components.
Table 1: CUSTOMER_INFO
-- TABLE 1: CUSTOMER INFORMATION
CREATE OR REPLACE DATABASE SYNTHETIC_DATA_DB;
CREATE OR REPLACE SCHEMA SYNTHETIC_DATA_SCH;
CREATE OR REPLACE TABLE CUSTOMER_INFO (
CUSTOMER_ID INTEGER,
FIRST_NAME STRING,
LAST_NAME STRING,
AGE INTEGER,
IS_ACTIVE BOOLEAN,
JOIN_DATE DATE
);
INSERT INTO CUSTOMER_INFO (CUSTOMER_ID, FIRST_NAME, LAST_NAME, AGE, IS_ACTIVE, JOIN_DATE) VALUES
(1, 'JOHN', 'DOE', 32, TRUE, '2022-01-10'),
(2, 'JANE', 'SMITH', 28, FALSE, '2023-04-20'),
(3, 'EMILY', 'JOHNSON', 45, TRUE, '2021-05-15'),
(4, 'MICHAEL', 'WILLIAMS', 29, FALSE, '2023-07-18'),
(5, 'SARAH', 'BROWN', 37, TRUE, '2022-09-11'),
(6, 'DAVID', 'JONES', 41, TRUE, '2022-03-25'),
(7, 'LAURA', 'GARCIA', 30, FALSE, '2023-11-14'),
(8, 'ROBERT', 'MARTINEZ', 34, TRUE, '2022-10-06'),
(9, 'LINDA', 'RODRIGUEZ', 47, TRUE, '2023-12-22'),
(10, 'JAMES', 'WILSON', 52, FALSE, '2021-08-19'),
(11, 'LISA', 'ANDERSON', 27, TRUE, '2023-02-12'),
(12, 'JOSEPH', 'TAYLOR', 43, TRUE, '2021-06-21'),
(13, 'PATRICIA', 'THOMAS', 36, FALSE, '2024-04-28'),
(14, 'CHRISTOPHER', 'HERNANDEZ', 39, TRUE, '2023-03-02'),
(15, 'BARBARA', 'MOORE', 33, TRUE, '2024-01-15'),
(16, 'KEVIN', 'JACKSON', 46, FALSE, '2022-07-09'),
(17, 'SUSAN', 'WHITE', 42, TRUE, '2023-05-24'),
(18, 'DANIEL', 'HARRIS', 35, FALSE, '2022-08-30'),
(19, 'ELIZABETH', 'MARTIN', 40, TRUE, '2023-10-11'),
(20, 'MATTHEW', 'BROWN', 36, TRUE, '2024-07-30');
Snowflake Generate_Synthetic_Data Procedure:
CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'SYNTHETIC_DATA_DB.SYNTHETIC_DATA_SCH.CUSTOMER_INFO',
'output_table': 'SYNTHETIC_DATA_DB.SYNTHETIC_DATA_SCH.CUSTOMER_INFO_SYNTHETIC_COMBINED_TABLES',
'columns': {'CUSTOMER_ID': {'join_key': True}}
}
],
'replace_output_tables':True
});
In the example above:
- Input Table:
CUSTOMER_INFO
- Output Table:
CUSTOMER_INFO_SYNTHETIC_COMBINED_TABLES
Here, I am using CUSTOMER_ID
as the join column to maintain referential integrity and enable join operations between the synthetic data and other related tables.
Once I execute the above procedure, the output will contain the following columns:
- Created_on
- Table_name
- Table_Schema
- Table_Database
- Columns
- Source_Table_Name
- Source_Table_Schema
- Source_Table_Database
- Source_Columns
- Metric_Type
- Metric_Value
The key detail in the output is the Metric_Value
, which is calculated as the absolute value of the correlation coefficient between two non-join columns in the source table and the same two columns in the generated data.
You can analyze this information by reviewing the Columns
and Source_Columns
in the output to ensure that the synthetic data closely mimics the statistical relationships present in the source data. The correlation_coefficient_difference
is a parameter in Snowflake's synthetic data generation process, used to evaluate and maintain statistical correlations between columns in the input dataset and the generated synthetic dataset.
What is the correlation_coefficient_difference?
- Purpose: It measures the difference in correlation coefficients between column pairs in the input data and the synthetic data.
- Why It’s Important: Ensuring similar correlations helps retain the relationships between columns in the synthetic dataset with Original dataset, making it more realistic and meaningful for analysis.
Output: CUSTOMER_INFO_SYNTHETIC_COMBINED_TABLES
Customer ID — Referential Integrity Observations
Key points to note:
- Output Table Row Count: The row count in the output table is directly proportional to the input table. For example, if the input table has 20 records, the output table will also contain 20 records. However, rows with null values are excluded in the synthetic data output. This is the current behavior.
- Redacted Columns: Certain columns are marked as redacted in the synthetic data. While this may not be practical for data engineers in real-world scenarios, the ability to generate synthetic data for these columns is not yet supported.
What’s Next?
In the next part, we will explore more advanced features of Snowflake’s Synthetic Data Generator.
If you have encountered any other scenarios related to synthetic data generator that you found useful or challenging, please feel free to share them in the comments section. If you want to discuss Snowflake or have any questions, don’t hesitate to connect with me on
References: -
About me:
I am a Cloud Data Architect with experience as a Senior Consultant at EY New Zealand. Throughout my career, I have worked on numerous projects involving legacy data warehouses, big data implementations, cloud platforms, and migrations. If you require assistance with certification, data solutions, or implementations, please feel free to connect with me on LinkedIn.