Sitemap

Unlocking the Power of Snowflake’s Synthetic Data Generator — 2 & Final

9 min readDec 8, 2024

--

Thank you all for sharing my LinkedIn post and for the messages you sent about Part 1 of the Synthetic Data Generator. I am truly grateful for the support and humbled by the positive response to my first post.

In this blog, I will tackle some of the most common questions about the Snowflake Synthetic Data Generator that developers have shared with me via LinkedIn messages and Medium. Let’s dive into them one by one!

Question 1: How many tables can I generate synthetic data for at once, and how does the syntax work?

According to Snowflake’s documentation, you can specify up to 5 input tables in a single call to the SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA stored procedure. This allows you to generate synthetic data for multiple tables simultaneously.

To demonstrate, I will use 3 tables as an example. It’s important to note that each table must meet the minimum requirement of having at least 20 distinct rows. Therefore, in the following examples, all three tables will meet this criterion.

Note: The synthetic data generation process operates on individual tables, not on joined datasets. If you need synthetic data for tables based on a join condition, my suggestion would be to create a view combining the input tables with the required join logic. You can then generate synthetic data from that view.

Fig 1 — Input Vs Output Tables

Setup: Creating Sample Tables:

CREATE OR REPLACE TABLE CUSTOMERS (
CUSTOMERID INT PRIMARY KEY,
FIRSTNAME VARCHAR(50),
LASTNAME VARCHAR(50),
EMAIL VARCHAR(100),
PHONE VARCHAR(15),
ADDRESS VARCHAR(200),
CITY VARCHAR(50),
COUNTRY VARCHAR(50)
);

INSERT INTO CUSTOMERS VALUES (2001, 'JOHN', 'DOE', 'JOHN.DOE@GMAIL.COM', '555-1234', '123 ELM STREET', 'WELLINGTON', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2002, 'JANE', 'SMITH', 'JANE.SMITH@GMAIL.COM', '555-2345', '456 OAK STREET', 'CHRISTCHURCH', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2003, 'ALICE', 'JOHNSON', 'ALICE.J@GMAIL.COM', '555-3456', '789 PINE AVENUE', 'AUCKLAND', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2004, 'BOB', 'BROWN', 'BOB.BROWN@YAHOO.COM', '555-4567', '101 MAPLE LANE', 'HAMILTON', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2005, 'TOM', 'WHITE', 'TOM.WHITE@HOTMAIL.COM', '555-5678', '202 CEDAR DRIVE', 'TAURANGA', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2006, 'SARAH', 'GREEN', 'SARAH.G@YAHOO.COM', '555-6789', '303 BIRCH BLVD', 'NAPIER', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2007, 'EMILY', 'CLARK', 'EMILY.CLARK@GMAIL.COM', '555-7890', '404 WALNUT ROAD', 'DUNEDIN', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2008, 'MICHAEL', 'ADAMS', 'M.ADAMS@YAHOO.COM', '555-8901', '505 ASPEN WAY', 'ROTORUA', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2009, 'LAURA', 'WILSON', 'LAURA.W@GMAIL.COM', '555-9012', '606 REDWOOD LANE', 'QUEENSTOWN', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2010, 'DANIEL', 'LEE', 'DANIEL.LEE@HOTMAIL.COM', '555-1235', '707 FIR STREET', 'NELSON', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2011, 'SOPHIA', 'HALL', 'SOPHIA.HALL@GMAIL.COM', '555-2346', '808 CYPRESS AVENUE', 'PALMERSTON NORTH', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2012, 'JAMES', 'SCOTT', 'JAMES.SCOTT@GMAIL.COM', '555-3457', '909 SPRUCE BLVD', 'NEW PLYMOUTH', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2013, 'OLIVIA', 'WRIGHT', 'OLIVIA.WRIGHT@YAHOO.COM', '555-4568', '1010 WILLOW LANE', 'INVERCARGILL', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2014, 'ETHAN', 'HILL', 'ETHAN.HILL@GMAIL.COM', '555-5679', '1111 SYCAMORE WAY', 'WHANGAREI', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2015, 'AVA', 'YOUNG', 'AVA.YOUNG@GMAIL.COM', '555-6780', '1212 MAGNOLIA DRIVE', 'GISBORNE', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2016, 'NOAH', 'KING', 'NOAH.KING@HOTMAIL.COM', '555-7891', '1313 CHESTNUT ROAD', 'WANGANUI', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2017, 'ISABELLA', 'MOORE', 'ISABELLA.MOORE@GMAIL.COM', '555-8902', '1414 DOGWOOD BLVD', 'TIMARU', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2018, 'LIAM', 'ALLEN', 'LIAM.ALLEN@YAHOO.COM', '555-9013', '1515 ALDER AVENUE', 'BLENHEIM', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2019, 'CHARLOTTE', 'MARTIN', 'CHARLOTTE.MARTIN@GMAIL.COM', '555-1236', '1616 POPLAR DRIVE', 'HASTINGS', 'NEW ZEALAND');
INSERT INTO CUSTOMERS VALUES (2020, 'MASON', 'TAYLOR', 'MASON.TAYLOR@GMAIL.COM', '555-2347', '1717 ELM AVENUE', 'TAUPO', 'NEW ZEALAND');


-- TABLE 2: PRODUCTS

CREATE OR REPLACE TABLE PRODUCTS (
PRODUCTID INT PRIMARY KEY,
PRODUCTNAME VARCHAR(100),
CATEGORY VARCHAR(50),
PRICE DECIMAL(10, 2),
STOCK INT
);

INSERT INTO PRODUCTS VALUES (3001, 'SMARTPHONE', 'ELECTRONICS', 699.99, 50);
INSERT INTO PRODUCTS VALUES (3002, 'LAPTOP', 'ELECTRONICS', 999.99, 30);
INSERT INTO PRODUCTS VALUES (3003, 'WASHING MACHINE', 'HOME APPLIANCES', 499.99, 20);
INSERT INTO PRODUCTS VALUES (3004, 'REFRIGERATOR', 'HOME APPLIANCES', 799.99, 15);
INSERT INTO PRODUCTS VALUES (3005, 'JEANS', 'CLOTHING', 39.99, 100);
INSERT INTO PRODUCTS VALUES (3006, 'T-SHIRT', 'CLOTHING', 19.99, 200);
INSERT INTO PRODUCTS VALUES (3007, 'SNEAKERS', 'FOOTWEAR', 59.99, 80);
INSERT INTO PRODUCTS VALUES (3008, 'FORMAL SHOES', 'FOOTWEAR', 89.99, 70);
INSERT INTO PRODUCTS VALUES (3009, 'COOKWARE SET', 'KITCHEN', 99.99, 40);
INSERT INTO PRODUCTS VALUES (3010, 'COFFEE MAKER', 'KITCHEN', 79.99, 60);
INSERT INTO PRODUCTS VALUES (3011, 'VACUUM CLEANER', 'HOME APPLIANCES', 149.99, 25);
INSERT INTO PRODUCTS VALUES (3012, 'MICROWAVE OVEN', 'HOME APPLIANCES', 199.99, 18);
INSERT INTO PRODUCTS VALUES (3013, 'LED TV', 'ELECTRONICS', 399.99, 12);
INSERT INTO PRODUCTS VALUES (3014, 'TABLET', 'ELECTRONICS', 299.99, 20);
INSERT INTO PRODUCTS VALUES (3015, 'BLENDER', 'KITCHEN', 49.99, 45);
INSERT INTO PRODUCTS VALUES (3016, 'TOASTER', 'KITCHEN', 29.99, 70);
INSERT INTO PRODUCTS VALUES (3017, 'SOFA', 'FURNITURE', 599.99, 10);
INSERT INTO PRODUCTS VALUES (3018, 'BED FRAME', 'FURNITURE', 699.99, 8);
INSERT INTO PRODUCTS VALUES (3019, 'WARDROBE', 'FURNITURE', 899.99, 5);
INSERT INTO PRODUCTS VALUES (3020, 'DESK LAMP', 'HOME DECOR', 24.99, 150);

-- TABLE 3:

CREATE OR REPLACE TABLE SALES (
SALEID INT PRIMARY KEY,
CUSTOMERID INT,
PRODUCTID INT,
QUANTITY INT,
SALEDATE DATE,
TOTALAMOUNT DECIMAL(10, 2)
);

INSERT INTO SALES VALUES (1001, 2001, 3001, 2, '2024-11-01', 1399.98);
INSERT INTO SALES VALUES (1002, 2002, 3005, 3, '2024-11-02', 119.97);
INSERT INTO SALES VALUES (1003, 2003, 3002, 1, '2024-11-03', 999.99);
INSERT INTO SALES VALUES (1004, 2004, 3010, 2, '2024-11-04', 159.98);
INSERT INTO SALES VALUES (1005, 2005, 3004, 1, '2024-11-05', 799.99);
INSERT INTO SALES VALUES (1006, 2006, 3006, 5, '2024-11-06', 99.95);
INSERT INTO SALES VALUES (1007, 2007, 3007, 2, '2024-11-07', 119.98);
INSERT INTO SALES VALUES (1008, 2008, 3012, 1, '2024-11-08', 199.99);
INSERT INTO SALES VALUES (1009, 2009, 3009, 3, '2024-11-09', 299.97);
INSERT INTO SALES VALUES (1010, 2010, 3015, 4, '2024-11-10', 199.96);
INSERT INTO SALES VALUES (1011, 2011, 3001, 1, '2024-11-11', 699.99);
INSERT INTO SALES VALUES (1012, 2012, 3003, 1, '2024-11-12', 499.99);
INSERT INTO SALES VALUES (1013, 2013, 3018, 1, '2024-11-13', 699.99);
INSERT INTO SALES VALUES (1014, 2014, 3020, 2, '2024-11-14', 49.98);
INSERT INTO SALES VALUES (1015, 2015, 3008, 1, '2024-11-15', 89.99);
INSERT INTO SALES VALUES (1016, 2016, 3013, 1, '2024-11-16', 399.99);
INSERT INTO SALES VALUES (1017, 2017, 3014, 1, '2024-11-17', 299.99);
INSERT INTO SALES VALUES (1018, 2018, 3002, 1, '2024-11-18', 999.99);
INSERT INTO SALES VALUES (1019, 2019, 3007, 1, '2024-11-19', 59.99);
INSERT INTO SALES VALUES (1020, 2020, 3016, 2, '2024-11-20', 59.98);

Table Screenshot (Customers, Sales and Product):

Fig 2 — Customers, Sales and Product table screenshot

GENERATE_SYNTHETIC_DATA Stored Procedure:

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMERS',
'output_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMERS_SYN_DATA',
'columns': {'CUSTOMERID': {'join_key': True}}
},
{
'input_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.PRODUCTS',
'output_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.PRODUCTS_SYN_DATA',
'columns' : {'PRODUCTID': {'join_key': True}}

},
{
'input_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.SALES',
'output_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.SALES_SYN_DATA',
'columns' : {'SALEID': {'join_key': True}}

}
],
'replace_output_tables':True
});

Output:

Fig 3 — GENERATE_SYNTHETIC_DATA Stored Procedure Output

Here’s an example of what the output synthetic tables might look like after generating synthetic data using Snowflake’s Synthetic Data Generator:

Output Table:

  • CUSTOMERS_SYN_DATA
  • PRODUCTS_SYN_DATA
  • SALES_SYN_DATA
Fig 4 — Synthetic data generated using (Customers, Sales and Product) table

Question 2: Can we generate synthetic tables only with tables?

Answer: No, we can generate synthetic tables not only from tables but also from views. This is particularly useful when you want to generate synthetic data for a dataset that involves joins or other transformations. Instead of directly generating synthetic data for multiple tables, you can create a view combining the required tables and then generate synthetic data from that view.

Example: Generating Synthetic Data Using a View

Here’s a demonstration of creating synthetic data from a view:

  1. Create a View
CREATE OR REPLACE VIEW CUSTOMER_SALES_PRODUCT_VIEW AS
SELECT
C.CUSTOMERID,
C.FIRSTNAME,
C.LASTNAME,
C.EMAIL,
C.PHONE,
C.ADDRESS,
C.CITY,
C.COUNTRY,
S.SALEID,
S.SALEDATE,
S.QUANTITY,
S.TOTALAMOUNT,
P.PRODUCTID,
P.PRODUCTNAME,
P.CATEGORY,
P.PRICE
FROM
SALES S
JOIN
CUSTOMERS C ON S.CUSTOMERID = C.CUSTOMERID
JOIN
PRODUCTS P ON S.PRODUCTID = P.PRODUCTID;

2. Generate Synthetic Data from the View

Once the view is created, use the Snowflake synthetic data generator to generate a synthetic table based on the view.

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMER_SALES_PRODUCT_VIEW',
'output_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMER_SALES_PRODUCT_VIEW_SYN_DATA',
'columns': {'CUSTOMERID': {'join_key': True}}
}
],
'privacy_filter': True,
'replace_output_tables':True
});

3. Result : The output table CUSTOMER_SALES_PRODUCT_VIEW_SYN_DATA will now contain synthetic data derived from the view, ensuring that all joins and transformations in the view are preserved in the synthetic dataset.

Fig 5 — CUSTOMER_SALES_PRODUCT_VIEW_SYN_DATA

Question 3: How much cost is associated with the Synthetic Data Generator?

Answer: The cost of using the Synthetic Data Generator in Snowflake is directly tied to the warehouse size and the duration of the computation for each run. In other words, the only cost incurred is the standard compute cost for running the warehouse during the synthetic data generation process.

Example: Cost of Generating 2 Million Rows of Synthetic Data

Let’s break it down:

  1. Input Table: For this example, we’re using the table SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER, which contains 100000000 rows.
    Note: Synthetic data can be generated for a maximum of 2.3 million rows per input table.
  2. Warehouse Details: Suppose we are using a Large Warehouse (L), which costs approximately 8 credits per hour. The total cost depends on how long the synthetic data generation process runs.
Fig 6 — Number of record in Customer table present in SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL

To overcome number of rows, I decided to reduce the number of input rows to 2 million and ran the synthetic data generation procedure.

Here’s a breakdown of the process:

  1. Create a View for 2 Million Rows: I created a view to limit the dataset to 2 million rows from the TPCDS_SF100TCL.CUSTOMER table in Snowflake's sample data.
CREATE OR REPLACE VIEW SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.TPC_CUSTOMERS_2M as (
SELECT * from SNOWFLAKE_SAMPLE_DATA.TPCDS_SF100TCL.CUSTOMER
LIMIT 2000000
);
  • Total Number of Rows: 2,000,000
  • Total Number of Columns: 18
  • Warehouse Used: Compute_WH (Large 8 Credits/hour)

Running the Synthetic Data Generation Procedure: Using the GENERATE_SYNTHETIC_DATAprocedure, I generated synthetic data for the specified input table. The process also included a privacy filter and replaced the output table.

CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA({
'datasets':[
{
'input_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.TPC_CUSTOMERS_2M',
'output_table': 'SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMER_SYN_DATA_COST',
'columns': {'C_CUSTOMER_ID': {'join_key': True}}
}
],
'privacy_filter': True,
'replace_output_tables':True
});

Total Run Time: 47 minutes

Total Credits Consumed: 6.248

Fig 7 — Details of CALL SNOWFLAKE.DATA_PRIVACY.GENERATE_SYNTHETIC_DATA

Checking the Query Attribution History: To evaluate the cost associated with generating synthetic data, I reviewed the SNOWFLAKE.ACCOUNT_USAGE.QUERY_ATTRIBUTION_HISTORY for the query. This provides insights into the compute credits consumed, resource utilization, and the number of output rows.

  • The query attribution history helps to assess the total cost based on the size of the warehouse, the execution time, and the number of rows processed.
Fig 8 — Select * from SYNTHETIC_ADV_DB.SYNTHETIC_ADV_SCH.CUSTOMER_SYN_DATA_COST
Fig 9 — Total number of Output rows

You might wonder why the input table contains 2 million records, but the output table (Fig 9) only has1,814,620 records. The reason for this discrepancy is that when you call the GENERATE_SYNTHETIC_DATA stored procedure, we have optionally set the 'privacy_filter': True configuration. This option applies a privacy filter to the output table, which removes rows that are too similar to the original input data.

Question 4: How are columns changed to “redacted” in the output table? How can I identify which columns will be mapped as redacted in the output table?

To be honest, I don’t have the answers to these questions at the moment. Ideally, columns containing PII (Personally Identifiable Information) would be marked as redacted. However, I have reached out to the Snowflake product team for clarification and will share more details as soon as I receive them.

If you have encountered any questions or other scenarios related to synthetic data generator that you found useful or challenging, please feel free to share them in the comments section. If you want to discuss Snowflake or have any questions, don’t hesitate to connect with me on LinkedIn.

References: -

About me:

I am a Cloud Data Architect at EY New Zealand. Over the course of my career, I have led and contributed to numerous projects, including legacy data warehouse modernization, big data implementations, cloud platform integrations, and migration initiatives. If you require assistance with certification, data solutions, or implementations, please feel free to connect with me on LinkedIn.

--

--

No responses yet