Cache in snowflake

What is Snowflake Caching ?

Alexander
5 min readOct 14, 2021

Imagine executing a query that takes 10 minutes to complete. Now if you re-run the same query later in the day while the underlying data hasn’t changed, you are essentially doing again the same work and wasting resources

Instead Snowflake caches the results of every query you ran and when a new query is submitted, it checks previously executed queries and if a matching query exists and the results are still cached, it uses the cached result set instead of executing the query. This can greatly reduce query times because Snowflake retrieves the result directly from the cache.

1. Query Results Caching:

The Results cache holds the results of every query executed in the past 24 hours. These are available across virtual warehouses, so query results returned to one user is available to any other user on the system who executes the same query, provided the underlying data has not changed.

Real Time Example:

Starting a new virtual warehouse (with no local disk caching), and executing the below mentioned query

SELECT BIKEID,MEMBERSHIP_TYPE,START_STATION_ID,BIRTH_YEAR FROM TEST_DEMO_TBL ;

Query returned result in around 13.2 Seconds, and demonstrates it scanned around 252.46MB of compressed data, with 0% from the local disk cache. This means it had no benefit from disk caching.

Now we will try to execute same query in same warehouse,

This query returned results in milliseconds, and involved re-executing the query, but with this time, the result cache enabled.

How to disable Snowflake Query Results Caching?
To disable the Snowflake Results cache, run the below query. It should disable the query for the entire session duration

ALTER SESSION USE_CACHED_RESULT=FALSE;

2. Virtual Warehouse Local Disk Caching

Whenever data is needed for a given query it’s retrieved from the Remote Disk storage, and cached in SSD and memory of the Virtual Warehouse. This data will remain until the virtual warehouse is active. When there is a subsequent query fired an if it requires the same data files as previous query, the virtual warehouse might choose to reuse the datafile instead of pulling it again from the Remote disk.

Real Time Example:

Starting a new virtual warehouse (with Query Result Caching set to False), and executing the below mentioned query

SELECT TRIPDURATION,TIMESTAMPDIFF(hour,STOPTIME,STARTTIME),START_STATION_ID,END_STATION_ID
FROM TRIPS;

This query returned in around 33.7 Seconds, and demonstrates it scanned around 53.81% from cache.

Local Bytes Scanned: 213.1 MB

Remote Byte Scanned: 182.9 MB

Now we will try to execute same query in same warehouse,

Same query returned results in 33.2 Seconds, and involved re-executing the query, but with this time, the bytes scanned from cache increased to 79.94%.

Local Bytes Scanned: 316.5 MB

Remote Byte Scanned: 79.4 MB

Snowflake uses a cloud storage service such as Amazon S3 as permanent storage for data (Remote Disk in terms of Snowflake), but it can also use Local Disk (SSD) to temporarily cache data used by SQL queries.

3. Metadata Cache

This is not really a Cache. Instead, It is a service offered by Snowflake. Snowflake automatically collects and manages metadata about tables and micro-partitions

  • Row Count
  • Table Size in Bytes
  • File references and table versions

For Micro-Partitions, Snowflake stores:

  • The range of values (MIN/MAX values)
  • Number of distinct values
  • NULL Count

For Clustering, Snowflake Stores:

  • The total number of Micro-Partitions
  • Number of Micro-Partitions containing values overlapping with each together
  • The depth of overlapping Micro-Partitions
  • This is an indication of how well-clustered a table is since as this value decreases, the number of pruned columns can increase.

All DML operations take advantage of micro-partition metadata for table maintenance. Some operations are metadata alone and require no compute resources to complete, like the query below

Real Time Example:

Starting a new virtual warehouse (with no local disk caching), and executing the below mentioned query

SELECT MIN(BIKEID),MIN(START_STATION_LATITUDE),MAX(END_STATION_LATITUDE) FROM TEST_DEMO_TBL ;

In above screenshot we could see 100% result was fetched directly from Metadata cache.

Snowflake Cache Best Practice

Clearly data caching data makes a massive difference to Snowflake query performance, but what can you do to ensure maximum efficiency when you cannot adjust the cache?

Here’s a few best practice tips:-

  • Auto-Suspend: By default, Snowflake will auto-suspend a virtual warehouse (the compute resources with the SSD cache after 10 minutes of idle time. Best practice? Leave this alone! Keep in mind, you should be trying to balance the cost of providing compute resources with fast query performance. To illustrate the point, consider these two extremes:
  1. If you auto-suspend after 60 seconds: When the warehouse is re-started, it will (most likely) start with a clean cache, and will take a few queries to hold the relevant cached data in memory. (Note: Snowflake will try to restore the same cluster, with the cache intact, but this is not guaranteed).
  2. If you never suspend: Your cache will always be warm, but you will pay for compute resources, even if nobody is running any queries. However, provided you set up a script to shut down the server when not being used, then maybe (just maybe), it may make sense.

Auto-Suspend Best Practice? Is remarkably simple, and falls into one of two possible options:

  1. Online Warehouses: Where the virtual warehouse is used by online query users, leave the auto-suspend at 10 minutes. This means if there’s a short break in queries, the cache remains warm, and subsequent queries use the query cache.
  2. Batch Processing Warehouses: For warehouses entirely deployed to execute batch processes, suspend the warehouse after 60 seconds. The performance of an individual query is not quite so important as the overall throughput, and it’s therefore unlikely a batch warehouse would rely on the query cache.

--

--