Day 3: Snowflake Cortex -> PARSE_DOCUMENT

Alexander
4 min readJust now

--

In today’s post, we’ll dive into one of the powerful function in Snowflake Cortex — the PARSE_DOCUMENT, part of Snowflake’s suite of Large Language Models (LLMs).

What is Snowflake Cortex?

Snowflake Cortex offers instant access to cutting-edge, industry-leading large language models (LLMs) developed by renowned organizations such as Mistral, Reka, Meta, and Google. Among these models is Snowflake Arctic, an enterprise-grade LLM developed specifically by Snowflake.

One of the key advantages of Snowflake Cortex is that it is fully hosted and managed by Snowflake, meaning there’s no setup required. Additionally, since your data remains within the Snowflake ecosystem, you can enjoy the high performance, scalability, and governance that Snowflake is known for.

What is PARSE_DOCUMENT in Snowflake Cortex?

The PARSE_DOCUMENT function in Snowflake Cortex allows you to extract structured content from documents stored in a Snowflake stage. The extracted content is returned as an OBJECT, which contains JSON-encoded strings.

PARSE_DOCUMENT supports two types of extraction:

  • Optical Character Recognition (OCR): Extracts text from images or scanned documents.
  • Layout: Extracts content while preserving the layout and structure of the document.

Use Case: Parsing a Supermarket Bill Using PARSE_DOCUMENT

To illustrate the use of PARSE_DOCUMENT, let’s take a real-world example. I’ve chosen my supermarket bill in PDF format as the document to parse.

Note: For security reasons, I have redacted the PII data with red color.

Fig 1 — Super Market Purchase Receipt

Input Explanation:

When you use the PARSE_DOCUMENT function, the document needs to be uploaded to either a Snowflake-managed internal stage or an external stage. In my scenario, I’ve uploaded the PDF to a Snowflake-managed internal stage for processing.

By using this function, you can easily extract data from documents like receipts or invoices, enabling seamless data processing and analysis within Snowflake.

Fig 2 — Receipt uploaded in Snowflake Stage (NEW_WORLD_RECEIPT)

Database Details:

Database: SNOWFLAKE_PARSE
Schema: SNOWFLAKE_PARSE_SCH
Stages: NEW_WORLD_RECEIPT
File Name: New_World_Doc.pdf

Approach 1: Extraction Using Optical Character Recognition (OCR)

In this approach, we will explore how Optical Character Recognition (OCR) can be used to extract text-based content from a document.

SELECT TO_VARCHAR(
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
'@SNOWFLAKE_PARSE.SNOWFLAKE_PARSE_SCH.NEW_WORLD_RECEIPT',
'New_World_Doc.pdf',
{'mode': 'OCR'}):content
) AS OCR;

Output:

Fig 3 — Extraction Using Optical Character Recognition (OCR)

Approach 2: Extraction Using LAYOUT Mode

In this approach, we will look at how the LAYOUT mode of PARSE_DOCUMENT works to extract structured data from documents, particularly when there are tables involved.

Fig 4 — Extraction Using LAYOUT Mode

From the two approaches mentioned above, we can easily extract any column needed for analysis.

Advantages:

One of the major benefits is that, without the need to train any AI/ML models, I can directly extract the required data using SNOWFLAKE_PARSE.

Additionally, the processing takes place within the secure Snowflake environment, minimizing data movement and enhancing data security and compliance.

Moreover, it’s incredibly easy to use and integrates seamlessly into existing workflows.

SQL Query to Extract Total Amount and Date from My Bill:

To extract the Total Amount and Date from my bill, I used regular expressions combined with the REGEXP_SUBSTR function. Here’s the SQL query I used:

SELECT
-- Extract the Total Amount
REGEXP_SUBSTR(TO_VARCHAR(layout:content),'TOTAL\\s*\\|\\s*(\\$?NZD?\\d+\\.\\d{2})',1,1,'e',1) AS total_amount,

-- Extract the Date
REGEXP_SUBSTR(TO_VARCHAR(layout:content),'\\d{2}[A-Za-z]{3}\\d{2}',1,1) AS transaction_date
FROM (
SELECT
SNOWFLAKE.CORTEX.PARSE_DOCUMENT(
'@SNOWFLAKE_PARSE.SNOWFLAKE_PARSE_SCH.NEW_WORLD_RECEIPT',
'New_World_Doc.pdf',
{'mode': 'LAYOUT'}
) AS layout
);

Output:

Fig 5 — TOTAL_AMOUNT and TRANSACTION_DATE extraction using LAYOUT mode

About me:

I am a Cloud Data Architect at EY New Zealand. Over the course of my career, I have led and contributed to numerous projects, including legacy data warehouse modernization, big data implementations, cloud platform integrations, and migration initiatives. If you require assistance with certification, data solutions, or implementations, please feel free to connect with me on LinkedIn.

--

--

No responses yet