Ippon Blog

Mage - ETL on Snowpark Container Services

Written by Adam Tobin | Jun 18, 2024 11:00:00 AM


Snowpark container services allow users to deploy containerized applications directly inside their secure Snowflake environment. In this blog, I will demonstrate how to quickly set up Mage, a data pipeline tool, on Snowpark Container Services.

Snowpark Container Services?

In the most simple terms, Snowpark Container Services are Kubernetes for Snowflake. They allow users to run containerized applications on a platform specifically optimized for Snowflake. While Kubernetes has been a reliable standard for years, Snowpark Container Services offers two key advantages:

  • Your data stays secure and never leaves your Snowflake environment.
  • Eliminate dependencies on third-party tools and use Snowflake as an all-in-one solution.

How Do Snowpark Container Services Work?

A service consists of two main components: an image and a compute pool. Images are stored in an Image Repository, a schema-level object in Snowflake, to make them accessible to the service. A compute pool, composed of a combination of GPUs and CPUs selected based on the user’s requirements, is also created.

With these components in place, a service is created by specifying an image (or multiple images) and a compute pool in its definition, along with any additional optional configurations. Once the service is up and running, an endpoint is generated for user interaction.

Why Mage.AI?

With the abundance of ETL tools on the market, it's challenging to stand out, and I only recently discovered Mage. This open-source tool features a notebook functionality that makes it easy to visualize data transformations at each step and boasts a clean UI. However, the primary reason I chose Mage for this demo was its straightforward architecture, which allows for quick deployment on Snowpark Container Services, especially if you are inexperienced with containerized applications.

Setting Up Mage On Snowpark Container Services

Step 1: Create the required objects:

Begin by creating the following objects in your account. We'll configure most of these using the ACCOUNTADMIN role, but we will also create an ETL_ROLE role for managing and maintaining the service.

  1. Image Repository:
    • Storage location for your images.
  2. Network Rule/External Access Integration:
    • Specify which webpages your service can access.
    • Note: By default, services cannot access internet pages.
  3. Compute Pool:
    • Define the resources your services will utilize (e.g., CPUs, GPUs).
  4. Stage (Optional):
    • Used for storing files for persistence.

USE ROLE ACCOUNTADMIN;
CREATE ROLE ETL_ROLE;
CREATE SCHEMA TOOLS;
USE SCHEMA TOOLS;
CREATE IMAGE REPOSITORY TOOLS.ETL_TOOLS_REPOSITORY;
 
CREATE COMPUTE POOL TOOLS.BASIC_COMPUTE_POOL
MIN_NODES = 1
MAX_NODES = 1
INSTANCE_FAMILY = CPU_X64_XS;
CREATE NETWORK RULE ETL_NETWORK_RULE
TYPE = HOST_PORT
VALUE_LIST = ('<your_account>.snowflakecomputing.com')
MODE = EGRESS
//VALUE_LIST = ('<domains to allow'>)
 
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION ETL_ACCESS_INTEGRATION
ALLOWED_NETWORK_RULES = (ETL_NETWORK_RULE)
ENABLED = true;
 
CREATE STAGE IF NOT EXISTS MAGEAI_FILES ENCRYPTION = (TYPE='SNOWFLAKE_SSE');
 

It’s important to note that services can not access web pages on the internet by default. To grant your service access to specific hostnames, they must be listed in the VALUE_LIST of your network rule.

Step 2: Grant privileges to the ETL_ROLE:

We will be creating the service with the ETL_ROLE role. First, we need to grant this new role the required privileges to create and manage the service.

GRANT BIND SERVICE ENDPOINT ON ACCOUNT TO ROLE ETL_ROLE; 
GRANT USAGE, MONITOR ON COMPUTE POOL BASIC_COMPUTE_POOL TO ROLE ETL_ROLE;
GRANT USAGE ON SCHEMA TOOLS TO ROLE ETL_ROLE;
GRANT CREATE SERVICE ON SCHEMA TOOLS TO ROLE ETL_ROLE;
GRANT USAGE ON INTEGRATION ETL_ACCESS_INTEGRATION TO ROLE ETL_ROLE;
GRANT USAGE ON WAREHOUSE <YOUR_WH>TO ROLE ETL_ROLE
GRANT READ ON IMAGE REPOSITORY TOOLS.ETL_TOOLS_REPOSITORY TO ROLE ETL_ROLE;
GRANT READ ON IMAGE REPOSITORY TOOLS.ETL_TOOLS_REPOSITORY TO ROLE ETL_ROLE;
GRANT READ, WRITE ON STAGE MAGEAI_FILES TO ROLE ETL_ROLE
GRANT ROLE ETL_ROLE TO USER <YOUR_USER>;
 
Step 3: Move the Mage image into the Snowflake Image Repository:

Use docker or a similar program to pull the latest Mage image, tag that image with your repository URL, and push the image into your Snowflake Image Repository.

Run the following commands from a terminal:

-- Pull the image
docker pull mageai/mageai
--Get the image ID. Run the below command and copy the image id for our image for the next step.
docker images
--Tag the image
docker tag <image_id> <respository_url>/mageai:latest
-- Login to your repo
docker login <registry_hostname> -u <username>
--Push to your repo
docker push <repository_url>/mageai:latest
 

To get the values for <repository_url> and <registry_hostname>, run the SHOW IMAGE REPOSITORIES statement in your Snowflake environment and look at the repository_url field.  The values should look like this:

  • Registry hostname:
    • <org>-<account>.registry.snowflakecomputing.com
  • Repository URL:
    • <org>-<account>.registry.snowflakecomputing.com/<your_db>/tools/etl_tools_repository
 

Once this step is complete, confirm that your image exists by running the following command in Snowflake:

SELECT SYSTEM$REGISTRY_LIST_IMAGES( '/<YOUR_DB>/TOOLS/ETL_TOOLS_REPOSITORY');
 
Step 4:  With the setup complete, we are ready to create the service!
CREATE SERVICE mageservice
  IN COMPUTE POOL BASIC_COMPUTE_POOL
  FROM SPECIFICATION $$
    spec:
      containers:
      - name: mage
        image: /ATOBIN_DB/TOOLS/ETL_TOOLS_REPOSITORY/mageai:latest
        volumeMounts:
        - name: magevol
          mountPath: /home/src/
      endpoints:
      - name: mage
        port: 6789
        public: true
   
      volumes:
      - name: magevol
        source: "@mageai_files"
     
      networkPolicyConfig:
        allowInternetEgress: true
      $$
EXTERNAL_ACCESS_INTEGRATIONS = (ETL_ACCESS_INTEGRATION)
MIN_INSTANCES = 1
MAX_INSTANCES = 1
;
 

We can use the below two statements to track the progress of the service while it starts up:

  • SELECT SYSTEM$GET_SERVICE_STATUS('mageservice');
  • SELECT SYSTEM$GET_SERVICE_LOGS('mageservice', 0, 'mage');
Step 5: Logging in:
After the service has started, you can locate the endpoint by executing the following statement: SHOW ENDPOINTS IN SERVICE mageservice. In the ingress_url column, copy and paste the URL into a new browser window where you will be prompted to log in with your Snowflake credentials.
 
 
Once logged in, you will be redirected to the homepage for your Mage service.
 

 

And that’s it! From here I would recommend checking out the example pipeline that comes included with a fresh instance of Mage.

  1. Navigate to the ‘Pipelines’ tab (second icon from the top on the left menu bar).
  2. Click on the ‘example_pipeline’ from the list.
  3. Click on ‘edit_pipeline’ (second icon from the top on the left menu bar).

The example is a three step pipeline, and demonstrates how users can validate their process in Jupyter notebook style by running each step independently. Keep in mind that this example pipeline accesses a GitHub webpage. The hostname for that web page must be included in your network rule or the service will throw an error.

For more information on Mage, visit their website at https://www.mage.ai/

Wrapping it up!

In this article, we demonstrated how easily a user can deploy a containerized application using Snowflake resources. This is a significant step forward and, along with Snowflake’s new Git Integration feature, underscores Snowflake’s commitment to reducing users’ reliance on third-party tools. However, as with any new feature, there are challenges and limitations that exist which may keep users from switching over just yet.