Access Snowflake data using OAuth-based authentication in Amazon SageMaker Data Wrangler

AWS Machine Learning Blog

In this post, we show how to configure a new OAuth-based authentication feature for using Snowflake in Amazon SageMaker Data Wrangler. Snowflake is a cloud data platform that provides data solutions for data warehousing to data science. Snowflake is an AWS Partner with multiple AWS accreditations, including AWS competencies in machine learning (ML), retail, and data and analytics.
Data Wrangler simplifies the data preparation and feature engineering process, reducing the time it takes from weeks to minutes by providing a single visual interface for data scientists to select and clean data, create features, and automate data preparation in ML workflows without writing any code. You can import data from multiple data sources, such as Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, Amazon EMR, and Snowflake. With this new feature, you can use your own identity provider (IdP) such as Okta, Azure AD, or Ping Federate to connect to Snowflake via Data Wrangler.
Solution overview
In the following sections, we provide steps for an administrator to set up the IdP, Snowflake, and Studio. We also detail the steps that data scientists can take to configure the data flow, analyze the data quality, and add data transformations. Finally, we show how to export the data flow and train a model using SageMaker Autopilot.
Prerequisites
For this walkthrough, you should have the following prerequisites:
For admin:
A Snowflake user with permissions to create storage integrations, and security integrations in Snowflake.
An AWS account with permissions to create AWS Identity and Access Management (IAM) policies and roles.
Access and permissions to configure IDP to register Data Wrangler application and set up the authorization server or API.

For data scientist:
An S3 bucket that Data Wrangler can use to output transformed data.
Access to Amazon SageMaker, an instance of Amazon SageMaker Studio, and a user for Studio. For more information about prerequisites, see Get Started with Data Wrangler.
An IAM role used for Studio with permissions to create and update secrets in AWS Secrets Manager.

Administrator setup
Instead of having your users directly enter their Snowflake credentials into Data Wrangler, you can have them use an IdP to access Snowflake.
The following steps are involved to enable Data Wrangler OAuth access to Snowflake:
Configure the IdP.
Configure Snowflake.
Configure SageMaker Studio.
Configure the IdP
To set up your IdP, you must register the Data Wrangler application and set up your authorization server or API.
Register the Data Wrangler application within the IdP
Refer to the following documentation for the IdPs that Data Wrangler supports:
Azure AD
Okta
Ping Federate
Use the documentation provided by your IdP to register your Data Wrangler application. The information and procedures in this section help you understand how to properly use the documentation provided by your IdP.
Specific customizations in addition to the steps in the respective guides are called out in the subsections.
Select the configuration that starts the process of registering Data Wrangler as an application.
Provide the users within the IdP access to Data Wrangler.
Enable OAuth client authentication by storing the client credentials as a Secrets Manager secret.
Specify a redirect URL using the following format: https://domain-ID.studio.AWS Region.sagemaker.aws/jupyter/default/lab.
You’re specifying the SageMaker domain ID and AWS Region that you’re using to run Data Wrangler. You must register a URL for each domain and Region where you’re running Data Wrangler. Users from a domain and Region that don’t have redirect URLs set up for them won’t be able to authenticate with the IdP to access the Snowflake connection.
Make sure the authorization code and refresh token grant types are allowed for your Data Wrangler application.
Set up the authorization server or API within the IdP
Within your IdP, you must set up an authorization server or an application programming interface (API). For each user, the authorization server or the API sends tokens to Data Wrangler with Snowflake as the audience.
Snowflake uses the concept of roles that are distinct from IAM roles used in AWS. You must configure the IdP to use ANY Role to use the default role associated with the Snowflake account. For example, if a user has systems administrator as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses systems administrator as the role.
Use the following procedure to set up the authorization server or API within your IdP:
From your IdP, begin the process of setting up the server or API.
Configure the authorization server to use the authorization code and refresh token grant types.
Specify the lifetime of the access token.
Set the refresh token idle timeout.
The idle timeout is the time that the refresh token expires if it’s not used. If you’re scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
Note that Data Wrangler doesn’t support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.
If the refresh token expires, your users must reauthenticate by accessing the connection that they’ve made to Snowflake through Data Wrangler.
Specify session:role-any as the new scope.
For Azure AD, you must also specify a unique identifier for the scope.
After you’ve set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your IdP to get values for the following fields:
Token URL – The URL of the token that the IdP sends to Data Wrangler
Authorization URL – The URL of the authorization server of the IdP
Client ID – The ID of the IdP
Client secret – The secret that only the authorization server or API recognizes
OAuth scope – This is for Azure AD only
Configure Snowflake
To configure Snowflake, complete the instructions in Import data from Snowflake.
Use the Snowflake documentation for your IdP to set up an external OAuth integration in Snowflake. See the previous section Register the Data Wrangler application within the IdP for more information on how to set up an external OAuth integration.
When you’re setting up the security integration in Snowflake, make sure you activate external_oauth_any_role_mode.
Configure SageMaker Studio
You store the fields and values in a Secrets Manager secret and add it to the Studio Lifecycle Configuration that you’re using for Data Wrangler. A Lifecycle Configuration is a shell script that automatically loads the credentials stored in the secret when the user logs into Studio. For information about creating secrets, see Move hardcoded secrets to AWS Secrets Manager. For information about using Lifecycle Configurations in Studio, see Use Lifecycle Configurations with Amazon SageMaker Studio.
Create a secret for Snowflake credentials
To create your secret for Snowflake credentials, complete the following steps:
On the Secrets Manager console, choose Store a new secret.
For Secret type, select Other type of secret.
Specify the details of your secret as key-value pairs.
Key names require lowercase letters due to case sensitivity. Data Wrangler gives a warning if you enter any of these incorrectly. Input the secret values as key-value pairs Key/value if you’d like, or use the Plaintext option.
The following is the format of the secret used for Okta. If you are using Azure AD, you need to add the datasource_oauth_scope field.

{
“token_url”:”https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token”,
“client_id”:”example-client-id”,
“client_secret”:”example-client-secretsovfDSUoOKAiLe4V6DiZrCLpW44x”,
“identity_provider”:”OKTA”|”AZURE_AD”|”PING_FEDERATE”,
“authorization_url”:”https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize”
}

Update the preceding values with your choice of IdP and information gathered after application registration.
Choose Next.
For Secret name, add the prefix AmazonSageMaker (for example, our secret is AmazonSageMaker-DataWranglerSnowflakeCreds).
In the Tags section, add a tag with the key SageMaker and value true.
Choose Next.
The rest of the fields are optional; choose Next until you have the option to choose Store to store the secret.
After you store the secret, you’re returned to the Secrets Manager console.
Choose the secret you just created, then retrieve the secret ARN.
Store this in your preferred text editor for use later when you create the Data Wrangler data source.
Create a Studio Lifecycle Configuration
To create a Lifecycle Configuration in Studio, complete the following steps:
On the SageMaker console, choose Lifecycle configurations in the navigation pane.
Choose Create configuration.
Choose Jupyter server app.
Create a new lifecycle configuration or append an existing one with the following content:

#!/bin/bash
set -eux
## Script Body
cat > ~/.snowflake_identity_provider_oauth_config
Go to Source
29/03/2023 – 10:42 /Ajjay Govindaram
Twitter: @hoffeldtcom

Admin

About Admin

As an experienced Human Resources leader, I bring a wealth of expertise in corporate HR, talent management, consulting, and business partnering, spanning diverse industries such as retail, media, marketing, PR, graphic design, NGO, law, assurance, consulting, tax services, investment, medical, app/fintech, and tech/programming. I have primarily worked with service and sales companies at local, regional, and global levels, both in Europe and the Asia-Pacific region. My strengths lie in operations, development, strategy, and growth, and I have a proven track record of tailoring HR solutions to meet unique organizational needs. Whether it's overseeing daily HR tasks or crafting and implementing new processes for organizational efficiency and development, I am skilled in creating innovative human capital management programs and impactful company-wide strategic solutions. I am deeply committed to putting people first and using data-driven insights to drive business value. I believe that building modern and inclusive organizations requires a focus on talent development and daily operations, as well as delivering results. My passion for HRM is driven by a strong sense of empathy, integrity, honesty, humility, and courage, which have enabled me to build and maintain positive relationships with employees at all levels.

    You May Also Like

    error: Content is protected !!