All posts by Mohit Mehta

Orchestrate an ETL pipeline using AWS Glue workflows, triggers, and crawlers with custom classifiers

Post Syndicated from Mohit Mehta original https://aws.amazon.com/blogs/big-data/orchestrate-an-etl-pipeline-using-aws-glue-workflows-triggers-and-crawlers-with-custom-classifiers/

Extract, transform, and load (ETL) orchestration is a common mechanism for building big data pipelines. Orchestration for parallel ETL processing requires the use of multiple tools to perform a variety of operations. To simplify the orchestration, you can use AWS Glue workflows. This post demonstrates how to accomplish parallel ETL orchestration using AWS Glue workflows and triggers. We also demonstrate how to use custom classifiers with AWS Glue crawlers to classify fixed width data files.

AWS Glue workflows provide a visual and programmatic tool to author data pipelines by combining AWS Glue crawlers for schema discovery and AWS Glue Spark and Python shell jobs to transform the data. A workflow consists of one of more task nodes arranged as a graph. Relationships can be defined and parameters passed between task nodes to enable you to build pipelines of varying complexity. You can trigger workflows on a schedule or on-demand. You can track the progress of each node independently or the entire workflow, making it easier to troubleshoot your pipelines.

You need to define a custom classifier if you want to automatically create a table definition for data that doesn’t match AWS Glue built-in classifiers. For example, if your data originates from a mainframe system that utilizes a COBOL copybook data structure, you need to define a custom classifier when crawling the data to extract the schema. AWS Glue crawlers enable you to provide a custom classifier to classify your data. You can create a custom classifier using a Grok pattern, an XML tag, JSON, or CSV. When the crawler starts, it calls a custom classifier. If the classifier recognizes the data, it stores the classification and schema of the data in the AWS Glue Data Catalog.

Use case

For this post, we use automated clearing house (ACH) and check payments data ingestion as an example. ACH is a computer-based electronic network for processing transactions, and check payments is a negotiable transaction drawn against deposited funds, to pay the recipient a specific amount of funds on demand. Both ACH and check payments data files, which are in fixed width format, need to be ingested in the data lake incrementally over a time series. As part of the ingestion, these two data types need to be merged to get a consolidated view of all payments. ACH and check payment records are consolidated into a table that is useful for performing business analytics using Amazon Athena.

Solution overview

We define an AWS Glue crawler with a custom classifier for each file or data type. We use an AWS Glue workflow to orchestrate the process. The workflow triggers crawlers to run in parallel. When the crawlers are complete, the workflow starts an AWS Glue ETL job to process the input data files. The workflow tracks the completion of the ETL job that performs the data transformation and updates the table metadata in AWS Glue Data Catalog.

The following diagram illustrates a typical workflow for ETL workloads.

This post is accompanied by an AWS CloudFormation template that creates resources described by the AWS Glue workflow architecture. AWS CloudFormation enables you to model, provision, and manage AWS resources by treating infrastructure as code.

The CloudFormation template creates the following resources:

  • An AWS Glue workflow trigger that is started manually. The trigger starts two crawlers simultaneously for processing the data file related to ACH payments and check payments, respectively.
  • Custom classifiers for parsing incoming fixed width files containing ACH and check data.
  • AWS Glue crawlers:
    • A crawler to classify ACH payments in the RAW database. This crawler uses the custom classifier defined for ACH payments raw data. The crawler creates a table named ACH in the Data Catalog’s RAW database.
    • A crawler to classify check payments. This crawler uses the custom classifier defined for check payments raw data. This crawler creates a table named Check in the Data Catalog’s RAW database.
  • An AWS Glue ETL job that runs when both crawlers are complete. The ETL job reads the ACH and check tables, performs transformations using PySpark DataFrames, writes the output to a target Amazon Simple Storage Service (Amazon S3) location, and updates the Data Catalog for the processedpayment table with new hourly partition.
  • S3 buckets designated as RawDataBucket, ProcessedBucket, and ETLBucket. RawDataBucket holds the raw payment data as it is received from the source system, and ProcessedBucket holds the output after AWS Glue transformations have been applied. This data is suitable for consumption by end-users via Athena. ETLBucket contains the AWS Glue ETL code that is used for processing the data as part of the workflow.

Create resources with AWS CloudFormation

To create your resources with the CloudFormation template, complete the following steps:

  1. Choose Launch Stack:
  2. Choose Next.
  3. Choose Next again.
  4. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  5. Choose Create stack.

Examine custom classifiers for fixed width files

Let’s review the definition of the custom classifier.

  1. On the AWS Glue console, choose Crawlers.
  2. Choose the crawler ach-crawler.
  3. Choose the RawACHClassifier classifier and review the Grok pattern.

This pattern assumes that the first 16 characters in the fixed width file are reserved for acct_num, and the next 10 characters are reserved for orig_pmt_date. When a crawler finds a classifier that matches the data, the classification string and schema are used in the definition of tables that are written to your Data Catalog.

Run the workflow

To run your workflow, complete the following steps:

  1. On the AWS Glue console, select the workflow that the CloudFormation template created.
  2. On the Actions menu, select Run.

This starts the workflow.

  1. When the workflow is complete, on the History tab, choose View run details.

You can review a graph depicting the workflow.

Examine the tables

In the Databases section under AWS Glue console, you can find a database named glue-database-raw, which contains two tables named ach and check. These tables are created by the respective AWS Glue crawler using the custom classification pattern specified.

Query processed data

To query your data, complete the following steps:

  1. On the AWS Glue console, select the database glue-database-processed.
  2. On the Action menu, choose View data.

The Athena console opens. If this is your first time using Athena, you need to set up the S3 bucket to store the query result.

  1. In the query editor, run the following query:
select acct_num,pymt_type,count(pymt_type)
from glue_database_processed.processedpayment 
group by acct_num,pymt_type;

You can see the count of payment type in each account displayed from the processedpayment table.

Clean up

To avoid incurring ongoing charges, clean up your infrastructure by deleting the CloudFormation stack. However, you first need to empty your S3 buckets.

  1. On the Amazon S3 console, select each bucket created by the CloudFormation stack.
  2. Choose Empty.
  3. On the AWS CloudFormation console, select the stack you created.
  4. Choose Delete.

Conclusion

In this post we explored how AWS Glue Workflows enable data engineers to build and orchestrate a data pipeline to discover, classify and process standard and non-standard data files. We also discussed how to leverage AWS Glue Workflow along with AWS Glue Custom Classifier, AWS Glue Crawlers and AWS Glue ETL capabilities to ingest from multiple sources into a data lake. We also walked through how you can use Amazon Athena to perform interactive SQL analysis.

For more details on using AWS Glue Workflows, see Performing Complex ETL Activities Using Blueprints and Workflows in AWS Glue.

For more information on AWS Glue ETL jobs, see Build a serverless event-driven workflow with AWS Glue.

For More information on using Athena, see Getting Started with Amazon Athena.


Appendix: Create a regular expression pattern for a custom classifier

Grok is a tool that you can use to parse textual data given a matching pattern. A Grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. AWS Glue uses Grok patterns to infer the schema of your data. When a Grok pattern matches your data, AWS Glue uses the pattern to determine the structure of your data and map it into fields. AWS Glue provides many built-in patterns, or you can define your own. When defining you own pattern, it’s a best practice to test the regular expression prior to setting up the AWS Glue classifier.

One way to do that is to build and test your regular expression by using https://regex101.com/#PYTHON. For this, you need to take a small sample from your input data. You can visualize the output of your regular expression by completing the following steps:

  1. Copy the following rows from the source file to the test string section.
    111111111ABCDEX 01012019000A2345678A23456S12345678901012ABCDEFGHMJOHN JOE                           123A5678ABCDEFGHIJK      ISECNAMEA                           2019-01-0100000123123456  VAC12345678901234
    211111111BBCDEX 02012019001B2345678B23456712345678902012BBCDEFGHMJOHN JOHN                          123B5678BBCDEFGHIJK      USECNAMEB                           2019-02-0100000223223456  XAC12345678901234

  2. Construct the regex pattern based on the specifications. For example, the first 16 characters represent acct_num followed by orig_pmt_date of 10 characters. You should end up with a pattern as follows:
(?<acct_num>.{16})(?<orig_pmt_date>.{10})(?<orig_rfc_rtn_num>.{8})(?<trace_seq_num>.{7})(?<cls_pmt_code>.{1})(?<orig_pmt_amt>.{14})(?<aas_code>.{8})(?<line_code>.{1})(?<payee_name>.{35})(?<fi_rtn_num>.{8})(?<dpst_acct_num>.{17})(?<ach_pmt_acct_ind>.{1})(?<scndry_payee_name>.{35})(?<r_orig_pmt_date>.{10})(?<r_orig_rfc_rtn_num>.{8})(?<r_trace_seq_num>.{7})(?<type_pmt_code>.{1})(?<va_stn_code>.{2})(?<va_approp_code>.{1})(?<schedule_num>.{14})

After you validate your pattern, you can create a custom classifier and attach it to an AWS Glue crawler.


About the Authors

Mohit Mehta is a leader in the AWS Professional Services Organization with expertise in AI/ML and big data technologies. Prior to joining AWS, Mohit worked as a digital transformation executive at a Fortune 100 financial services organization. Mohit holds an M.S in Computer Science, all AWS certifications, an MBA from College of William and Mary, and a GMP from Michigan Ross School of Business.

Meenakshi Ponn Shankaran is Senior Big Data Consultant in the AWS Professional Services Organization with expertise in big data. Meenakshi is a SME on working with big data use cases at scale and has experience in architecting and optimizing workloads processing petabyte-scale data lakes. When he is not solving big data problems, he likes to coach the game of cricket.

Enabling multi-factor authentication for an Amazon Redshift cluster using Okta as an identity provider

Post Syndicated from Mohit Mehta original https://aws.amazon.com/blogs/big-data/enabling-multi-factor-authentication-for-an-amazon-redshift-cluster-using-okta-as-an-identity-provider/

Many organizations have started using single sign-on (SSO) with multi-factor authentication (MFA) for enhanced security. This additional authentication factor is the new normal, which enhances the security provided by the user name and password model. Using SSO reduces the effort needed to maintain and remember multiple login credentials.

In the past, you needed to manage Amazon Redshift database users separate from the SSO mechanism. Database administrators had to manage authorizations based on user groups and with a complex set of database grants. These authorizations were maintained within the database and could easily get disjointed from the corporate directory.

Amazon Redshift now supports a browser-based authentication workflow that works with any MFA configured by the identity federation provider. This browser-based authentication workflow also makes integration with any Security Assertion Markup Language (SAML)-based identity federation easier.

As an Amazon Redshift administrator, you can manage users and groups within the enterprise identity provider (IdP) and pass them to Amazon Redshift at login using additional authentication factors. Amazon Redshift supports a browser-based authentication workflow with any MFA configured by the identity federation provider.

Okta is one of the most commonly used SSO service providers using SAML 2.0 authentication. Although setting up Okta is mostly the same as configuring any SAML provider, this post specifically describes the steps to set up identity federation for Amazon Redshift using MFA. I cover the steps to set up an Okta.com account, build users and groups within your organization’s directory, and enable SSO into Amazon Redshift with MFA. You can do all this while also maintaining group-level access controls within your data warehouse.

Solution overview

The steps in this post are structured into the following sections:

  • Okta configuration – You set up Okta, which contains your users organized into logical groups.
  • AWS configuration – You set up an AWS Identity and Access Management (IAM) role that establishes a trust relationship between your IdP and AWS and a role that Okta uses to access Amazon Redshift.
  • Okta advanced configuration – You finalize the Okta configuration by inputting the roles that you just created. You also inform Okta of which groups are allowed to be passed to Amazon Redshift.
  • Amazon Redshift server and client setup – You set up groups within the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

Configuring Okta

In this first step, you set up Okta, add users, and organize them into logical groups. You then add the Amazon Web Services Redshift Okta application.

  1. For your Okta users to connect, you need to create or use an existing Amazon Redshift cluster.

You need the identifier property of the cluster when you configure the connection.

  1. Create an Okta account.

If you don’t already have access to an Okta account, you can start a 30-day free trial.

  1. Sign in to Okta.com using the following URL, where <prefix> is specific to your account and was created at account setup: https://<prefix>-admin.okta.com/admin/dashboard.
  2. Navigate to the Directory page to add people and groups into Okta that match your organization.

Be sure to use lowercase group names (Amazon Redshift expects the group names to be lowercase).

For this use case, I added three users and two groups, where one of the users (Jorge) belongs to both the sales and marketing groups.

  1. Choose Admin.
  2. Choose Add Person.

After you add your users, you can add your groups.

  1. Choose Add Group.

The following screenshot shows our new groups: marketing and sales.

  1. To add a new application, navigate to the Applications.
  2. Choose Create New App.
  3. On the Create a New Application Integration page, for Platform, choose Web.
  4. For Sign on method, select SAML 2.0.
  5. Choose Create.
  6. On the General Settings page, for App name, enter a name that identifies your application.
  7. Choose Next.
  8. On the Configure SAML page, for Single sign on URL, enter http://localhost:7890/redshift/.
  9. Select Use this for Recipient URL and Destination URL.
  10. For Audience URI (SP Entity ID), enter urn:amazon:webservices.
  11. In the Attribute Statements section, set up the following four attribute mappings:
Name Name Format Value
https://aws.amazon.com/SAML/Attributes/Role Unspecified arn:aws:iam::<aws_account_id>:saml-provider/<name_for_your_iam_identity_provider>,arn:aws:iam::<aws_account_id>:role/<name_for_your_sso_iam_role>
https://aws.amazon.com/SAML/Attributes/RoleSessionName Unspecified user.email
https://redshift.amazon.com/SAML/Attributes/AutoCreate Unspecified true
https://redshift.amazon.com/SAML/Attributes/DbUser Unspecified user.email

Replace the placeholders with the following information:

  • <aws_account_id> – Account ID of the Amazon Redshift cluster
  • <name_for_your_sso_iam_role> – IAM role for access to the Amazon Redshift cluster
  • <aws_account_id> – Descriptive name for the IdP

  1. In the Group Attribute Statement section, add the following attribute:
Name
Name Format Value
https://redshift.amazon.com/SAML/UnspecifiedAttributes/DbGroups Unspecified marketing

  1. In the Group Attribute Statements section, you can add optional group attribute statements. The following filter conditions are supported:
  • Starts with – Matches groups that start with a specified string. If you want to set up permissions for multiple groups, you can choose a common prefix. For example, you could prefix your group names as aws_marketing and aws_sales. Specifying a filter that starts with aws_ ensures that permissions apply to both groups that start with that prefix.
  • Equals – In our use case, we use the Equals filter condition to specify the group named marketing.
  • Contains – Matches text contained in the group name.
  • Matches Regex – For more complex scenarios, you could use a regular expression. The expression (aws_\S*_retail) matches group names that begin with aws_ and end with _retail.

If you have set up SAML correctly, you should have a SAML assertion with the following attributes:

<saml2:AttributeStatement xmlns:saml2="urn:oasis:names:tc:SAML:2.0:assertion">
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/Role" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified">
<saml2:AttributeValue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string">arn:aws:iam::<accountid>:saml-provider/redshiftmfa,arn:aws:iam::<accountid>:role/rs-okta-mfa-role</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://aws.amazon.com/SAML/Attributes/RoleSessionName" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified">
<saml2:AttributeValue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string">[email protected]</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://redshift.amazon.com/SAML/Attributes/AutoCreate" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified">
<saml2:AttributeValue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string">true</saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbUser" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified">
<saml2:AttributeValue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string"> [email protected] </saml2:AttributeValue>
</saml2:Attribute>
<saml2:Attribute Name="https://redshift.amazon.com/SAML/Attributes/DbGroups" NameFormat="urn:oasis:names:tc:SAML:2.0:attrname-format:unspecified">
<saml2:AttributeValue xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xs="http://www.w3.org/2001/XMLSchema" xsi:type="xs:string">marketing</saml2:AttributeValue>
</saml2:Attribute>
</saml2:AttributeStatement>
</saml2:Assertion>
</saml2p:Response>
  1. In the Feedback section, keep the options at their default and choose Finish.

You can now download the Okta application metadata.

  1. Navigate to the Amazon Web Services Redshift application’s Settings page.
  2. Choose Sign On.
  3. Choose Identity Provider metadata to download the metadata file in XML format (for example, metadata.xml).

Configuring AWS

In this section, you set up a role that establishes a trust relationship between the IdP and AWS. You also create a role that Okta uses to access Amazon Redshift.

  1. Sign in to the AWS Management Console with your AWS credentials.
  2. On the IAM console, choose Identity providers.
  3. Choose Create Provider.
  4. For Provider Type, choose SAML.
  5. For Provider name, enter a name for your IdP.
  6. For Metadata Document, upload the metadata file that you downloaded in the previous step.
  7. On the IAM console, navigate to Roles and choose Create role.
  8. For Select type of trusted entity, choose SAML 2.0 federation.
  9. For SAML provider, choose the IdP you created in the previous step.
  10. Select Allow programmatic access and AWS Management Console access.
  11. Choose Next: Permissions.
  12. Choose Create policy.
  13. Create the following custom policy, replacing the region, account, and cluster parameters. These permissions allow the role to use Amazon Redshift to query data, create users, and allow users to join groups:
    {
        "Version": "2012-10-17",
        "Statement": [{
            "Effect": "Allow",
               "Action": [
                    "redshift:JoinGroup,
    		"redshift:CreateClusterUser",
                    "redshift:GetClusterCredentials",
                    "redshift:ListSchemas",
                    "redshift:ListTables",
                    "redshift:ListDatabases",
                    "redshift:ExecuteQuery",
                    "redshift:FetchResults",
                    "redshift:CancelQuery",
                    "redshift:DescribeClusters",
                    "redshift:DescribeQuery",
                    "redshift:DescribeTable"               
    		],
               "Resource": [
                 "arn:aws:redshift:<region>:<account>:dbgroup:<clusterName>/<groupName>"
    ]
         }]
    }
    

There are a few important things to note:

  • The group membership lasts only for the duration of the user session
  • There is no CreateGroup permission because groups need to be manually created and granted DB privileges

The following screenshot shows the summary page for the role.

Okta advanced configuration

In this section, you finalize the Okta configuration by adding the roles that you just created. You also tell Okta which groups are allowed to be passed to Amazon Redshift.

You authorize users to use the Amazon Web Services Redshift application by selecting their respective groups or individual user accounts.

For this use case, I authorize users by group. See the following screenshot.

Setting up the Amazon Redshift server and client

Next, you set up groups in the Amazon Redshift database to match the Okta groups. You also authorize these groups to access certain schemas and tables. Finally, you set up your client tools to use your enterprise credentials and sign in to Amazon Redshift.

  1. Log in to your Amazon Redshift cluster with an admin account.
  2. Create groups that match the IdP group names, and grant the appropriate permissions to tables and schemas:
CREATE GROUP marketing;
ALTER DEFAULT PRIVILEGES IN SCHEMA marketing
GRANT SELECT on TABLES to GROUP marketing;
GRANT USAGE on SCHEMA sales to GROUP marketing;
GRANT SELECT on ALL TABLES in SCHEMA sales to GROUP marketing;

Now you’re ready to connect to Amazon Redshift using a SQL client application. Make sure to install the Amazon Redshift driver. For instructions, see Download an Amazon Redshift JDBC driver. Also, IAM authentication requires using the JDBC driver with AWS SDK included or that you ensure the AWS SDK is within your Java classpath.

  1. Assuming that the Amazon Redshift JDBC driver is installed, set up a new connection to your cluster using your IdP credentials.

For this use case, I use SQLWorkbenchJ.

  1. For URL, be sure to enter iam to instruct the driver to authenticate using IAM.
  2. For Username and Password, enter the values that you set in Okta.
  3. Choose OK.
  4. For Edit extended properties, enter the following properties and values:
    1. login_url – Use the application URL from OKTA. E.g., https://amazonmohitmehta.okta.com/app/amazonorg426936_redshiftmfa_1/exkai5axi2ep4hF6C4x6/sso/saml
    2. plugin_name – Enter com.amazon.redshift.plugin.BrowserSamlCredentialsProvider
  5. Choose OK.
  6. When prompted, enter your Okta user name and password to sign in.
  7. When prompted, enter your MFA token.
  8. Retrieve your MFA token from your MFA device.

You should see a message that you successfully connected to Amazon Redshift. See the following screenshot.

You should now be able to sign in with the users created. In our use case, [email protected] has access to the tables in the marketing schema only. Using the [email protected] user, you get the following results when trying to query data from each the schema:

select campaignid From marketing.campaign


campaignid	
-------
1234
2345
3456
[…]


select * From sales.storeid


An error occurred when executing the SQL command:

[Amazon](500310) Invalid operation: permission denied for schema sales;
1 statement failed.

Execution time: 0.16s

Summary

In this post, you learned how to configure Okta as your IdP with multi-factor authentication to enable single sign-on to an Amazon Redshift cluster. You also configured group membership within your IdP that can be passed along, which enables you to manage user access to Amazon Redshift resources from within your IdP.

If you have questions or suggestions, please leave a comment.


About the Author

Mohit Mehta is a Senior Consultant for AWS Professional Services. He works with our customers to provide leadership on a variety of projects, helping them shorten their time to value when using AWS.