Tag Archives: Analytics

Restrict Amazon Redshift Spectrum external table access to Amazon Redshift IAM users and groups using role chaining

Post Syndicated from Harsha Tadiparthi original https://aws.amazon.com/blogs/big-data/restrict-amazon-redshift-spectrum-external-table-access-to-amazon-redshift-iam-users-and-groups-using-role-chaining/

With Amazon Redshift Spectrum, you can query the data in your Amazon Simple Storage Service (Amazon S3) data lake using a central AWS Glue metastore from your Amazon Redshift cluster. This capability extends your petabyte-scale Amazon Redshift data warehouse to unbounded data storage limits, which allows you to scale to exabytes of data cost-effectively. Like Amazon EMR, you get the benefits of open data formats and inexpensive storage, and you can scale out to thousands of Redshift Spectrum nodes to pull data, filter, project, aggregate, group, and sort. Like Amazon Athena, Redshift Spectrum is serverless and there’s nothing to provision or manage. You only pay $5 for every 1 TB of data scanned. This post discusses how to configure Amazon Redshift security to enable fine grained access control using role chaining to achieve high-fidelity user-based permission management.

As you start using the lake house approach, which integrates Amazon Redshift with the Amazon S3 data lake using RedShift Spectrum, you need more flexibility when it comes to granting access to different external schemas on the cluster. For example, in the following use case, you have two Redshift Spectrum schemas, SA and SB, mapped to two databases, A and B, respectively, in an AWS Glue Data Catalog, in which you want to allow access for the following when queried from Amazon Redshift:

  • Select access for SA only to IAM user group Grp1
  • Select access for database SB only to IAM user group Grp2
  • No access for IAM user group Grp3 to databases SA and SB

By default, the policies defined under the AWS Identity and Access Management (IAM) role assigned to the Amazon Redshift cluster manages Redshift Spectrum table access, which is inherited by all users and groups in the cluster. This IAM role associated to the cluster cannot easily be restricted to different users and groups. This post details the configuration steps necessary to achieve fine-grained authorization policies for different users in an Amazon Redshift cluster and control access to different Redshift Spectrum schemas and tables using IAM role chaining. When using role chaining, you don’t have to modify the cluster; you can make all modifications on the IAM side. Adding new roles doesn’t require any changes in Amazon Redshift. Even when using AWS Lake Formation, as of this writing, you can’t achieve this level of isolated, coarse-grained access control on the Redshift Spectrum schemas and tables. For more information about cross-account queries, see How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3.

Prerequisites

This post uses a TPC-DS 3 TB public dataset from Amazon S3 cataloged in AWS Glue by an AWS Glue crawler and an example retail department dataset. To get started, you must complete the following prerequisites. The first two prerequisites are outside of the scope of this post, but you can use your cluster and dataset in your Amazon S3 data lake.

  1. Create an Amazon Redshift cluster with or without an IAM role assigned to the cluster.
  2. Create an AWS Glue Data Catalog with a database using data from the data lake in Amazon S3, with either an AWS Glue crawler, Amazon EMR, AWS Glue, or Athena.The database should have one or more tables pointing to different Amazon S3 paths. This post uses an industry standard TPC-DS 3 TB dataset, but you can also use your own dataset.
  3. Create IAM users and groups to use later in Amazon Redshift:
    • Create new IAM groups named grpA and grpB without any policies.
    • Create users a1 and b1 and add them to groups grpA and grpB, respectively. Use lower-case usernames.
    • Add the following policy to all the groups you created to allow IAM users temporary credentials when authenticating against Amazon Redshift:
      {
         			 "Version": "2012-10-17",
          			"Statement": {
              			"Effect": "Allow",
              			"Action": "redshift:GetClusterCredentials",
              			"Resource": "arn:aws:redshift:us-east-1:0123456789:dbuser:*/*"
          			}
      }

      You may want to use more restricted access by allowing specific users and groups in the cluster to this policy for additional security.

  1. Create the IAM users and groups locally on the Amazon Redshift cluster without any password.
    • To create user a1, enter the following code:
      create user a1 password disable;

    • To create grpA, enter the following code:
      create group grpA with user a1;

    • Repeat these steps for user b1 and add the user to grpB.
  1. Install a jdbc sql query client such as SqlWorkbenchJ on the client machine.

Use case

In the following use case, you have an AWS Glue Data Catalog with a database named tpcds3tb. Tables in this database point to Amazon S3 under a single bucket, but each table is mapped to a different prefix under the bucket. The following screenshot shows the different table locations.

You use the tpcds3tb database and create a Redshift Spectrum external schema named schemaA. You create groups grpA and grpB with different IAM users mapped to the groups. The goal is to grant different access privileges to grpA and grpB on external tables within schemaA.

This post presents two options for this solution:

  • Use the Amazon Redshift grant usage statement to grant grpA access to external tables in schemaA. The groups can access all tables in the data lake defined in that schema regardless of where in Amazon S3 these tables are mapped to.
  • Configure role chaining to Amazon S3 external schemas that isolate group access to specific data lake locations and deny access to tables in the schema that point to a different Amazon S3 locations.

Isolating user and group access using the grant usage privilege

You can use the Amazon Redshift grant usage privilege on schemaA, which allows grpA access to all objects under that schema. You don’t grant any usage privilege to grpB; users in that group should see access denied when querying.

  1. Create an IAM role named mySpectrum for the Amazon Redshift cluster to allow Redshift Spectrum to read Amazon S3 objects using the following policy:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": "*"
            }
        ]
    }

  2. Add a trust relationship to allow users in Amazon Redshift to assume roles assigned to the cluster. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }

  3. Choose your cluster name.
  4. Choose Properties.
  5. Choose Manage IAM roles.
  6. For Available IAM roles, choose your new role.If you don’t see the role listed, select Enter ARN and enter the role’s ARN.
  7. Choose Done as seen in screenshot below.
  8. Use the Amazon Redshift JDBC driver that has AWS SDK, which you can download from the Amazon Redshift console (see the following screenshot) and connect to the cluster using the IAM connection string from a SQL client such as SqlWorkbenchJ.Screenshot below depicts the jar file type you select.Following screenshot depicts the connection configuration using Workbenchj.
  9. As an Amazon Redshift admin user, create external schemas with schemaA mapped to the AWS Glue database tpcds3tb (you use the IAM role you created earlier to allow Redshift Spectrum access to Amazon S3). See the following code:
    create external schema schemaA from data catalog
    database 'tpcds3tb'
    iam_role 'arn:aws:iam::0123456789:role/mySpectrum' region 'us-east-1';

  10. Verify the schema is in the Amazon Redshift catalog with the following code:
    select * from svv_external_schemas;

  11. Grant usage privilege to grpA. See the following code:
    Grant usage on schema schemaA to group grpA;

  12. Query tables in schemaA as user a1 in grpA using your SQL client. See the following code:
    Select * from schemaA.customer limit 3;
    Select * from schemaA.call_center limit 3;

    The following screenshot shows the successful query results.

  13. Query tables in schemaA as user b1 in grpB.The following screenshot shows the error message you receive.

This option gives great flexibility to isolate user access on Redshift Spectrum schemas, but what if user b1 is authorized to access one or more tables in that schema but not all tables? The second option creates coarse-grained access control policies.

Isolating user and group access using IAM policies and role chaining

You can use IAM policies mapped to IAM roles with a trust relationship to specific users and groups based on Amazon S3 location access and assign it to the cluster. For this use case, grpB is authorized to only access the table catalog_page located at s3://myworkspace009/tpcds3t/catalog_page/, and grpA is authorized to access all tables but catalog_page located at s3://myworkspace009/tpcds3t/*. The following steps help you configure for the given security requirement.

The following diagram depicts how role chaining works.

You first create IAM roles with policies specific to grpA and grpB. The first role is a generic cluster role that allows users to assume this role using a trust relationship defined in the role.

  1. On the IAM console, create a new role. See the following code:
    RoleName: myblog-redshift-assumeRole

  2. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy called myblog-redshift-assumerole-inline with the following rules:
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "Stmt1487639602000",
                  "Effect": "Allow",
                  "Action": [
                      "sts:AssumeRole"
                  ],
                  "Resource": "arn:aws:iam::0123456789:role/*"
              }
          ]
      }

  1. Add a trust relationship that allows the users in the cluster to assume this role. You can choose to limit this to specific users as necessary. See the following code:
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "redshift.amazonaws.com"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringLike": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/*"
            }
          }
        }
      ]
    }

  2. Create a new Redshift-customizable role specific to grpA with a policy allowing access to Amazon S3 locations for which this group is only allowed access. Make sure you omit the Amazon S3 location for the catalog_page table; you don’t want to authorize this group to view that data.
    • Name the role myblog-grpA-role.
  1. Add the following two policies to this role:
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. (Note: You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.)
    • Add an inline policy named myblog-grpA-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/call_center/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_returns/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_address/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/customer_demographics/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/date_dim/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/item/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/promotion/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/store_sales/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_page/*",
                      "arn:aws:s3:::myworkspace009/tpcds3t/web_sales/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpA to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):The trust relationship has to be updated for each user added to this role, or build a new role for each user. It is fairly easy to script automate updating this trust relationship for each new user.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/a1"
            }
          }
        }
      ]
    }

  2. Create another Redshift-customizable role specific to grpB with a policy restricting access only to Amazon S3 locations where this group is allowed access.
    • Name the role myblog-grpB-role.
  1. Add the following two policies to this role. Create these managed policies reflecting the data access per DB Group and attach them to the roles that are assumed on the cluster.
    • Add a managed policy named AWSAWS GlueConsoleFullAccess to the role. You might consider adding your inline policy with least privileges instead of this managed role for more restricted security needs.
    • Add an inline policy named myblog-grpB-access-policy with the following rules (modify it to fit your security needs and allow minimal permissions):
      {
          "Version": "2012-10-17",
          "Statement": [
              {
                  "Sid": "AllowS3",
                  "Effect": "Allow",
                  "Action": [
                      "s3:*"
                  ],
                  "Resource": [
                      "arn:aws:s3:::myworkspace009/tpcds3t/catalog_page/*"
                  ]
              }
          ]
      }

  1. Add a trust relationship explicitly listing all users in grpB to only allow them to assume this role (choose the tab Trust relationships and edit it to add the following policy updating the relevant account details):This trust relationship has to be updated for each user for this role, or build a role for each user. It is fairly easy to script automate updating this trust relationship.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": " arn:aws:iam::0123456789:role/myblog-redshift-assumeRole"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "arn:aws:redshift:us-east-1:0123456789:dbuser:<YourRedshiftClusterName>/b1"
            }
          }
        }
      ]
    }

  2. Attach the three roles to the Amazon Redshift cluster and remove any other roles mapped to the cluster. If you don’t find any roles in the drop-down menu, use the role ARN.
  3. As an admin user, create a new external schema for grpA and grpB, respectively, using role chaining with the two roles you created.
    • For grpA, enter the following code:
      create external schema rc_schemaA from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpA-role' region 'us-east-1';
      
      grant usage on schema rc_schemaA to group grpA;

    • For grpB, enter the following code:
      create external schema rc_schemaB from data catalog
      database 'tpcds3tb'
      iam_role 'arn:aws:iam::0123456789:role/myblog-redshift-assumeRole,arn:aws:iam::0123456789:role/myblog-grpB-role' region 'us-east-1';
      
      grant usage on schema rc_schemaB to group grpB;

  1. Query the external schema as user in grpA and grpB.
    • To query the customer table and catalog_page table as user a1 in grpA, enter the following code:
      select * from rc_schemaA.customer limit 3;
      select * from rc_schemaA.catalog_page limit 3;

The following screenshot shows the query results; user a1 can access the customer table successfully.

The following screenshot shows that user a1 can’t access catalog_page.

    • Query the customer table and catalog_page table as user b1 in grpB.

The following screenshot shows that user b1 can access catalog_page.

The following screenshot shows that user b1 can’t access the customer table.

Conclusion

This post demonstrated two different ways to isolate user and group access to external schema and tables. With the first option of using Grant usage statements, the granted group has access to all tables in the schema regardless of which Amazon S3 data lake paths the tables point to. This approach gives great flexibility to grant access at ease, but it doesn’t allow or deny access to specific tables in that schema.

With the second option, you manage user and group access at the grain of Amazon S3 objects, which gives more control of data security and lowers the risk of unauthorized data access. This approach has some additional configuration overhead compared to the first approach, but can yield better data security.

In both approaches, building a right governance model upfront on Amazon S3 paths, external schemas, and table mapping based on how groups of users access them is paramount to provide the best security and allow low operational overhead.

Special acknowledgment goes to AWS colleague Martin Grund for his valuable comments and suggestions.

 


About the Authors

Harsha Tadiparthi is a Specialist Sr. Solutions Architect, AWS Analytics. He enjoys solving complex customer problems in Databases and Analytics and delivering successful outcomes. Outside of work, he loves to spend time with his family, watch movies, and travel whenever possible.

 

 

 

Harshida Patel is a Data Warehouse Specialist Solutions Architect with AWS.

 

 

Enable fine-grained permissions for Amazon QuickSight authors in AWS Lake Formation

Post Syndicated from Adnan Hasan original https://aws.amazon.com/blogs/big-data/enable-fine-grained-permissions-for-amazon-quicksight-authors-in-aws-lake-formation/

We’re excited to announce the integration of Amazon QuickSight with the AWS Lake Formation security model, which provides fine-grained access control for QuickSight authors. Data lake administrators can now use the Lake Formation console to grant QuickSight users and groups permissions to AWS Glue Data Catalog databases, tables, and Amazon Simple Storage Service (Amazon S3) buckets that are registered and managed via Lake Formation.

This new feature enhances the fine-grained access control capability previously introduced in QuickSight, which allows admins to use AWS Identity and Access Management (IAM) policies to scope down QuickSight author access to Amazon S3, Amazon Athena, Amazon Relational Database Service (Amazon RDS), and Amazon Redshift. The scope-down access is enforced by attaching IAM policies to the QuickSight user or a group in the QuickSight portal. For more information, see Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena.

For Athena-based datasets, you’re no longer required to use IAM policies to scope down QuickSight author access to Amazon S3, or Data Catalog databases and tables. You can grant permissions directly in the Lake Formation console. An added benefit is that you can also grant column-level permissions to the QuickSight users and groups. Lake Formation handles all this for you centrally.

This feature is currently available in the QuickSight Enterprise edition in the following Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)

It will soon be available in all Regions where Lake Formation exists as of this post. For more information, see Region Table.

This post compares the new fine-grained permissions model in Lake Formation to the IAM policy-based access control in QuickSight. It also provides guidance on how to migrate fine-grained permissions for QuickSight users and groups to Lake Formation.

QuickSight fine-grained permissions vs. Lake Formation permissions

In QuickSight, you can limit user or group access to AWS resources by attaching a scope-down IAM policy. If no such policies exist for a user or a group (that the user is a member of), QuickSight service role permissions determine access to the AWS resources. The following diagram illustrates how permissions work for a QuickSight user trying to create an Athena dataset.

With the Lake Formation integration, the permissions model changes slightly. The two important differences while creating an Athena dataset are:

  • Users can view the Data Catalog resources (databases and tables) that have one of the following:
    1. The IAMAllowedPrincipal security group is granted Super permission to the resource in Lake Formation.
    2. An ARN for the QuickSight user or group (that the user is a member of) is explicitly granted permissions to the resource in Lake Formation.
  • If the S3 source bucket for the Data Catalog resource is registered in Lake Formation. Amazon S3 access settings in QuickSight are ignored, including scope-down IAM policies for users and groups.

The following diagram shows the change in permission model when a QuickSight user tries to create an Athena dataset.

The following sections dive into how fine-grained permissions work in QuickSight and how you can migrate the existing permissions to the Lake Formation security model.

Existing fine-grained access control in QuickSight

For this use case, a business analyst in the marketing team, lf-gs-author, created an Athena dataset Monthly Sales in QuickSight. It was built using the month_b2bsalesdata table in AWS Glue and the data in S3 bucket b2bsalesdata.

The following screenshot shows the table details.

The following screenshot shows the dataset details.

The dataset is also shared with a QuickSight group analystgroup. See the following screenshot of the group details.

A fine-grained IAM policy enforces access to the S3 bucket b2bsalesdata for lf-qs-author and analystgroup. The following code is an example of an Amazon S3 access policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "arn:aws:s3:::"
        },
        {
            "Action": [
                "s3:ListBucket"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata"
            ]
        },
        {
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::b2bsalesdata/"
            ]
        }
    ]
}

Enabling QuickSight permissions in Lake Formation

To migrate QuickSight permissions to Lake Formation,  follow the steps described below (in the given order):

1.) Capturing the ARN for the QuickSight user and group

First, capture the QuickSight ARN for the business analyst user and marketing team group. You can use the describe-user API and the describe-group API to retrieve the user ARN and the group ARN, respectively. For example, to retrieve the ARN for the QuickSight group analystgroup, enter the following code in the AWS Command Line Interface (AWS CLI):

aws quicksight describe-group --group-name 'analystgroup' --aws-account-id 253914981264 --namespace default

Record the group ARN from the response, similar to the following code:

{
 "Status": 200,
 "Group": {
 "Arn": "arn:aws:quicksight:us-east-1:253914981264:group/default/analystgroup",
 "GroupName": "analystgroup",
 "PrincipalId": "group/d-906706bd27/3095e3ab-e901-479b-88da-92f7629b202d"
 },
 "RequestId": "504ec460-2ceb-46ca-844b-a33a46bc7080"
}

Repeat the same step to retrieve the ARN for the business analyst lf-qs-author.

2.) Granting permissions in the data lake

To grant permissions to the month_b2bsalesdata table in salesdb, complete the following steps:

  1. Sign in to the Lake Formation console as the data lake administrator.

A data lake administrator can grant any principal (IAM, QuickSight, or Active Directory) permissions to Data Catalog resources (databases and tables) or data lake locations in Amazon S3. For more information about creating a data lake administrator and the data lake security model, see AWS Lake Formation: How It Works.

  1. Choose Tables.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

You see a list of principals with associated permissions for each resource type.

  1. Choose Grant.
  2. For Active Directory and Amazon QuickSight users and groups, enter the QuickSight user ARN.
  3. For Table permissions, select Select.
  4. Optionally, under Column permissions, you can grant column-level permissions to the user. This is a benefit of using Lake Formation permissions over QuickSight policies.
  5. Choose Grant.

  1. Repeat the preceding steps to grant select table permissions to analystgroup, using the ARN you recorded earlier.
  2. Select month_b2bsalesdata.
  3. From the Actions drop-down menu, choose View permissions.

The following screenshot shows the added permissions for the QuickSight user and group.

3.) Removing IAMAllowedPrincipal group permissions

For Lake Formation permissions to take effect, you must remove the IAMAllowedPrincipal group from the month_b2bsalesdata table.

  1. Select month_b2bsalesdata.
  2. From the Actions drop-down menu, choose View permissions.
  3. Select IAMAllowedPrincipals.
  4. Choose Revoke.

  1. Choose Revoke

4.) Registering your S3 bucket in Lake Formation

You can now register the S3 source bucket (b2bsalesdata) in Lake Formation. Registering the S3 bucket switches Amazon S3 authorization from QuickSight scope-down policies to Lake Formation security.

  1. Choose Data lake locations.
  2. Choose Register location.
  3. For Amazon S3 path, enter the path for your source bucket (s3://b2bsalesdata).
  4. For IAM role, choose the role with permissions to that bucket.
  5. Choose Register location.

5.) Cleaning up the scope-down policies in QuickSight

You can now remove the scope-down policies for the user and group in QuickSight. To find these policies, under Security and Permissions, choose IAM policy assignments.

6.) Creating a dataset in QuickSight

To create a dataset, complete the following steps:

  1. Log in to QuickSight as a user who is a member of analystgroup (someone besides lf-qs-author).
  2. Choose Manage data.
  3. Choose New data set.
  4. Choose Athena.
  5. For the data source name, enter Marketing Data.
  6. Choose Create data source.
  7. In the list of databases, choose salesdb.
  8. Choose month_b2bsalesdata.
  9. Choose Edit/Preview data.

The following screenshot shows the details of month_b2bsalesdata table.

You can also use custom SQL to query the data.

Conclusion

This post demonstrates how to extend the Lake Formation security model to QuickSight users and groups, which allows data lake administrators to manage data catalog resource permissions centrally from one console. As organizations embark on the journey to secure their data lakes with Lake Formation, having the ability to centrally manage fine-grained permissions for QuickSight authors can extend the data governance and enforcement of security controls at the data consumption (business intelligence) layer. You can enable these fine-grained permissions for QuickSight users and groups at the database, table, or column level, and they’re reflected in the Athena dataset in QuickSight.

Start migrating your fine-grained permissions to Lake Formation today, and leave your thoughts and questions in the comments.

 


About the Author

Adnan Hasan is a Solutions Architect with Amazon QuickSight at Amazon Web Services.

 

Enforce column-level authorization with Amazon QuickSight and AWS Lake Formation

Post Syndicated from Avijit Goswami original https://aws.amazon.com/blogs/big-data/enforce-column-level-authorization-with-amazon-quicksight-and-aws-lake-formation/

Amazon QuickSight is a fast, cloud-powered, business intelligence service that makes it easy to deliver insights and integrates seamlessly with your data lake built on Amazon Simple Storage Service (Amazon S3). QuickSight users in your organization often need access to only a subset of columns for compliance and security reasons. Without having a proper solution to enforce column-level security, you have to develop additional solutions, such as views, data masking, or encryption, to enforce security.

QuickSight accounts can now take advantage of AWS Lake Formation column-level authorization to enforce granular-level access control for their users.

Overview of solution

In this solution, you build an end-to-end data pipeline using Lake Formation to ingest data from an Amazon Aurora MySQL database to an Amazon S3 data lake and use Lake Formation to enforce column-level access control for QuickSight users.

The following diagram illustrates the architecture of this solution.

Walkthrough overview

The detailed steps in this solution include building a data lake using Lake Formation, which uses an Aurora MySQL database as the source and Amazon S3 as the target data lake storage. You create a workflow in Lake Formation that imports a single table from the source database to the data lake. You then use Lake Formation security features to enforce column-level security for QuickSight service on the imported table. Finally, you use QuickSight to connect to this data lake and visualize only the columns for which Lake Formation has given access to QuickSight user.

To implement the solution, you complete the following steps:

  1. Prerequisites
  2. Creating a source database
  3. Importing a single table from the source database
    • Creating a connection to the data source
    • Creating and registering your S3 bucket
    • Creating a database in the Data Catalog and granting permissions
    • Creating and running the workflow
    • Granting Data Catalog permissions
  4. Enforcing column-level security in Lake Formation
  5. Creating visualizations in QuickSight

Prerequisites

For this walkthrough, you should have the following prerequisites:

Creating a source database

In this step, create an Aurora MySQL database cluster and use the DDLs in the following GitHub repo to create an HR schema with associated tables and sample data.

You should then see the schema you created using the MySQL monitor or your preferred SQL client. For this post, I used SQL Workbench. See the following screenshot.

Record the Aurora database JDBC endpoint information; you need it in subsequent steps.

Importing a single table from the source database

Before you complete the following steps, make sure you have set up Lake Formation and met the JDBC prerequisites.

The Lake Formation setup creates a datalake_user IAM user. You need to add the same user as a QuickSight user. For instructions, see Managing User Access Inside Amazon QuickSight. For Role, choose AUTHOR.

Creating a connection to the data source

After you complete the Lake Formation prerequisites, which include creating IAM users datalake_admin and datalake_user, create a connection in your Aurora database. For instructions, see Create a Connection in AWS Glue. Provide the following information:

  • Connection name<yourPrefix>-blog-datasource
  • Connection type – JDBC
  • Database connection parameters – JDBC URL, user name, password, VPC, subnet, and security group

Creating and registering your S3 bucket

In this step, you create an S3 bucket named <yourPrefix>-blog-datalake, which you use as the root location of your data lake. After you create the bucket, you need to register the Amazon S3 path. Lastly, grant data location permissions.

Creating a database in the Data Catalog and granting permissions

Create a database in the Lake Formation Data Catalog named <yourPrefix>-blog-database, which stores the metadata tables. For instructions, see Create a Database in the Data Catalog.

After you create the database, you grant data permissions to the metadata tables to the LakeFormationWorkflowRole role, which you use to run the workflows.

Creating and running the workflow

In this step, you copy the EMPLOYEES table from the source database using a Lake Formation blueprint. Provide the following information:

  • Blueprint type – Database snapshot
  • Database connection<yourPrefix>-blog-datasource
  • Source data pathHR/EMPLOYEES
  • Target database<yourPrefix>-blog-database
  • Target storage location<yourPrefix>-blog-datalake
  • Workflow name<yourPrefix>-datalake-quicksight
  • IAM roleLakeFormationWorkflowRole
  • Table prefixblog

For instructions, see Use a Blueprint to Create a Workflow.

When the workflow is ready, you can start the workflow and check its status by choosing View graph. When the workflow is complete, you can see the employee table available in your Data Catalog under <yourPrefix>-blog-database. See the following screenshot.

You can also view the imported data using Athena, which is integrated with Lake Formation. You need to select “View Data” from “Actions” drop down menu for this purpose. See the following screenshot.

Granting Data Catalog permissions

In this step, you provide the Lake Formation Data Catalog access to the IAM user datalake_user. This is the same user that you added in QuickSight to create the dashboard. For Database permissions, select Create table and Alter for this use case, but you can change the permission level based on your specific requirements. For instructions, see Granting Data Catalog Permissions.

When this step is complete, you see the permissions for your database <yourPrefix>-blog-database.

Enforcing column-level security in Lake Formation

Now that your table is imported into the data lake, enforce column-level security to the dataset. For this use case, you want to hide the Salary and Phone_Number columns from business intelligence QuickSight users.

  1. In the Lake Formation Data Catalog, choose Databases.
  2. From the list of databases, choose <yourPrefix>-blog-database.
  3. Choose View tables.
  4. Select blog_hr_employees.
  5. From the Actions drop-down menu, choose Grant.

  1. For Active Directory and Amazon QuickSight users and groups, provide the QuickSight user ARN.

You can find the ARN by entering the code aws quicksight list-users --aws-account-id <your AWS account id> --namespace default --region us-east-1 in the AWS Command Line Interface (AWS CLI).

  1. For Database, choose <yourPrefix>-blog-database.
  2. For Table, choose blog_hr_employees.
  3. For Columns, choose Exclude columns.
  4. For Exclude columns, choose salary and phone_number.
  5. For Table permissions, select Select.

You should receive a confirmation on the console that says Permission granted for: datalake_user to Exclude: <yourPrefix>-blog-database.blog_hr_employees.[salary, phone_number].

You can also verify that appropriate permission is reflected for the QuickSight user on the Lake Formation console by navigating to the Permissions tab and filtering for your database and table.

You can also specify column-level permissions in the AWS CLI with the following code:

aws lakeformation grant-permissions --principal DataLakePrincipalIdentifier=<QuickSight User ARN> --permissions "SELECT" --resource '{ "TableWithColumns": {"DatabaseName":"<yourPrefix>-blog-database", "Name":"blog_hr_employees", "ColumnWildcard": {"ExcludedColumnNames": ["salary", "phone_number"]}}}'  --region us-west-2 --profile datalake_admin

Creating visualizations in QuickSight

In this step, you use QuickSight to access the blog_hr_employees table in your data lake. While accessing this dataset from QuickSight, you can see that QuickSight doesn’t show the salary and phone_number columns, which you excluded from the source table in the previous step.

  1. Log in to QuickSight using the datalake_user IAM user.
  2. Choose New analysis.
  3. Choose New dataset.
  4. For the data source, choose Athena.

  1. For your data source name, enter Athena-HRDB.
  2. For Database, choose <yourPrefix>-blog-database.
  3. For Tables, select blog_hr_employees.
  4. Choose Select.

  1. Choose Import to SPICE for quicker analysis or Directly query your data.

For this use case, choose Import to SPICE. This provides faster visualization in a production setup, and you can run a scheduled refresh to make sure your dashboards are referring to the current data. For more information, see Scheduled Refresh for SPICE Data Sets on Amazon QuickSight.

When you complete the previous steps, your data is imported to your SPICE machine and you arrive at the QuickSight visualization dashboard. You can see that SPICE has excluded the salary and phone_number fields from the table. In the following screenshot, we created a pie chart visualization to show how many employees are present in each department.

Cleaning up

To avoid incurring future charges, delete the resources you created in this walkthrough, including your S3 bucket, Aurora cluster, and other associated resources.

Conclusion

Restricting access to sensitive data to various users in a data lake is a very common challenge. In this post, we demonstrated how to use Lake Formation to enforce column-level access to QuickSight dashboard users. You can enhance security further with Athena workgroups. For more information, see Creating a Data Set Using Amazon Athena Data and Benefits of Using Workgroups.

 


About the Author

Avijit Goswami is a Sr. Startups Solutions Architect at AWS, helping startup customers become tomorrow’s enterprises. When not at work, Avijit likes to cook, travel, watch sports, and listen to music.

 

 

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Streaming web content with a log-based architecture with Amazon MSK

Post Syndicated from Sebastian Doell original https://aws.amazon.com/blogs/big-data/streaming-web-content-with-a-log-based-architecture-with-amazon-msk/

Content such as breaking news or sports scores require updates in near-real-time. To stay up to date, you may be constantly refreshing your browser or mobile app. Building APIs to deliver this content at speed and scale can be challenging. In this post, I present an alternative to an API-based approach. I outline the concept and virtues of a log-based architecture, a software architecture that uses a commit log to capture data changes to easily build new services and databases on other services’ full datasets. These data changes can also be content for the web. The architecture enables you to stream this content in real time. It’s simple and easy to scale.

The following video clip shows you an example of this architecture in action.

In this post, I show you how you can use Amazon Managed Streaming for Apache Kafka (Amazon MSK) to build a log-based architecture, and the other technologies you need to stream content on the web. I also show you an example microblogging service that puts everything into action. For more information, see the GitHub repo. It contains a web application, the backend service that runs on AWS Fargate on Amazon Elastic Container Service (Amazon ECS), and the AWS CloudFormation templates to create the infrastructure on the AWS Cloud. You can walk through running the example on AWS Cloud9 or your local machine.

Benefits of a push model

Most websites and mobile applications distribute content by pulling it from an API. The client has to pull any new content or updates to the content by making a new request on the backend. A common behavior is a browser window refresh, or a pull-to-refresh on a mobile app. Another behavior is to poll the server for new content in defined intervals. This is known as the pull model.

When clients use this model, they create a new request to the server every time they update the content. Every request creates stress on your application. You need to check for updates in a database or cache and send data to the client. This consumes CPU and memory, beyond the eventual need to create new connections to your services.

A different approach is to update the client from the server side, known as the push model. The server pushes new content or updates to the client. It’s an asynchronous communication between the server and the client. The following diagram illustrates the architecture for publish/subscribe (pub/sub) messaging, a common pattern for this asynchronous communication.

In the pub/sub pattern, new or updated content is an event. It originates from the publisher of the event and gets distributed to the subscriber. This pattern is used with the reader and the pub/sub service, whereby the reader subscribes to the service and waits for the service to publish new articles. The pub/sub service subscribes to the article log to consume the articles that the editor published to it. A reader that is subscribed to the pub/sub services is only required to keep the connection to the service open and wait for new articles to flow in. In the example of the microblogging service, the reader uses a simple React app to read the articles. The web application keeps the connection to the backend service open, waits for new articles to be published and updates the displayed articles on publication.

When a new article is published to the log, the article is stored as a message. A message is a key-value pair of which the log keeps the order in which the messages are stored. For the example use case, the message stored here is an article, but it can be any kind of data as they are stored as serialized bytes. The most common formats are plain text, JSON, Protobuf, and Avro. The pub/sub service consumes the messages as they flow in and publishes them to connected clients. Again, this can be a web application in a browser or a mobile application on iOS or Android. In the example it is a React app in the browser. The subscribers receive the new articles without the need to pull for the content.

This behavior of pushing the content to the clients is called content streaming, because the client waits to read from a stream of messages. These messages contain the published content. It’s similar to video streaming, in which videos are continuously delivered in pieces to the viewer.

The virtues of log-based architectures

A common approach to give consumers access to your content is building APIs. For a long time, these have been RESTful APIs; more recently, GraphQL-based APIs have gained momentum. However, API-based architectures have several issues:

  • Many content-producers define their own schema for representing the data. The names and fields vary between them, especially as the number of endpoints increase and the API evolves.
  • The endpoints made available differ in behavior. They use different semantics and have different request parameters.
  • Many APIs don’t include a feature to notify clients about new content or updates. They need additional notification services or extensive polling mechanisms to offer this behavior.

This post’s approach is to use a log-based architecture to stream content on the web and mobile. The idea is to use a log as generic mechanism to store and distribute content. For more information, see Turning the database inside-out with Apache Samza and Designing Data-Intensive Applications.

Although many different log technologies may exist, Apache Kafka has become an industry standard, and has created a rich ecosystem. Amazon MSK became generally available last year, and is a fully managed service to build applications that use Apache Kafka. The sample microblogging service you run in this post uses Amazon MSK. It appends the published posts to the log in a partition of a topic. Partitions allow you to parallelize the orderly processing of messages in a topic by splitting the data in a topic. The topic in the microblogging example has only one partition because the global order of articles should be guaranteed. With multiple partitions, only the order in a partition is guaranteed. The pub/sub service in the example consumes the articles by reading the log in chronological order and publishing them in that order to the subscribers.

Using a log to store the articles has advantages compared to a traditional database system. Firstly, they are schema-less. They store simple key-value pairs with some additional metadata. The value can be any kind of binary-encoded data, which means that the producer and consumer are responsible to serialize and de-serialize the stored value. Therefore it’s easy to change the data stored over time without any downtime.

A log is also easy to back up and restore. You can consume and store the single messages in the log as objects to Amazon Simple Storage Service (Amazon S3) and restore the messages by republishing them to a new log. The replay of the messages happens from the object store, which is cost-effective and secure, and uses the capabilities of Amazon S3 like object versioning and lifecycle management.

Replication is also much easier compared to traditional database systems. Because the log is in chronological order, replicas are also always in order, and it’s easy to determine if they are in sync or not.

Furthermore, building derived stores (for example, the latest articles) is much easier this way. The log represents everything needed to build up those stores, whereby databases represent the latest state. Log-based architectures are inherently consistent.

A log is an ordered representation of all events that happened to the system. The events themselves are the changes to the system. This can be a new article or an updated article. The log is used to create materialized views. This can be a NoSQL database like Amazon DynamoDB, which drives a GraphQL API via AWS AppSync, or any other specific view on the data. The data in a log can materialize in any kind of view because the needed state can always be recreated by replaying the messages in the log. These views are used to eventually consume the data. Every service can have its own data store and view of the data. They can expose as much or as little of the data as they need for their view on it. The databases of these services can be more purposeful and are easier to maintain. They can use DynamoDB in one instance or Amazon Aurora in another. The example of the microblogging service, however, doesn’t use a materialized view to show the posts. It publishes and consumes directly to and from the log.

Log-based architectures are appealing to content providers because there is no distinction between accessing current and future data. Consumers are always replaying the data; they get current and all future data from where they start to read the log. The current position of the consumer is the offset. It’s a simple integer number that points to the last record the consumer read from the log. The number of messages between the offset and the latest message in the log is the lag. When a consumer starts to read messages from the offset, everything from this point on fuses to a message stream. If this is combined with a protocol that provides the capability to push data to the browser, you can stream these messages to the web.

Streaming content with gRPC

gRPC is a high-performance RPC framework that can run in any environment. It’s often used to connect services in distributed backend systems, but is also applicable to connecting devices, mobile applications, and browsers in the last mile. An RPC framework allows applications to call a function in a remote process. gRPC uses protocol buffers to define a service and its remote calls. It’s a powerful binary serialization tool and language. Protocol buffers are Google’s language-neutral, platform-neutral, extensible mechanism for serializing structured data and sending it over the wire. You define the structure of your data and use source code generation to create the code to easily read and write data in your data structures. You don’t need to write object model wrappers in your services or on the client, and it supports many popular programming languages.

The microblogging example for this post uses the Go support for the backend service, and grpc-web in the React app, which is the client for the service. The code for both languages is created from one Protobuf definition. If you’re developing a mobile application, there is support for Swift and Android. The elimination of object model wrappers to unwrap JSON into an object is a big advantage.

The most important feature to streaming content is bidirectional streaming with HTTP/2 based transport. The example uses a server-side streaming RPC to list the published articles. The client sends a request to the server and gets a stream to read the sequence of articles. Because the connection is kept alive, the server continues to push all future articles. gRPC guarantees message ordering within an individual RPC call, which is important for our example, because the articles are directly consumed from the log, and the articles in the log are in chronological order.

Microblogging example service

Learning about the virtues of log-based architectures is one thing; building a service on top of this pattern is another. This post provides a complementary microblogging service that uses the pattern and gRPC to exemplify everything that you learn. For more information, see the GitHub repo.

The service has two main components:

  • A simple React app that connects to the backend service via a gRPC to list all published articles and create new articles
  • A backend that implements the gRPC service that the app calls

The React app publishes new articles to a topic in Amazon MSK and lists these articles as they are appended to the log. Listing the articles is a call to the ListArticles remote procedure, which subscribes to the topic of articles and reads the log from the beginning. They are pushed to the web application as they are read from the log.

The GitHub repo also contains the CloudFormation templates to create the needed infrastructure and deploy the services. The following diagram illustrates the architecture of the services and infrastructure.

The service uses Amazon Elastic Container Registry (Amazon ECR) to store the Docker images for Fargate. The pub/sub service uses Amazon MSK for its commit log. The service is discovered via AWS Cloud Map. Clients connect to the service via an Application Load Balancer. You use this load balancer because there can be a longer idle timeout on the connection, so that data can be pushed from the service to the client without managing the connection in the client.

Running on AWS Cloud9

You can deploy and run the service from your development machine. If you want to try it out and experiment with it, you can also run it using AWS Cloud9. AWS Cloud9 is a cloud-based integrated development environment (IDE) that lets you write, run, and debug your code with just a browser.

The only prerequisite for completing this walkthrough is an AWS account. For instructions on creating one, see How do I create and activate a new AWS account?

Creating the environment and running the example consumes AWS resources. Make sure you remove all resources when you’re finished to avoid ongoing charges to your AWS account.

Creating the AWS Cloud9 environment

To create your AWS Cloud9 environment, complete the following steps:

  1. On the AWS Cloud9 console, choose Create environment.
  2. Name the environment mskworkshop.
  3. Choose Next step.
  4. For Instance type, choose small.
  5. Accept all default values and choose Next Step.
  6. On the summary page, review your inputs and choose Create environment.

AWS Cloud9 provides a default auto-hibernation setting of 30 minutes for your Amazon Elastic Compute Cloud (Amazon EC2) instances created through it. With this setting, your EC2 instances automatically stop 30 minutes after you close the IDE. They only start again when you reopen the environment.

  1. When your environment is ready, close the Welcome
  2. From the remaining tab, choose New Terminal.

Your environment should look like the following screenshot.

For more information, see Working with Environments in AWS Cloud9.

Preparing the AWS Cloud9 environment

To proceed with the walkthrough, you have to install the most current version of the AWS Command Line Interface (AWS CLI), install the needed tools, and resize the AWS Cloud9 environment.

  1. To view your current version of AWS CLI, enter the following code:

Bash $ > aws –version

  1. To update to the latest version, enter the following code:

Bash $ > pip install –user –upgrade awscli

You can ignore warnings about the outdated pip version and check the installed AWS CLI. You need the jq command installed. It’s a lightweight and flexible command-line JSON processor that the example scripts use.

  1. Install the tool with the following code:

Bash $ > sudo yum install -y jq

The client runs in the Cloud9 environment. It needs a current version of Node.js and the Yarn package manager to be installed.

  1. To manage the installed Node.js, use the Node Version Manager (nvm). See the following code:

Bash $ > curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.34.0/install.sh | bash

  1. Activate the version manager in your environment with the following code:

Bash $ > . ~/.nvm/nvm.sh

  1. Use nvm to install the current version of Node.js:

Bash $ > nvm install node

  1. Use the npm to install the latest version of the Yarn package manager:

Bash $ > npm install yarn -g

You have finished preparing your AWS Cloud9 environment. Next, you clone the example repository and set up the example.

Cloning and setting up the example repository

You first need to change into the folder you want to clone the repository to. See the following code:

Bash $ > cd ~/environment

Clone the repository and enter the folder with the source code:

Bash $ > git clone https://github.com/aws-samples/aws-msk-content-streaming aws-msk-content-streaming && cd $_

You have successfully cloned the source code in your Cloud9 environment.

Resizing the environment

By default, AWS Cloud9 has 8 GB of storage attached. The example service needs various Docker containers that the provided scripts build. They consume more than default storage. Therefore, you have to resize the size of the attached storage. See the following code:

Bash $ > make resize

This resizes the attached storage to 20 GB, which is sufficient for the microblogging service. If you encounter an error that the /dev/nvme0n1 device doesn’t exist, you’re not running on a Nitro-based architecture. For instructions on replacing the devices, see Moving an environment or resizing an Amazon EBS volume.

The micoblogging service contains the option of a bastion host, which is a special-purpose computer on a network specifically designed to securely access resources in this network. You can access this host via SSH.

Creating an SSH key

To generate an SSH key, enter the following code (if needed, you can use this key to access Amazon MSK and your Fargate containers):

Bash $ > ssh-keygen

Choose Enter three times to take the default choices. Next, upload the public key to your Amazon EC2 Region with the following code:

Bash $ > aws ec2 import-key-pair --key-name ${C9_PROJECT} --public-key-material file://~/.ssh/id_rsa.pub

Set the key as the KEY_PAIR environment variable for the scripts to use. You’re now ready to deploy the example application to your AWS account.

Deploying the application

Before you can run the web application, you have to deploy the needed services to your AWS account. The deployment scripts create two CloudFormation stacks. One stack with the core services VPC, NAT Gateway, Internet Gateway, and Amazon MSK. The other stack is the application stack, with the Docker containers and Application Load Balancer.

To deploy these stacks, enter the following code:

Bash $ > make deploy

The deployment process may take some time. When the deployment process is complete, you can start the web application. It starts a local development server that is also exposed to the public. See the following code:

Bash $ > make start

The build process is finished when you see a Compiled successfully! message and the URLs to the development server. You access the preview to the web application by choosing Preview, Preview Running Application in the toolbar. This opens a split view with a window and the web application.

Unfortunately, you can’t post or read any content in this preview because you didn’t configure a custom domain for the Application Load Balancer, and therefore the deployed service is only accessible via HTTP. However, AWS Cloud9 is a secure environment and expects content to be served via HTTPS.

To make the example work, you can either copy the full URL from the preview (https://12345678910.vfs.cloud9.eu-west-1.amazonaws.com/) or choose the Pop Out Into New Window icon next to the browser bar.

In the URL, replace https with http. You can now access the service with HTTP.

You can test the example by creating a new item. Give it a title and add some content. When you’re finished, choose Create Post. If you see an error because of a connection problem, refresh your browser.

Cleaning up

When you’re finished exploring the example and discovering the deployed resources, the last step is to clean up your account. The following code deletes all the resources you created:

Bash $ > make delete

Running on your machine

If you want to run the example on your local machine, you need to install the required tools:

  • Docker
  • AWS CLI
  • js and Yarn package manager
  • Linux userland with bash
  • GNU Make

Installing them enables you to build the needed Docker containers for the pub/sub service, create the needed infrastructure and deploy the containers, and run the React app to create and read posts. For more information, see the GitHub repo.

To run the infrastructure and access the example, you also clone the repository to your machine and enter the folder. See the following code:

bash $ > git clone https://github.com/aws-samples/aws-msk-content-streaming aws-msk-content-streaming && cd $_

In the next step, you build the backend service, the envoy proxy for the gRPC calls, and the needed infrastructure for the service. The envoy proxy is needed to bridge the gRPC-Web client in the web app to the gRPC server in the backend service. The calls from the web app are text-encoded, while the backend service uses the binary protobuf format. You have to set the following environment variables for the deployment to work:

bash $ > export PROJECT_NAME=<YOUR_PROJECT_NAME>

bash $ > export AWS_ACCOUNT_ID=<YOUR_ACCOUNT_ID>

bash $ > export AWS_DEFAULT_REGION=<YOUR_AWS_REGION>

bash $ > export KEY_PAIR=<YOUR_AWS_EC2_KEY_PAIR>

Replace the <> with your details, the name of the project, the account you want to deploy the infrastructure to, and the Region you want to deploy the stack to.

To deploy the same CloudFormation stacks as in the AWS Cloud9 environment, enter the following code:

bash $ > make deploy

To start a development webserver at localhost:3030 and open a browser window with this URL in your default browser, enter the following code:

bash $ > make start

The client is configured with the environment variable REACT_APP_ENDPOINT to the URL of the Application Load Balancer.

Create your articles

You can now create a new article with a title and content and publish it to the log. The list of articles should then automatically update as it’s pushed the new article. You can also test this by duplicating the tab and creating a new article in the new tab.

The following diagram illustrates the behavior of the solution from the client perspective:

The remote call to the pub/sub service subscribes the client to the list of articles in the article log. A unidirectional gRPC stream is created. The pub/sub services push all available articles and all new articles to the client. The envoy proxy filters the grpc-web calls, which are in text format, and translates them into the binary gRPC calls for the pub/sub service. The insert of an article is an additional unary gRPC call to the pub/sub service.

Summary

This post discussed log-based architectures and how you can use them to stream content on the web. We talked about the virtues of gRPC to stream the content to the browser or to mobile devices. Furthermore, you experimented with a log-based architecture with a microblogging service built on Amazon MSK. You also saw how to deploy the needed infrastructure and run it with the sample code.

You can use the principle idea and provided example and build your own solution. Please share what you built or your questions regarding running log-based architectures on the AWS Cloud.

 


About the Author

Sebastian Doell is a Startup Solutions Architect at AWS. He helps startups execute on their ideas at speed and at scale. Sebastian also maintains a number of open-source projects and is an advocate of Dart and Flutter.

 

Process data with varying data ingestion frequencies using AWS Glue job bookmarks

Post Syndicated from Dipankar Kushari original https://aws.amazon.com/blogs/big-data/process-data-with-varying-data-ingestion-frequencies-using-aws-glue-job-bookmarks/

We often have data processing requirements in which we need to merge multiple datasets with varying data ingestion frequencies. Some of these datasets are ingested one time in full, received infrequently, and always used in their entirety, whereas other datasets are incremental, received at certain intervals, and joined with the full datasets to generate output. To address this requirement, this post demonstrates how to build an extract, transform, and load (ETL) pipeline using AWS Glue.

Using AWS Glue

AWS Glue provides a serverless environment to extract, transform, and load a large number of datasets from several sources for analytics purposes. It has a feature called job bookmarks to process incremental data when rerunning a job on a scheduled interval. A job bookmark is composed of the states for various job elements, such as sources, transformations, and targets. This is done by persisting state information from a job run that helps AWS Glue prevent reprocessing old data.

For this use case, we use an AWS Glue job with job bookmarks enabled to process files received in varying frequencies (a full dataset signifying files that are received one time, and incremental datasets signifying files that are received in certain regular intervals). These files are merged together. In addition to enabling job bookmarks, we also use an optional parameter transformation_ctx (transformation context) in an AWS Glue PySpark dynamic frame. This acts as a unique identifier for the ETL operator instance to identify state information within a job bookmark for a given operator. AWS Glue uses transformation_ctx to index the key to the bookmark state.

You can capture and maintain state information for incremental datasets and avoid reprocessing by using transformation context. Transformation context is omitted for the full dataset file, which results in the job run state information not getting captured for the full dataset and allowing it to participate in the next processing event in its entirety. Even though the job bookmark flag is enabled at the AWS Glue job level, because transformation context is omitted for the full dataset, every time the job runs, the entire data from the full dataset is used as part of the job. In contrast, only the newly arrived datasets are processed for the incremental datasets.

Solution overview

To demonstrate the job bookmark utility of AWS Glue, we use TLC Trip Record Data datasets. We use NYC yellow taxi trip data monthly files as the incremental dataset, and NYC taxi zone lookup as the full dataset. The monthly yellow taxi trip data has a field named PULocationID (where a customer was picked up), which is joined with the LocationID field from the NYC taxi zone lookup file to create an output dataset that contains Borough, Zone, and service_zone from the NYC taxi zone lookup dataset and all the fields (except the PULocationID field) from the monthly NYC taxi trip data file.

The following diagram depicts a high-level architecture of the process.

Descriptions of Diagram

  • Two Amazon S3 Raw bucket locations are used for storing incoming CSV source data (NYC taxi monthly files (Incremental Dataset) and NYC Taxi lookup file (Full Dataset)).
  • A Bookmark enabled glue Job joins data between monthly trip data file and the taxi zone lookup file to generate output parquet files and creates NYC taxi trip table in Glue Data Catalog and Redshift database.
  • S3 Curated Bucket is used to store NYC Taxi monthly processed parquet files.

Creating the AWS CloudFormation stack

You use the following AWS CloudFormation template to create the below mentioned resources in your preferred AWS account and Region:

Additionally, make sure you have an Amazon EC2 key pair created in the account and Region you’re launching the stack from.

To provide the stack parameters, complete the following steps:

  1. For Stack name, enter BigDataBlog-GlueBookmark-Stack.

  1. For RedshiftClusterIdentifier, enter bigdatablogrscluster.
  2. For NodeType, choose large.
  3. For NumberOfNodes, choose 2.
  4. For DatabaseName, enter bigdatablogdev.

  1. For MasterUserName, enter bigdatabloguser.
  2. For MasterUserPassword, enter a password for the master user account.
  3. For Maintenancewindow, enter sun:05:00-sun:05:30.
  4. For EC2InstanceType, choose micro.
  5. For SubscriptionEmail, enter your preferred email.
  6. For MyIPAddressCidr, enter your IP address.

You can find your IP Address by browsing https://www.whatismyip.com/ and looking up the value for My Public IPv4 is:. Add /32 at the end to make it CIDR-compatible and most restrictive.

  1. For DestinationPrefixListId, enter your prefix list ID.

To find your ID, set AWS credentials by entering aws configure in the command prompt. Run aws ec2 describe-prefix-lists to get the PrefixListId where PrefixListName is com.amazonaws.<<AWS region>>.s3 from the output.

  1. For NewS3BucketName, enter the name of your S3 bucket.

  1. For gluedatabase, enter bigdatabloggluedb.
  2. For EC2KeyName, enter the name of your key pair.

For instructions on creating a stack, see Creating a Stack on the AWS CloudFormation Console.

Make sure the stack is complete before moving to the next steps.

Creating the AWS Glue job

To create your AWS Glue job, complete the following steps:

  1. Download NYC yellow monthly trip data for October 2019 and November 2019 and save them under the s3://<<Your S3 Bucket>>/tripdata/ prefix.
  2. Download the NYC Taxi Zone lookup table and save it under the s3://<<Your S3 Bucket>>/tripdata-lookup/ prefix.
  3. Use the following PySpark script and change the piece of the code enclosed inside <<…>>.

You can find the values for the following keys on the Outputs tab for the CloudFormation stack:

    • S3Bucket
    • Snstopic

You can find the values for the following keys on the Parameters tab for the CloudFormation stack:

    • EC2KeyName
    • MyIPAddressCidr
    • NewS3BucketName
    • SubscriptionEmail

  1. When the AWS Glue script is ready, upload it to the S3 bucket under the s3://<<Your S3 Bucket>>/glue-script/ prefix.

You refer to this when you create the AWS Glue job.

  1. On the AWS Glue console, under ETL, choose Jobs.
  2. Choose Create job.
  3. For Name, enter a name for the job. For more information about AWS Glue job naming, see Jobs.
  4. For IAM role, choose the role the CloudFormation template created. Use the value for the key Glueaccessrole from the stack outputs.
  5. For Type, choose Spark.
  6. For Glue version, choose Spark 2.4, Python 3 (Glue Version 1.0).
  7. For This job runs, choose An existing script that you provide.
  8. For S3 path where the script is stored, choose the script file that you saved earlier under the s3://<<Your S3 Bucket>>/Glue-script/ prefix.
  9. In the Advanced properties section, for Job bookmark, choose Enable.
  10. For Catalog options, select Use Glue Data Catalog as the Hive metastore.
  11. For Connections, enter the value of the key GlueConnection from the stack outputs.
  12. Choose Save job and edit script.

Creating an Amazon Redshift database schema

Before you run the AWS Glue job, you need to connect to the Amazon Redshift cluster and create an Amazon Redshift database schema named Glue_bookmark_redshift_schema. To connect to the cluster, use one of the JDBC client-based SQL tools, such as SQL Workbench/J. For instructions, see How can I access a private Amazon Redshift cluster from my local machine?

To access the cluster, you use the Amazon Redshift master user bigdatabloguser (the value for MasterUserName on the Parameters tab of the CloudFormation stack) and the password you provided when creating the stack.

Running AWS Glue job

The Glue Job takes only one argument; name of the file being processed. Pass the file name, such as yellow_tripdata_2019-10.csv, while processing that file. This enables you to track the records that belong to a specific file so that it’s easier to evaluate the result of multiple job runs using different files.

When the Glue job run is successful, you can see the output Parquet files under the /tripdata-joined-output/ prefix inside the S3 bucket you created by running the CloudFormation template. You can also use Amazon Athena to query the data from the table created in the Data Catalog. For more information, see Running SQL Queries Using Amazon Athena.

Query the Amazon Redshift database table named redshift_bookmark_table and review the output.

Explaining the solution

A bookmark-enabled AWS Glue job (in PySpark) is created that reads the NYC yellow taxi trip’s monthly file, joins it with NYC taxi zone lookup file, produces files in Parquet format, and saves them in an Amazon s3 location.

A Data Catalog table is created that refers to the Parquet files’ location in Amazon S3. The resulting dataset is also loaded into an Amazon Redshift table using the AWS Glue PySpark job.

The AWS Glue job bookmark transformation context is used while the AWS Glue dynamic frame is created by reading a monthly NYC taxi file, whereas the transformation context is disabled while reading and creating the dynamic frame for the taxi zone lookup file (because the entire file is required for processing each monthly trip file). This allows you to process each monthly trip file exactly one time and reuse the entire taxi zone lookup file as many times as required because the missing transformation context for the lookup file doesn’t allow the bookmark context to be set for that file.

When a new NYC trip data monthly file arrives and the AWS Glue job runs, it only processes the newly arrived monthly file and ignores any previously processed monthly files. Similarly, when the Data Catalog table data is copied into Amazon Redshift, it only copies the newly processed underlying Parquet files’ data and appends it to the Amazon Redshift table. At this time the transformation context is enabled to utilize the job bookmark, and the AWS Glue dynamic frame is created by reading the Data Catalog table.

The following PySpark code uses the transformation context to create an AWS Glue dynamic frame while reading the monthly incremental file:

taxidata = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths": [InputDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'},transformation_ctx = "taxidata")

However, the following code omits transformation context when creating the AWS Glue dynamic frame for the lookup file:

Lookupdata  = GlueContext.create_dynamic_frame_from_options(connection_type="s3",connection_options = {"paths":[InputLookupDir]},format="csv",format_options={"withHeader": True,"separator": ",","quoteChar": '"',"escaper": '"'})

Additionally, the following code uses the transformation context while reading the Data Catalog table, which is loaded into an Amazon Redshift table:

datasource0 = GlueContext.create_dynamic_frame.from_catalog(database = Glue_catalog_database, table_name = Glue_table_name, transformation_ctx = "datasource0")

You can see in the screenshot below that the 2019 October yellow taxi trip data file has arrived for processing (the incremental dataset).

To process each month’s data, you need the taxi zone lookup (full dataset).

The following screenshot shows the output of the AWS Glue job after processing the 2019 October trip data, saved in Parquet format.

The following two screenshots show the Amazon Redshift table, displaying the count of records for the October 2019 taxi data and only October 2019 taxi data file has been processed so far, respectively

The following screenshot shows that the November 2019 NYC taxi data file has arrived for processing.

The following screenshot shows the output of the AWS Glue job after processing the 2019 November trip data, saved in Parquet format. The job only processed the November data and ignored the October data (to be reprocessed) because the job bookmark and transformation context was enabled.

The following screenshot shows that the Amazon Redshift table now has both October and November data and shows the total record count.

The following screenshot shows individual record count for each month.

Querying with Athena

You can also review the dataset in Athena, which uses the same Glue Data Catalog. The following screenshot of an Athena query shows the Data Catalog table has both October and November data, with the total record count.

The following screenshot of an Athena query shows the individual record count for each month.

The following screenshot shows the location information, including borough, zone, and service zone, which is available in the taxi zone lookup and is joined with the October taxi trip data.

The following screenshot shows the output for the same query on the November data.

Cleaning up

When you’re done using this solution, you should delete the CloudFormation stack to avoid incurring any further charges.

Conclusion

This post describes how you can merge datasets received in different frequencies as part of your ETL pipeline processing using AWS Glue job bookmarks. The use case demonstrated how to use job bookmarks and transformation context to build an ETL pipeline for processing several incremental datasets.

 


About the Authors

Dipankar is a Senior Data Architect with AWS Professional Services, helping customers build analytics platform and solutions. He has a keen interest in distributed computing. Dipankar enjoys spending time playing chess and watching old Hollywood movies.

 

 

 

Ashok Padmanabhan is a big data consultant with AWS Professional Services, helping customers build big data and analytics platform and solutions. When not building and designing data lakes, Ashok enjoys spending time at beaches near his home in Florida.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Access web interfaces securely on Amazon EMR launched in a private subnet using an Application Load Balancer

Post Syndicated from Hitesh Parikh original https://aws.amazon.com/blogs/big-data/access-web-interfaces-securely-on-amazon-emr-launched-in-a-private-subnet-using-an-application-load-balancer/

Amazon EMR web interfaces are hosted on the master node of an EMR cluster. When you launch an EMR cluster in a private subnet, the EMR master node doesn’t have a public DNS record. The web interfaces hosted in a private subnet aren’t easily accessible outside the subnet. You can use an Application Load Balancer (ALB) as an HTTPS proxy to access EMR web interfaces over the internet without requiring SSH tunneling through a bastion host. This approach greatly simplifies accessing EMR web interfaces.

This post outlines how to use an ALB to securely access EMR web interfaces over the internet for an EMR cluster launched in a private subnet.

Solution overview

Nodes that are launched within a VPC subnet can’t communicate outside of the subnet unless one of the following exists:

  • A network route from the subnet to other subnets in its VPC
  • Subnets in other VPCs using VPC Peering
  • A route through AWS Direct Connect to the subnet
  • A route to an internet gateway
  • A route to the subnet from a VPN connection

If you want the highest level of security to an EMR cluster, you should place the cluster in a subnet with a minimal number of routes to the cluster. This makes it more difficult to access web interfaces running on the master node of an EMR cluster launched in a private subnet.

This solution uses an internet-facing ALB that acts as an HTTPS proxy to web interface endpoints on the EMR master node. The ALB listens on HTTPS ports for incoming web interface access requests and routes requests to the configured ALB targets that point to the web interface endpoints on the EMR master node.

The following diagram shows the network flow from the client to the EMR master node through Amazon Route 53 and ALB to access the web interfaces running on the EMR master node in a private subnet.

Securing your endpoints

The solution outlined in this post restricts access to EMR web interfaces for a range of client IP addresses using an ingress security group on ALB. You should further secure the endpoints that are reachable using ALB by having a user authentication mechanism like LDAP or SSO. For more information about Jupyter authentication methods, see Adding Jupyter Notebook Users and Administrators. For more information about Hue, see Configure Hue for LDAP Users. For more information about Hive, see  User and Group Filter Support with LDAP Atn Provider in HiveServer2.

Additionally, it may be a good idea to enable access logs through the ALB. For more information about ALB access logs, see Access Logs for Your Application Load Balancer.

Solution walkthrough

When a client accesses an EMR web interface, the process includes the following sequence of steps:

  • A client submits an EMR web interface request from a web browser (for example, YARN Node Manager).
  • Route 53 resolves the HTTPS request using the record set name sample-emr-web in the hosted zone example.com for the registered domain example.com. Route 53 resolves the request URL to the IP address of the ALB, and routes the request to the ALB.
  • The ALB receives the EMR web interface request on its HTTPS listener and forwards it to the web interface endpoint configured in the load balancer target group. There are multiple HTTPS listener and load balancer target group pairs created, one pair for each EMR web interface endpoint.
  • The ALB ingress security group controls what other VPCs or corporate networks can access the ALB.
  • The EMR ingress security group on the master node allows inbound traffic from the ALB to the EMR master node.

The AWS CloudFormation template for this solution creates the following AWS objects in the solution stack:

  • An ALB.
  • HTTPS listener and target pairs; one pair for each EMR web application. It supports Ganglia, YARN Resource Manager, JupyterHub, Livy, and Hue EMR web applications. You can modify the CloudFormation stack to add ALB HTTPS listeners and targets for any other EMR web applications. The following AWS CloudFormation code example shows the code for the ALB, HTTPS listener, and load balancer target:
    # EMR ALB Resources
      # ALB, Target Groups, Listeners and R53 RecordSet
      SampleEmrApplicationLoadBalancer:
        Type: AWS::ElasticLoadBalancingV2::LoadBalancer
        Properties:
          IpAddressType: ipv4
          Name: sample-emr-alb
          Scheme: internet-facing
          SecurityGroups:
            - !Ref AlbIngressSecurityGroup
    
          Subnets:
            - !Ref ElbSubnet1
            - !Ref ElbSubnet2
          LoadBalancerAttributes:
            -
              Key: deletion_protection.enabled
              Value: false
          Tags:
            -
              Key: businessunit
              Value: heloc
            -
              Key: environment
              Value: !Ref EnvironmentName
            -
              Key: name
              Value: sample-emr-alb
    
      ALBHttpGangliaTargetGroup:
        Type: 'AWS::ElasticLoadBalancingV2::TargetGroup'
        Properties:
          HealthCheckIntervalSeconds: 30
          HealthCheckTimeoutSeconds: 5
          HealthyThresholdCount: 3
          UnhealthyThresholdCount: 5
          HealthCheckPath: '/ganglia'
          Matcher:
            HttpCode: 200-399
          Name: sample-emr-ganglia-tgt
          Port: 80
          Protocol: HTTP
          VpcId: !Ref VpcID
          TargetType: instance
          Targets:
           - Id: !Ref EMRMasterEC2NodeId
             Port: 80
          Tags:
            -
              Key: Name
              Value: sample-emr-ganglia-tgt
            -
              Key: LoadBalancer
              Value: !Ref SampleEmrApplicationLoadBalancer
    
      ALBHttpsGangliaListener:
        Type: 'AWS::ElasticLoadBalancingV2::Listener'
        Properties:
          DefaultActions:
            - Type: forward
              TargetGroupArn: !Ref ALBHttpGangliaTargetGroup
          LoadBalancerArn: !Ref SampleEmrApplicationLoadBalancer
          Certificates:
            - CertificateArn: !Ref SSLCertificateArn
          Port: 443
          Protocol: HTTPS

  • The AWS::Route53::RecordSet object (sample-emr-web) in the hosted zone (example.com) for a given registered domain (example.com). The hosted zone and record set name are parameters on the CloudFormation template.
  • An Ingress Security Group attached to the ALB that controls what CIDR blocks can access the ALB. You can modify the template to customize the security group to meet your requirements.

For more information and to download the CloudFormation stack, see the GitHub repo.

Prerequisites

To follow along with this walkthrough, you need the following:

  • An AWS account.
  • A VPC with private and public subnets. An ALB requires at least two Availability Zones, with one public subnet in each Availability Zone. For the sample code to create a basic VPC with private and public subnets, see the GitHub repo.
  • An EMR cluster launched in a private subnet.
  • Web applications such as Ganglia, Livy, Jupyter, and Hue installed on the EMR cluster when the cluster is launched.
  • A hosted zone entry in Route 53 for your domain. If you don’t have a domain, you can register a new one in Route 53. There is a non-refundable cost associated with registering a new domain. For more information, see Amazon Route 53 Pricing.
  • A public certificate to access HTTPS endpoints in the domain. You can request a public certificate if you don’t have one.

Creating an ALB as an HTTPS proxy

To create an ALB as an HTTPS proxy in front of an EMR cluster, you first launch the CloudFormation stack.

  1. Log in to your AWS account.
  2. Select the Region where you’re running your EMR cluster.
  3. To launch your CloudFormation stack, choose Launch Stack.

  4. Enter your parameter values and follow the screen prompts to create the stack.

The following screenshot shows examples of stack parameters.

  1. Modify the EMR master node security group to allow ingress traffic from the ALB.
  2. Create a Custom TCP rule with port range 80–65535.
  3. Add a source security group that is attached with the ALB.

In the following steps, you add an inbound rule to the security group.

  1. Choose the EMR master node security group.
  2. Choose Security group for master on the EMR cluster Summary tab to open the security group.

  1. Choose Edit inbound rules.

  1. Choose Add Rule.

  1. Add a port range and select the ALB security group as a source security group.
  2. Choose Save rules.

  1. Test the following EMR web interfaces in your browser:
    1. Gangliahttps://sample-emr-web.[web domain]/ganglia/
    2. YARN Resource Managerhttps://sample-emr-web.[web domain]:8088/cluster
    3. JupyterHubhttps://sample-emr-web.[web domain]:9443/hub/login
    4. Huehttps://sample-emr-web.[web domain]:8888/hue/accounts/login
    5. Livyhttps://sample-emr-web.[web domain]:8998/ui

If you don’t get a response from web interface endpoints, disconnect from your VPN connection and test it. Some organizations may block outgoing web requests on ports other than 80.

Sometimes Route 53 DNS record updates propagation to the worldwide network of DNS servers may take longer than it takes under normal conditions. If you don’t get a response from the EMR web interfaces, wait to test for a minute or two after the CloudFormation stack is created.

You can add code to support other EMR web interface endpoints in the CloudFormation template. For more information, see View Web Interfaces Hosted on Amazon EMR Clusters.

Locating the public certificate ARN from AWS Certificate Manager

You can find the public certificate ARN from AWS Certificate Manager (ACM) on the ACM console. When you expand the domain for a given certificate, locate the ARN in the Details section.

Creating a hosted zone from Route 53

To create a hosted zone from Route 53, complete the following steps:

  1. On the Route 53 console, choose Hosted zones.
  2. Choose the hosted zone in your domain.
  3. In the Hosted Zone Details section, copy the entry for Domain Name.

  1. Enter the domain name in the R53 Hosted Zone AWS CloudFormation parameter box.

Cost breakdown

The following AWS Cost Explorer report table shows an example total cost and cost breakdown by services for the time it takes to complete this walkthrough. This cost includes the cost for a minimal EMR cluster created without any data stored at the start of the exercise, and other resources that the CloudFormation template creates.

Cleaning up

To avoid incurring future charges, delete the CloudFormation stack to delete all the resources created.

Conclusion

You can now create an ALB as a HTTPS proxy to access EMR web interfaces securely over the internet, without requiring a bastion host for SSH tunneling. This simplifies securely accessing EMR web interfaces for the EMR launched in a private subnet.

 


About the Authors

Hitesh Parikh is a Cloud Architect with AWS. Hitesh is passionate about partnering with customers on their cloud adoption journey and building innovative and modern cloud native digital solutions on AWS. Outside of work, he loves to spend time with his family, travel, watch movies, and do community service.

 

 

 

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

 

Introducing Cache Analytics

Post Syndicated from Jon Levine original https://blog.cloudflare.com/introducing-cache-analytics/

Introducing Cache Analytics

Today, I’m delighted to announce Cache Analytics: a new tool that gives deeper exploration capabilities into what Cloudflare’s caching and content delivery services are doing for your web presence.

Caching is the most effective way to improve the performance and economics of serving your website to the world. Unsurprisingly, customers consistently ask us how they can optimize their cache performance to get the most out of Cloudflare.

With Cache Analytics, it’s easier than ever to learn how to speed up your website, and reduce traffic sent to your origin. Some of my favorite capabilities include:

  • See what resources are missing from cache, expired, or never eligible for cache in the first place
  • Slice and dice your data as you see fit: filter by hostnames, or see a list of top URLs that miss cache
  • Switch between views of requests and data Transfer to understand both performance and cost
Introducing Cache Analytics
An overview of Cache Analytics

Cache Analytics is available today for all customers on our Pro, Business, and Enterprise plans.

In this blog post, I’ll explain why we built Cache Analytics and how you can get the most out of it.

Why do we need analytics focused on caching?

If you want to scale the delivery of a fast, high-performance website, then caching is critical. Caching has two main goals:

First, caching improves performance. Cloudflare data centers are within 100ms of 90% of the planet; putting your content in Cloudflare’s cache gets it physically closer to your customers and visitors, meaning that visitors will see your website faster when they request it! (Plus, reading assets on our edge SSDs is really fast, rather than waiting for origins to generate a response.)

Second, caching helps reduce bandwidth costs associated with operating a presence on the Internet. Origin data transfer is one of the biggest expenses of running a web service, so serving content out of Cloudflare’s cache can significantly reduce costs incurred by origin infrastructure.

Because it’s not safe to cache all content (we wouldn’t want to cache your bank balance by default), Cloudflare relies on customers to tell us what’s safe to cache with HTTP Cache-Control headers and page rules. But even with page rules, it can be hard to understand what’s actually getting cached — or more importantly, what’s not getting cached, and why. Is a resource expired? Or was it even eligible for cache in the first place?

Faster or cheaper? Why not both!

Cache Analytics was designed to help users understand how Cloudflare’s cache is performing, but it can also be used as a general-purpose analytics tool. Here I’ll give a quick walkthrough of the interface.

First, at the top-left, you should decide if you want to focus on requests or data transfer.

Introducing Cache Analytics
Cache Analytics enables you to toggle between views of requests and data transfer.

As a rule of thumb, requests (the default view) is more useful for understanding performance, because every request that misses cache results in a performance hit. Data transfer is useful for understanding cost, because most hosts charge for every byte that leaves their network — every gigabyte served by Cloudflare translates into money saved at the origin.

You can always toggle between these two views while keeping filters enabled.

A filter for every occasion

Let’s say you’re focused on improving the performance of a specific subdomain on your zone. Cache Analytics allows flexible filtering of the data that’s important to you:

Introducing Cache Analytics
Cache Analytics enables flexible filtering of data.

Filtering is essential for zooming in on the chunk of traffic that you’re most interested in. You can filter by cache status, hostname, path, content type, and more. This is helpful, for example, if you’re trying to reduce data transfer for a specific subdomain, or are trying to tune the performance of your HTML pages.

Seeing the big picture

When analyzing traffic patterns, it’s essential to understand how things change over time. Perhaps you just applied a configuration change and want to see the impact, or just launched a big sale on your e-commerce site.

Introducing Cache Analytics

“Served by Cloudflare” indicates traffic that we were able to serve from our edge without reaching your origin server. “Served by Origin” indicates traffic that was proxied back to origin servers. (It can be really satisfying to add a page rule and see the amount of traffic “Served by Cloudflare” go up!)

Note that this graph will change significantly when you switch between “Requests” and “Data Transfer.” Revalidated requests are particularly interesting; because Cloudflare checks with the origin before returning a result from cache, these count as “Served by Cloudflare” for the purposes of data transfer, but “Served by Origin” for the purposes of “requests.”

Slicing the pie

After the high-level summary, we show an overview of cache status, which explains why traffic might be served from Cloudflare or from origin. We also show a breakdown of cache status by Content-Type to give an overview on how different components of your website perform:

Introducing Cache Analytics

Cache statuses are also essential for understanding what you need to do to optimize cache ratios. For example:

  • Dynamic indicates that a request was never eligible for cache, and went straight to origin. This is the default for many file types, including HTML. Learn more about making more content eligible for cache using page rules. Fixing this is one of the fastest ways to reduce origin data transfer cost.
  • Revalidated indicates content that was expired, but after Cloudflare checked the origin, it was still fresh! If you see a lot of revalidated content, it’s a good sign you should increase your Edge Cache TTLs through a page rule or max-age origin directive. Updating TTLs is one of the easiest ways to make your site faster.
  • Expired resources are ones that were in our cache, but were expired. Consider if you can extend TTLs on these, or at least support revalidation at your origin.
  • A miss indicates that Cloudflare has not seen that resource recently. These can be tricky to optimize, but there are a few potential remedies: Enable Argo Tiered Caching to check another datacenter’s cache before going to origin, or use a Custom Cache Key to make multiple URLs match the same cached resource (for example, by ignoring query string)

For a full explanation of each cache status, see our help center.

To the Nth dimension

Finally, Cache Analytics shows a number of what we call “Top Ns” — various ways to slice and dice the above data on useful dimensions.

Introducing Cache Analytics

It’s often helpful to apply filters (for example, to a specific cache status) before looking at these lists. For example, when trying to tune performance, I often filter to just “expired” or “revalidated,” then see if there are a few URLs that dominate these stats.

But wait, there’s more

Cache Analytics is available now for customers on our Pro, Business, and Enterprise plans. Pro customers have access to up to 3 days of analytics history. Business and Enterprise customers have access to up to 21 days, with more coming soon.

This is just the first step for Cache Analytics. We’re planning to add more dimensions to drill into the data. And we’re planning to add even more essential statistics — for example, about how cache keys are being used.

Finally, I’m really excited about Cache Analytics because it shows what we have in store for Cloudflare Analytics more broadly. We know that you’ve asked for many features— like per-hostname analytics, or the ability to see top URLs — for a long time, and we’re hard at work on bringing these to Zone Analytics. Stay tuned!

Best practices for Amazon Redshift Federated Query

Post Syndicated from Joe Harris original https://aws.amazon.com/blogs/big-data/amazon-redshift-federated-query-best-practices-and-performance-considerations/

Amazon Redshift Federated Query enables you to use the analytic power of Amazon Redshift to directly query data stored in Amazon Aurora PostgreSQL and Amazon RDS for PostgreSQL databases. For more information about setting up an environment where you can try out Federated Query, see Accelerate Amazon Redshift Federated Query adoption with AWS CloudFormation.

Federated Query enables real-time data integration and simplified ETL processing. You can now connect live data sources directly in Amazon Redshift to provide real-time reporting and analysis. Previously, you needed to extract data from your PostgreSQL database to Amazon Simple Storage Service (Amazon S3) and load it to Amazon Redshift using COPY or query it from Amazon S3 with Amazon Redshift Spectrum. For more information about the benefits of Federated Query, see Build a Simplified ETL and Live Data Query Solution using Amazon Redshift Federated Query.

This post discusses 10 best practices to help you maximize the benefits of Federated Query when you have large federated data sets, when your federated queries retrieve large volumes of data, or when you have many Redshift users accessing federated data sets. These techniques are not necessary for general usage of Federated Query. They are intended for advanced users who want to make the most of this exciting feature.

The best practices are divided into two sections: the first for advice that applies to your Amazon Redshift cluster, and the second for advice that applies to your Aurora PostgreSQL and Amazon RDS for PostgreSQL environments.

The code examples provided in this post derive from the data and queries in the CloudDataWarehouseBenchmark GitHub repo (based on TPC-H and TPC-DS).

Best practices to apply in Amazon Redshift

The following best practices apply to your Amazon Redshift cluster when using federated queries to access your Aurora or Amazon RDS for PostgreSQL instances.

1. Use separate external schemas for each use case

Consider creating separate Amazon Redshift external schemas, using separate remote PostgreSQL users, for each specific Amazon Redshift use case. This practice allows you to have extra control over the users and groups who can access the external database. For instance, you may want to have an external schema for ETL usage, with an associated PostgreSQL user, that has broad access and another schema, and an associated PostgreSQL user for ad-hoc reporting and analysis with access limited to specific resources.

The following code example creates two external schemas for ETL use and ad-hoc reporting use. Each schema uses a different SECRET_ARN containing credentials for separate users in the PostgreSQL database.

-- ETL usage - broad access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd'
;
-- Ad-Hoc usage - limited access
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_adhoc
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-adhoc-secret-187Asd'
;

2. Use query timeouts to limit total runtimes

Consider setting a timeout on the users or groups that have access to your external schemas. User queries could unintentionally try to retrieve a very large number of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL.

To limit the total runtime of a user’s queries, you can set a statement_timeout for all a user’s queries. The following code example sets a 2-hour timeout for an ETL user:

-- Set ETL user timeout to 2 hours
ALTER USER etl_user SET statement_timeout TO 7200000;

If many users have access to your external schemas, it may not be practical to define a statement_timeout for each individual user. Instead, you can add a query monitoring rule in your WLM configuration using the query_execution_time metric. The following screenshot shows an Auto WLM configuration with an Adhoc Reporting queue for users in the adhoc group, with a rule that cancels queries that run for longer than 1,800 seconds (30 minutes).

3. Make sure the Amazon Redshift query plan is efficient

Review the overall query plan and query metrics of your federated queries to make sure that Amazon Redshift processes them efficiently. For more information about query plans, see Evaluating the query plan.

Viewing the Amazon Redshift query explain plan

You can retrieve the plan for your query by prefixing your SQL with EXPLAIN and running that in your SQL client. The following code example is the explain output for a sample query:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 XN Aggregate  (cost=6396670427721.37..6396670427721.37 rows=1 width=32)
   ->  XN Hash Join DS_BCAST_INNER  (cost=499986.50..6396670410690.30 rows=6812425 width=32)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=449987.85..449987.85 rows=19999460 width=4)
               ->  XN PG Query Scan part  (cost=0.00..449987.85 rows=19999460 width=4)
                     ->  Remote PG Seq Scan apg_tpch_100g.part  (cost=0.00..249993.25 rows=19999460 width=4)
                           Filter: ((p_type)::text ~~ 'PROMO%'::text)

The operator XN PG Query Scan indicates that Amazon Redshift will run a query against the federated PostgreSQL database for this part of the query, we refer to this as the “federated subquery” in this post. When your query uses multiple federated data sources Amazon Redshift runs a federated subquery for each source. Amazon Redshift runs each federated subquery from a randomly selected node in the cluster.

Below the XN PG Query Scan line, you can see Remote PG Seq Scan followed by a line with a Filter: element. These two lines define how Amazon Redshift accesses the external data and the predicate used in the federated subquery. You can see that the federated subquery will run against the federated table apg_tpch.part.

You can also see from rows=19999460 that Amazon Redshift estimates that the query can return up to 20 million rows from PostgreSQL. It creates this estimate by asking PostgreSQL for statistics about the table.

Joins

Since each federated subquery runs from a single node in the cluster, Amazon Redshift must choose a join distribution strategy to send the rows returned from the federated subquery to the rest of the cluster to complete the joins in your query. The choice of a broadcast or distribution strategy is indicated in the explain plan. Operators that start with DS_BCAST broadcast a full copy of the data to all nodes. Operators that start with DS_DIST distribute a portion of the data to each node in the cluster.

It’s usually most efficient to broadcast small results and distribute larger results. When the planner has a good estimate of the number of rows that the federated subquery will return, it chooses the correct join distribution strategy. However, if the planner’s estimate isn’t accurate, it may choose broadcast for result that is too large, which can slow down your query.

Join Order

Joins should use the smaller result as the inner relation. When your query joins two tables (or two federated subqueries), Amazon Redshift must choose how best to perform the join. The query planner may not perform joins in the order declared in your query. Instead, it uses the information it has about the relations being joined to create estimated costs for a variety of possible plans. It uses the plan, including join order, that has the lowest expected cost.

When you use a hash join, the most common join, Amazon Redshift constructs a hash table from the inner table (or result) and compares it to every row from the outer table. You want to use the smallest result as the inner so that the hash table can fit in memory. The chosen ordering join may not be optimal if the planner’s estimate doesn’t reflect the real size of the results from each step in the query.

Improving query efficiency

The following is high-level advice for improving efficiency. For more information, see Analyzing the query plan.

  • Examine the plan for separate parts of your query. If your query has multiple joins or uses subqueries, you can review the explain plan for each join or subquery to check whether the query benefits from being simplified. For instance, if you use several joins, examine the plan for a simpler query using only one join to see how Amazon Redshift plans that join on its own.
  • Examine the order of outer joins and use an inner join. The planner can’t always reorder outer joins. If you can convert an outer join to an inner join, it may allow the planner to use a more efficient plan.
  • Reference the distribution key of the largest Amazon Redshift table in the join. When a join references the distribution key Amazon Redshift can complete the join on each node in parallel without moving the rows from the Redshift table across the cluster.
  • Insert the federated subquery result into a table. Amazon Redshift has optimal statistics when the data comes from a local temporary or permanent table. In rare cases, it may be most efficient to store the federated data in a temporary table first and join it with your Amazon Redshift data.

4. Make sure predicates are pushed down to the remote query

Amazon Redshift’s query optimizer is very effective at pushing predicate conditions down to the federated subquery that runs in PostgreSQL. Review the query plan of important or long-running federated queries to check that Amazon Redshift applies all applicable predicates to each subquery.

Consider the following example query, in which the predicate is inside a CASE statement and the federated relation is within a CTE subquery:

WITH cte 
AS (SELECT p_type, l_extendedprice, l_discount, l_quantity 
    FROM public.lineitem
    JOIN apg_tpch.part --<< PostgreSQL table
         ON l_partkey = p_partkey
    WHERE l_shipdate >= DATE '1994-02-01'
      AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 month')
)
SELECT CASE WHEN p_type LIKE 'PROMO%' --<< PostgreSQL filter predicate pt1
            THEN TRUE ELSE FALSE END AS is_promo
     , AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_promo_disc_val
FROM cte
WHERE is_promo IS TRUE --<< PostgreSQL filter predicate pt2
GROUP BY 1;

Amazon Redshift can still effectively optimize the federated subquery by pushing a filter down to the remote relation. See the following plan:

<< REDSHIFT >> QUERY PLAN
-------------------------------------------------------------------------------------------------------------------
 XN HashAggregate  (cost=17596843176454.83..17596843176456.83 rows=200 width=87)
   ->  XN Hash Join DS_BCAST_INNER  (cost=500000.00..17596843142391.79 rows=6812609 width=87)
         Hash Cond: ("outer".l_partkey = ("inner".p_partkey)::bigint)
         ->  XN Seq Scan on lineitem  (cost=0.00..2997629.29 rows=199841953 width=40)
               Filter: ((l_shipdate < '1994-03-01'::date) AND (l_shipdate >= '1994-02-01'::date))
         ->  XN Hash  (cost=450000.00..450000.00 rows=20000000 width=59)-- Federated subquery >>
               ->  XN PG Query Scan part  (cost=0.00..450000.00 rows=20000000 width=59)
                     ->  Remote PG Seq Scan apg_tpch.part  (cost=0.00..250000.00 rows=20000000 width=59)
                           Filter: (CASE WHEN ((p_type)::text ~~ 'PROMO%'::text) THEN true ELSE false END IS TRUE)-- << Federated subquery

If Redshift can’t push your predicates down as needed, or the query still returns too much data, consider the advice in the following two sections regarding materialized views and syncing tables. To easily rewrite your queries to achieve effective filter pushdown, consider the advice in the final best practice regarding persisting frequently queried data.

5. Use materialized views to cache frequently accessed data

Amazon Redshift now supports the creation of materialized views that reference federated tables in external schemas.

Cache queries that run often

Consider caching frequently run queries in your Amazon Redshift cluster using a materialized view. When many users run the same federated query regularly, the remote content of the query must be retrieved again for each execution. With a materialized view, the results can instead be retrieved from your Amazon Redshift cluster without getting the same data from the remote database. You can then schedule the refresh of the materialized view to happen at a specific time, depending upon the change rate and importance of the remote data.

The following code example demonstrates the creation, querying, and refresh of a materialized view from a query that uses a federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_store_quantities_by_quarter AS
SELECT ss_store_sk
     , d_quarter_name
     , COUNT(ss_quantity) AS count_quantity
     , AVG(ss_quantity) AS avg_quantity
FROM public.store_sales
JOIN apg_tpcds.date_dim --<< federated table
    ON d_date_sk = ss_sold_date_sk
GROUP BY ss_store_sk
ORDER BY ss_store_sk
;
--Query the materialized view
SELECT * 
FROM mv_store_quanties_by_quarter
WHERE d_quarter_name = '1998Q1'
;
--Refresh the materialized view
REFRESH MATERIALIZED VIEW mv_store_quanties_by_quarter
;

Cache tables that are used by many queries

Also consider locally caching tables used by many queries using a materialized view. When many different queries use the same federated table it’s often better to create a materialized view for that federated table which can then be referenced by the other queries instead.

The following code example demonstrates the creation and querying of a materialized view on a single federated source table:

-- Create the materialized view
CREATE MATERIALIZED VIEW mv_apg_part AS
SELECT * FROM apg_tpch_100g.part
;
--Query the materialized view
SELECT SUM(l_extendedprice * (1 - l_discount)) AS discounted_price
FROM public.lineitem, mv_apg_part
WHERE l_partkey = p_partkey
  AND l_shipdate BETWEEN '1997-03-01' AND '1997-04-01'
;

As of this writing, you can’t reference a materialized view inside another materialized view. Other views that use the cached table need to be regular views.

Balance caching against refresh time and frequency

The use of materialized views is best suited for queries that run quickly relative to the refresh schedule. For example, a materialized view refreshed hourly should run in a few minutes, and a materialized view refreshed daily should run in less than an hour. As of this writing, materialized views that reference external tables aren’t eligible for incremental refresh. A full refresh occurs when you run REFRESH MATERIALIZED VIEW and recreate the entire result.

Limit remote access using materialized views

Also consider using materialized views to reduce the number of users who can issue queries directly against your remote databases. You can grant external schema access only to a user who refreshes the materialized views and grant other Amazon Redshift users access only to the materialized view.

Limiting the scope of access in this way is a general best practice for data security when querying from remote production databases that contain sensitive information.

6. Sync large remote tables to a local copy

Consider keeping a copy of the remote table in a permanent Amazon Redshift table. When your remote table is large and a full refresh of a materialized view is time-consuming it’s more effective to use a sync process to keep a local copy updated.

Sync newly added remote data

When your large remote table only has new rows added, not updated nor deleted, you can synchronize your Amazon Redshift copy by periodically inserting the new rows from the remote table into the copy. You can automate this sync process using the example stored procedure sp_sync_get_new_rows on GitHub.

This example stored procedure requires the source table to have an auto-incrementing identity column as its primary key. It finds the current maximum in your Amazon Redshift table, retrieves all rows in the federated table with a higher ID value, and inserts them into the Amazon Redshift table.

The following code examples demonstrate a sync from a federated source table to a Amazon Redshift target table. First, you create a source table with four rows in the PostgreSQL database:

CREATE TABLE public.pg_source ( 
      pk_col   BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.pg_tbl (data_col)
VALUES ('aardvark'),('aardvarks'),('aardwolf'),('aardwolves')
;

Create a target table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_target (
      pk_col   BIGINT PRIMARY KEY
    , data_col VARCHAR(20));
INSERT INTO public.rs_tbl
VALUES (1,'aardvark'), (2,'aardvarks')
;

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_get_new_rows(SYSDATE,'apg_tpch.pg_source','public.rs_target','pk_col','public.sp_logs',0);
-- INFO:  SUCCESS - 2 new rows inserted into `target_table`.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  
-- --------+------------
--       1 | aardvark  
--       2 | aardvarks  
--       4 | aardwolves
--       3 | aardwolf 

Merge remote data changes

After you update or insert rows in your remote table, you can synchronize your Amazon Redshift copy by periodically merging the changed rows and new rows from the remote table into the copy. This approach works best when changes are clearly marked in the table so that you can easily retrieve just the new or changed rows. You can automate this sync process using the example stored procedure sp_sync_merge_changes, on GitHub.

This example stored procedure requires the source to have a date/time column that indicates the last time each row was modified. It uses this column to find changes that you need to sync and either updates the changed rows or inserts new rows in the Amazon Redshift copy. The stored procedure also requires the table to have a primary key declared. It uses the primary key to identify which rows to update in the local copy of the data.

The following code examples demonstrate a refresh from a federated source table to an Amazon Redshift target table. First, create a sample table with two rows in your Amazon Redshift cluster:

CREATE TABLE public.rs_tbl ( 
      pk_col   INTEGER PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.rs_tbl 
VALUES (1,'aardvark', SYSDATE), (2,'aardvarks', SYSDATE);

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |     last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:01:02
--       2 | aardvarks  | 2020-04-01 18:01:02

Create a source table with four rows in your PostgreSQL database:

CREATE TABLE public.pg_tbl (` `
      pk_col   INTEGER GENERATED ALWAYS AS IDENTITY PRIMARY KEY
    , data_col VARCHAR(20)
    , last_mod TIMESTAMP);
INSERT INTO public.pg_tbl (data_col, last_mod)
VALUES ('aardvark', NOW()),('aardvarks', NOW()),('aardwolf', NOW()),('aardwolves', NOW());

Call the Amazon Redshift stored procedure to sync the tables:

CALL sp_sync_merge_changes(SYSDATE,'apg_tpch.pg_tbl','public.rs_tbl','last_mod','public.sp_logs',0);
-- INFO:  SUCCESS - 4 rows synced.

SELECT * FROM public.rs_tbl;
--  pk_col |  data_col  |      last_mod
-- --------+------------+---------------------
--       1 | aardvark   | 2020-04-01 18:09:56
--       2 | aardvarks  | 2020-04-01 18:09:56
--       4 | aardwolves | 2020-04-01 18:09:56
--       3 | aardwolf   | 2020-04-01 18:09:56

Best practices to apply in Aurora or Amazon RDS

The following best practices apply to your Aurora or Amazon RDS for PostgreSQL instances when using them with Amazon Redshift federated queries.

7. Use a read replica to minimize Aurora or RDS impact

Aurora and Amazon RDS allow you to configure one or more read replicas of your PostgreSQL instance. As of this writing, Federated Query doesn’t allow writing to the federated database, so you should use a read-only endpoint as the target for your external schema. This also makes sure that the federated subqueries Amazon Redshift issues have the minimum possible impact on the master database instance, which often runs a large number of small and fast write transactions.

For more information about read replicas, see Adding Aurora Replicas to a DB Cluster and Working with PostgreSQL Read Replicas in Amazon RDS.

The following code example creates an external schema using a read-only endpoint. You can see the -ro naming in the endpoint URI configuration:

--In Amazon Redshift: 
CREATE EXTERNAL SCHEMA IF NOT EXISTS apg_etl
FROM POSTGRES DATABASE 'tpch' SCHEMA 'public'
URI 'aurora-postgres-ro.cr7d8lhiupkf.us-west-2.rds.amazonaws.com' PORT 8192
IAM_ROLE 'arn:aws:iam::123456789012:role/apg-federation-role'
SECRET_ARN 'arn:aws:secretsmanager:us-west-2:123456789012:secret:apg-redshift-etl-secret-187Asd';

8. Use specific and limited PostgreSQL users for each use case

As mentioned in the first best practice regarding separate external schemas, consider creating separate PostgreSQL users for each federated query use case. Having multiple users allows you to grant only the permissions needed for each specific use case. Each user needs a different SECRET_ARN, containing its access credentials, for the Amazon Redshift external schema to use. See the following code:

-- Create an ETL user who will have broad access
CREATE USER redshift_etl WITH PASSWORD '<<example>>';
-- Create an Ad-Hoc user who will have limited access
CREATE USER redshift_adhoc WITH PASSWORD '<<example>>';

Apply a user timeout to limit query runtime

Consider setting a statement_timeout on your PostgreSQL users. A user query could accidentally try to retrieve many millions of rows from the external relation and remain running for an extended time, which holds open resources in both Amazon Redshift and PostgreSQL. To prevent this, specify different timeout values for each user according to their expected usage. The following code example sets timeouts for an ETL user and an ad-hoc reporting user:

-- Set ETL user timeout to 1 hour
ALTER USER redshift_etl SET statement_timeout TO 3600000;
-- Set Ad-Hoc user timeout to 15 minutes
ALTER USER redshift_adhoc SET statement_timeout TO 900000;

9. Make sure the PostgreSQL table is correctly indexed

Consider adding or modifying PostgreSQL indexes to make sure Amazon Redshift federated queries run efficiently. Amazon Redshift retrieves data from PostgreSQL using regular SQL queries against your remote database. Queries are often faster when using an index, particularly when the query returns a small portion of the table.

Consider the following code example of an Amazon Redshift federated query on the lineitem table:

SELECT AVG( ( l_extendedprice * l_discount) / l_quantity ) AS avg_disc_val
FROM apg_tpch.lineitem
WHERE l_shipdate >= DATE '1994-02-01'
  AND l_shipdate < (DATE '1994-02-01' + INTERVAL '1 day');

Amazon Redshift rewrites this into the following federated subquery to run in PostgreSQL:

SELECT pg_catalog."numeric"(l_discount)
     , pg_catalog."numeric"(l_extendedprice)
     , pg_catalog."numeric"(l_quantity) 
FROM public.lineitem  
WHERE (l_shipdate < '1994-02-02'::date) 
  AND (l_shipdate >= '1994-02-01'::date);

Without an index, you get the following plan from PostgreSQL:

<< POSTGRESQL >> QUERY PLAN [No Index]
--------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..16223550.40 rows=232856 width=17)
   Workers Planned: 2
   ->  Parallel Seq Scan on lineitem  (cost=0.00..16199264.80 rows=97023 width=17)
         Filter: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

You can add the following index to cover exactly the data this query needs:

CREATE INDEX lineitem_ix_covering 
ON public.lineitem (l_shipdate, l_extendedprice, l_discount, l_quantity);

With the new index in place, you see the following plan:

<< POSTGRESQL >> QUERY PLAN [With Covering Index]
------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on lineitem  (cost=7007.35..839080.74 rows=232856 width=17)
   Recheck Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))
   ->  Bitmap Index Scan on lineitem_ix_covering  (cost=0.00..6949.13 rows=232856 width=0)
         Index Cond: ((l_shipdate < '1994-02-02'::date) AND (l_shipdate >= '1994-02-01'::date))

In the revised plan, the max cost is 839080 versus the original 16223550—19 times less. The reduced cost suggests that the query is faster when using the index, but testing is needed to confirm this.

Indexes require careful consideration. The detailed tradeoffs of adding additional indexes in PostgreSQL, the specific PostgreSQL index types available, and index usage techniques are beyond the scope of this post.

10. Replace restrictive joins with a remote view

Many analytic queries use joins to restrict the rows that the query returns. For instance, you might apply a predicate such as calender_quarter='2019Q4' to your date_dim table and join to your large fact table. The filter on date_dim reduces the rows returned from the fact table by an order of magnitude. However, as of this writing, Amazon Redshift can’t push such join restrictions down to the federated relation.

Consider the following example query with a join between two federated tables:

SELECT ss_store_sk
,COUNT(ss_quantity) AS count_quantity
,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds.store_sales
JOIN apg_tpcds.date_dim
  ON d_date_sk = ss_sold_date_sk
WHERE d_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Original]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN PG Query Scan store_sales  (cost=0.00..576019.84 rows=28800992 width=12)
        ->  Remote PG Seq Scan store_sales  (cost=0.00..288009.92 rows=28800992 width=12)
  ->  XN Hash  (cost=1643.60..1643.60 rows=73049 width=4)
        ->  XN PG Query Scan date_dim  (cost=0.00..1643.60 rows=73049 width=4)
              ->  Remote PG Seq Scan date_dim  (cost=0.00..913.11 rows=73049 width=4)
                    Filter: (d_quarter_name = '1998Q1'::bpchar)

The query plan shows that date_dim is filtered, but store_sales doesn’t have a filter. This means Amazon Redshift retrieves all rows from store_sales and only then uses the join to filter the rows. Because store_sales is a very big table, this probably takes too long, especially if you want to run this query regularly.

As a solution, you can create the following view in PostgreSQL that encapsulates this join:

CREATE VIEW vw_store_sales_quarter
AS SELECT ss.*, dd.d_quarter_name ss_quarter_name 
FROM store_sales   ss 
JOIN date_dim      dd 
    ON ss.ss_sold_date_sk = dd.d_date_sk;

Rewrite the Amazon Redshift query to use the view as follows:

SELECT ss_store_sk
      ,COUNT(ss_quantity) AS count_quantity
      ,AVG(ss_quantity) AS avg_quantity
FROM apg_tpcds_10g.vw_store_sales_date
WHERE ss_quarter_name = '1998Q1'
GROUP BY ss_store_sk
ORDER BY ss_store_sk
LIMIT 100;

When you EXPLAIN this rewritten query in Amazon Redshift, you see the following plan:

<< REDSHIFT >> QUERY PLAN [Remote View]
----------------------------------------------------------------------------------------------------------------------------------
<< snip >>
  ->  XN HashAggregate  (cost=30.00..31.00 rows=200 width=8)
        ->  XN PG Query Scan vw_store_sales_date  (cost=0.00..22.50 rows=1000 width=8)
              ->  Remote PG Seq Scan vw_store_sales_date  (cost=0.00..12.50 rows=1000 width=8)
                    Filter: (ss_quarter_name = '1998Q1'::bpchar)

Amazon Redshift now pushes the filter down to your view. The join restriction is applied in PostgreSQL and many fewer rows are returned to Amazon Redshift. You may notice that Remote PG Seq Scan now shows rows=1000; this is a default value that the query optimizer uses when PostgreSQL can’t provide table statistics.

Summary

This post reviewed 10 best practices to help you maximize the performance Amazon Redshift federated queries. Every use case is unique, so carefully evaluate how you can apply these recommendations to your specific situation.

AWS will continue to enhance and improve Amazon Redshift Federated Query, and welcomes your feedback. If you have any questions or suggestions, leave your feedback in the comments. If you need further assistance in optimizing your Amazon Redshift cluster, contact your AWS account team.

Special thanks go to AWS colleagues Sriram Krishnamurthy, Entong Shen, Niranjan Kamat, Vuk Ercegovac, and Ippokratis Pandis for their help and support with this post.

 


About the Author

Joe Harris is a senior Redshift database engineer at AWS, focusing on Redshift performance. He has been analyzing data and building data warehouses on a wide variety of platforms for two decades. Before joining AWS he was a Redshift customer from launch day in 2013 and was the top contributor to the Redshift forum.

 

Analyzing Google Analytics data with Amazon AppFlow and Amazon Athena

Post Syndicated from Shimura Makoto original https://aws.amazon.com/blogs/big-data/analyzing-google-analytics-data-with-amazon-appflow-and-amazon-athena/

Software as a service (SaaS) applications are rapidly growing in importance. This data is essential to include when performing analytics to influence business decisions. Amazon AppFlow is a fully managed integration service that helps you transfer SaaS data to your data lake securely. You can run data transfer flow on demand, on a schedule, or after an event. You can quickly analyze this data using Amazon Athena and join it with numerous datasets already stored on Amazon Simple Storage Service (Amazon S3). You can join multiple SaaS datasets and combine it with operational data sitting in traditional databases such as Amazon Relational Database Service (Amazon RDS) via the Athena federated query feature.

This post walks you through extracting Google Analytics data using Amazon AppFlow and storing it in Amazon S3 so you can query it with Athena.

Architecture overview

The following diagram shows the flow described in this post. You first create a new flow inside Amazon AppFlow to transfer Google Analytics data to Amazon S3. The format of transferred data is multi-line JSON, which Athena doesn’t support. An AWS Lambda function transforms this JSON format file into Apache Parquet format. This transformation enables you to run a query efficiently and cost-effectively. This function can also include other transformations, such as Amazon S3 prefix changes and storing the data using Hive style partitions. Amazon AppFlow supports scheduled jobs to extract only new data, so you can develop an automated workflow with using an Amazon S3 event trigger and a transformation Lambda function. Amazon AppFlow is currently available in 15 Regions; pick the Region where your S3 bucket is located. In this walkthrough, you use US East (N. Virginia).

In this post, you use a sample Google account, OAuth client with appropriate permission, and Google Analytics data. You can also use your own Google resources. To enable Google Analytics access from Amazon AppFlow, you should set up a new OAuth client in advance. Complete the following steps:

  1. On the Google API Console (https://console.developers.google.com), choose Library.
  2. Enter analytics in the search field.
  3. Choose Google Analytics API.
  4. Choose ENABLE and return to the previous page.
  5. Choose Google Analytics Reporting API listed in the search results.
  6. Choose ENABLE and return to the main page.
  7. Choose OAuth consent screen.
  8. Create a new Internal app (if you’re using your personal account, choose External).
  9. Add com as Authorized domains.
  10. Choose Add scope.
  11. Add ../auth/analytics.readonly as Scopes for Google APIs.
  12. Choose Save.
  13. Choose Credentials.
  14. Add OAuth client ID credentials.
  15. Choose Web application.
  16. Enter https://console.aws.amazon.com/ as an authorized JavaScript origins URL.
  17. Enter https://AWSREGION.console.aws.amazon.com/appflow/oauth as an authorized redirect URL. (Replace AWSREGION with the Region you’re working in. If you’re using Amazon AppFlow in us-east-1, enter https://console.aws.amazon.com/appflow/oauth.)
  18. Choose Save.

Setting up Lambda and Amazon S3

You need to start by creating a new S3 bucket as your Amazon AppFlow transfer destination. Then you develop a new Lambda function to transform JSON format data into Parquet format using pandas and pyarrow modules. Finally, you set an Amazon S3 event trigger to automatically call the Lambda function when a new Amazon S3 object is created.

Creating a new S3 bucket

To create an Amazon S3 bucket, complete the following steps:

  1. On the Amazon S3 console, choose Create bucket.
  2. Enter a name for your bucket; for example, appflow-ga-sample.
  3. Choose Create bucket.

Preparing a .zip file for your Lambda layer

To create a .zip file that includes pandas and pyarrow module, complete the following steps:

  1. Set up any environment that can run Docker.
  2. Run the following command:
mkdir python
docker run -it --rm -v $(pwd)/python:/python python:3.6.8 pip install -t /python pandas==0.23.4 pyarrow==0.11.1
zip -r pandas-pyarrow.zip python
  1. On the Amazon S3 console, choose appflow-ga-sample.
  2. Choose Create folder.
  3. Enter a name for your folder; for example, lambda-layer.
  4. Choose Save.
  5. Choose lambda-layer.
  6. Choose Upload.
  7. Choose pandas-pyarrow.zip and choose Upload.

Creating a Lambda layer for Parquet export

To create a Lambda layer, complete the following steps:

  1. On the Lambda console, choose Layers.
  2. Choose Create layer.
  3. For name, enter a name for your layer; for example, pandas-parquet.
  4. Select Upload a file from Amazon S3.
  5. For Amazon S3 link URL, enter an Amazon S3 path for your zip file; for example, s3://appflow-sample/lambda-layer/pandas-parquet.zip.
  6. For Compatible runtimes, choose Python 3.6.
  7. Choose Create.

Creating a Lambda function for data transformation

To create a Lambda function and trigger an Amazon S3 event, complete the following steps:

  1. On the Lambda console, choose Create function.
  2. Select Author from scratch.
  3. For Function name, enter a name for your function; for example, ga-converter.
  4. For Runtime, choose Python 3.6.
  5. Select Create a new role with basic Lambda permissions.
  6. Choose Create function.

  1. At the Lambda function configuration, enter the following code in the lambda_function area in the Function code

This Lambda function downloads AppFlow output file, extracts the necessary data from the Google Analytics JSON file, and transforms it into Parquet format. Finally, it uploads it to Amazon S3 again with a different key name. You can modify the script, especially in dimensions and values names, or any other transformations according to your needs.

import pandas as pd
import boto3
from datetime import datetime as dt

def lambda_handler(event, context):
    bucket_name = event['Records'][0]['s3']['bucket']['name']
    object_key = event['Records'][0]['s3']['object']['key']

    s3_client = boto3.client('s3')
    raw_object = s3_client.get_object(Bucket=bucket_name, Key=object_key)
    raw_data = json.loads(raw_object['Body'].read().decode('utf-8'))
    
    record_dates = [dt.strptime(r['dimensions'][0], '%Y%m%d%H') for r in raw_data['reports'][0]['data']['rows']]
    devices = [r['dimensions'][1] for r in raw_data['reports'][0]['data']['rows']]
    user_counts = [int(r['metrics'][0]['values'][0]) for r in raw_data['reports'][0]['data']['rows']]
    df = pd.DataFrame({
        'year': [r.year for r in record_dates],
        'month': [r.month for r in record_dates],
        'day': [r.day for r in record_dates],
        'hour': [r.hour for r in record_dates],
        'device': devices,
        'user_count': user_counts
    })
    
    output_file = dt.now().strftime('%Y%m%d%H%M%S')
    output_path = '/tmp/{}.parquet'.format(output_file)
    df.to_parquet(output_path)

    s3_resource = boto3.resource('s3')
    bucket = s3_resource.Bucket(bucket_name)
    bucket.upload_file(output_path, 'ga-data/{}.parquet'.format(output_file))

Adding layers

To add layers to your Lambda function, complete the following steps:

  1. Choose Layers in the Designer
  2. Choose Add a layer.
  3. Select Select from list of runtime compatible layers.
  4. For Name, choose pandas-pyarrow.
  5. For Version, choose 1.
  6. Choose Add.

Increasing your timeout setting

To increase the Lambda timeout setting, complete the following steps:

  1. On the Configuration tab, under Basic settings, choose Edit.
  2. Set 30 sec as Timeout.
  3. Choose Save.
  4. On the Configuration tab, choose Save.

Adding an event trigger

To add an Amazon S3 event trigger, complete the following steps:

  1. In the Designer section, choose Add trigger.
  2. Choose S3, and choose the bucket you created.
  3. For Event type, choose All object create events.
  4. For Prefix, enter raw.
  5. Choose Add.

Adding permissions

To add appropriate permissions for this Lambda function to read and write Amazon S3 objects, complete the following steps:

  1. On the Permissions tab, enter a role name; for example, ga-converter-role-zm6u0f4g.
  2. On the AWS Identity and Access Management (IAM) console, choose Policies.
  3. Choose Create Policy.
  4. On the JSON tab, enter the following policy (replace the target bucket name arn:aws:s3:::appflow-ga-sample* with your own bucket name).
{	
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
            "Resource": [
                "arn:aws:s3:::appflow-ga-sample*"
            ]
        }
    ]
}
  1. Choose Review policy.
  2. Enter a name for your new policy; for example, lambda-s3-ga-converter-policy.
  3. Choose Create policy.
  4. On the IAM console, choose
  5. Enter your role name (ga-converter-role-zm6u0f4g) in the search field.
  6. Choose your role.
  7. Choose Attach policies.
  8. Choose lambda-s3-ga-converter-policy.
  9. Choose Attach policy.

Setting up Amazon AppFlow

Now you can create a new Amazon AppFlow flow to transfer from Google Analytics to Amazon S3. To create a new Amazon AppFlow transfer flow, complete the following steps:

  1. On the Amazon AppFlow console, choose Create flow.
  2. Enter a name for your flow; for example, my-ga-flow.
  3. Choose Next.
  4. For Source name, choose Google Analytics.
  5. Choose Create new connection.
  6. Enter your OAuth client ID and client secret, then name your connection; for example, ga-connection.
  7. In the pop-up window, choose to allow amazon.com access to the Google Analytics API.
  8. For Choose Google Analytics object, choose Reports.
  9. For Choose Google Analytics view, choose All Web Site Data.
  10. For Destination name, choose Amazon S3.
  11. For Bucket details, choose the bucket you created.
  12. Enter raw as a prefix.
  13. Select Run on demand.

  1. Choose Next.
  2. Select Manually map fields.
  3. Select the following three fields for Source field name:
    • Time: DIMENSION: ga:dateHour
    • Platform or Device: DIMENSION: ga:deviceCategory
    • User: METRIC: ga:users
  4. Choose Map fields directly.

  1. Choose Next.
  2. In the Add filters section, choose Next.
  3. Choose Create flow.

Running the flow

After creating your new flow, you can run it on demand:

  1. On the Amazon AppFlow console, choose my-ga-flow.
  2. Choose Run flow.

For this walkthrough, you choose on-demand job execution for ease of understanding. In practice, you can choose a scheduled job and periodically extract only newly added data. The Amazon S3 event trigger also helps you transform data automatically.

Querying via Athena

You need to create an external table before querying. Complete the following steps:

  1. On the Athena console, enter create database appflow_data into the query editor.
  2. Choose Run query.
  3. Enter the following command in the query editor (replace the target bucket name appflow-ga-sample with your own bucket):
CREATE EXTERNAL TABLE appflow_data.ga_sample (
  `year` int,
  `month` int,
  `day` int,
  `hour` int,
  `device` string,
  `user_count` int
 )
STORED AS PARQUET
LOCATION 's3://appflow-ga-sample/ga-data'
tblproperties ("parquet.compression"="SNAPPY")
;
  1. Choose Run query.

Now you can query Google Analytics data. Enter the following query and run it. This query shows what kind of device is popular for accessing your website on an hourly basis:

SELECT
  year
  , month
  , day
  , device
  , count(user_count) as cnt
FROM
  appflow_data.ga_sample
GROUP BY
  year
  , month
  , day
  , device
ORDER BY
  cnt DESC
LIMIT 10
; 

The following screenshot shows the query results.

Summary

This post demonstrated how you can transfer Google Analytics data to Amazon S3 using Amazon AppFlow and analyze it with Amazon Athena. You no longer need to build your own application to extract data from Google Analytics and other SaaS applications. Amazon AppFlow enables you to develop a fully automated data transfer and transformation workflow and an integrated query environment in one place.


About the Author

Makoto Shimura is a specialist solutions architect, analytics at Amazon Web Services. He helps customers develop efficient data pipelines on the AWS platform. Previously, he worked as a data engineer, developing a distributed data platform. Outside of work, he loves to spend time with his family, play with his dog, and also play video games.

 

Health Check Analytics and how you can use it

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/health-check-analytics-and-how-you-can-use-it/

Health Check Analytics and how you can use it

At the end of last year, we introduced Standalone Health Checks – a service that lets you monitor the health of your origin servers and avoid the need to purchase additional third party services. The more that can be controlled from Cloudflare decreases maintenance cost, vendor management, and infrastructure complexity. This is important as it ensures you are able to scale your infrastructure seamlessly as your company grows. Today, we are introducing Standalone Health Check Analytics to help decrease your time to resolution for any potential issues. You can find Health Check Analytics in the sub-menu under the Traffic tab in your Cloudflare Dashboard.

Health Check Analytics and how you can use it

As a refresher, Standalone Health Checks is a service that monitors an IP address or hostname for your origin servers or application and notifies you in near real-time if there happens to be a problem. These Health Checks support fine-tuned configurations based on expected codes, interval, protocols, timeout and more. These configurations enable you to properly target your checks based on the unique setup of your infrastructure. An example of a Health Check can be seen below which is monitoring an origin server in a staging environment with a notification set via email.

Health Check Analytics and how you can use it

Once you set up a notification, you will be alerted when there is a change in the health of your origin server. In the example above, if your staging environment starts responding with anything other than a 200 OK response code, we will send you an email within seconds so you can take the necessary action before customers are impacted.

Introducing Standalone Health Check Analytics

Once you get the notification email, we provide tools that help to quickly debug the possible cause of the issue with detailed logs as well as data visualizations enabling you to better understand the context around the issue. Let’s walk through a real-world scenario and see how Health Check Analytics helps decrease our time to resolution.

A notification email has been sent to you letting you know that Staging is unhealthy. You log into your dashboard and go into Health Check Analytics for this particular Health Check. In the screenshot below, you can see that Staging is up 76% of the time vs 100% of the time for Production. Now that we see the graph validating the email notification that there is indeed a problem, we need to dig in further.  Below the graph you can see a breakdown of the type of errors that have taken place in both the Staging and Production addresses over the specified time period. We see there is only one error taking place, but very frequently, in the staging environment – a TCP Connection Failed error, leading to the lower availability.

This starts to narrow the funnel for what the issue could be. We know that there is something wrong with either the Staging server’s ability to receive connections, maybe an issue during the SYN-ACK handshake, or possibly an issue with the router being used and not an issue with the origin server at all but instead receiving a down-stream consequence. With this information, you can quickly make the necessary checks to validate your hypothesis and minimize your time to resolution. Instead of having to dig through endless logs, or try to make educated guesses at where the issue could stem from, Health Check Analytics allows you to quickly hone in on detailed areas that could be the root cause. This in turn maximizes your application reliability but more importantly, keeps trust and brand expectation with your customers.

Health Check Analytics and how you can use it

Being able to quickly understand an overview of your infrastructure is important, but sometimes being able to dig deeper into each healthcheck can be more valuable to understand what is happening at a granular level. In addition to general information like address, response-code, round trip time (RTT) and failure reason,  we are adding more features to help you understand the Health Check result(s). We have also added extra information into the event table so you can quickly understand a given problem. In the case of a Response Code Mismatch Error, we now provide the expected response code for a given Health Check along with the received code. This removes the need to go back and check the configuration that may have been setup long ago and keep focus on the problem at hand.

Health Check Analytics and how you can use it

The availability of different portions of your infrastructure is very important, but this does not provide the complete view. Performance is continuing to skyrocket in importance and value to customers. If an application is not performant, they will quickly go to a competitor without a second thought. Sometimes RTT is not enough to understand why requests have higher latency and where the root of the issue may reside. To better understand where time is spent for a given request, we are introducing the waterfall view of a request within the Event Log. With this view, you can understand the time taken for the TCP connection, time taken for the TLS handshake, and the time to first byte (TTFB) for the request. The waterfall will give you a chronological idea about time spent in different stages of the request.

  1. Time taken for establishing the initial TCP connection.(in dark blue, approx 41ms)
  2. Once the TCP connection is established, time is spent doing the TLS handshake. This is another component that takes up time for HTTPS websites. (in light blue, approx 80ms)
  3. Once the SYN-ACK handshake and connection is complete, then the time taken for the first byte to be received is also exposed. (in dark orange, approx 222ms)
  4. The total round trip time (RTT) is the time taken to load the complete page once the TLS handshake is complete. The difference between the RTT and the TTFB gives you the time spent downloading content from a page. If your page has a large amount of content, the difference between TTFB and RTT will be high. (in light orange, approx 302ms). The page load time is approximately 80 ms for the address.

Using the information above lends to a number of steps that can be taken for the website owner. The delay in initial TCP connection time could be decreased by making the website available in different geo locations around the world. This could also reduce the time for TLS handshake as each round trip will be faster. Another thing that is visible is the page load time of 80ms. This might be because of the contents of the page, maybe compression can be applied on the server side to make the load time better or unnecessary content can be removed. The information in the waterfall view can also tell if an additional external library increases the time to load the website after a release.

Cloudflare has over 200 edge locations around the world making it one of the largest Anycast networks on the planet. When a health check is configured, it can be run across the different regions on the Cloudflare infrastructure, enabling you to see the variation in latency around the world for specific Health Checks.

Health Check Analytics and how you can use it
Waterfall from India
Health Check Analytics and how you can use it
Waterfall from Western North America‌‌

Based on the new information provided from Health Check Analytics, users can definitively validate that the address performs better from Western North America compared to India due to the page load time and overall RTT.

How do health checks run?

To understand and decipher the logs that are found in the analytics dashboard, it is important to understand how Cloudflare runs the Health Checks. Cloudflare has data centers in more than 200 cities across 90+ countries throughout the world [more]. We don’t run health checks from every single of these data centers (that would be a lot of requests to your servers!). Instead, we let you pick between one and thirteen regions from which to run health checks [Regions].

The Internet is not the same everywhere around the world. So your users may not have the same experience on your application according to where they are. Running Health Checks from different regions lets you know the health of your application from the point of view of the Cloudflare network in each of these regions.

Imagine you configure a Health Check from two regions, Western North America and South East Asia, at an interval of 10 seconds. You may have been expecting to get two requests to your origin server every 10 seconds, but if you look at your server’s logs you will see that you are actually getting six. That is because we send Health Checks not just from one location in each region but three.

Health Check Analytics and how you can use it

For a health check configured from All Regions (thirteen regions) there will be 39 requests to your server per configured interval.

You may wonder: ‘Why do you probe from multiple locations within a region?’ We do this to make sure the health we report represents the overall performance of your service as seen from that region. Before we report a change, we check that at least two locations agree on the status. We added a third one to make sure that the system keeps running even if there is an issue at one of our locations.

Conclusion

Health Check Analytics is now live and available to all Pro, Business and Enterprise customers!  We are very excited to help decrease your time to resolution and ensure your application reliability is maximised.

Monitor Spark streaming applications on Amazon EMR

Post Syndicated from Amir Shenavandeh original https://aws.amazon.com/blogs/big-data/monitor-spark-streaming-applications-on-amazon-emr/

For applications to be enterprise-ready, you need to consider many aspects of the application before moving to a production environment and have operational visibility of your application. You can get that visibility through metrics that measure your application’s health and performance and feed application dashboards and alarms.

In streaming applications, you need to benchmark different stages and tasks in each stage. Spark has provided some interfaces to plug in your probes for real-time monitoring and observation of your applications. SparkListeners is a flexible and powerful tool for both steaming and batch applications. You can combine it with Amazon CloudWatch metrics, dashboards, and alarms for visibility and generate notifications when issues arise or automatically scale clusters and services.

This post demonstrates how to implement a simple SparkListener, monitor and observe Spark streaming applications, and set up some alerts. The post also shows how to use alerts to set up automatic scaling on Amazon EMR clusters, based on your CloudWatch custom metrics.

Monitoring Spark streaming applications

For production use cases, you need to plan ahead to determine the amount of resources your Spark application requires. Real-time applications often have SLAs that they need to meet, such as how long each batch execution can safely run or how long each micro-batch can be delayed. Quite often, in the lifecycle of an application, sudden increases of data in the input stream require more application resources to process and catch up with the influx.

For these use cases, you may be interested in common metrics such as the count of records in each micro-batch, the delay on running scheduled micro-batches, and how long each batch takes to run. For example, in Amazon Kinesis Data Streams, you can monitor the IteratorAge metric. With Apache Kafka as a streaming source, you might monitor consumer lag, such as the delta between the latest offset and the consumer offset. For Kafka, there are various open-source tools for this purpose.

You can react in real time or raise alerts based on environment changes by provisioning more resources or reducing unused resources for cost optimization.

Different methods to monitor Spark streaming applications are already available. A very efficient, out-of-the-box feature of Spark is the Spark metrics system. Additionally, Spark can report metrics to various sinks including HTTP, JMX, and CSV files.

You can also monitor and record application metrics from within the application by emitting logs. This requires running count().print(), printing metrics in maps and reading the data that may cause delays, adding to the application stages, or performing unwanted shuffles that may be useful for testing but often prove to be expensive as a long-term solution.

This post discusses another method: using the SparkStreaming interface. The following screenshot shows some available metrics on the Spark UI’s Streaming tab.

Apache Spark listeners

Spark internally relies on SparkListeners for communication between its internal components in an event-based fashion. Also, Spark scheduler emits events for SparkListeners whenever the stage of each task changes. SparkListeners listen to the events that are coming from Spark’s DAGScheduler, which is the heart of the Spark execution engine. You can use custom Spark listeners to intercept SparkScheduler events so you know when a task or stage starts and finishes.

The Spark Developer API provides eight methods in the SparkListener trait called on different SparkEvents, mainly at start and stop, failure, completion, or submission of receivers, batches, and output operation. You can execute an application logic at each event by implementing these methods. For more information, see StreamingListener.scala on GitHub.

To register your custom Spark listener, set spark.extraListeners when launching the application, or programmatically by calling addSparkListener when setting up SparkContext in your application.

SparkStreaming micro-batches

By default, SparkStreaming has a micro-batch execution model. Spark starts a job in intervals on a continuous stream. Each micro-batch contains stages, and stages have tasks. Stages are based on the DAG and the operation that the application code defines, and the number of tasks in each stage is based on the number of DStream partitions.

At the start of a streaming application, the receivers are assigned to executors, in a round-robin fashion, as long-running tasks.

Receivers create blocks of data based on blockInterval. The received blocks are distributed by the BlockManager of the executors, and the network input tracker running on the driver is informed about the block locations for further processing.

On the driver, an RDD is created for the blocks in each batchInterval. Each block translates to a partition of the RDD and a task is scheduled to process each partition.

The following diagram illustrates this architecture.

Creating a custom SparkListener and sending metrics to CloudWatch

You can rely on CloudWatch custom metrics to react or raise alarms based on the custom Spark metrics you collect from a custom Spark listener.

You can implement your custom streaming listeners by directly implementing the SparkListener trait if writing in Scala, or its equivalent Java interface or PySpark Python wrapper pyspark.streaming.listener.

For this post, you only override onBatchCompleted and onReceiverError because you’re only collecting metrics about micro-batches.

From OnBatchCompleted, you submit the following metrics:

  • Heartbeat – A numeric 1 (one) whenever a batch completes so you can sum or average time periods to see how many micro-batches ran
  • Records – The number of records per batch
  • Scheduling delay – The delay from when the batch was scheduled to run until when it actually ran
  • Processing delay – How long the batch execution took
  • Total delay – The sum of the processing delay and scheduling delay

From OnRecieverError, you submit a numeric 1 (one), whenever a receiver fails. See the following code:

/**
    * This method executes when a Spark Streaming batch completes.
    *
    * @param batchCompleted Class having information on the completed batch
    */

  override def onBatchCompleted(batchCompleted: StreamingListenerBatchCompleted): Unit = {
    log.info("CloudWatch Streaming Listener, onBatchCompleted:" + appName)

    // write performance metrics to CloutWatch Metrics
    writeBatchStatsToCloudWatch(batchCompleted)

  }
  /**
  * This method executes when a Spark Streaming batch completes.
  *
  * @param receiverError Class having information on the reciever Errors
  */

  override def onReceiverError(receiverError: StreamingListenerReceiverError): Unit = { 
    log.warn("CloudWatch Streaming Listener, onReceiverError:" + appName)

    writeRecieverStatsToCloudWatch(receiverError)
  }

For the full source code of this example for Scala implementation and a sample Spark Kinesis streaming application, see the AWSLabs GitHub repository.

To register your custom listener, make an instance of the custom listener object and pass the object to the streaming context, in the driver code, using the addStreamingListener method. See the following code:

val conf = new SparkConf().setAppName(appName)
val batchInterval = Milliseconds(1000)
val ssc = new StreamingContext(conf, batchInterval)
val cwListener = new CloudWatchSparkListener(appName)

ssc.addStreamingListener(cwListener)

When you run the application, you can find your metrics in CloudWatch in the same account as the one the EMR cluster is running in. See the following screenshot.

Using the sample code

This post provides an AWS CloudFormation template, which demonstrates the code. Download the emrtemplate.json file from the GitHub repo. The template launches an EMR cluster in a public subnet and a Kinesis data stream with three shards with the required default AWS Identity and Access Management (IAM) roles. The sample Spark Kinesis streaming application is a simple word count that an Amazon EMR step script compiles and packages with the sample custom StreamListener.

Using application alarms in CloudWatch

The alerts you need to set up mainly depend on the SLA of your application. As a general rule, you don’t want your batches to take longer than the micro-batch intervals because it causes the scheduled batches to queue and you start falling behind the input stream. Also, if the rate of your receivers reading from the stream is more than what you can process in the batches due to a surge, the read records can spill to disk and cause more delays to shuffle across to other executors. You can set up a CloudWatch alarm to notify you when a processing delay is approaching your application’s batchInterval. For instructions on setting up an alarm, see Using Amazon CloudWatch Alarms.

The CloudFormation template for this post has two sample alarms to monitor. One is based on the anomaly detection band on the processingDelays metric; the second is based on a threshold on a math expression that calculates schedulingDelay ratio to totalDelay or (schedulingDelay / totalDelay) * 100 .

Scaling streaming applications

In terms of scaling, as the amount of data grows, you have more DStream partitions, based on the blockIntervals of the streaming application. In addition to the batches that should catch up with the received records and finish within batch intervals, the receivers should also keep up with the influx of records. The source streams should provide enough bandwidth for the receivers to read fast enough from the stream, and there should be enough receivers reading at the right rate to consume the records from the source.

If your DStreams are backed by receivers and WALs, you need to consider the number of receivers in advance. When you launch the application, the number of receivers may not change without restarting the application.

When a SparkStreaming application starts, by default, the driver schedules the receivers in a round-robin fashion on the available executors unless a preferred location is defined for receivers. When all executors are allocated with receivers, the rest of the required receivers are scheduled on the executors to balance the number of receivers on each executor, and the receivers stay up in the executors as long-running tasks. For more information about scheduling receivers on executors, see ReceiverSchedulingPolicy.scala on GitHub and SPARK-8882 on the Spark issues website.

Sometimes you may want to slow down receivers because you want less data in micro-batches and don’t want to surpass your micro-batch intervals. To slow down receivers, in case you have streaming sources that can hold on to the records when the batches can’t run fast enough to keep up with the surge of records, you can enable the BackPressure feature to adapt to the input rate from receivers. To do so, set spark.streaming.backpressure.enabled to true.

Another factor you can consider is the dynamic allocation for streaming applications. By default, spark.dynamicAllocation is enabled on Amazon EMR, which is mutually exclusive to spark.streaming.dynamicAllocation. If you want the driver to request for more executors for your DStream tasks, you need to set spark.dynamicAllocation.enabled to false and spark.streaming.dynamicAllocation.enabled to true. Spark periodically looks into the average batch duration. If it’s above the scale-up ratio, it requests for more executors. If it’s below the scale-down ratio, it releases the idle executors, preferably those that aren’t running any receivers. For more information, see ExecutorAllocationManager.scala on GitHub and the Spark Streaming Programming Guide.

The ExecutorAllocationManager is already looking into the batch execution average time and requests more executors based on the scale-up and scale-down ratios. Because of this, you can set up automatic scaling in Amazon EMR, preferably on tasks instance groups, to add and remove nodes based on the ContainerPendingRatio and assign PreferredLocation for receivers to core nodes. The example code for this post provides a custom KinesisInputDStream, which allows assigning the preferred location for every receiver you request. It’s basically a function that returns a hostname to preferably place the receiver. The GitHub repo also has a sample application that uses the customKinesisInputDStream and customKinesisReciever, which allows requesting a preferredLocation for receivers.

At scale-down, Amazon EMR nominates the nodes with the fewest containers running for decommissioning in the task instance group.

For more information about setting up automatic scaling, see Using Automatic Scaling with a Custom Policy for Instance Groups. The example code contains a threshold on schedulingDelay. As a general rule, you should base the threshold on the batchIntervals and processingDelay. A growth in schedulingDelay usually means a lack of resources to schedule a task.

The following table summarizes the configuration attributes to tune when you launch your Spark streaming job.

Configuration AttributeDefault
spark.streaming.backpressure.enabledFalse
spark.streaming.backpressure.pid.proportional1.0
spark.streaming.backpressure.pid.integral0.2
spark.streaming.backpressure.pid.derived0.0
spark.streaming.backpressure.pid.minRate100
spark.dynamicAllocation.enabledTrue
spark.streaming.dynamicAllocation.enabledFalse
spark.streaming.dynamicAllocation.scalingInterval60 Seconds
spark.streaming.dynamicAllocation.minExecutorsmax(1,numReceivers)
spark.streaming.dynamicAllocation.maxExecutorsInteger.MAX_VALUE

Monitoring structured streaming with a listener

Structured streaming still processed records in micro-batches and triggers queries when there is data from receivers. You can monitor these queries using another listener interface, StreamingQueryListener. This post provides a sample listener for structured streaming on Kafka, with a sample application to run. For more information, see CloudWatchQueryListener.scala GitHub. The following image is a snapshot of few CloudWatch custom metrics the custom StreamingQueryListerer will collect.

Scaling down your EMR cluster

When you launch a Spark streaming application, Spark evenly schedules the receivers on all available executors at the start of the application. When an EMR cluster is set to scale down, Amazon EMR nominates the nodes running fewer tasks in the instance group with an automatic scaling rule. Although Spark receivers are long-running tasks, Amazon EMR waits for yarn.resourcemanager.decommissioning.timeout, or when the NodeManagers are decommissioned, to gracefully terminate and shrink the nodes. You’re always at risk of losing a running executor with a receiver. You should always consider enough Spark block replication and CheckPointing for the DStreams and ideally define a PreferedLocation so you don’t risk losing receivers.

Metrics pricing

In general, Amazon EMR metrics don’t incur CloudWatch costs. However, custom metrics incur charges based on CloudWatch metrics pricing. For more information, see Amazon CloudWatch pricing. Additionally, Spark Kinesis Streaming relies on the Kinesis Client Library, and it publishes custom CloudWatch metrics that also incur charges based on CloudWatch metrics pricing. For more information, see Monitoring the Kinesis Client Library with Amazon CloudWatch.

Conclusion

Monitoring and tuning Spark streaming and real-time applications is challenging, and you must react to environment changes in real time. You also need to monitor your source streams and job outputs to get a full picture. Spark is a very flexible and rich framework that provides multiple options for monitoring jobs. This post looked into an efficient way to monitor the performance of Spark streaming micro-batches using SparkListeners and integrate the extracted metrics with CloudWatch metrics.

 


About the Author

Amir Shenavandeh is a Hadoop systems engineer with AWS. He helps customers with architectural guidance and technical support using open-source applications, develops and advances the applications of the Hadoop ecosystem and works with the open source community. 

 

 

Setting up trust between ADFS and AWS and using Active Directory credentials to connect to Amazon Athena with ODBC driver

Post Syndicated from Arun Alapati original https://aws.amazon.com/blogs/big-data/setting-up-trust-between-adfs-and-aws-and-using-active-directory-credentials-to-connect-to-amazon-athena-with-odbc-driver/

Amazon Athena is a serverless and interactive query service that allows you to easily analyze your raw and processed datasets in Amazon Simple Storage Service (Amazon S3) using standard SQL. The JDBC and ODBC drivers that Athena provides allow you to easily integrate your data analytics tools (such as Microsoft Power BI, Tableau, or SQLWorkBench) with Athena seamlessly and gain insights about your data in minutes.

Before November 2018, you had to connect to Athena with ODBC or JDBC drivers using your IAM user or role credentials. However, with the November 20, 2018 release of support for Microsoft Active Directory Federation Services (ADFS 3.0) and Security Assertion Markup Language (SAML 2.0) in the Athena ODBC/JDBC driver, you can now connect to Athena directly using your Microsoft Active Directory (AD) credentials.

Microsoft ADFS 3.0, a component of Windows Server, supports SAML 2.0 and is integrated with AWS Identity and Authentication Management (IAM). This integration allows Active Directory (AD) users to federate to AWS using corporate directory credentials, such as username and password from Microsoft Active Directory.

This post walks you through configuring ADFS 3.0 on a Windows Server 2012 R2 Amazon Elastic Compute Cloud (Amazon EC2) instance and setting up trust between ADFS 3.0 IdP and AWS through SAML 2.0. The post then demonstrates how to install the Athena OBDC driver on Amazon Linux EC2 instance (RHEL instance) and configure it to use ADFS for authentication.

Solution overview

The following architecture diagram shows how an AD user in your organization is authenticated in the Athena ODBC/JDBC driver:

The process includes the following steps:

  1. A user in your organization uses a client application with the JDBC or ODBC driver to request authentication from your organization’s IdP. The IdP is ADFS 3.0.
  2. The IdP authenticates the user against AD, which is your organization’s Identity Store.
  3. The IdP constructs a SAML assertion with information about the user and sends the assertion to the client application via the JDBC or ODBC driver.
  4. The JDBC or ODBC driver calls the AWS Security Token Service AssumeRoleWithSAML API operation, passing it the following parameters:
    • The ARN of the SAML provider
    • The ARN of the role to assume
    • The SAML assertion from the IdP
  5. The API response to the client application via the JDBC or ODBC driver includes temporary security credentials.
  6. The client application uses the temporary security credentials to call Athena API operations, which allows your users to access Athena API operations.

This post walks you through configuring ADFS-AWS trust through SAML and using that trust to federate AD users in the Athena JDBC/ODBC driver.

To implement this solution, you complete the following steps:

  • Configure ADFS (3.0) on a Windows Server 2012 R2 Amazon EC2 instance
  • Set up trust between AWS and ADFS (3.0) through SAML 2.0 rules
  • Install the Athena ODBC driver 1.0.5 on RHEL EC2 instance and configure it to use ADFS

Prerequisites

For this walkthrough, you need to have the following prerequisites:

  • An understanding of the concepts of Active Directory. The steps for configuring Active Directory on a Windows instance are outside the scope of this post.
  • An understanding of IAM roles and concepts.
  • DNS and networking set up between your Active Directory server and the instance on which the Athena ODBC or JDBC driver is installed. This post sets up Active Directory (that runs ADFS) on a Windows Amazon EC2 instance and ODBC driver on another EC2 instance that are part of the same AWS VPC and subnet. However, for your use case, you need to provide that connectivity between the ADFS server and the OBDC/JDBC instance.

Configuring ADFS (3.0) on a Windows Server 2012 R2 Amazon EC2 instance

In the following steps, you install ADFS 3.0 on a Windows Server 2012 R2 Amazon EC2 instance. As per the prerequisites, you already installed Active Directory on a Windows Server 2012 R2 EC2 instance. For this post, the domain name is arunad.local. For instructions on setting up an Active Directory domain controller on an EC2 instance, see Building Your First Domain Controller on 2012 R2 on the Microsoft TechNet website.

Installing prerequisites for ADFS 3.0

To configure ADFS 3.0 on a Windows domain controller, you must have the following:

  • An SSL certificate – For this post, you can create a self-signed certificate by installing IIS (Internet Information Server)
  • Configuring a service account – Create an Active Directory user with Domain Admin groups
  1. Install IIS on Windows Server 2012 R2 Amazon EC2 instance. For instructions, see How to install and configure IIS on Windows Server 2012 R2 on The Solving website. For this post, you can skip Step 2 in the preceding instructions.
  2. After you install IIS, create a self-signed certificate. For instructions, see How to Create a Self Signed SSL Certificate with Windows Server on the Sophos Community website. For this post, you can skip the step about binding the self-signed certificate in the preceding instructions.

To configure a service account in your domain controller, you create a user in your active directory with the name ADFSSVC and add the user to the domain admins group.

  1. Open Server Manager.
  2. Choose Tools.
  3. Choose Active Directory Users and Computers.
  4. Expand your domain (arunad.local).
  5. Choose User (right-click).
  6. Choose New.
  7. Choose User.
  1. Create a user with the name ADFSSVC.
  2. Set the password to never expire.

You can now add user ADFSSVC to the domain admins group.

  1. Choose Users.
  2. Choose ADFSSV (right-click) and choose Add to group.
  3. In the search bar, enter domain.
  4. Choose checknames.
  5. Choose Domain Admins.
  6. Choose OK.

You receive a message that the user is added to the group, but should still verify it.

  1. Choose ADFSSVC (right-click) and choose Properties.
  2. On the Member Of tab, check that Domain Admins is listed.

Installing and configuring ADFS 3.0

Now that you have installed the prerequisites for ADFS 3.0, you can install and configure ADFS 3.0 on Windows Server 2012 R2 EC2 instance.

  1. Open Server Manager.
  2. Choose Roles and Features.
  3. Select Role-based or feature-based installation.

  1. Choose Next until you reach the Select server roles
  2. For Roles, select Active Directory Federation Services.

  1. Choose Next until you reach the Confirmation installation selections
  2. Choose Install.

  1. Choose Configure the Federation Service for this server.
  2. Select Create the first federation server in a federation server farm.
  3. Choose Next.

  1. Choose Next until you reach the Specify Service Properties
  2. For SSL Certificate, choose the self-signed certificate you installed earlier.
  3. For Federation Service Display Name, enter ArunADFS.
  4. Choose Next.

  1. In Specify Service Account page, select Use an existing domain user account or group Managed Service Account.
  2. Choose Select.

  1. In the text box, enter ADFSSVC.
  2. Choose Check names.
  3. When the name is populated, choose OK.

  1. Enter your password and choose Next.

  1. Select Create a database on this server using Windows Internal Database.
  2. Choose Next.

  1. Choose Next until you reach the Pre-requisite Checks
  2. Choose Configure.

When the server is successfully configured, you may see the following warning message: 

An error occurred during an attempt to set the SPN for the specified service account. Set the SPN for the service account manually. For more information about setting the SPN of the service account manually, see the AD FS Deployment Guide. Error message: The SPN required for this Federation Service is already set on another Active Directory account. Choose a different Federation Service name and try again.

To fix the problem, run the following command by opening PowerShell as an administrator:

setspn -a host/localhost adfssvc

The following code shows the output.

The ADFS 3.0 configuration is now complete.

  1. To download your ADFS server’s federation XML file, open a browser on your Windows Server and enter the following address: https://<yourservername>/FederationMetadata/2007-06/FederationMetadata.xml.

This file is required to set up trust between ADFS and AWS.

Alternatively, you can download the ADFS server’s federation XML file by running the following command as administrator in PowerShell 3.0+:

wget https://<your-server-name>/FederationMetadata/2007-06/FederationMetadata.xml -OutFile FederationMetadata.xml

You can see your server name by clicking Start button (Windows icon) > Right click My Computer > Click Properties > Check for Full Computer Name.

Copy the downloaded XML file on to your local machine so you can use it when creating a SAML-based role in IAM in the next step.

Establishing trust between Windows AD (using ADFS IDP) and AWS via SAML 2.0

Now that you have configured the ADFS 3.0 on a Windows Server, you can establish the trust between AWS and the IdP (ADFS) via SAML assertion claim rules. By establishing this trust, users in your AD can federate into AWS using an IAM role and access AWS resources such as Athena or the AWS Glue Data Catalog.

Setting up this trust requires configuration in both AWS and Active Directory. For AWS, you set up IAM roles and establish a SAML provider. For Active Directory, you write the SAML assertion and claim rules.

Setting up your SAML provider in IAM

To set up your SAML provider, complete the following steps:

  1. On the IAM console, choose Identity provider.
  2. Choose Create provider.
  3. For Provider Type, choose SAML.
  4. For Provider Name, enter MytestADFS.
  5. For Metadata Document, choose the XML file you downloaded earlier.

  1. Create a new role in IAM and choose the trusted entity as SAML 2.0 federation.
  2. For SAML provider, choose the provider you created earlier (MytestADFS).
  3. For Attribute, select SAML:aud.

  1. Add the necessary IAM permissions to this role and create the IAM role.

For this post, attach the AthenaFullAccess managed policy and name it ArunADFSTest.

The role name you give in this step is crucial because any users and groups you create in Active Directory as part of establishing trust in the following steps are based on this name.

Configuring the SAML assertion rules in ADFS 3.0 IdP

In this step, you configure the SAML assertion rules in your ADFS so that users can federate into AWS using the IAM role you created.

Determining how you create and delineate your AD groups and IAM roles in AWS is crucial in how you secure access to your account and manage resources. One approach for creating the AD groups that uniquely identify the IAM role mapping is by selecting a common group naming convention.

For this post, create a group in Active Directory with the naming convention AWS-<AccountID>-<IAMRolename>; for example, AWS-123456789012-ArunADFSTest.

This naming convention is extremely crucial in the next steps because you write SAML assertion claim rules where you fetch all the AD groups of your AD user that contain the string AWS-<AccountID>- in them and extract the last portion of the group name to map it with IAM role in AWS. For example, if the AD user that you’re authenticating to AWS is part of AD group AWS-123456789012-ArunADFSTest, then the claim rules, which you write later, find all groups of the AD user that match with string AWS-123456789012-, extract the last portion of the group name (ArunADFSTest), and send it as the role name to the AWS SAML endpoint in the format aws:arn::iam::<AccountID>: role/ArunADFSTest.

  1. In Server Manager, under Tools, choose Active directory users and computers.
  2. Choose your domain (right-click) and choose New.
  3. Choose Group.

The following screenshot shows creating an AD group with name AWS-123456789012-ArunADFSTest:

  1. After you create the group in AD with name AWS-123456789012-ArunADFSTest, create a new user in that group. For this post, name the user myldapuser1.

  1. Make sure the E-mail field of the user is filled with a valid email address syntax because you use this email field of the user and pass it as RolesessionName to AWS when constructing the SAML token.

  1. After you create the user, add the user to the AD group AWS-123456789012-ArunADFSTest.

Now that you’ve created the AD groups, AD users, and IAM roles, you create the relying party trust in ADFS and write the claim rules. The ADFS IdP needs to construct the following values in the SAML assertion and send the values to AWS for authentication:

    • NameID
    • RoleSessionName
    • Roles (which contains your SAML IDP in AWS and role name)

For instructions on setting up the relying trust and claim rules in ADFS, see AWS Federated Authentication with Active Directory Federation Services (AD FS). For this walkthrough, you can start at the Active Directory Federation Services Configuration section.

  1. For Display name, enter My Amazon Portal.

After the configuration, your claim rules looks similar to the following screenshots.

The following screenshot shows the rules for NameID.

The following screenshot shows the rules for RoleSessionName.

The following screenshot shows the rules for Get AD Groups.

The following screenshot shows the rules for Roles. The SAML provider in IAM was created with the name MyTestADFS (arn arn:aws:iam::123456789012:saml-provider/MytestADFS), so you need to use that same value here (MyTestADFS) in the claim script. If you used a different name, replace it with your IdP ARN name.

After you create these four rules, your ADFS relying trust setup is complete.

Verifying your IdP

To verify that you set up your IdP successfully, complete the following steps:

  1. Navigate to the following URL in the browser on your ADFS server Windows instance (use your Windows Server hostname): https://<windows-hostname>/adfs/ls/IdpInitiatedSignOn.aspx.

  1. Select your ADFS display name (My Amazon Portal).

  1. Enter the AD credentials for the user myldapuser1 you created earlier.

Because the domain name for this demo is arunad.local, specify the user name as ARUNAD\myldapuser1, or you can specify it as [email protected].

If everything is successful, you should be able to sign in to the AWS Management Console.

If you encounter any errors, review the preceding steps. For more information about common errors with SAML, see Troubleshooting SAML 2.0 Federation with AWS. For additional information about troubleshooting, see How to View a SAML Response in Your Browser for Troubleshooting.

The Athena ODBC/JDBC driver when initiating connection to your ADFS server also uses the same federation URL (https://<windows-hostname>/adfs/ls/IdpInitiatedSignOn.aspx) for retrieving the SAML assertion AWS properties.

Installing the Athena ODBC driver 1.0.5 on an Amazon EC2 Linux instance and configuring it to use ADFS

Now that you have configured ADFS 3.0 and set up trust with AWS, the last step is to configure the Athena ODBC driver to use ADFS as its authentication mechanism.

As per the prerequisites, you launched the Active Directory Windows EC2 instance and the Athena ODBC driver Linux (RHEL) EC2 instance in the same VPC and subnet and allowed all traffic between both instances. However, in your environment, you need to make sure your ADFS server can communicate with the machine that has Athena JDBC/ODBC installed.

For this post, the domain name servers in the DHCP option set of VPC (in which you launch the Linux EC2) are modified as follows: 172.31.X.X, AmazonProvidedDNS, where 172.31.X.X is the IP address of the EC2 Windows instance on which ADFS is configured with the domain name arunad.local.

Setting up the environment on your EC2 instance

To set up your environment, complete the following steps:

  1. Launch a Linux EC2 instance with an AMI that supports the Red Hat Enterprise Distribution (for example, AMI with RHEL-7.6_HVM_GA-20181017-x86_64-0-Hourly2-GP2).
  2. SSH into the instance and enter the following commands:
sudo yum install telnet
sudo yum install nc
sudo yum install gcc
sudo yum install wget
sudo yum install vim
  1. Verify connectivity between your ADFS server and RHEL EC2 instance with a networking tool of your choice.

This post uses the ping utility. The following code shows the output:

  1. Install the open-ldap client and test if you can search for a user in AD from this Linux instance (replace the values with your user and domain name):
sudo yum install openldap-clients
ldapsearch -h arunad.local -p 389 -D "CN=mylapuser1,CN=Users,DC=arunad,DC=local" -x -W -b "DC=arunad,DC=local"

If these commands are successful, the RHEL EC2 instance can communicate with the AD server and retrieve the credentials.

Installing and configuring UnixODBC Driver Manager 2.3.4

The Athena ODBC driver on Linux requires you to have one of the following ODBC Driver Managers installed to set up the connection:

  • iODBC 3.52.9, 3.52.10, 3.52.11, or 3.52.12
  • unixODBC 2.3.2,2.3.3, or 2.3.4
  1. Install the UnixODBC driver manager 2.3.4 on your RHEL EC2 instance. For instructions, see unixODBC-2.3.4 on the Beyond Linux From Scratch website.

The command needed for installation on your EC2 instance should look similar to the following code:

wget ftp://ftp.unixodbc.org/pub/unixODBC/unixODBC-2.3.4.tar.gz
tar -zxvf unixODBC-2.3.4.tar.gz
cd unixODBC-2.3.4
./configure --prefix=/usr --sysconfdir=/etc/unixODBC && make
  1. Install the ODBC Driver Manager by switching to the root user. See the following code:
sudo -su root

make install &&

find doc -name "Makefile*" -delete                &&
chmod 644 doc/{lst,ProgrammerManual/Tutorial}/*   &&

install -v -m755 -d /usr/share/doc/unixODBC-2.3.4 &&
cp      -v -R doc/* /usr/share/doc/unixODBC-2.3.4

  1. After the ODBC Driver Manager is installed, make sure the following files are present, which indicate a successful installation of the ODBC driver manager on an RHEL EC2 instance:
    • /etc/unixODBC/odbcinst.ini
    • /etc/unixODBC/odbc.ini
  1. The ODBC Driver Manager library files created as part of the installation are present in the path /usr/lib. Set the shared library path to point your ODBC Driver Manager libraries by entering the following command as a non-root user:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib

  1. Verify the ODBC Driver Manager environment configuration is loaded properly by entering the code odbcinst -j.

Installing and configuring the Simba Athena ODBC Driver Manager (64 bit)

Now that the ODBC driver manager is configured, the final step is to install the Athena ODBC driver 1.0.5 on this Linux Instance and configure it to use ADFS as the authentication mechanism.

  1. To install the Athena ODBC driver on this EC2 instance, enter the following code:
wget https://s3.amazonaws.com/athena-downloads/drivers/ODBC/SimbaAthenaODBC_1.0.5/Linux/simbaathena-1.0.5.1006-1.x86_64.rpm

sudo yum --nogpgcheck localinstall simbaathena-1.0.5.1006-1.x86_64.rpm

The example ODBC configuration file of the downloaded driver is included in the path /opt/simba/athenaodbc/Setup/odbc.ini.

  1. To configure the ODBC driver to use ADFS as an authentication mechanism, log in as ec2-user and enter the following code:

vim .odbc.ini

  1. Insert the following configuration directives:
[ODBC]
Trace=no

[ODBC Data Sources]
Simba Athena 64-bit=Simba Athena ODBC Driver 64-bit


[Simba Athena 64-bit]
Description=Simba Athena ODBC Driver (64-bit) DSN
Driver=/opt/simba/athenaodbc/lib/64/libathenaodbc_sb64.so


# Connection configurations should be set here.
AwsRegion=us-west-2
Schema=default
S3OutputLocation=s3://aws-athena-query-results-123456780912-us-west-2
AuthenticationType=ADFS
UID=ARUNAD\myldapuser1
PWD=XXXXXXXX
IdP_Host=win-qikm653mpj9.arunad.local
IdP_Port=443
SSL_Insecure=true

Replace the values for IdP_Host and IdP_Port to point to your ADFS server. For this post, these values are win-qikm653mpj9.arunad.local and 443. Similarly, replace UID and PWD with the LDAP user name and password you created earlier. This post uses ARUNAD\myldapuser1.  Also,  replace AwsRegion and S3OutputLocation values according to your environment. For production workloads, make sure that you set SSL_Insecure to false so the driver can verify the server certificate.

You can retrieve the HTTPS port number of your ADFS server by entering the following code on the Windows AD server instance’s power shell:

Get-AdfsProperties

  1. To enable DEBUG level logging on your Athena ODBC driver, edit the file /opt/simba/athenaodbc/lib/64/simba.athenaodbc.ini and set the following values:
[Driver]
## - Note that this default DriverManagerEncoding of UTF-32 is for iODBC.
ErrorMessagesPath=/opt/simba/athenaodbc/ErrorMessages
LogLevel=5
LogPath=/home/ec2-user/odbclogs/
SwapFilePath=/tmp
  1. Now that the Athena ODBC driver is configured, you can test it by entering the following code:

isql -v "Simba Athena 64-bit"

In the preceding code, Simba Athena 64-bit refers to the name of your DSN you specified in /home/ec2-user/.odbc.ini while connecting to the ODBC driver.

If you’re connected, it means you have successfully connected the Athena ODBC driver manager by authenticating your user against ADFS.

You can also check the connection log to verify the connection URI used by your driver and values returned by ADFS to the Athena ODBC driver.

Entries from connection log snippet looks as follows:

May 09 01:40:58.761 DEBUG 50743104 IAMAdfsCredentialsProvider::FormBasedAuthentication: verifySSL: false
May 09 01:40:58.761 DEBUG 50743104 IAMAdfsCredentialsProvider::FormBasedAuthentication: Using URI: https://win-qikm653mpj9.arunad.local:443/adfs/ls/IdpInitiatedSignOn.aspx?loginToRp=urn:amazon:webservices
May 09 01:40:58.821 DEBUG 50743104 IAMSamlPluginCredentialsProvider::GetAWSCredentialsWithSaml: Using RoleArn: arn:aws:iam::143280751103:role/ArunADFSTest, PrincipalArn: arn:aws:iam::143280751103:saml-provider/MytestADFS

Conclusion

This post demonstrated how to configure ADFS 3.0 on your Active Directory and use it as an IdP to federate into AWS using SAML. This post also showed how you can integrate your Athena ODBC driver to ADFS and use your Active Directory credentials directly to connect to Athena. Integrating your Active Directory with the Athena ODBC driver gives you the flexibility to access Athena from BI tools you’re already familiar with and analyze the data in Amazon S3 using SQL, without needing to create separate IAM users.

If your organization has single sign-on (SSO) into AWS enabled with the OKTA service provider, you can use the latest version of the Athena JDBC driver, version 2.0.9, to use OKTA as the authentication mechanism. For more information, see using OKTA in Athena JDBC

If you have any questions or feedback, please leave a comment.

 


About the Author

Alapati Arun is a Cloud Support Engineer with AWS based out of Dallas. He focuses on supporting customers in using big data technologies. He enjoys travel and watching movies.

 

How Drop used the Amazon EMR runtime for Apache Spark to halve costs and get results 5.4 times faster

Post Syndicated from Michael Chau original https://aws.amazon.com/blogs/big-data/how-drop-used-the-amazon-emr-runtime-for-apache-spark-to-halve-costs-and-get-results-5-4-times-faster/

This is a guest post by Michael Chau, software engineer with Drop, and Leonardo Gomez, AWS big data specialist solutions architect. In their own words, “Drop is on a mission to level up consumer lives, one reward at a time. Through our personalized commerce platform, we intelligently surface the right brands, at the right time, to make our members’ everyday better than it was before. Powered by machine learning, we match consumers with over 200+ partner brands to satisfy two main goals: to earn points from their purchases and redeem them for instant rewards. Calling Toronto home but operating under a global mindset, Drop is building the next-level experience for our 3 million+ members across North America. Learn more by visiting www.joindrop.com.”

Overview

At Drop, our data lake infrastructure plays a foundational role in enabling better data-informed product and business decisions. A critical feature is its ability to process vast amounts of raw data and produce reconciled datasets that follow our data lake’s standardized file format and partitioning structure. Our business intelligence, experimentation analytics, and machine learning (ML) systems use these transformed datasets directly.

This post details how we designed and implemented our data lake’s batch ETL pipeline to use Amazon EMR, and the numerous ways we iterated on its architecture to reduce Apache Spark runtimes from hours to minutes and save over 50% on operational costs.

Building the pipeline

Drop’s data lake serves as the center and source of truth for the company’s entire data infrastructure upon which our downstream business intelligence, experimentation analytics, and ML systems critically rely. Our data lake’s goal is to ingest vast amounts of raw data from various sources and generate reliable and reconciled datasets that our downstream systems can access via Amazon Simple Storage Service (Amazon S3). To accomplish this, we architected our data lake’s batch ETL pipeline to follow the Lambda architecture processing model and used a combination of Apache Spark and Amazon EMR to transform the raw ingested data that lands into our Amazon S3 lake into reconciled columnar datasets. When designing and implementing this pipeline, we adopted the following core guiding principles and considerations:

  • Keep our tech stack simple
  • Use infrastructure as code
  • Work with transient resources

Keeping our tech stack simple

We aimed to keep our tech stack simple by using existing and proven AWS technologies and only adopting services that would drive substantial impact. Drop is primarily an AWS shop, so continuing to use AWS technologies made sense due to our existing experience, the ability to prototype new features quickly, and the inherent integration benefits of using other services within Amazon’s ecosystem.

Another effort to keep our tech stack simple was to limit the overhead and complexity of newly adopted open-source Apache Hadoop technologies. Our engineering team initially had limited experience working with these technologies, so we made a conscious effort to mitigate additional technical overhead to our stack by using proven fully-managed services. We integrated Amazon EMR as part of our idempotent data pipelines because we could use the service when our pipeline operations required it, which eliminated the need to maintain the service when no longer required. This allowed us to reduce the technical overhead of constantly maintaining production clusters.

Using infrastructure as code

We use Apache Airflow to manage and schedule our data lake pipeline operations. Using Airflow enables us to build our entire workflows and infrastructure as code via Airflow Directed Acyclic Graphs (DAGs). This key decision also simplified our engineering development and deployment processes, while providing version control for all aspects of our data infrastructure.

Working with transient resources

To reduce operational costs, we made a key decision to build our data processing pipelines using transient resources. By designing our pipelines to spin up EMR clusters only upon operational demand and terminate upon job completion, we can use Amazon Elastic Compute Cloud (Amazon EC2) Spot and On-Demand Instances without paying for idle resources. This approach has enabled a dramatic reduction in costs associated with idle clusters.

Batch ETL pipeline overview

The following diagram illustrates our batch ETL pipeline architecture.

The pipeline includes the following steps:

  1. A core requirement for the Lambda architecture data model is to have access to both batch and stream data sources of a dataset. The batch ETL pipeline primarily ingests data in batch and stream formats from our Amazon Relational Database Service (Amazon RDS) Postgres database using AWS Database Migration Service (AWS DMS). The pipeline initiates full-migration AWS DMS tasks for comprehensive batch snapshots using Airflow, and ingests stream data using ongoing replication AWS DMS tasks for 1-minute latency change data capture (CDC) files. Data from both batch and stream formats are landed into our Amazon S3 lake, and cataloged in our AWS Glue Data Catalog using AWS Glue crawlers.
  2. The batch ETL pipeline Apache Airflow DAG runs a series of tasks, which begins with uploading our Lambda architecture Spark application to Amazon S3, spinning up an EMR cluster, and ultimately running the Spark application as an Amazon EMR step. Depending on the characteristics of the datasets, the necessary Amazon EMR resources are calibrated to produce the reconciled dataset. To produce the resultant datasets in Apache Parquet format, we must allocate sufficient CPU and memory to our clusters.
  3. Upon completion of all of the Amazon EMR steps, the cluster is terminated, and the newly produced dataset is crawled using AWS Glue crawlers to update the dataset’s metadata within the Data Catalog. The output dataset is now ready for consumer systems to access via Amazon S3 or query using Amazon Athena or Amazon Redshift Spectrum.

Evolving the EMR pipeline

Our engineering team is constantly iterating on the architecture of our batch ETL pipeline in an effort to reduce its runtime duration and operational costs. The following iterations and notable feature enhancements have generated the largest impact to the downstream systems, as well as the end-users that rely on this pipeline.

Migrating from AWS Glue to Amazon EMR

The first iteration of our batch ETL pipeline used AWS Glue to process our Spark applications rather than Amazon EMR due to our limited in-house Hadoop experience in the initial stages. AWS Glue was an appealing first solution due to itsETL as a service” features, and simplified resource allocation. The AWS Glue solution successfully delivered desired results; however, as we gained experience with Hadoop technologies, we recognized a significant opportunity to use Amazon EMR to improve pipeline performance and reduce operational costs.

The migration from AWS Glue to Amazon EMR was seamless and only required EMR cluster configurations and minor modifications to our Spark application that used AWS Glue libraries. Thanks to this, we achieved the following operational benefits:

  • Faster cluster bootstrapping and resource provisioning durations. We found that AWS Glue clusters have a cold start time of 10–12 minutes, whereas EMR clusters have a cold start time of 7–8 minutes.
  • An 80% reduction in cost while using equivalent resources. We swapped the standard AWS Glue worker type at the cost of $0.44 per DPU-Hour, for the resource equivalent m5.xlarge Amazon EMR instance type, which has a Spot Instance price of approximately $0.085 per instnace per hour.

File committers

Our original partitioning strategy attempted to use Spark’s dynamic write partitioning feature to reduce the number of written files per run. See the following code:

sparkSession.conf.set(“spark.sql.sources.partitionOverwriteMode”, “dynamic”)

This strategy didn’t translate well in our pipeline’s performance; we quickly experienced the limitations and considerations of working with cloud object stores. By pivoting our Spark application’s file-writing strategy to completely overwrite an existing directory and using the Amazon EMR EMRFS S3-optimized committer, we could realize critical performance gains. In scenarios where datasets were nearly a terabyte, deployment of this optimized file committer reduced runtime from hours to less than half an hour! It’s worth noting that Amazon EMR 5.30.0 includes an optimization that should help with dynamic partitionOverwriteMode.

Upgrading Amazon EMR versions to 5.28+

Our datasets often exceed billions of rows, which necessitated the comparison and processing of hundreds of thousands of stream files against large batch files. The ability to execute these Spark operations given the input data sources comes at a high cost to query and process the data.

A huge improvement in our pipeline’s overall performance came from using the Amazon EMR runtime for Apache Spark feature introduced in Amazon EMR version 5.28. We saw immediate performance gains by upgrading from Amazon EMR 5.27 to 5.29, without having to make any additional changes to our existing pipeline. Our Spark application total runtime and subsequent Amazon EMR cost was reduced by over 35% using identical resource configurations. These improvements were benchmarked against two of our datasets and averaged against three production runs.

The following table summarizes the dataset and EMR cluster properties.

DatasetTable RowsTotal Batch Files SizeTotal Stream Files SizeCount Stream FilesEC2 Instance TypeCount EC2 Instances
Dataset A~3.5M~0.5GB~0.2GB~100km5.xlarge10
Dataset B~3,500M~500GB~120GB~250kr5.2xlarge30

The following diagrams summarize the Amazon EMR upgrade performance benchmarks and metrics. We calculated these cost metrics with Amazon EMR bootstrapping and resource provisioning time included.

Amazon EMR step concurrency

Early iterations of our pipeline architecture involved creating a new batch ETL pipeline per dataset, as well as a dedicated EMR cluster for that dataset. Cloning new pipelines was a quick and simple way to scale our processing capabilities because our infrastructure was written as code and the operations and resources were self-contained. Although this enabled pipeline generation quickly for our most important datasets, there was ample opportunity for operational improvements.

The following screenshot shows Drop’s batch ETL processing DAG. All of the clusters are named after the Drop engineering team’s pets.

The evolution of the pipeline architecture involved grouping datasets based on its Amazon EMR resource requirements and running them as Spark application Amazon EMR steps in a common EMR cluster concurrently using Amazon EMR step concurrency. Re-architecting our batch ETL pipelines in this manner allowed us to do the following:

  • Remove the EMR cluster bootstrapping and provisioning duration associated within individual EMR clusters per dataset
  • Reduce overall Spark runtimes in aggregate
  • Simplify our Amazon EMR resource configurations with fewer EMR clusters

On average, our clusters required 8–10 minutes to bootstrap and source the Spot Instances requested. By migrating multiple Spark applications to a common EMR cluster, we removed this bottleneck, and ultimately reduced overall runtime and Amazon EMR costs. Amazon EMR step concurrency also allowed us to run multiple applications at the same time against a dramatically reduced set of resources. For our smaller datasets (under 15 million rows), we learned that running Spark applications concurrently with reduced resources didn’t have a linear effect on overall runtime, and we could achieve shorter runtimes with fewer resources compared to the previous architecture in aggregate. However, our larger datasets (over 1 billion rows) didn’t exhibit the same performance behaviors or gains as the smaller tables when running Amazon EMR steps concurrently. Therefore, EMR clusters for larger tables required additional resources and fewer steps; however, the overall result is still marginally better in terms of cost and overall runtime in aggregate compared to the previous architecture.

Amazon EMR instance fleets

Working with Amazon EMR and Amazon EC2 Spot Instances has allowed us to realize tremendous cost savings, but it can come at the expense of EMR cluster reliability. We have experienced Spot Instance availability issues due to Spot Instance type supply constraints on the available market and losing EC2 instances due to competitive bidding. Both issues directly contribute to overall pipeline performance degradation in the form of longer EMR cluster resource provisioning and longer Spark runtimes due to lost nodes.

To improve our pipeline reliability and protect against these risks, we began to use Amazon EMR instance fleets. Instance fleets addressed both pain points—they limited supply of a specific EC2 Spot Instance type by sourcing an alternative Amazon EMR instance type, and the ability to automatically switch to On-Demand Instances if provisioning Spot Instances exceeds a specified threshold duration. Prior to using instance fleets, about 15% of our Amazon EMR production runs were affected by limitations related to Spot Instance supply or price bidding. Since implementing instance fleets, we haven’t had a cluster fail or experienced prolonged resource provisioning past programmed thresholds.

Conclusion

Amazon EMR has played a critical role in Drop’s ability to use data to make better-informed product and business decisions. We have had tremendous success in capitalizing Amazon EMR features to improve our data processing pipeline’s overall performance and cost efficiency, and will continue to explore new ways to constantly improve our pipeline. One of the easiest ways to learn about these new opportunities to improve our systems is to stay current with the latest AWS technologies and Amazon EMR features.

 


About the Authors

Michael Chau is a Software Engineer at Drop. He has experience moving data from A to B and sometimes transforming it along the way.

 

 

 

 

Leonardo Gómez is a Big Data Specialist Solutions Architect at AWS. Based in Toronto, Canada, He works with customers across Canada to design and build big data architectures.

 

 

Using Random Cut Forests for real-time anomaly detection in Amazon Elasticsearch Service

Post Syndicated from Chris Swierczewski original https://aws.amazon.com/blogs/big-data/using-random-cut-forests-for-real-time-anomaly-detection-in-amazon-elasticsearch-service/

Anomaly detection is a rich field of machine learning. Many mathematical and statistical techniques have been used to discover outliers in data, and as a result, many algorithms have been developed for performing anomaly detection in a computational setting. In this post, we take a close look at the output and accuracy of the anomaly detection feature available in Amazon Elasticsearch Service and Open Distro for Elasticsearch, and provide insight as to why we chose Random Cut Forests (RCF) as the core anomaly detection algorithm. In particular, we:

  • Discuss the goals of anomaly detection
  • Share how to use the RCF algorithm to detect anomalies and why we chose RCF for this tool
  • Interpret the output of anomaly detection for Elasticsearch
  • Compare the results of the anomaly detector to commonly used methods

What is anomaly detection?

Human beings have excellent intuition and can detect when something is out of order. Often, an anomaly or outlier can appear so obvious you just “know it when you see it.” However, you can’t base computational approaches to anomaly detection on such intuition; you must found them on mathematical definitions of anomalies.

The mathematical definition of an anomaly is varied and typically addresses the notion of separation from normal observation. This separation can manifest in several ways via multiple definitions. One common definition is “a data point lying in a low-density region.” As you track a data source, such as total bytes transferred from a particular IP address, number of logins on a given website, or number of sales per minute of a particular product, the raw values describe some probability or density distribution. A high-density region in this value distribution is an area of the domain where a data point is highly likely to exist. A low-density region is where data tends not to appear. For more information, see Anomaly detection: A survey.

For example, the following image shows two-dimensional data with a contour map indicating the density of the data in that region.

The data point in the bottom-right corner of the image occurs in a low-density region and, therefore, is considered anomalous. This doesn’t necessarily mean that an anomaly is something bad. Rather, under this definition, you can describe an anomaly as behavior that rarely occurs or is outside the normal scope of behavior.

Random Cut Forests and anomaly thresholding

The algorithmic core of the anomaly detection feature consists of two main components:

  • A RCF model for estimating the density of an input data stream
  • A thresholding model for determining if a point should be labeled as anomalous

You can use the RCF algorithm to summarize a data stream, including efficiently estimating its data density, and convert the data into anomaly scores. Anomaly scores are positive real numbers such that the larger the number, the more anomalous the data point. For more information, see Real Time Anomaly Detection in Open Distro for Elasticsearch.

We chose RCF for this plugin for several reasons:

  • Streaming context – Elasticsearch feature queries are streaming, in that the anomaly detector only receives each new feature aggregate one at a time.
  • Expensive queries – Especially on a large cluster, each feature query may be costly in CPU and memory resources. This limits the amount of historical data we can obtain for model training and initialization.
  • Customer hardware – Our anomaly detection plugin runs on the same hardware as our customers’ Elasticsearch cluster. Therefore, we must be mindful of our plugin’s CPU and memory impact.
  • Scalable – It is preferred if you can distribute the work required to determine anomalous data across the nodes in the cluster.
  • Unlabeled data – Even for training purposes, we don’t have access to labeled data. Therefore, the algorithm must be unsupervised.

Based on these constraints and performance results from internal and publicly available benchmarks across many data domains, we chose the RCF algorithm for computing anomaly scores in data streams.

But this begs the question: How large of an anomaly score is large enough to declare the corresponding data point as an anomaly? The anomaly detector uses a thresholding model to answer this question. This thresholding model combines information from the anomaly scores observed thus far and certain mathematical properties of RCFs. This hybrid information approach allows the model to make anomaly predictions with a low false positive rate when relatively little data has been observed, and effectively adapts to the data in the long run. The model constructs an efficient sketch of the anomaly score distribution using the KLL Quantile Sketch algorithm. For more information, see Optimal Quantile Approximation in Streams.

Understanding the output

The anomaly detector outputs two values: an anomaly grade and a confidence score. The anomaly grade is a measurement of the severity of an anomaly on a scale from zero to one. A zero anomaly grade indicates that the corresponding data point is normal. Any non-zero grade means that the anomaly score output by RCF exceeds the calculated score threshold, and therefore indicates the presence of an anomaly. Using the mathematical definition introduced at the beginning of this post, the grade is inversely related to the anomaly’s density; that is, the rarer the event, the higher the corresponding anomaly grade.

The confidence score is a measurement of the probability that the anomaly detection model correctly reports an anomaly within the algorithm’s inherent error bounds. We derive the model confidence from three sources:

  • A statistical measurement of whether the RCF model has observed enough data. As the RCF model observes more data, this source of confidence approaches 100%.
  • A confidence upper bound comes from the approximation made by the distribution sketch in the thresholding model. The KLL algorithm can only predict the score threshold within particular error bounds with a certain probability.
  • A confidence measurement attributed to each node of the Elasticsearch cluster. If a node is lost, the corresponding model data is also lost, which leads to a temporary confidence drop.

The NYC Taxi dataset

We demonstrate the effectiveness of the new anomaly detector feature on the New York City taxi passenger dataset. This data contains 6 months of taxi ridership volume from New York City aggregated into 30-minute windows. Thankfully, the dataset comes with labels in the form of anomaly windows, which indicate a period of time when an anomalous event is known to occur. Example known events in this dataset are the New York City marathon, where taxi ridership uncharacteristically spiked shortly after the event ended, and the January 2015 North American blizzard, when at one point the city ordered all non-essential vehicles off the streets, which resulted in a significant drop in taxi ridership.

We compare the results of our anomaly detector to two common approaches for detecting anomalies: a rules-based approach and the Gaussian distribution method.

Rules-based approach

In a rules-based approach, you mark a data point as anomalous if it exceeds a preset, human-specified boundary. This approach requires significant domain knowledge of the incoming data and can break down if the data has any upward or downward trends.

The following graph is a plot of the NYC taxi dataset with known anomalous event periods indicated by shaded regions. The model’s anomaly detection output is shown in red below the taxi ridership values. A set of human labelers received the first month of data (1,500 data points) to define anomaly detection rules. Half of the participants responded by stating they didn’t have sufficient information to confidently define such rules. From the remaining responses, the consolidated rules for anomalies are either that the value is equal to or greater than 30,000, or the value is below 20,000 for 150 points (about three days).

In this use case, the human annotators do a good enough job in this particular range of data. However, this approach to anomaly detection doesn’t scale well and may require a large amount of training data before a human can set reasonable thresholds that don’t suffer from a high false positive or high false negative rate. Additionally, as mentioned earlier, if this data develops an upward or downward trend, our team of annotators needs to revisit these constant-value thresholds.

Gaussian distribution method

A second common approach is to fit a Gaussian distribution to the data and define an anomaly as any value that is three standard deviations away from the mean. To improve the model’s ability to adapt to new information, the distribution is typically fit on a sliding window of the observations. Here, we determine the mean and standard deviation from the 1,500 most recent data points and use these to make predictions on the current value. See the following graph.

The Gaussian model detects the clear ridership spike at the marathon but isn’t robust enough to capture the other anomalies. In general, such a model can’t capture certain kinds of temporal anomalies where, for example, there’s a sudden spike in the data or other change in behavior that is still within the normal range of values.

Anomaly detection tool

Finally, we look at the anomaly detection results from the anomaly detection tool. The taxi data is streamed into an RCF model that estimates the density of the data in real time. The RCF sends these anomaly scores to the thresholding model, which decides whether the corresponding data point is anomalous. If so, the model reports the severity of the anomaly in the anomaly grade. See the following graph.

Five out of seven of the known anomalous events are successfully detected with zero false positives. Furthermore, with our definition of anomaly grade, we can indicate which anomalies are more severe than others. For example, the NYC Marathon spike is much more severe than those of Labor Day and New Year’s Eve. Based on the definition of an anomaly in terms of data density, the behavior observed at the NYC Marathon lives in a very low-density region, whereas, by the time we see the New Year’s Eve spike, this kind of behavior is still rare but not as rare anymore.

Summary

In this post, you learned about the goals of anomaly detection and explored the details of the model and output of the anomaly detection feature, now available in Amazon ES and Open Distro for Elasticsearch. We also compared the results of the anomaly detection tool to two common models and observed considerable performance improvement.


About the Authors

Chris Swierczewski is an applied scientist in Amazon AI. He enjoys hiking and backpacking with his family.

 

 

 

 

 

Lai Jiang is a software engineer working on machine learning and Elasticsearch at Amazon Web Services. His primary interests are algorithms and math. He is an active contributor to Open Distro for Elasticsearch.

 

 

 

Moving to managed: The case for the Amazon Elasticsearch Service

Post Syndicated from Kevin Fallis original https://aws.amazon.com/blogs/big-data/moving-to-managed-the-case-for-amazon-elasticsearch-service/

Prior to joining AWS, I led a development team that built mobile advertising solutions with Elasticsearch. Elasticsearch is a popular open-source search and analytics engine for log analytics, real-time application monitoring, clickstream analysis, and (of course) search. The platform I was responsible for was essential to driving my company’s business.

My team ran a self-managed implementation of Elasticsearch on AWS. At the time, a managed Elasticsearch offering wasn’t available. We had to build the scripting and tooling to deploy our Elasticsearch clusters across three geographical regions. This included the following tasks (and more):

  • Configuring the networking, routing, and firewall rules to enable the cluster to communicate.
  • Securing Elasticsearch management APIs from unauthorized access.
  • Creating load balancers for request distribution across data nodes.
  • Creating automatic scaling groups to replace the instances if there were issues.
  • Automating the configuration.
  • Managing the upgrades for security concerns.

If I had one word to describe the experience, it would be “painful.” Deploying and managing your own Elasticsearch clusters at scale takes a lot of time and knowledge to do it properly. Perhaps most importantly, it took my engineers away from doing what they do best—innovating and producing solutions for our customers.

Amazon Elasticsearch Service (Amazon ES) launched on October 1, 2015, almost 2 years after I joined AWS. Almost 5 years later, Amazon ES is in the best position ever to provide you with a compelling feature set that enables your search-related needs. With Amazon ES, you get a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch clusters securely and cost-effectively in the AWS Cloud. Amazon ES offers direct access to the Elasticsearch APIs, which makes your existing code and applications using Elasticsearch work seamlessly with the service.

Amazon ES provisions all the resources for your Elasticsearch cluster and launches it in any Region of your choice within minutes. It automatically detects and replaces failed Elasticsearch nodes, which reduces the overhead associated with self-managed infrastructures. You can scale your cluster horizontally or vertically, up to 3 PB of data, with zero downtime through a single API call or a few clicks on the AWS Management Console. With this flexibility, Amazon ES can support any workload from single-node development clusters to production-scale, multi-node clusters.

Amazon ES also provides a robust set of Kibana plugins, free of any licensing fees. Features like fine-grained access control, alerting, index state management, and SQL support are just a few of the many examples. The Amazon ES feature set originates from the needs of customers such as yourself and through open-source community initiatives such as Open Distro for Elasticsearch (ODFE).

You need to factor several considerations into your decision to move to a managed service. Obviously, you want your teams focused on doing meaningful work that propels the growth of your company. Deciding what processes you offload to a managed service versus what are best self-managed can be a challenge. Based on my experience managing Elasticsearch at my prior employer, and having worked with thousands of customers who have migrated to AWS, I consider the following sections important topics for you to review.

Workloads

Before migrating to a managed service, you might look to what others are doing in their “vertical,” whether it be in finance, telecommunications, legal, ecommerce, manufacturing, or any number of other markets. You can take comfort in knowing that thousands of customers across these verticals successfully deploy their search, log analytics, SIEM, and other workloads on Amazon ES.

Elasticsearch is by default a search engine. Compass uses Amazon ES to scale their search infrastructure and build a complete, scalable, real estate home-search solution. By using industry-leading search and analytical tools, they make every listing in the company’s catalog discoverable to consumers and help real estate professionals find, market, and sell homes faster.

With powerful tools such as aggregations and alerting, Elasticsearch is widely used for log analytics workloads to gain insights into operational activities. As Intuit moves to a cloud hosting architecture, the company is on an “observability” journey to transform the way it monitors its applications’ health. Intuit used Amazon ES to build an observability solution, which provides visibility to its operational health across the platform, from containers to serverless applications.

When it comes to security, Sophos is a worldwide leader in next-generation cybersecurity, and protects its customers from today’s most advanced cyber threats. Sophos developed a large-scale security monitoring and alerting system using Amazon ES and other AWS components because they know Amazon ES is well suited for security use cases at scale.

Whether it be finding a home, detecting security events, or assisting developers with finding issues in applications, Amazon ES supports a wide range of use cases and workloads.

Cost

Any discussion around operational best practices has to factor in cost. With Amazon ES, you can select the optimal instance type and storage option for your workload with a few clicks on the console. If you’re uncertain of your compute and storage requirements, Amazon ES has on-demand pricing with no upfront costs or long-term commitments. When you know your workload requirements, you can lock in significant cost savings with Reserved Instance pricing for Amazon ES.

Compute and infrastructure costs are just one part of the equation. At AWS, we encourage customers to evaluate their Total Cost of Ownership (TCO) when comparing solutions. As an organizational decision-maker, you have to consider all the related cost benefits when choosing to replace your self-managed environment. Some of the factors I encourage my customers to consider are:

  • How much are you paying to manage the operation of your cluster 24/7/365?
  • How much do you spend on building the operational components, such as support processes, as well as automated and manual remediation procedures for clusters in your environment?
  • What are the license costs for advanced features?
  • What costs do you pay for networking between the clusters or the DNS services to expose your offerings?
  • How much do you spend on backup processes and how quickly can you recover from any failures?

The beauty of Amazon ES is that you do not need to focus on these issues. Amazon ES provides operational teams to manage your clusters, automated hourly backups of data for 14 days for your cluster, automated remediation of events with your cluster, and incremental license-free features as one of the basic tenants of the service.

You also need to pay particular close attention to managing the cost of storing your data in Elasticsearch. In the past, to keep storage costs from getting out of control, self-managed Elasticsearch users had to rely on solutions that were complicated to manage across data tiers and in some cases didn’t give you quick access to that data. AWS solved this problem with UltraWarm, a new, low-cost storage tier. UltraWarm lets you store and interactively analyze your data, backed by Amazon Simple Storage Service (Amazon S3) using Elasticsearch and Kibana, while reducing your cost per GB by almost 90% over existing hot storage options.

Security

In my conversations with customers, their primary concern is security. One data breech can cost millions and forever damage a company’s reputation. Providing you with the tools to secure your data is a critical component of our service. For your data in Amazon ES, you can do the following:

Many customers want to have a single sign-on environment when integrating with Kibana. Amazon ES offers Amazon Cognito authentication for Kibana. You can choose to integrate identity providers such as AWS Single Sign-On, PingFederate, Okta, and others. For more information, see Integrating Third-Party SAML Identity Providers with Amazon Cognito User Pools.

Recently, Amazon ES introduced fine-grained access control (FGAC). FGAC provides granular control of your data on Amazon ES. For example, depending on who makes the request, you might want a search to return results from only one index. You might want to hide certain fields in your documents or exclude certain documents altogether. FGAC gives you the power to control who sees what data exists in your Amazon ES domain.

Compliance

Many organizations need to adhere to a number of compliance standards. Those that have experienced auditing and certification activities know that ensuring compliance is an expensive, complex, and long process. However, by using Amazon ES, you benefit from the work AWS has done to ensure compliance with a number of important standards. Amazon ES is PCI DSS, SOC, ISO, and FedRamp compliant to help you meet industry-specific or regulatory requirements. Because Amazon ES is a HIPAA-eligible service, processing, storing and transmitting PHI can help you accelerate development of these sensitive workloads.

Amazon ES is part of the services in scope of the most recent assessment. You can build solutions on top of Amazon ES with the knowledge that independent auditors acknowledge that the service meets the bar for these important industry standards.

Availability and resiliency

When you build an Elasticsearch deployment either on premises or in cloud environments, you need to think about how your implementation can survive failures. You also need to figure out how you can recover from failures when they occur. At AWS, we like to plan for the fact that things do break, such as hardware failures and disk failures, to name a few.

Unlike virtually every other technology infrastructure provider, each AWS Region has multiple Availability Zones. Each Availability Zone consists of one or more data centers, physically separated from one another, with redundant power and networking. For high availability and performance of your applications, you can deploy applications across multiple Availability Zones in the same Region for fault tolerance and low latency. Availability Zones interconnect with fast, private fiber-optic networking, which enables you to design applications that automatically fail over between Availability Zones without interruption. Availability Zones are more highly available, fault tolerant and scalable than traditional single or multiple data center infrastructures.

Amazon ES offers you the option to deploy your instances across one, two, or three AZs. If you’re running development or test workloads, pick the single-AZ option. Those running production-grade workloads should use two or three Availability Zones.

For information, see Increase availability for Amazon Elasticsearch Service by Deploying in three Availability Zones. Additionally, deploying in multiple Availability Zones with dedicated master nodes means that you get the benefit of the Amazon ES SLA.

Operations

A 24/7/365 operational team with experience managing thousands of Elasticsearch clusters around the globe monitors Amazon ES. If you need support, you can get expert guidance and assistance across technologies from AWS Support to achieve your objectives faster at lower costs. I want to underline the importance of having a single source for support for your cloud infrastructure. Amazon ES doesn’t run in isolation, and having support for your entire cloud infrastructure from a single source greatly simplifies the support process. AWS also provides you with the option to use Enterprise level support plans, where you can have a dedicated technical account manager who essentially becomes a member of your team and is committed to your success with AWS.

Using tools on Amazon ES such as alerting, which provides you with a means to take action on events in your data, and index state management, which enables you to automate activities like rolling off logs, gives you additional operation features that you don’t need to build.

When it comes to monitoring your deployments, Amazon ES provides you with a wealth of Amazon CloudWatch metrics with which you can monitor all your Amazon ES deployments within a “single pane of glass.” For more information, see Monitoring Cluster Metrics with Amazon CloudWatch.

Staying current is another important topic. To enable access to newer versions of Elasticsearch and Kibana, Amazon ES offers in-place Elasticsearch upgrades for domains that run versions 5.1 and later. Amazon ES provides access to the most stable and current versions in the open-source community as long as the distribution passes our stringent security evaluations. Our service prides itself on the fact that we offer you a version that has passed our own internal AWS security reviews.

AWS integrations and other benefits

AWS has a broad range of services that seamlessly integrate with Amazon ES. Like many customers, you may want to monitor the health and performance of your native cloud services on AWS. Most AWS services log events into Amazon CloudWatch Logs. You can configure a log group to stream data it receives to your Amazon ES domain through a CloudWatch Logs subscription.

The volume of log data can be highly variable, and you should consider buffering layers when operating at a large scale. Buffering allows you to design stability into your processes. When designing for scale, this is one of easiest ways I know to avoid overwhelming your cluster with spikey ingestion events. Amazon Kinesis Data Firehose has a direct integration with Amazon ES and offers buffering and retries as part of the service. You configure Amazon ES as a destination through a few simple settings, and data can begin streaming to your Amazon ES domain.

Increased speed and agility

When building new products and tuning existing solutions, you need to be able to experiment. As part of that experimentation, failing fast is an accepted process that gives your team the ability to try new approaches to speed up the pace of innovation. Part of that process involves using services that allow you to create environments quickly and, should the experiment fail, start over with a new approach or use different features that ultimately allow you to achieve the desired results.

With Amazon ES, you receive the benefit of being able to provision an entire Elasticsearch cluster, complete with Kibana, on the “order of minutes” in a secure, managed environment. If your testing doesn’t produce the desired results, you can change the dimensions of your cluster horizontally or vertically using different instance offerings within the service via a single API call or a few clicks on the console.

When it comes to deploying your environment, native tools such as Amazon CloudFormation provide you with deployment tooling that gives you the ability to create entire environments through configuration scripting via JSON or YAML. The AWS Command Line Interface (AWS CLI) provides command line tooling that also can spin up domains with a small set of commands. For those who want to ride the latest wave in scripting their environments, the AWS CDK has a module for Amazon ES.

Conclusion

Focusing your teams on doing important, innovative work creating products and services that differentiate your company is critical. Amazon ES is an essential tool to provide operational stability, security, and performance of your search and analytics infrastructure. When you consider the following benefits Amazon ES provides, the decision to migrate is simple:

  • Support for search, log analytics, SIEM, and other workloads
  • Innovative functionality using UltraWarm to help you manage your costs
  • Highly secure environments that address PCI and HIPAA workloads
  • Ability to offload operational processes to an experienced provider that knows how to operate Elasticsearch at scale
  • Plugins at no additional cost that provide fine-grained access, vector-based similarity algorithms, or alerting and monitoring with the ability to automate incident response.

You can get started using Amazon ES with the AWS Free Tier. This tier provides free usage of up to 750 hours per month of a t2.small.elasticsearch instance and 10 GB per month of optional EBS storage (Magnetic or General Purpose).

Over the course of the next few months, I’ll be co-hosting a series of posts that introduce migration patterns to help you move to Amazon ES. Additionally, AWS has a robust partner ecosystem and a professional services team that provide you with skilled and capable people to assist you with your migration.

 


About the Author

Kevin Fallis (@AWSCodeWarrior) is an AWS specialist search solutions architect.His passion at AWS is to help customers leverage the correct mix of AWS services to achieve success for their business goals. His after-work activities include family, DIY projects, carpentry, playing drums, and all things music.

 

Monitor and control the storage space of a schema with quotas with Amazon Redshift

Post Syndicated from BP Yau original https://aws.amazon.com/blogs/big-data/monitor-and-control-the-storage-space-of-a-schema-with-quotas-with-amazon-redshift/

Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46% increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70% of searches, and more than 58% of content originating from mobile devices.

Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. Yelp has immediately benefited by the new Amazon Redshift schema storage quota feature.

“Amazon Redshift is a managed data warehouse service that allows Yelp to focus on data analytics without spending time on database administration,” says Steven Moy, Lead Engineer for Yelp’s Metrics Platform. The Metrics Platform provides long-term persistent data storage and SQL-on-anything query capabilities for Yelp’s Engineering teams. “A key strategy for our data warehouse users to iterate quickly is to have a writable schema called ‘tmp’ for users to prototype various table schema. However, we occasionally faced challenges when there was not enough free space during a query execution, degrading the entire data warehouse query operation. With the new schema quota feature, we can provision a storage quota ceiling on the ‘tmp’ schema to safeguard runaway storage issues. We look forward to all the autonomous features coming from Amazon Redshift.”

Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.

Prerequisites

Before starting this walkthrough, you must have the following:

  • An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in us-east-1.
  • A database user with superuser permission.

Setting up and testing the schema quota

To set up the environment and implement the use case, complete the following steps:

  1. Connect to your Amazon Redshift cluster using your preferred SQL client as a superuser or user with CREATE SCHEMA privileges.
  1. Create the user sales with the following code:
CREATE USER sales WITH password 'Abcd1234!';

The user in the preceding code is just for the purposes of this walkthrough. You should use identity federation in AWS Identity and Access Management (IAM) to use a corporate user rather than creating one manually. For more information, see Authentication.

  1. Set up a schema quota with the CREATE SCHEMA See the following code:
CREATE SCHEMA [ IF NOT EXISTS ] schema_name [ AUTHORIZATION username ] 
           [ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] 
 
CREATE SCHEMA AUTHORIZATION username[ QUOTA {quota [MB | GB | TB] | UNLIMITED} ] [ schema_element [ ... ] ]

The QUOTA parameters define the maximum amount of disk space that the specified schema can use.

  1. Create the schema sales_schema with a quota of 2 GB for the user sales. See the following code:
CREATE SCHEMA sales_schema AUTHORIZATION sales QUOTA 2 GB; 
  1. Impersonate the user sales with the following code:
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;
  1. Create the tables region and lineitem with the following code:
CREATE TABLE sales_schema.region (
  r_regionkey int4,
  r_name char(25),
  r_comment varchar(152)                            
) DISTSTYLE EVEN;
 
CREATE TABLE sales_schema.lineitem (
  l_orderkey int8,
  l_partkey int8,
  l_suppkey int4,
  l_linenumber int4,
  l_quantity numeric(12,2),
  l_extendedprice numeric(12,2),
  l_discount numeric(12,2),
  l_tax numeric(12,2),
  l_returnflag char(1),
  l_linestatus char(1),
  l_shipdate date,
  l_commitdate date,
  l_receiptdate date,
  l_shipinstruct char(25),
  l_shipmode char(10),
  l_comment varchar(44)
) DISTSTYLE EVEN;

Single statement transaction

To test single statement transaction use case, complete the following steps:

  1. Load data into the table region with the following code:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

 <Your-IAM-Role> is the ARN of your IAM role with the necessary permissions associated with your Amazon Redshift cluster.

 The COPY command should complete with five rows loaded. See the following code:

dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
  1. Check the quota and disk usage for the schema from system view svv_schema_quota_state. See the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should show a 2,048 MB quota with 30 MB disk usage on your four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  2048 |         30 |           1.46
(1 row)
  1. Load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

You should get an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 40895 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 2798).
Free up disk space or request increased quota for the schema(s).

Amazon Redshift checks each transaction for quota violations before committing the transaction. The size limit can exceed the quota temporarily within a transaction before it’s committed because the quota violation check occurs at the end of a transaction. That’s why you initially see the message loading to table lineitem completed successfully. When a transaction exceeds the quota, Amazon Redshift aborts the transaction, prohibits subsequent ingestions, and reverts all changes until you free up disk space.

The quota violation occurrence is stored in the system table stl_schema_quota_violations.

  1. Run the following code to see the violation:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should show the quota, disk usage, disk usage percentage out of the configured quota, and the timestamp of when the violation happened. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(1 row)
  1. Change the schema quota to 4 GB using the ALTER SCHEMA See the following code:
RESET SESSION AUTHORIZATION; 
ALTER SCHEMA sales_schema QUOTA 4 GB;
SET SESSION AUTHORIZATION 'sales';
SELECT CURRENT_USER;

You have to be a database superuser or user with CREATE SCHEMA permissions to set a schema quota during schema creation. However, only a superuser can change a schema quota. You use RESET SESSION AUTHORIZATION to reset the user impersonation back to the original superuser.

  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota with 30 MB disk usage on your four-node DC2.large cluster. The failed COPY statement earlier automatically reclaims disk space when the transaction is aborted for a pristine table. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)
  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully
  1. Check the quota and disk usage for the schema with the following code:
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

The statement should return a 4,096 MB quota and 2798 MB disk usage on a four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:

dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |       2798 |          68.31
(1 row)
  1. Rerun the COPY command to load a large amount of data into the table lineitem to exceed the quota. See the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return an error that the transaction is aborted due to exceeding the disk space quota. See the following code:

dev=# COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 86438 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
  1. Execute the following code to perform subsequent small COPY/INSERT/UPDATE after the quota violation:
COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;

All statements should return that the error transaction is aborted due to exceeding disk space quota. See the following code:

dev=# COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
ERROR:  Transaction 86478 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
ERROR:  Transaction 86479 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).
dev=# UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
ERROR:  Transaction 86483 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5486).
Free up disk space or request increased quota for the schema(s).

For non-pristine tables, although the offending transaction that exceeded the quota was aborted (for example, COPY to lineitem), the transaction still consumes space and the schema is blocked until disk space is reclaimed. That is why the subsequent COPY/INSERT/UPDATE to the region table is aborted as well.

  1. Double-check the quota violation from system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

It should return disk usage of 5,486 MB with a four-node DC2.large cluster. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(5 rows)

Single statement shrink-only transaction

A shrink-only transaction is a transaction composed solely of statements that has the potential to directly or indirectly reduce the amount of disk space used, such as: DELETE, TRUNCATE, DROP TABLE, VACUUM, etc.

To test single statement shrink-only transaction use case, complete the following steps:

  1. Delete all data from the lineitem table with the following code:
DELETE sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;

It should return a record count of zero:

dev=> DELETE sales_schema.lineitem;
DELETE 59986052
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)

Although the schema quota was exceeded, the DELETE transaction is committed because the shrink-only transaction is an exception to the quota violation check.

The TRUNCATE command is much more efficient than DELETE to delete all data and doesn’t require a VACUUM and ANALYZE command. The preceding code is for illustration purposes only.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following error message:

Transaction is aborted due to exceeding the disk space quota

See the following code:

dev=> COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'lineitem' completed, 59986052 record(s) loaded successfully.
ERROR:  Transaction 87024 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 8150).
Free up disk space or request increased quota for the schema(s).

Because DELETE doesn’t automatically free up disk space, the preceding COPY command exceeds the quota and the transaction is aborted.

  1. Double-check the quota violation from the system table with the following code:
SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

The latest row should return a disk usage of 8,150 MB with a four-node DC2.large cluster. This shows that disk usage increases with the aborted transaction. See the following code:

dev=> SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;
 userid | schema_name  | quota | disk_usage | disk_usage_pct |         timestamp          
--------+--------------+-------+------------+----------------+----------------------------
    104 | sales_schema |  4096 |       8150 |         198.97 | 2020-04-20 21:30:54.354669
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.653489
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.469287
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 21:27:29.197434
    104 | sales_schema |  4096 |       5486 |         133.94 | 2020-04-20 20:28:01.344333
    104 | sales_schema |  2048 |       2798 |         136.62 | 2020-04-20 20:09:25.494723
(6 rows)
  1. Run VACUUM to free up disk space:
VACUUM sales_schema.lineitem;

The schema quota is exceeded, but because Amazon Redshift disregards the quota violation and commits transactions in certain cases like VACUUM, this completes without error.

Amazon Redshift now automatically runs the VACUUM DELETE operation in the background to reclaim disk space.

  1. Rerun the COPY command to load data into the table lineitem with the following code:
COPY sales_schema.lineitem FROM 's3://redshift-downloads/TPC-H/10GB/lineitem/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';

It should return the following message:

Load into table 'lineitem' completed, 59986052 record(s) loaded successfully

Because VACUUM reclaimed the disk space, the COPY command ran successfully.

CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema

To test CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema use case, complete the following steps:

  1. Execute the CTAS statement to create another table using a query. See the following code:
CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;

It should return that the transaction is aborted due to exceeding the disk space quota. CTAS is subject to similar quota checks in the same schema. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
ERROR:  Transaction 112229 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 4096, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Create a temp table and populate data with the following code:
CREATE TEMP TABLE lineitem_temp AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) from lineitem_temp; 

It should return 59,986,052 rows. The schema quota doesn’t consider temporary tables created as part of a temporary namespace or schema.

  1. Create the same table lineitem in the public schema and load data into it using the following code:
CREATE TABLE public.lineitem_ctas AS SELECT * FROM sales_schema.lineitem;
SELECT COUNT(*) FROM lineitem_ctas;

It should return 59,986,052 rows.

Ingestion into different schemas has no effect and the transaction is committed.

  1. Append data into the table lineitem in sales_schema from another table using the ALTER TABLE APPEND See the following code:
ALTER TABLE sales_schema.lineitem APPEND FROM lineitem_ctas;

It should return an error that the transaction is aborted due to exceeding the disk space quota. Quota violation is applicable to the full schema.

  1. Append data from the sales_schema lineitem table to another table in the public schema with the following code:
ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
SELECT COUNT(*) FROM sales_schema.lineitem;
SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';

It should return a record count of zero from the SELECT COUNT statement and the disk usage drop. The ALTER TABLE APPEND statement empties the table immediately and reclaims disk space when the transaction is complete. See the following code:

dev=> ALTER TABLE lineitem_ctas APPEND FROM sales_schema.lineitem;
INFO:  ALTER TABLE APPEND "lineitem_ctas" from "lineitem" is complete.
ALTER TABLE APPEND and COMMIT TRANSACTION
dev=> SELECT COUNT(*) FROM sales_schema.lineitem;
 count 
-------
     0
(1 row)
 
dev=> SELECT TRIM(SCHEMA_NAME) "schema_name", QUOTA, disk_usage, disk_usage_pct FROM svv_schema_quota_state WHERE SCHEMA_NAME = 'sales_schema';
 schema_name  | quota | disk_usage | disk_usage_pct 
--------------+-------+------------+----------------
 sales_schema |  4096 |         30 |           0.73
(1 row)

Concurrent transactions

To test concurrent transactions use case, complete the following steps:

  1. Increase the quota to 6 GB using the following code:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 6 GB;
SET SESSION AUTHORIZATION 'sales';

You need two separate SQL client sessions connected to the Amazon Redshift cluster and run the code concurrently for the following test.

  1. On the first session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
  1. While the first session is still running, on the second session, enter the following code:
CREATE TABLE sales_schema.lineitem_txn2 AS SELECT * FROM lineitem_ctas;

The statement from the first session should return a warning message:

Schema(s) close to reaching their allocated disk space quotas

This is a warning message that the schema is close to its quota, but the statement is still complete. See the following code:

dev=> CREATE TABLE sales_schema.lineitem_txn1 AS SELECT * FROM lineitem_ctas;
WARNING:  Schema(s) close to reaching their allocated disk space quotas: (Schema: sales_schema, Quota: 6144, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
SELECT

The statement from the second session should return an error that the transaction is aborted due to exceeding the disk space quota. Because the first transaction from the first session was committed first, the disk space usage increased to close to the quota. The second transaction checks the quota violation before committing the transaction and thus aborts the transaction.

Reset quota below existing disk space usage

To test resetting quota below existing disk space usage use case, complete the following:

  1. Run the following code to reset the quota to 2 GB:
RESET SESSION AUTHORIZATION;
ALTER SCHEMA sales_schema QUOTA 2 GB;
SET SESSION AUTHORIZATION 'sales';
CREATE TABLE sales_schema.test AS SELECT 1;

You can set the quota below the existing disk space usage; however, all future ingestion transactions are aborted until you free disk space.  See the following code:

dev=> RESET SESSION AUTHORIZATION;
RESET
dev=# ALTER SCHEMA sales_schema QUOTA 2 GB;
ALTER SCHEMA
dev=# SET SESSION AUTHORIZATION 'sales';
SET
dev=> CREATE TABLE sales_schema.test AS SELECT 1;
ERROR:  Transaction 112557 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5610).
Free up disk space or request increased quota for the schema(s).

Multi-statement transactions

To test multi-statement transaction use case, complete the following steps:

  1. Enter the following code to test a multi-statement transaction:
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
COMMIT;
BEGIN;
	DELETE FROM sales_schema.region WHERE r_regionkey = 0;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
COMMIT;

All INSERT/UPDATE/COPY transactions should be aborted because the quota is exceeded. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> COMMIT;
ERROR:  Transaction 114291 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 1;
UPDATE 1
dev=> COMMIT;
ERROR:  Transaction 114295 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5606).
Free up disk space or request increased quota for the schema(s).
dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.region WHERE r_regionkey = 0;
DELETE 0
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> COMMIT;
ERROR:  Transaction 114303 is aborted due to exceeding the disk space quota in schema(s): (Schema: sales_schema, Quota: 2048, Current Disk Usage: 5624).
Free up disk space or request increased quota for the schema(s).
  1. Enter the following code to test a multi-statement transaction with DROP TABLE at the same transaction:
BEGIN;
	INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
	UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
	COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
	DROP TABLE sales_schema.region;
COMMIT;

The transaction should be committed successfully. The COPY/INSERT/DELETE transaction into a table that gets dropped at the same time the transaction is committed. See the following code:

dev=> BEGIN;
BEGIN
dev=> INSERT INTO sales_schema.region VALUES(100, 'Test','This is a test');
INSERT 0 1
dev=> UPDATE sales_schema.region SET r_name = 'Fail' WHERE r_regionkey = 100;
UPDATE 3
dev=> COPY sales_schema.region FROM 's3://redshift-downloads/TPC-H/10GB/region/' iam_role '<Your-IAM-Role>' gzip delimiter '|' region 'us-east-1';
INFO:  Load into table 'region' completed, 5 record(s) loaded successfully.
COPY
dev=> DROP TABLE sales_schema.region;
DROP TABLE
dev=> COMMIT;
COMMIT
  1. Run shrink-only statements in a multi-statement transaction with the following code:
BEGIN;
	DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
	DROP TABLE sales_schema.lineitem;
COMMIT;

The transaction should be committed successfully. A transaction containing shrink-only statements should succeed. See the following code:

dev=> BEGIN;
BEGIN
dev=> DELETE FROM sales_schema.lineitem WHERE l_orderkey = 53924162;
DELETE 0
dev=> DROP TABLE sales_schema.lineitem;
DROP TABLE
dev=> COMMIT;
COMMIT

Best practices

These following recommended practices can help you optimize your workload using storage quotas:

  • The schema feature works best when users are aligned with schemas – Use schemas to logically segregate the users, for example, the MAIN schema for the base and aggregated tables owned by data engineers, and the ANALYST schema controlled with storage quota owned by data analysts that perform their own aggregation by querying the base data from the MAIN schema into the ANALYST The data analyst only has read permissions on the MAIN schema but read and write permissions on the ANALYST schema.
  • Revoke usage on public schema – All users have CREATE and USAGE privileges on the PUBLIC schema of a database by default. Because storage quota isn’t applicable to the PUBLIC schema by default, all users should be disallowed from creating objects in the PUBLIC schema of a database. They should instead use their own aligned schemas as per the previous best practice. Use the REVOKE command to remove that privilege.
  • Use physical tables instead of temporary tables – Because temporary tables aren’t subject to storage quota, use physical tables instead—but remember to drop the physical tables at the end of the transaction. An alternative is to create a separate schema to cater to temporary physical tables with storage quota, so instead of using temporary tables, create physical tables in the dedicated schema. This would ease administration and you can perform an automation to clean up tables in this dedicated schema on a regular basis.
  • Use data shrinking operations – Use data shrinking operations like DELETE, VACUUM DELETE, DROP, and TRUNCATE to reclaim disk space, especially if it’s nearing the quota.
  • Use workload management (WLM) with query priority to control resources for different users – Data ingestion and aggregation performed by different users might be resource-intensive and impact the throughput of other mission-critical loads or queries running on the cluster. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.

Monitoring

You should periodically monitor for any storage quota violations so that you can take corrective action and provide business continuity. The system table stl_schema_quota_violations is a good resource to get quota violation events. To query the violation information, enter the following code:

SELECT userid, TRIM(SCHEMA_NAME) "schema_name", quota, disk_usage, disk_usage_pct, timestamp FROM stl_schema_quota_violations WHERE SCHEMA_NAME = 'sales_schema' ORDER BY timestamp DESC;

You can automate the execution of the preceding query and send notifications to users if any violations occur using the following solutions:

Cleaning up

When you’re done with testing, you can remove the objects and users with the following code:

RESET SESSION AUTHORIZATION;
DROP TABLE IF EXISTS sales_schema.lineitem_txn1; 
DROP TABLE IF EXISTS lineitem_temp;
DROP TABLE IF EXISTS lineitem_ctas;
DROP SCHEMA sales_schema;
DROP USER IF EXISTS sales;

Summary

Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. Data storage operation with quotas is part of an important data governance framework that provides compliance and operation efficiency. A simple QUOTA parameter in a CREATE/ALTER SCHEMA statement enables disk usage quotas on an Amazon Redshift schema. You can assign Amazon Redshift database users who represent personas in the organization to different schemas with quotas to control the amount of disk space they can consume. This is crucial for data governance and cost control, and avoids running into operation issues like disk full errors, especially if they keep creating objects and ingesting data that might halt other mission-critical analytic operations.

This post showed you how easy it is to control the storage quota for Amazon Redshift using schemas. You went through the steps to create and alter the quota on a schema and authorize specific users to the schema. You also explored different use cases of ingesting data into the schema and monitoring the different outcomes and disk usage using the system view. You can adopt this feature to support your business needs.

 


About the Authors

BP Yau is a Data Warehouse Specialist Solutions Architect at AWS His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.

 

 

 

Ahmad Barghout is a Software Development Engineer on the Amazon Redshift storage team. He has worked on cluster resize optimizations and currently works on ease-of-use features such as schema quota. Before that, he was a full stack developer at Cisco. He holds a master’s degree from The University of Ottawa.

 

 

 

Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.

 

 

How Goldman Sachs builds cross-account connectivity to their Amazon MSK clusters with AWS PrivateLink

Post Syndicated from Robert L. Cossin original https://aws.amazon.com/blogs/big-data/how-goldman-sachs-builds-cross-account-connectivity-to-their-amazon-msk-clusters-with-aws-privatelink/

This guest post presents patterns for accessing an Amazon Managed Streaming for Apache Kafka cluster across your AWS account or Amazon Virtual Private Cloud (Amazon VPC) boundaries using AWS PrivateLink. In addition, the post discusses the pattern that the Transaction Banking team at Goldman Sachs (TxB) chose for their cross-account access, the reasons behind their decision, and how TxB satisfies its security requirements with Amazon MSK. Using Goldman Sachs’s implementation as a use case, this post aims to provide you with general guidance that you can use when implementing an Amazon MSK environment.

Overview

Amazon MSK is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming data. When you create an MSK cluster, the cluster resources are available to participants within the same Amazon VPC. This allows you to launch the cluster within specific subnets of the VPC, associate it with security groups, and attach IP addresses from your VPC’s address space through elastic network interfaces (ENIs). Network traffic between clients and the cluster stays within the AWS network, with internet access to the cluster not possible by default.

You may need to allow clients access to an MSK cluster in a different VPC within the same or a different AWS account. You have options such as VPC peering or a transit gateway that allow for resources in either VPC to communicate with each other as if they’re within the same network. For more information about access options, see Accessing an Amazon MSK Cluster.

Although these options are valid, this post focuses on a different approach, which uses AWS PrivateLink. Therefore, before we dive deep into the actual patterns, let’s briefly discuss when AWS PrivateLink is a more appropriate strategy for cross-account and cross-VPC access.

VPC peering, illustrated below, is a bidirectional networking connection between two VPCs that enables you to route traffic between them using private IPv4 addresses or IPv6 addresses.

VPC peering is more suited for environments that have a high degree of trust between the parties that are peering their VPCs. This is because, after a VPC peering connection is established, the two VPCs can have broad access to each other, with resources in either VPC capable of initiating a connection. You’re responsible for implementing fine-grained network access controls with security groups to make sure that only specific resources intended to be reachable are accessible between the peered VPCs.

You can only establish VPC peering connections across VPCs that have non-overlapping CIDRs. This can pose a challenge when you need to peer VPCs with overlapping CIDRs, such as when peering across accounts from different organizations.

Additionally, if you’re running at scale, you can have hundreds of Amazon VPCs, and VPC peering has a limit of 125 peering connections to a single Amazon VPC. You can use a network hub like transit gateway, which, although highly scalable in enabling you to connect thousands of Amazon VPCs, requires similar bidirectional trust and non-overlapping CIDRs as VPC peering.

In contrast, AWS PrivateLink provides fine-grained network access control to specific resources in a VPC instead of all resources by default, and is therefore more suited for environments that want to follow a lower trust model approach, thus reducing their risk surface. The following diagram shows a service provider VPC that has a service running on Amazon Elastic Compute Cloud (Amazon EC2) instances, fronted by a Network Load Balancer (NLB). The service provider creates a configuration called a VPC endpoint service in the service provider VPC, pointing to the NLB. You can share this endpoint service with another Amazon VPC (service consumer VPC), which can use an interface VPC endpoint powered by AWS PrivateLink to connect to the service. The service consumers use this interface endpoint to reach the end application or service directly.

AWS PrivateLink makes sure that the connections initiated to a specific set of network resources are unidirectional—the connection can only originate from the service consumer VPC and flow into the service provider VPC and not the other way around. Outside of the network resources backed by the interface endpoint, no other resources in the service provider VPC get exposed. AWS PrivateLink allows for VPC CIDR ranges to overlap, and it can relatively scale better because thousands of Amazon VPCs can consume each service.

VPC peering and AWS PrivateLink are therefore two connectivity options suited for different trust models and use cases.

Transaction Banking’s micro-account strategy

An AWS account is a strong isolation boundary that provides both access control and reduced blast radius for issues that may occur due to deployment and configuration errors. This strong isolation is possible because you need to deliberately and proactively configure flows that cross an account boundary. TxB designed a strategy that moves each of their systems into its own AWS account, each of which is called a TxB micro-account. This strategy allows TxB to minimize the chances of a misconfiguration exposing multiple systems. For more information about TxB micro-accounts, see the video AWS re:Invent 2018: Policy Verification and Enforcement at Scale with AWS on YouTube.

To further complement the strong gains realized due to a TxB micro-account segmentation, TxB chose AWS PrivateLink for cross-account and cross-VPC access of their systems. AWS PrivateLink allows TxB service providers to expose their services as an endpoint service and use whitelisting to explicitly configure which other AWS accounts can create interface endpoints to these services. This also allows for fine-grained control of the access patterns for each service. The endpoint service definition only allows access to resources attached to the NLBs and thereby makes it easy to understand the scope of access overall. The one-way initiation of connection from a service consumer to a service provider makes sure that all connectivity is controlled on a point-to-point basis.  Furthermore, AWS PrivateLink allows the CIDR blocks of VPCs to overlap between the TxB micro-accounts. Thus the use of AWS PrivateLink sets TxB up for future growth as a part of their default setup, because thousands of TxB micro-account VPCs can consume each service if needed.

MSK broker access patterns using AWS PrivateLink

As a part of their micro-account strategy, TxB runs an MSK cluster in its own dedicated AWS account, and clients that interact with this cluster are in their respective micro-accounts. Considering this setup and the preference to use AWS PrivateLink for cross-account connectivity, TxB evaluated the following two patterns for broker access across accounts.

Pattern 1: Front each MSK broker with a unique dedicated interface endpoint

In this pattern, each MSK broker is fronted with a unique dedicated NLB in the TxB MSK account hosting the MSK cluster. The TxB MSK account contains an endpoint service for every NLB and is shared with the client account. The client account contains interface endpoints corresponding to the endpoint services. Finally, DNS entries identical to the broker DNS names point to the respective interface endpoint. The following diagram illustrates this pattern in the US East (Ohio) Region.

High-level flow

After setup, clients from their own accounts talk to the brokers using their provisioned default DNS names as follows:

  1. The client resolves the broker DNS name to the interface endpoint IP address inside the client VPC.
  2. The client initiates a TCP connection to the interface endpoint IP over port 9094.
  3. With AWS PrivateLink technology, this TCP connection is routed to the dedicated NLB setup for the respective broker listening on the same port within the TxB MSK account.
  4. The NLB routes the connection to the single broker IP registered behind it on TCP port 9094.

High-level setup

The setup steps in this section are shown for the US East (Ohio) Region, please modify if using another region. In the TxB MSK account, complete the following:

  1. Create a target group with target type as IP, protocol TCP, port 9094, and in the same VPC as the MSK cluster.
    • Register the MSK broker as a target by its IP address.
  2. Create an NLB with a listener of TCP port 9094 and forwarding to the target group created in the previous step.
    • Enable the NLB for the same AZ and subnet as the MSK broker it fronts.
  3. Create an endpoint service configuration for each NLB that requires acceptance and grant permissions to the client account so it can create a connection to this endpoint service.

In the client account, complete the following:

  1. Create an interface endpoint in the same VPC the client is in (this connection request needs to be accepted within the TxB MSK account).
  2. Create a Route 53 private hosted zone, with the domain name kafka.us-east-2.amazonaws.com, and associate it with the same VPC as the clients are in.
  3. Create A-Alias records identical to the broker DNS names to avoid any TLS handshake failures and point it to the interface endpoints of the respective brokers.

Pattern 2: Front all MSK brokers with a single shared interface endpoint

In this second pattern, all brokers in the cluster are fronted with a single unique NLB that has cross-zone load balancing enabled. You make this possible by modifying each MSK broker’s advertised.listeners config to advertise a unique port. You create a unique NLB listener-target group pair for each broker and a single shared listener-target group pair for all brokers. You create an endpoint service configuration for this single NLB and share it with the client account. In the client account, you create an interface endpoint corresponding to the endpoint service. Finally, you create DNS entries identical to the broker DNS names that point to the single interface. The following diagram illustrates this pattern in the US East (Ohio) Region.

High-level flow

After setup, clients from their own accounts talk to the brokers using their provisioned default DNS names as follows:

  1. The client resolves the broker DNS name to the interface endpoint IP address inside the client VPC.
  2. The client initiates a TCP connection to the interface endpoint over port 9094.
  3. The NLB listener within the TxB MSK account on port 9094 receives the connection.
  4. The NLB listener’s corresponding target group load balances the request to one of the brokers registered to it (Broker 1). In response, Broker 1 sends back the advertised DNS name and port (9001) to the client.
  5. The client resolves the broker endpoint address again to the interface endpoint IP and initiates a connection to the same interface endpoint over TCP port 9001.
  6. This connection is routed to the NLB listener for TCP port 9001.
  7. This NLB listener’s corresponding target group is configured to receive the traffic on TCP port 9094, and forwards the request on the same port to the only registered target, Broker 1.

High-level setup

The setup steps in this section are shown for the US East (Ohio) Region, please modify if using another region. In the TxB MSK account, complete the following:

  1. Modify the port that the MSK broker is advertising by running the following command against each running broker. The following example command shows changing the advertised port on a specific broker b-1 to 9001. For each broker you run the below command against, you must change the values of bootstrap-server, entity-name, CLIENT_SECURE, REPLICATION and REPLICATION_SECURE. Please note that while modifying the REPLICATION and REPLICATION_SECURE values, -internal has to be appended to the broker name and the ports 9093 and 9095 shown below should not be changed.
    ./kafka-configs.sh \
    --bootstrap-server b-1.exampleClusterName.abcde.c2.kafka.us-east-2.amazonaws.com:9094 \
    --entity-type brokers \
    --entity-name 1 \
    --alter \
    --command-config kafka_2.12-2.2.1/bin/client.properties \
    --add-config advertised.listeners=[\
    CLIENT_SECURE://b-1.exampleClusterName.abcde.c2.kafka.us-east-2.amazonaws.com:9001,\
    REPLICATION://b-1-internal.exampleClusterName.abcde.c2.kafka.us-east-2.amazonaws.com:9093,\
    REPLICATION_SECURE://b-1-internal.exampleClusterName.abcde.c2.kafka.us-east-2.amazonaws.com:9095]

  2. Create a target group with target type as IP, protocol TCP, port 9094, and in the same VPC as the MSK cluster. The preceding diagram represents this as B-ALL.
    • Register all MSK brokers to B-ALL as a target by its IP address.
  3. Create target groups dedicated for each broker (B1, B2) with the same properties as B-ALL.
    • Register the respective MSK broker to each target group by its IP address.
  4. Perform the same steps for additional brokers if needed and create unique listener-target group corresponding to the advertised port for each broker.
  5. Create an NLB that is enabled for the same subnets that the MSK brokers are in and with cross-zone load balancing enabled.
    • Create a TCP listener for every broker’s advertised port (9001, 9002) that forwards to the corresponding target group you created (B1, B2).
    • Create a special TCP listener 9094 that forwards to the B-ALL target group.
  6. Create an endpoint service configuration for the NLB that requires acceptance and grant permissions to the client account to create a connection to this endpoint service.

In the client account, complete the following:

  1. Create an interface endpoint in the same VPC the client is in (this connection request needs to be accepted within the TxB MSK account).
  2. Create a Route 53 private hosted zone, with the domain name kafka.us-east-2.amazonaws.com and associate it with the same VPC as the client is in.
  3. Under this hosted zone, create A-Alias records identical to the broker DNS names to avoid any TLS handshake failures and point it to the interface endpoint.

This post shows both of these patterns to be using TLS on TCP port 9094 to talk to the MSK brokers. If your security posture allows the use of plaintext communication between the clients and brokers, these patterns apply in that scenario as well, using TCP port 9092.

With both of these patterns, if Amazon MSK detects a broker failure, it mitigates the failure by replacing the unhealthy broker with a new one. In addition, the new MSK broker retains the same IP address and has the same Kafka properties, such as any modified advertised.listener configuration.

Amazon MSK allows clients to communicate with the service on TCP ports 9092, 9094, and 2181. As a byproduct of modifying the advertised.listener in Pattern 2, clients are automatically asked to speak with the brokers on the advertised port. If there is a need for clients in the same account as Amazon MSK to access the brokers, you should create a new Route53 hosted zone in the Amazon MSK account with identical broker DNS names pointing to the NLB DNS name. The Route53 record sets override the MSK broker DNS and allow for all traffic to the brokers to go via the NLB.

Transaction Banking’s MSK broker access pattern

For broker access across TxB micro-accounts, TxB chose Pattern 1, where one interface endpoint per broker is exposed to the client account. TxB streamlined this overall process by automating the creation of the endpoint service within the TxB MSK account and the interface endpoints within the client accounts without any manual intervention.

At the time of cluster creation, the bootstrap broker configuration is retrieved by calling the Amazon MSK APIs and stored in AWS Systems Manager Parameter Store in the client account so that they can be retrieved on application startup. This enables clients to be agnostic of the Kafka broker’s DNS names being launched in a completely different account.

A key driver for TxB choosing Pattern 1 is that it avoids having to modify a broker property like the advertised port. Pattern 2 creates the need for TxB to track which broker is advertising which port and make sure new brokers aren’t reusing the same port. This adds the overhead of having to modify and track the advertised port of new brokers being launched live and having to create a corresponding listener-target group pair for these brokers. TxB avoided this additional overhead by choosing Pattern 1.

On the other hand, Pattern 1 requires the creation of additional dedicated NLBs and interface endpoint connections when more brokers are added to the cluster. TxB limits this management overhead through automation, which requires additional engineering effort.

Also, using Pattern 1 costs more compared to Pattern 2, because each broker in the cluster has a dedicated NLB and an interface endpoint. For a single broker, it costs $37.80 per month to keep the end-to-end connectivity infrastructure up. The breakdown of the monthly connectivity costs is as follows:

  • NLB running cost – 1 NLB x $0.0225 x 720 hours/month = $16.20/month
  • 1 VPC endpoint spread across three AZs – 1 VPCE x 3 ENIs x $0.01 x 720 hours/month = $21.60/month

Additional charges for NLB capacity used and AWS PrivateLink data processed apply. For more information about pricing, see Elastic Load Balancing pricing and AWS PrivateLink pricing.

To summarize, Pattern 1 is best applicable when:

  • You want to minimize the management overhead associated with modifying broker properties, such as advertised port
  • You have automation that takes care of adding and removing infrastructure when new brokers are created or destroyed
  • Simplified and uniform deployments are primary drivers, with cost as a secondary concern

Transaction Banking’s security requirements for Amazon MSK

The TxB micro-account provides a strong application isolation boundary, and accessing MSK brokers using AWS PrivateLink using Pattern 1 allows for tightly controlled connection flows between these TxB micro-accounts. TxB further builds on this foundation through additional infrastructure and data protection controls available in Amazon MSK. For more information, see Security in Amazon Managed Streaming for Apache Kafka.

The following are the core security tenets that TxB’s internal security team require for using Amazon MSK:

  • Encryption at rest using Customer Master Key (CMK) – TxB uses the Amazon MSK managed offering of encryption at rest. Amazon MSK integrates with AWS Key Management Service (AWS KMS) to offer transparent server-side encryption to always encrypt your data at rest. When you create an MSK cluster, you can specify the AWS KMS CMK that AWS KMS uses to generate data keys that encrypt your data at rest. For more information, see Using CMKs and data keys.
  • Encryption in transit – Amazon MSK uses TLS 1.2 for encryption in transit. TxB makes client-broker encryption and encryption between the MSK brokers mandatory.
  • Client authentication with TLS – Amazon MSK uses AWS Certificate Manager Private Certificate Authority (ACM PCA) for client authentication. The ACM PCA can either be a root Certificate Authority (CA) or a subordinate CA. If it’s a root CA, you need to install a self-signed certificate. If it’s a subordinate CA, you can choose its parent to be an ACM PCA root, a subordinate CA, or an external CA. This external CA can be your own CA that issues the certificate and becomes part of the certificate chain when installed as the ACM PCA certificate. TxB takes advantage of this capability and uses certificates signed by ACM PCA that are distributed to the client accounts.
  • Authorization using Kafka Access Control Lists (ACLs) – Amazon MSK allows you to use the Distinguished Name of a client’s TLS certificates as the principal of the Kafka ACL to authorize client requests. To enable Kafka ACLs, you must first have client authentication using TLS enabled. TxB uses the Kafka Admin API to create Kafka ACLs for each topic using the certificate names of the certificates deployed on the consumer and producer client instances. For more information, see Apache Kafka ACLs.

Conclusion

This post illustrated how the Transaction Banking team at Goldman Sachs approaches an application isolation boundary through the TxB micro-account strategy and how AWS PrivateLink complements this strategy.  Additionally, this post discussed how the TxB team builds connectivity to their MSK clusters across TxB micro-accounts and how Amazon MSK takes the undifferentiated heavy lifting away from TxB by allowing them to achieve their core security requirements. You can leverage this post as a reference to build a similar approach when implementing an Amazon MSK environment.

 


About the Authors

Robert L. Cossin is a Vice President at Goldman Sachs in New York. Rob joined Goldman Sachs in 2004 and has worked on many projects within the firm’s cash and securities flows. Most recently, Rob is a technical architect on the Transaction Banking team, focusing on cloud enablement and security.

 

 

 

Harsha W. Sharma is a Solutions Architect with AWS in New York. Harsha joined AWS in 2016 and works with Global Financial Services customers to design and develop architectures on AWS, and support their journey on the cloud.

 

 

Migrating your Netezza data warehouse to Amazon Redshift

Post Syndicated from John Hwang original https://aws.amazon.com/blogs/big-data/migrating-your-netezza-data-warehouse-to-amazon-redshift/

With IBM announcing Netezza reaching end-of-life, you’re faced with the prospect of having to migrate your data and workloads off your analytics appliance. For some, this presents an opportunity to transition to the cloud.

Enter Amazon Redshift.

Amazon Redshift is a cloud-native data warehouse platform built to handle workloads at scale, and it shares key similarities with Netezza that make it an excellent candidate to replace your on-premises appliance. You can migrate your data and applications to Amazon Redshift in less time and with fewer changes than migrating to other analytics platforms. For developers, this means less time spent retraining on the new database. For stakeholders, it means a lower cost of, and time to, migration. For more information, see How to migrate a large data warehouse from IBM Netezza to Amazon Redshift with no downtime.

This post discusses important similarities and differences between Netezza and Amazon Redshift, and how they could impact your migration timeline.

Similarities

Three significant similarities between Netezza and Amazon Redshift are their compatibility with Postgres, their massively parallel processing (MPP) architecture, and the Amazon Redshift feature Advanced Query Accelerator (AQUA) compared to Netezza’s use of FPGAs.

Postgres compatibility

Both Netezza and Amazon Redshift share some compatibility with Postgres, an open-source database. This means that Netezza SQL and Amazon Redshift SQL have a similar syntax. In particular, both support many features of PL/pgSQL, Postgres’s procedural language. Your Netezza stored procedures can translate to Amazon Redshift with little-to-no rewriting of code.

You can also use the AWS Schema Conversion Tool, which can automatically migrate a large percentage of Netezza storage procedures to Amazon Redshift syntax with zero user effort. And because both databases are built for analytics and not transactional workloads, there are similar characteristics between the two databases. For example, both Netezza and Amazon Redshift don’t enforce primary keys to improve performance, though you can still define primary keys on your tables to help the optimizer create better query plans.

MPP architecture

Both Netezza and Amazon Redshift are MPP databases. This means that a query is sent to a leader node, which then compiles a set of commands that it sends to multiple compute nodes. Each worker node performs its task in parallel and returns the results to the leader node, where the results are aggregated and returned to the user. This means that you can apply similar architectural strategies from Netezza, such as zone maps and distribution styles, to Amazon Redshift. For more information, see Amazon Redshift Engineering’s Advanced Table Design Playbook: Preamble, Prerequisites, and Prioritization.

Amazon Redshift AQUA and Netezza’s FPGA

AWS recently announced the new Amazon Redshift feature AQUA, which is in preview as of this writing. AQUA uses AWS-designed processors to bring certain compute tasks closer to the storage layer. Compression, encryption, filtering, aggregation—and other tasks—can now happen in this intermediary layer in between the storage and compute, which leaves the CPU free to handle more complex tasks. When it’s released, AQUA will accelerate queries up to 10 times faster.

Netezza uses FPGAs to perform simple compute tasks before data reaches the CPU. Applications designed to employ these FPGA features on Netezza (for example, queries that rely heavily on certain aggregate functions and data filtering) translate well to Amazon Redshift clusters using AQUA.

Differences

For all the similarities that Amazon Redshift and Netezza share, they also have differences. There are three important differences that could have significant impact on your data and application architecture when migrating from Netezza to Amazon Redshift: column store vs. row store, concurrency scaling, and data lake integration.

Column store vs. row store

Netezza stores each row of data onto disk in data blocks, whereas Amazon Redshift stores each column of data. For many analytics workloads, a column store can have dramatic performance benefits over a row store. Typically, an analytics database has tables with many columns, but only a small handful of those columns are used in any one query. For example, assume that you have a table in a row store database with 100 columns and 100 million rows. If you want to sum the entire value of one column in this table, your query has to suffer the I/O penalty of scanning the entire table to retrieve the data in this single column. In a columnar database with the same table, the query only faces I/O for the single column. In addition to enjoying the improved performance for this type of workload in Amazon Redshift, this gives you options for designing wide tables without having to weigh the increase in I/O for typical analytics workloads the way you do in Netezza.

Concurrency scaling

Although both Netezza and Amazon Redshift offer queue priority and short query acceleration to help reduce concurrency issues, Amazon Redshift also uses the benefits of the cloud to offer additional options to handle concurrency.

One option is to take unrelated workloads from a single Netezza appliance and migrate them to separate Amazon clusters, each with an instance type and number of nodes sized accordingly for the workload it has to support. Netezza’s more traditional license and support pricing model, combined with its limited options for appliance sizes, can make this type of architecture untenable for your organization’s budget.

Additionally, Amazon Redshift offers Concurrency Scaling to scale out (and scale back in) additional compute capacity automatically and on the fly. If you want to add or remove nodes in the cluster, or experiment with other nodes types, you can do so in just a few minutes using elastic resize (to change the number of nodes) and classic resize (to change instance types).

This type of scaling and resizing isn’t feasible on Netezza because the on-premises appliance has a fixed number of blades. Concurrency scaling in Amazon Redshift can support virtually unlimited concurrent users and concurrent queries, and its ability to automatically add and remove additional capacity means you only pay for the time the concurrency scaling clusters are in use.

Data lake integration

Netezza offers the ability to create an external table from a data file either on the Netezza host or a remote host. However, querying those tables means migrating the entire dataset internally before running the query. With Amazon Redshift Spectrum, rather than using external tables as a convenient way to migrate entire datasets to and from the database, you can run analytical queries against data in your data lake the same way you do an internal table. As the volume of your data in the lake grows, partitioning can help keep the amount of data scanned for each query (and therefore performance) consistent. You can even join the data from your external tables in Amazon Simple Storage Service (Amazon S3) to your internal tables in Amazon Redshift.

Not only does this mean true integration of your data warehouse and data lake, which can lower your warehouse’s more expensive storage requirements, but because the Amazon Redshift Spectrum layer uses its own dedicated infrastructure outside of your cluster, you can offload many compute-intensive tasks from your cluster and push them down to the Redshift Spectrum layer. For more information, see Amazon Redshift Spectrum overview.

Conclusion

Amazon Redshift can accelerate migration from Netezza and potentially lower the time and cost of moving your analytics workloads into the cloud. The similarity between the two systems eases the pain of migration, but Amazon Redshift has important differences that can offer additional benefits around performance and cost. The expertise of AWS and its partner network can help with your migration strategy and offer guidance to avoid potential roadblocks and pitfalls.

 


About the Author

John Hwang is a senior solutions architect at AWS.

 

 

 

 

 

Brian McDermott is a senior sales manager at AWS.