All posts by Sriharsh Adari

Build efficient ETL pipelines with AWS Step Functions distributed map and redrive feature

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/build-efficient-etl-pipelines-with-aws-step-functions-distributed-map-and-redrive-feature/

AWS Step Functions is a fully managed visual workflow service that enables you to build complex data processing pipelines involving a diverse set of extract, transform, and load (ETL) technologies such as AWS Glue, Amazon EMR, and Amazon Redshift. You can visually build the workflow by wiring individual data pipeline tasks and configuring payloads, retries, and error handling with minimal code.

While Step Functions supports automatic retries and error handling when data pipeline tasks fail due to momentary or transient errors, there can be permanent failures such as incorrect permissions, invalid data, and business logic failure during the pipeline run. This requires you to identify the issue in the step, fix the issue and restart the workflow. Previously, to rerun the failed step, you needed to restart the entire workflow from the very beginning. This leads to delays in completing the workflow, especially if it’s a complex, long-running ETL pipeline. If the pipeline has many steps using map and parallel states, this also leads to increased cost due to increases in the state transition for running the pipeline from the beginning.

Step Functions now supports the ability for you to redrive your workflow from a failed, aborted, or timed-out state so you can complete workflows faster and at a lower cost, and spend more time delivering business value. Now you can recover from unhandled failures faster by redriving failed workflow runs, after downstream issues are resolved, using the same input provided to the failed state.

In this post, we show you an ETL pipeline job that exports data from Amazon Relational Database Service (Amazon RDS) tables using the Step Functions distributed map state. Then we simulate a failure and demonstrate how to use the new redrive feature to restart the failed task from the point of failure.

Solution overview

One of the common functionalities involved in data pipelines is extracting data from multiple data sources and exporting it to a data lake or synchronizing the data to another database. You can use the Step Functions distributed map state to run hundreds of such export or synchronization jobs in parallel. Distributed map can read millions of objects from Amazon Simple Storage Service (Amazon S3) or millions of records from a single S3 object, and distribute the records to downstream steps. Step Functions runs the steps within the distributed map as child workflows at a maximum parallelism of 10,000. A concurrency of 10,000 is well above the concurrency supported by many other AWS services such as AWS Glue, which has a soft limit of 1,000 job runs per job.

The sample data pipeline sources product catalog data from Amazon DynamoDB and customer order data from Amazon RDS for PostgreSQL database. The data is then cleansed, transformed, and uploaded to Amazon S3 for further processing. The data pipeline starts with an AWS Glue crawler to create the Data Catalog for the RDS database. Because starting an AWS Glue crawler is asynchronous, the pipeline has a wait loop to check if the crawler is complete. After the AWS Glue crawler is complete, the pipeline extracts data from the DynamoDB table and RDS tables. Because these two steps are independent, they are run as parallel steps: one using an AWS Lambda function to export, transform, and load the data from DynamoDB to an S3 bucket, and the other using a distributed map with AWS Glue job sync integration to do the same from the RDS tables to an S3 bucket. Note that AWS Identity and Access Management (IAM) permissions are required for invoking an AWS Glue job from Step Functions. For more information, refer to IAM Policies for invoking AWS Glue job from Step Functions.

The following diagram illustrates the Step Functions workflow.

There are multiple tables related to customers and order data in the RDS database. Amazon S3 hosts the metadata of all the tables as a .csv file. The pipeline uses the Step Functions distributed map to read the table metadata from Amazon S3, iterate on every single item, and call the downstream AWS Glue job in parallel to export the data. See the following code:

"States": {
            "Map": {
              "Type": "Map",
              "ItemProcessor": {
                "ProcessorConfig": {
                  "Mode": "DISTRIBUTED",
                  "ExecutionType": "STANDARD"
                },
                "StartAt": "Export data for a table",
                "States": {
                  "Export data for a table": {
                    "Type": "Task",
                    "Resource": "arn:aws:states:::glue:startJobRun.sync",
                    "Parameters": {
                      "JobName": "ExportTableData",
                      "Arguments": {
                        "--dbtable.$": "$.tables"
                      }
                    },
                    "End": true
                  }
                }
              },
              "Label": "Map",
              "ItemReader": {
                "Resource": "arn:aws:states:::s3:getObject",
                "ReaderConfig": {
                  "InputType": "CSV",
                  "CSVHeaderLocation": "FIRST_ROW"
                },
                "Parameters": {
                  "Bucket": "123456789012-stepfunction-redrive",
                  "Key": "tables.csv"
                }
              },
              "ResultPath": null,
              "End": true
            }
          }

Prerequisites

To deploy the solution, you need the following prerequisites:

Launch the CloudFormation template

Complete the following steps to deploy the solution resources using AWS CloudFormation:

  1. Choose Launch Stack to launch the CloudFormation stack:
  2. Enter a stack name.
  3. Select all the check boxes under Capabilities and transforms.
  4. Choose Create stack.

The CloudFormation template creates many resources, including the following:

  • The data pipeline described earlier as a Step Functions workflow
  • An S3 bucket to store the exported data and the metadata of the tables in Amazon RDS
  • A product catalog table in DynamoDB
  • An RDS for PostgreSQL database instance with pre-loaded tables
  • An AWS Glue crawler that crawls the RDS table and creates an AWS Glue Data Catalog
  • A parameterized AWS Glue job to export data from the RDS table to an S3 bucket
  • A Lambda function to export data from DynamoDB to an S3 bucket

Simulate the failure

Complete the following steps to test the solution:

  1. On the Step Functions console, choose State machines in the navigation pane.
  2. Choose the workflow named ETL_Process.
  3. Run the workflow with default input.

Within a few seconds, the workflow fails at the distributed map state.

You can inspect the map run errors by accessing the Step Functions workflow execution events for map runs and child workflows. In this example, you can identity the exception is due to Glue.ConcurrentRunsExceededException from AWS Glue. The error indicates there are more concurrent requests to run an AWS Glue job than are configured. Distributed map reads the table metadata from Amazon S3 and invokes as many AWS Glue jobs as the number of rows in the .csv file, but AWS Glue job is set with the concurrency of 3 when it is created. This resulted in the child workflow failure, cascading the failure to the distributed map state and then the parallel state. The other step in the parallel state to fetch the DynamoDB table ran successfully. If any step in the parallel state fails, the whole state fails, as seen with the cascading failure.

Handle failures with distributed map

By default, when a state reports an error, Step Functions causes the workflow to fail. There are multiple ways you can handle this failure with distributed map state:

  • Step Functions enables you to catch errors, retry errors, and fail back to another state to handle errors gracefully. See the following code:
    Retry": [
                          {
                            "ErrorEquals": [
                              "Glue.ConcurrentRunsExceededException "
                            ],
                            "BackoffRate": 20,
                            "IntervalSeconds": 10,
                            "MaxAttempts": 3,
                            "Comment": "Exception",
                            "JitterStrategy": "FULL"
                          }
                        ]
    

  • Sometimes, businesses can tolerate failures. This is especially true when you are processing millions of items and you expect data quality issues in the dataset. By default, when an iteration of map state fails, all other iterations are aborted. With distributed map, you can specify the maximum number of, or percentage of, failed items as a failure threshold. If the failure is within the tolerable level, the distributed map doesn’t fail.
  • The distributed map state allows you to control the concurrency of the child workflows. You can set the concurrency to map it to the AWS Glue job concurrency. Remember, this concurrency is applicable only at the workflow execution level—not across workflow executions.
  • You can redrive the failed state from the point of failure after fixing the root cause of the error.

Redrive the failed state

The root cause of the issue in the sample solution is the AWS Glue job concurrency. To address this by redriving the failed state, complete the following steps:

  1. On the AWS Glue console, navigate to the job named ExportsTableData.
  2. On the Job details tab, under Advanced properties, update Maximum concurrency to 5.

With the launch of redrive feature, You can use redrive to restart executions of standard workflows that didn’t complete successfully in the last 14 days. These include failed, aborted, or timed-out runs. You can only redrive a failed workflow from the step where it failed using the same input as the last non-successful state. You can’t redrive a failed workflow using a state machine definition that is different from the initial workflow execution. After the failed state is redriven successfully, Step Functions runs all the downstream tasks automatically. To learn more about how distributed map redrive works, refer to Redriving Map Runs.

Because the distributed map runs the steps inside the map as child workflows, the workflow IAM execution role needs permission to redrive the map run to restart the distributed map state:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "states:RedriveExecution"
      ],
      "Resource": "arn:aws:states:us-east-2:123456789012:execution:myStateMachine/myMapRunLabel:*"
    }
  ]
}

You can redrive a workflow from its failed step programmatically, via the AWS Command Line Interface (AWS CLI) or AWS SDK, or using the Step Functions console, which provides a visual operator experience.

  1. On the Step Functions console, navigate to the failed workflow you want to redrive.
  2. On the Details tab, choose Redrive from failure.

The pipeline now runs successfully because there is enough concurrency to run the AWS Glue jobs.

To redrive a workflow programmatically from its point of failure, call the new Redrive Execution API action. The same workflow starts from the last non-successful state and uses the same input as the last non-successful state from the initial failed workflow. The state to redrive from the workflow definition and the previous input are immutable.

Note the following regarding different types of child workflows:

  • Redrive for express child workflows – For failed child workflows that are express workflows within a distributed map, the redrive capability ensures a seamless restart from the beginning of the child workflow. This allows you to resolve issues that are specific to individual iterations without restarting the entire map.
  • Redrive for standard child workflows – For failed child workflows within a distributed map that are standard workflows, the redrive feature functions the same way as with standalone standard workflows. You can restart the failed state within each map iteration from its point of failure, skipping unnecessary steps that have already successfully run.

You can use Step Functions status change notifications with Amazon EventBridge for failure notifications such as sending an email on failure.

Clean up

To clean up your resources, delete the CloudFormation stack via the AWS CloudFormation console.

Conclusion

In this post, we showed you how to use the Step Functions redrive feature to redrive a failed step within a distributed map by restarting the failed step from the point of failure. The distributed map state allows you to write workflows that coordinate large-scale parallel workloads within your serverless applications. Step Functions runs the steps within the distributed map as child workflows at a maximum parallelism of 10,000, which is well above the concurrency supported by many AWS services.

To learn more about distributed map, refer to Step Functions – Distributed Map. To learn more about redriving workflows, refer to Redriving executions.


About the Authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing Tennis.

Joe Morotti is a Senior Solutions Architect at Amazon Web Services (AWS), working with Enterprise customers across the Midwest US to develop innovative solutions on AWS. He has held a wide range of technical roles and enjoys showing customers the art of the possible. He has attained seven AWS certification and has a passion for AI/ML and the contact center space. In his free time, he enjoys spending quality time with his family exploring new places and overanalyzing his sports team’s performance.

Uma Ramadoss is a specialist Solutions Architect at Amazon Web Services, focused on the Serverless platform. She is responsible for helping customers design and operate event-driven cloud-native applications and modern business workflows using services like Lambda, EventBridge, Step Functions, and Amazon MWAA.

Simplify semi-structured nested JSON data analysis with AWS Glue DataBrew and Amazon QuickSight

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/simplify-semi-structured-nested-json-data-analysis-with-aws-glue-databrew-and-amazon-quicksight/

As the industry grows with more data volume, big data analytics is becoming a common requirement in data analytics and machine learning (ML) use cases. Data comes from many different sources in structured, semi-structured, and unstructured formats. For semi-structured data, one of the most common lightweight file formats is JSON. However, due to the complex nature of data, JSON often includes nested key-value structures. Analysts may want a simpler graphical user interface to conduct data analysis and profiling.

To support these requirements, AWS Glue DataBrew offers an easy visual data preparation tool with over 350 pre-built transformations. You can use DataBrew to analyze complex nested JSON files that would otherwise require days or weeks writing hand-coded transformations. You can then use Amazon QuickSight for data analysis and visualization.

In this post, we demonstrate how to configure DataBrew to work with nested JSON objects and use QuickSight for data visualization.

Solution overview

To implement our solution, we create a DataBrew project and DataBrew job for unnesting data. We profile the unested data in DataBrew and analyze data in QuickSight. The following diagram illustrates the architecture of this solution.

Prerequisites

Before you get started, make sure you have the following prerequisites:

Prepare the data

To illustrate the DataBrew functionality to support data analysis for nested JSON files, we use a publicly available sample customer order details nested JSON dataset.

Complete the following steps to prepare your data:

  1. Sign in to the AWS Management Console.
  2. Browse to the publicly available datasets on the Amazon S3 console.
  3. Select the first dataset (customer_1.json) and choose Download to save the files on your local machine.
  4. Repeat this step to download all three JSON files.

    You can view the sample data from your local machine using any text editor, as shown in the following screenshot.
  5. Create input and output S3 buckets with subfolders nestedjson and outputjson to capture data.
  6. Choose Upload and upload the three JSON files to the nestedjson folder.

Create a DataBrew project

To create your Amazon S3 connection, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose Create project.
  3. For Project name, enter Glue-DataBew-NestedJSON-Blog.
  4. Select New dataset.
  5. For Dataset name, enter Glue-DataBew-NestedJSON-Dataset.
  6. For Enter your source from S3, enter the path to the nestedjson folder.
  7. Choose Select the entire folder to select all the files.
  8. Under Additional configurations, select JSON as the file type, then select JSON document.
  9. In the Permissions section, choose Choose existing IAM role if you have one available, or choose Create new IAM role.
  10. Choose Create project.
  11. Skip the preview steps and wait for the project to be ready.
    As shown in the following screenshot, the three JSON files were uploaded to the S3 bucket, so three rows of customer order details are loaded.
    The orders column contains nested files. We can use DataBrew to unnest or nest transform to flatten the columns and rows.
  12. Choose the menu icon (three dots) and choose Nest-unnest.
  13. Depending on the nesting, either choose Unnest to columns or Unnest to rows. In this blog post, we choose Unnest to columns to flatten example JSON file.

    Repeat this step until you get a flattened json for all the nested json data and this will create the AWS Glue Databrew recipe as shown below.
  14. Choose Apply.

    DataBrew automatically creates the required recipe steps with updated column values.
  15. Choose Create job.
  16. For Job name, enter Glue-DataBew-NestedJSON-job.
  17. For S3 location, enter the path to the outputjson folder.
  18. In the Permissions section, for Role name, choose the role you created earlier.
  19. Choose Create and run job.

On the Jobs page, you can choose the job to view its run history, details, and data lineage.

Profile the metadata with DataBrew

After you have a flattened file in the S3 output bucket, you can use DataBrew to carry out the data analysis and profiling for the flattened file. Complete the following steps:

  1. On the Datasets page, choose Connect new datasets.
  2. Provide your dataset details and choose Create dataset.
  3. Choose the newly added data source, then choose the Data profile overview tab.
  4. Enter the name of the job and the S3 path to save the output.
  5. Choose Create and run job.

The job takes around two minutes to complete and display all the updated information. You can explore the data further on the Data profile overview and Column statistics tabs.

Visualize the data in QuickSight

After you have the output file generated by DataBrew in the S3 output bucket, you can use QuickSight to query the JSON data. QuickSight is a scalable, serverless, embeddable, ML-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.

Launch QuickSight

On the console, enter quicksight into the search bar and choose QuickSight.

You’re presented with the QuickSight welcome page. If you haven’t signed up for QuickSight, you may have to complete the signup wizard. For more information, refer to Signing up for an Amazon QuickSight subscription.

After you have signed up, QuickSight presents a “Welcome wizard.” You can view the short tutorial, or you can close it.

Grant Amazon S3 access

To grant Amazon S3 access, complete the following steps:

  1. On the QuickSight console, choose your user name, choose Manage QuickSight, then choose Security & permissions.
  2. Choose Add or remove.
  3. Locate Amazon S3 in the list. Choose one of the following:
    1. If the check box is clear, select Amazon S3.
    2. If the check box is already selected, choose Details, then choose Select S3 buckets.
  4. Choose the buckets that you want to access from QuickSight, then choose Select.
  5. Choose Update.
  6. If you changed your Region during the first step of this process, change it back to the Region that you want to use.

Create a dataset

Now that you have QuickSight up and running, you can create your dataset. Complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.

    QuickSight supports several data sources. For a complete list, refer to Supported data sources.
  3. For your data source, choose S3.

    The S3 import requires a data source name and a manifest file.
  4. On your machine, use a text editor to create a manifest file called BlogGlueDataBrew.manifest using the following structure (provide the name of the your output bucket):
    {
        "fileLocations": [
            {
                "URIPrefixes": [
                "https://s3.amazonaws.com/ s3://<output bucket>/outputjson/"
                ]
            }
        ],
        "globalUploadSettings": {
            "format": "CSV",
            "delimiter": ","
        }
    }

    The manifest file points to the folder that you created earlier as part of your DataBrew project. For more information, refer to Supported formats for Amazon S3 manifest files.

  5. Select Upload and navigate to the manifest file to upload it.
  6. Choose Connect to upload data into SPICE, which is an in-memory database built into QuickSight to achieve fast performance.
  7. Choose Visualize.

You can now create visuals by adding different fields.

To learn more about authoring dashboards in QuickSight, check out the QuickSight Author Workshop.

Clean up

Complete the following steps to avoid incurring future charges:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Select the project you created and on the Actions menu, choose Delete.
  3. Choose Jobs in the navigation pane.
  4. Select the job you created and on the Actions menu, choose Delete.
  5. Choose Recipes in the navigation pane.
  6. Select the recipe you created and on the Actions menu, choose Delete.
  7. On the QuickSight dashboard, choose your user name on the application bar, then choose Manage QuickSight.
  8. Choose Account settings, then choose Delete account.
  9. Choose Delete account.
  10. Enter confirm and choose Delete account.

Conclusion

This post walked you through the steps to configure DataBrew to work with nested JSON objects and use QuickSight for data visualization. We used Glue DataBrew to unnest our JSON file and profile the data, and then used QuickSight to create dashboards and visualizations for further analysis.

You can use this solution for your own use cases when you need to unnest complex semi-structured JSON files without writing code. If you have comments or feedback, please leave them in the comments section.


About the authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Rahul Sonawane is a Principal Analytics Solutions Architect at AWS with AI/ML and Analytics as his area of specialty.

Amogh Gaikwad is a Solutions Developer at Amazon Web Services. He helps global customers build and deploy AI/ML solutions. His work is mainly focused on computer vision, and NLP uses-cases and helping customers optimize their AI/ML workloads for sustainability. Amogh has received his master’s in Computer Science specializing in Machine Learning.

Enable Amazon QuickSight federation with Google Workspace

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/enable-amazon-quicksight-federation-with-google-workspace/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud that supports identity federation in both Standard and Enterprise editions. Organizations are working towards centralizing their identity and access strategy across all of their applications, including on-premises, third-party, and applications on AWS. Many organizations use Google Workspace to control and manage user authentication and authorization centrally. You can enable federation to QuickSight accounts without needing to create and manage users. This authorizes users to access QuickSight assets—analyses, dashboards, folders, and datasets—through centrally managed Google Workspace Identities.

In this post, we go through the steps to configure federated single sign-on (SSO) between a Google Workspace instance and QuickSight account. We demonstrate registering an SSO application in Google Workspace, and map QuickSight roles (admin, author, and reader) to Google Workspace Identities. These QuickSight roles represent three different personas supported in QuickSight. Administrators can publish the QuickSight app in a Google Workspace Dashboard to enable users to SSO to QuickSight using their Google Workspace credentials.

Solution overview

In your organization, the portal is typically a function of your identity provider (IdP), which handles the exchange of trust between your organization and QuickSight.

On the Google Workspace Dashboard, you can review a list of apps. This post shows you how to configure the custom app for AWS.

The user flow consists of the following steps:

  1. The user logs in to your organization’s portal and chooses the option to go to the QuickSight console.
  2. The portal verifies the user’s identity in your organization.
  3. The portal generates a SAML authentication response that includes assertions that identify the user and include attributes about the user. The portal sends this response to the client browser. Although not discussed here, you can also configure your IdP to include a SAML assertion attribute called SessionDuration that specifies how long the console session is valid.
  4. The client browser is redirected to the AWS single sign-on endpoint and posts the SAML assertion.
  5. The endpoint requests temporary security credentials on behalf of the user, and creates a QuickSight sign-in URL that uses those credentials.
  6. AWS sends the sign-in URL back to the client as a redirect.
  7. The client browser is redirected to the QuickSight console. If the SAML authentication response includes attributes that map to multiple AWS Identity and Access Management (IAM) roles, the user is first prompted to select the role for accessing the console.

The following diagram illustrates the solution architecture.

The following are the high-level steps to set up federated single sign-on access via Google Workspace:

  1. Download the Google IdP information.
  2. Create an IAM IdP with Google as SAML IdP.
  3. Configure IAM policies for QuickSight roles.
  4. Configure IAM QuickSight roles for federated users.
  5. Create a custom user attribute in Google Workspace.
  6. Add the AWS SAML attributes to your Google Workspace user profile.
  7. Set up the AWS SAML app in Google Workspace.
  8. Grant access to users in Google Workspace.
  9. Verify federated access to your QuickSight instance.

Detailed procedures for each of these steps comprise the remainder of this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • A Google Workspace subscription
  • An AWS account with QuickSight subscription
  • Basic understanding of QuickSight roles—admin, author, and reader
  • Basic understanding of IAM and privileges required to create an IAM identity provider, roles, policies, and users

Download the Google IdP information

First, let’s get the SAML metadata that contains essential information to enable your AWS account to authenticate the IdP and locate the necessary communication endpoint locations. Complete the following steps:

  1. Log in to the Google Workspace Admin console.
  2. On the Admin console home page, under Security in the navigation pane, choose Authentication and SSO with SAML applications.
  3. Under IdP metadata, choose Download Metadata.

Create an IAM IdP with Google as SAML IdP

You now configure Azure AD as your SAML IdP via the IAM console. Complete the following steps:

  1. On the IAM console, choose Identity providers in the navigation pane.
  2. Choose Add provider.
  3. For Configure provider, select SAML.
  4. For Provider name, enter a name for the IdP (such as Google).
  5. For Metadata document, choose Choose file and specify the SAML metadata document that you downloaded.
  6. Choose Add provider.
  7. Document the Amazon Resource Name (ARN) by viewing the IdP you just created.

The ARN should looks similar to arn:aws:iam::<YOURACCOUNTNUMBER>:saml-provider/Google. We need this ARN to configure claim rules later in this post.

Configure IAM policies for QuickSight roles

In this step, we create three IAM policies for different role permissions in QuickSight:

  • QuickSight-Federated-Admin
  • QuickSight-Federated-Author
  • QuickSight-Federated-Reader

Use the following steps to set up the QuickSight-Federated-Admin policy. This policy grants admin privileges in QuickSight to the federated user:

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. Choose JSON and replace the existing text with the following code:
    {
        “Version”: “2012-10-17”,
        “Statement”: [
            {
                “Effect”: “Allow”,
                “Action”: “quicksight:CreateAdmin”,
                “Resource”: “*”
            }
        ]
    }

  4. Choose Review policy.
  5. For Name, enter QuickSight-Federated-Admin.
  6. Choose Create policy.
  7. Repeat these steps to create QuickSight-Federated-Author, and use the following policy to grant author privileges in QuickSight to the federated user:
    {
        “Version”: “2012-10-17”,
        “Statement”: [
            {
                “Effect”: “Allow”,
                “Action”: “quicksight:CreateUser”,
                “Resource”: “*”
            }
        ]
    }

  8. Repeat the steps to create QuickSight-Federated-Reader, and use the following policy to grant reader privileges in QuickSight to the federated user:
    {
        “Version”: "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "quicksight:CreateReader",
                "Resource": "*"
            }
        ]
    }

Configure IAM QuickSight roles for federated users

Next, create the roles that Google IdP users assume when federating into QuickSight. The following steps set up the admin role:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. For Trusted entity type, choose SAML 2.0 federation.
  4. For SAML provider, choose the provider you created earlier (Google).
  5. For Attribute, choose SAML:aud.
  6. For Value, enter https://signin.aws.amazon.com/saml.
  7. Choose Next.
  8. On the Add permissions page, select the QuickSight-Federated-Admin IAM policy you created earlier.
  9. Choose Next.
  10. For Role name, enter QuickSight-Admin-Role.
  11. For Role description, enter a description.
  12. Choose Create role.
  13. On the IAM console, in the navigation pane, choose Roles.
  14. Choose the QuickSight-Admin-Role role you created to open the role’s properties.
  15. On the Trust relationships tab, choose Edit trust relationship.
  16. Under Trusted entities, verify that the IdP you created is listed.
  17. Under Condition, verify that SAML:aud with a value of https://signin.aws.amazon.com/saml is present.
  18. Repeat these steps to create author and reader roles and attach the appropriate policies:
    1. For QuickSight-Author-Role, use the policy QuickSight-Federated-Author.
    2. For QuickSight-Reader-Role, use the policy QuickSight-Federated-Reader.
  19. Navigate to the newly created roles and note the ARNs for them.

We use these ARNs to configure claims rules later in this post. They are in the following format:

  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Admin-Role
  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Author-Role
  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Reader-Role

Create a custom user attribute in Google Workspace

Now let’s create a custom user attribute in your Google Workspace. This allows us to add the SAML attributes that the AWS Management Console expects in order to allow a SAML-based authentication.

  1. Log in to Google Admin console with admin credentials.
  2. Under Directory, choose Users.
  3. On the More options menu, choose Manage custom attributes.
  4. Choose Add Custom Attribute.
  5. For Select type of trusted entity, choose SAML 2.0 federation.
  6. Configure the custom attribute as follows:
    1. Category: Amazon
    2. Description: Amazon Custom Attributes
  7. For Custom fields, enter the following:
    1. Name: Role
    2. Info type: Text
    3. Visibility: Visible to user and admin
    4. No. of values: Multi-value
  8. Choose Add.

The new category appears on the Manage user attributes page.

Add the AWS SAML attributes to the Google Workspace user profile

Now that we have configured a custom user attribute, let’s add the SAML attributes that we noted earlier to the Google Workspace user profile.

  1. While logged in to the Google Admin console with admin credentials, navigate to the Users page.
  2. In the Users list, find the user. If you need help, see Find a user account.
  3. Choose the user’s name to open their account page.
  4. Choose User information.
  5. Choose custom attribute you recently created, named Amazon.
  6. Add a value to this custom attribute noted earlier in the following format: <AWS Role ARN>,<AWS provider/IdP ARN>.
  7. Choose Save.

Set up the AWS SAML app in Google Workspace

Now that we have everything in place, we’re ready to create a SAML app within our Google Workspace account and provide the QuickSight instance starting URL. This provides the entry point for Google Workspace users to SSO into the QuickSight instance.

  1. While logged in to Google Admin console with admin credentials, under Apps, choose Web and mobile apps.
  2. Choose Add App, and Search for apps.
  3. Enter Amazon Web Services in the search field.
  4. In the search results, hover over the Amazon Web Services SAML app and choose Select.
  5. On the Google Identity Provider details page, choose Continue.
  6. On the Service provider details page, the ACS URL and Entity ID values for Amazon Web Services are configured by default.
  7. For Start URL, enter https://quicksight.aws.amazon.com.
  8. On the Attribute Mapping page, choose the Select field menu and map the following Google directory attributes to their corresponding Amazon Web Services attributes:

    Google Directory Attribute Amazon Web Services Attribute
    Basic Information > Primary Email https://aws.amazon.com/SAML/Attributes/RoleSessionName
    Amazon > Role https://aws.amazon.com/SAML/Attributes/Role

  1. Choose Finish.

Grant access to users in Google Workspace

When the SAML app is created in Google workspace, it’s turned off by default. This means for users logged in to their Google Workspace account, the SAML app isn’t visible to them. We now enable the AWS SAML app to your Google Workspace users.

  1. While logged in to the Google Admin console with admin credentials, navigate to the Web and mobile apps page.
  2. Choose Amazon Web Services.

  3. Choose User access.
  4. To turn on a service for everyone in your organization, choose ON for everyone.
  5. Choose Save.

If you don’t want to activate this application for all users, you can alternatively grant access to a subset of users by using Google Workspace organizational units.

Verify federated access to the QuickSight instance

To test your SAML 2.0-based authentication with QuickSight for users in your existing IDP (Google Workspace), complete the following steps:

  1. Open a new browser session, for example, using Chrome, in a new incognito window.
  2. Log in to your Google Workspace account (for the purpose of this demo, we use the Google Workspace admin account).
  3. Choose Amazon Web Services from the list of Google apps.

Conclusion

This post provided a step-by-step guide for configuring Google Workspace as your IdP, and using IAM roles to enable SSO to QuickSight. Now your users have a seamless sign-in experience to QuickSight and have the appropriate level of access related to their role.

Although this post demonstrated the integration of IAM and Google Workspace, you can replicate this solution using your choice of SAML 2.0 IdPs. For other supported federation options, see Using identity federation and single sign-on (SSO) with Amazon QuickSight.

To get answers to your questions related to QuickSight, refer to the QuickSight Community.

If you have any questions or feedback, please leave a comment.


About the Authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.