Tag Archives: Amazon QuickSight

Visualize data using Apache Spark running on Amazon EMR with Amazon QuickSight

Post Syndicated from Tom McMeekin original https://aws.amazon.com/blogs/big-data/visualize-data-using-apache-spark-running-on-amazon-emr-with-amazon-quicksight/

Organizations often need to process large volumes of data before serving to business stakeholders. In this blog, we will learn how to leverage Amazon EMR to process data using Apache Spark, the go-to platform for in-memory analytics of large data volume, and connect business intelligence (BI) tool Amazon QuickSight to serve data to end-users.

QuickSight is a fast, cloud-powered BI service that makes it easy to build visualizations, perform ad hoc analysis, and quickly get business insights from your data. With our cloud-based service, you can easily connect to your data, perform advanced analysis, and create stunning visualizations and rich dashboards that can be accessed from any browser or mobile device.

QuickSight supports connectors for big data analytics using Spark. With the SparkSQL connector in QuickSight, you can easily create interactive visualizations over large datasets using Amazon EMR. Amazon EMR provides a simple and cost-effective way to run highly distributed processing frameworks such as Spark.

In this post, we use the public data set, New York City Taxi and Limousine Commission (TLC) Trip Record Data, which contains data of trips taken by taxis and for-hire vehicles in New York City. We use an optimized Parquet version of the CSV public dataset available from the Registry of Open Data on AWS.

This post also explores how to use AWS Glue to create the Data Catalog by crawling the NYC taxi data in an Amazon Simple Storage Service (Amazon S3) bucket, making it immediately query able for analyzing. AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load your data for analytics. To learn more about how to use AWS Glue to transform a dataset from CSV to Parquet, see Harmonize, Query, and Visualize Data from Various Providers using AWS Glue, Amazon Athena, and Amazon QuickSight.

Prerequisites

The following steps assume that you have a VPC with public and private subnets, with NAT configured for private subnets and an available S3 bucket for Amazon EMR logging.

If you create the EMR cluster in a private subnet, you can use AWS Systems Manager Session Manager, a bastion host, or a VPN connection to access the EMR cluster. For this post, we use Session Manager to access our EMR cluster.

In QuickSight Enterprise edition, you can create connections to your VPCs from your QuickSight account. Each connection creates an elastic network interface in your VPC for QuickSight to send traffic to instances in your VPC. For more information, see Connecting to a VPC with Amazon QuickSight. If you haven’t already signed up for QuickSight, you can sign up before getting started. QuickSight offers a free trial so you can try out this solution at no cost.

The following AWS CloudFormation template offers single-click deployment.

We use the US East (N. Virginia) Region as the default; we highly recommended you launch the stack to optimize querying the public dataset in Amazon S3. You can change to any Region that supports Amazon EMR, AWS Glue, and QuickSight, but it may impact the time it takes to query the data.

If deploying into production, we recommend you secure communication by using the Configure SSL with a QuickSight supported authority step after deployment of the CloudFormation template to enable SSL.

Solution overview

We walk you through the following steps:

  1. Deploy and configure Amazon EMR with a CloudFormation template.
  2. Run AWS Glue crawlers to crawl and populate the Hive-compatible metastore.
  3. Test JDBC connectivity using Beeline.
  4. Visualize the data with QuickSight.

The CloudFormation template provided in the prerequisites, provides a configured Amazon EMR cluster for you to start querying your data with Spark. After deploying the CloudFormation stack, you can skip the first step and start running the AWS Glue crawlers.

Deploy and Configure Amazon EMR

Those looking to dive deep to understand what the CloudFormation template is deploying can use the following steps to manually deploy Amazon EMR running Spark and connect it to QuickSight:

  1. Create an EMR cluster with 5.30.0 or later release.
  2. Connect to the cluster using Session Manager.
  3. Install and configure OpenLDAP.
  4. Create a user in LDAP.
  5. Start the Thrift server.
  6. Configure SSL using a QuickSight supported authority.

Create an EMR cluster

For this post, we use an EMR cluster with 5.30.0 or later release.

  1. On the Amazon EMR console, choose Create cluster.
  2. For Cluster name, enter a name (for example, visualisedatablog).
  3. For Release, choose your release version.
  4. For Applications, select Spark.
  5. Select Use AWS Glue Data Catalog for table metadata.
  6. For Instance type¸ choose your instance.
  7. For EC2 key pair, choose Proceed without an EC2 key pair.
  8. Choose Create cluster.

Make sure you enabled run as support for Session Manager.

Connect to the EMR cluster using Session Manager

Session Manager is a fully managed AWS Systems Manager capability that lets you manage your Amazon Elastic Compute Cloud (Amazon EC2) instances, on-premises instances, and virtual machines through an interactive, one-click, browser-based shell or through the AWS Command Line Interface (AWS CLI). Session Manager provides secure and auditable instance management without the need to open inbound ports, maintain bastion hosts, or manage SSH keys.

By default, sessions are launched using the credentials of a system-generated ssm-user account that is created on a managed instance. For Amazon EMR, you can instead launch sessions using the Hadoop user account. Session Manager provides two methods for specifying the Hadoop user operating system account to use. For more information, see Enable run as support for Linux and macOS instances. For those configuring Systems Manager for the first time, review Why is my EC2 instance not appearing under Managed Instances in the Systems Manager console? for helpful tips on getting started with adding managed instances.

After you log in to the primary node of your cluster, run the following commands to install and configure OpenLDAP.

Install and configure OpenLDAP

To install and configure OpenLDAP, complete the following steps (alternatively, you can download the script used in the CloudFormation script and run it):

  1. Run the following commands:
# Install LDAP Server
sudo yum -y install openldap compat-openldap openldap-clients openldap-servers openldap-servers-sql openldap-devel
# Restart LDAP 
sudo service slapd restart

For more about configuring OpenLDAP, see the OpenLDAP documentation.

  1. Run the following command to set a new password for the root account and store the resulting hash:
slappasswd

This command outputs a hash that looks like the following sample:

{SSHA}DmD616c3yZyKndsccebZK/vmWiaQde83
  1. Copy the hash output to a text editor to use in subsequent steps.

Next, we prepare the commands to set the password for the LDAP root.

  1. Run the following code (replace the hash with the one you generated in the previous step, and make sure carriage returns are preserved):
cat > /tmp/config.ldif <<EOF
dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcSuffix
olcSuffix: dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootDN
olcRootDN: cn=dev,dc=example,dc=com

dn: olcDatabase={2}hdb,cn=config
changetype: modify
replace: olcRootPW
olcRootPW: <<REPLACE_WITH_PASSWORD_HASH>>
EOF
  1. Run the following command to run the preceding commands against LDAP:
sudo ldapmodify -Y EXTERNAL -H ldapi:/// -f /tmp/config.ldif
  1. Copy the sample database configuration file to /var/lib/ldap and add relevant schemas:
sudo cp /usr/share/openldap-servers/DB_CONFIG.example /var/lib/ldap/DB_CONFIG

sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/cosine.ldif
sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/nis.ldif 
sudo ldapadd -Y EXTERNAL -H ldapi:/// -f /etc/openldap/schema/inetorgperson.ldif

Create a user in LDAP

Next, create a user account with a password in the LDAP directory with the following commands. When prompted for a password, use the LDAP root password that you created in the previous step (for this post, we use sparky as the username). Make sure carriage returns are preserved when copying and entering the code.

cat > /tmp/accounts.ldif <<EOF 
dn: dc=example,dc=com
objectclass: domain
objectclass: top
dc: example

dn: ou=dev,dc=example,dc=com
objectclass: organizationalUnit
ou: dev
description: Container for developer entries

dn: uid=$username,ou=dev,dc=example,dc=com
uid: $username
objectClass: inetOrgPerson
userPassword: <<REPLACE_WITH_STRONG_PASSWORD>>
sn: sparky
cn: dev
EOF

Run the following command to run the preceding commands against LDAP (you must enter the root LDAP password specified in the previous section):

sudo ldapadd -x -w <<LDAP_ROOT_PASSWORD>> -D "cn=dev,dc=example,dc=com" -f /tmp/accounts.ldif

We have now configured OpenLDAP on the EMR cluster running Spark and created the user sparky that we use to connect to QuickSight.

Start the Thrift server

Start the Thrift server by running the following command. By default, the Thrift server runs on port 10001. Amazon EMR by default places limits on executor sizes to avoid having the executor consume too much memory and interfere with the operating system and other processes running on the instance. To optimize the use of R family instances with the flexibility of also using the smallest supported instance types, we use –executor-memory=18GB –executor-cores=4 for our Thrift server configuration. See the following code:

sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn –executor-memory=18GB –executor-cores=4

Now that we have configured the EMR cluster to accept connections, let’s get our public dataset ready.

Configure SSL using a QuickSight supported authority

If deploying into production, we recommend using a secure communication between QuickSight and Spark. QuickSight doesn’t accept certificates that are self-signed or issued from a non-public CA. For more information, see Amazon QuickSight SSL and CA Certificates. To secure the Thrift connection, you can enable the SSL encryption and restart the hive-server2 and Thrift service on the primary EMR instance.

After you have your certificate, you can enable SSL.

In your preferred editor, open and edit /etc/hive/conf/hive-site.xml:

    <property>
        <name>hive.server2.thrift.bind.host</name>
        <value>HOSTNAME</value>
    </property>
    <property>
        <name>hive.server2.use.SSL</name>
        <value>true</value>
    </property>
    <property>
        <name>hive.server2.keystore.path</name>
        <value>PATH_TO_KEYSTORE/KEYSTORE/KEYSTORE.jks</value>
    </property>
    <property>
        <name>hive.server2.keystore.password</name>
        <value>KEYSTORE_PASSWORD</value>
    </property>

Restart the Thrift server by running the following command:

sudo /usr/lib/spark/sbin/stop-thriftserver.sh --master yarn && sudo /usr/lib/spark/sbin/start-thriftserver.sh --master yarn

Run AWS Glue crawlers

Now let’s use AWS Glue crawlers to detect the schema. If you used the CloudFormation template, you already have a crawler ready to start via the AWS Glue console. When the crawler is complete, you should have a table listed in the database.

If you’re configuring the crawler manually on the AWS Glue console, the following screenshot summarizes the crawler configuration.

After the crawler has run, you can go to the Tables page to view the taxi_ny_pub table with the table properties and schema. The following screenshot shows the table details page; here you can find the partitions and various versions of the schema.

The Data Catalog is shared between Amazon EMR, Amazon Athena, and Amazon Redshift Spectrum. You can use Athena to preview the data that is stored in this table.

Test JDBC connectivity using Beeline

Now that the EMR cluster is deployed and the data is copied, we can quickly test the JDBC connectivity on the EMR cluster using Beeline. Beeline is an open-source JDBC client, based on the SQLLine CLI, used to connect to your cluster via the command line.

Log in to your EMR cluster using Session Manager. You can use Beeline to connect to the Thrift server and test the connection:

/usr/lib/spark/bin/beeline -u 'jdbc:hive2://<REPLACE_MASTER_PUBLIC_DNS>:10001/default' -n <<USERNAME>> -p <<PASSWORD>> -e "show databases;" 

The preceding command connects to the Spark cluster and shows you the list of databases, as in the following example code:

Connected to: Spark SQL (version 2.3.0) 
Driver: Hive JDBC (version 1.2.1-spark2-amzn-0) 
Transaction isolation: TRANSACTION_REPEATABLE_READ 
+---------------+--+ 
| databaseName | 
+---------------+--+ 
| default | 
| nyc_taxi |
| sampledb | 
+---------------+--+ 
3 rows selected (0.171 seconds) 
Beeline version 1.2.1-spark2-amzn-0 by Apache Hive 
Closing: 0: 
jdbc:hive2://<REPLACE_MASTER_PUBLIC_DNS>:10001/default

Visualize data with QuickSight

Now let’s connect Amazon EMR to QuickSight and do a quick visualization of this data.

  1. On the QuickSight console, on the Datasets page, choose New dataset.
  2. Choose Spark as your connector.

  1. For Data source name, enter a name (for example, SPARKY).
  2. For Database server, enter your public primary DNS.

To allow QuickSight to connect to your EMR cluster, you must create a security group containing an inbound rule authorizing access from the appropriate IP address range for the QuickSight servers in that Region. For further details on how to create appropriate security group rules, see Authorizing Connections from Amazon QuickSight to Amazon EC2 Instances.

For this post, we use security groups to control network connectivity.

  1. For Port, add TCP port 10001 as an inbound rule to allow for inbound connectivity from QuickSight to Amazon EMR.

If deploying into production, we recommend using a secure communication between QuickSight and Spark, which we covered in a previous step.

QuickSight Enterprise edition provides full integration with Amazon Virtual Private Cloud (Amazon VPC), which enables you to secure and isolate traffic between resources. For more information, see Connecting to a VPC with Amazon QuickSight. This allows you to deploy your EMR cluster in a private VPC Subnet.

  1. Enter a username and password.
  2. If you configured SSL, select Enable SSL.
  3. Choose Create data source.

The Spark cluster reads the Data Catalog and provides information about the schema and the tables in the schema. You can also choose the table created by the AWS Glue crawler and load the data into SPICE for faster analytics. SPICE is the in-memory calculation engine in QuickSight that provides blazing fast performance at scale. SPICE automatically replicates data for high availability, allowing thousands of users to simultaneously perform fast, interactive analysis, while shielding your underlying data infrastructure, which saves you time and resources. QuickSight supports uploading 250 million rows (and 500 GB) per SPICE dataset. If you have larger datasets than this, you can use the direct query option. In this post, we use SPICE.

Also make sure that you defined the correct permissions to access the S3 bucket for the EMR cluster. For instructions, see Reading and Writing Data to Amazon S3 Using EMRFS.

Let’s create a custom SQL query to perform some aggregations prior to loading into SPICE (see the following screenshot).

  1. Enter the following code:
SELECT 
SUM (cast (fare_amount as double)) as TotalFare 
,AVG(cast (fare_amount as double)) as AvgFare 
,AVG (cast (trip_distance as double)) as AvgTripDistance 
,AVG(passenger_count) as AvgPassengerCount 
,year 
,month
FROM nyc_taxi.taxi_ny_pub
WHERE year BETWEEN 2011 AND 2016
GROUP BY year, month;

The database and table names may vary in your deployment.

  1. For this post, select Import to SPICE for quicker analytics.

Alternatively, because the NYC taxi dataset is larger than 250 million rows, you can choose to directly query your data.

  1. To create a visualization, select the fields in the left panel.

For this post, we review the Average Fare Amount and Passenger Count between 2013–2019, using ML Insights to automatically generate natural language narratives when analyzing the 229.12 GB dataset.

Summary

In less than an hour, we created an EMR cluster, enabled OpenLDAP, and started the Thrift server. We also used AWS Glue to crawl a public dataset and visualize the data. Now you have what you need to get started creating powerful dashboards and reports using QuickSight on your Amazon S3 data using Apache Spark. Feel free to reach out if you have any questions or suggestions.

To learn more about these capabilities and start using them in your dashboards, check out the QuickSight User Guide.

If you have questions and suggestions, you can post them on the QuickSight forum.

Go to the QuickSight website to get started now for free.


About the Author

Tom McMeekin is an Enterprise Solutions Architect with a career in technology spanning over 20 years. Tom has worked across a number of industry verticals including Telecommunications, Manufacturing, Infrastructure and Development, Utilities, Energy, and Retail. Throughout his career, he has focused on solving complex business problems through innovative technologies that deliver the right business outcomes for his customers.

 

Build a centralized granular access control to manage assets and data access in Amazon QuickSight

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/build-a-centralized-granular-access-control-to-manage-assets-and-data-access-in-amazon-quicksight/

A large business intelligence (BI) project with many users and teams and sensitive information demands a multi-faceted security architecture. Such architecture should provide BI administrators and architects with the capability to minimize the amount of information accessible to users. For a straightforward solution to manage Amazon QuickSight user and asset access permissions, you can use the AWS Command Line Interface (AWS CLI) or AWS Management Console to manually edit QuickSight user role and dashboard access. However, in specific cases, an enterprise can easily have hundreds or thousands of users and groups, and these access management methods aren’t efficient. We have received a large number of requests to provide an advanced programmable approach to deploy and manage a centralized QuickSight security architecture.

This post describes the best practices for QuickSight authentication and authorization granular access control, and provides a centralized cloud application with an AWS Cloud Development Kit (AWS CDK) stack to download. One of the advantages of our solution is enterprises can deploy the security framework to administer access control of their BI without leaving AWS.

All configurations are saved in the AWS Systems Manager Parameter Store. Parameter Store provides secure, hierarchical storage for configuration data management and secrets management. You can store data such as user name, user permissions, passwords, and database strings as parameter values. You can reference AWS Systems Manager parameters in your scripts and configuration and automation workflows by using the unique name that you specified when you created the parameter.

The AWS CDK application template fits into the continuous integration and continuous deployment (CI/CD) infrastructure and grants or revokes all authentications and authorizations based on a defined policy prescribed by AWS. This avoids possible human errors made by BI developers or administrators. BI developers can edit configuration parameters to release new dashboards to end-users. At the same time, BI administrators can edit another set of parameters to manage users or groups. This AWS CDK CI/CD design bridges the gaps between development and operation activities by enforcing automation in building and deploying BI applications.

Security requirements

In enterprise BI application design, multi-tenancy is a common use case, which serves multiple sets of users with one infrastructure. Tenants could either be different customers of an independent software vendor (ISV), or different departments of an enterprise. In a multi-tenancy design, each tenant shares the dashboards, analyses, and other QuickSight assets. Each user, who can see all other users belonging to the same tenant (for example, when sharing content), remains invisible to other tenants. Within each tenant, the BI admin team has to create different user groups to control the data authorization, including asset access permissions and granular-level data access.

Let’s discuss some use cases of asset access permissions in detail. In a BI application, different assets are usually categorized according to business domains (such as an operational dashboard or executive summary dashboard) and data classification (critical, highly confidential, internal only, and public). For example, you can have two dashboards for analyzing sales results data. The look and feel of both dashboards are similar, but the security classification of the data is different. One dashboard, named Sales Critical Dashboard, contains critical columns and rows of data. The other dashboard, called Sales Highly-Confidential Dashboard, contains highly confidential columns and rows of data. Some users are granted permission to view both dashboards, and others have lower security level permission and can only access Sales Highly-Confidential Dashboard.

In the following use case, we address granular-level data access as follows:

  • Row-level access (RLS) – For the users who can access Sales Critical Dashboard, some of them can only view US data. However, some global users can view the data of all countries, including the US and UK.
  • Column-level access (CLS) – Some users can only view non-personally identifiable information (PII) data columns of a dataset, whereas the HR team can view all the columns of the same dataset.

Large projects might have several tenants, hundreds of groups, and thousands of users in one QuickSight account. The data leader team wants to deploy one protocol for user creation and authentication in order to reduce the maintenance cost and security risk. The architecture and workflow described in this post help the data leader achieve this goal.

Additionally, to avoid human errors in daily operation, we want these security permissions to be granted and revoked automatically, and fit into the CI/CD infrastructure. The details are explained later in this post.

Architecture overview

The following diagram shows the QuickSight account architecture of this solution.

  • Authors create dashboards and update AWS Systems Manager Parameter Store to release dashboards to different groups
  • Admins approve the requests from authors
  • Admins update user management (roles, namespace,) by editing AWS Systems ManagerParameter Store
  • DevOps deploy the updates with AWS CDK

*Groups: Object access permission groups control the owner/viewer of the objects. Data segment groups combined with RLS/CLS control data access.

*Datasets: Contain all data, restricted by row-level security (RLS) and column-level security (CLS)

The following diagram illustrates the authentication workflow of the architecture:

*First time log in QuickSight: If the QuickSight user is not registered before first time log in, a reader is created and this reader only can view the landing page dashboard, which shares to all users of this account. The landing page provides the reports list that this user can view.

The following diagram illustrates the authorization workflow of the architecture.

Authorization diagram details:

  1. User information (department, team, geographic location) is stored in Amazon Redshift, Amazon Athena, or any other database. Combined with group-user mapping, RLS databases are built for control data access.
  2. Hourly permissions assignment:
    1. According to group-employee name (user) mapping (membership.csv) and group-role mapping (/qs/console/roles), an AWS Lambda function creates groups, registers, users, assigns group members, removes group memberships, promotes readers to author or admin, and deletes users if they’re demoted from author or admin to reader.
    2. According to group-dashboard mapping in /qs/config/access, an AWS Lambda function updates dashboard permissions to QuickSight groups.
    3. According to group-namespace mapping in membership.csv, an AWS Lambda function creates QuickSight groups in the specified namespace.
  3. Sample parameters of objects access permissions and data segments:

  1. Sample parameters of QuickSight user role:

  1. Sample data of membership.csv:

In this solution, custom namespaces are deployed to support multi-tenancy. The default namespace is for all internal users of a company (we call it OkTank). OkTank creates the 3rd-Party namespace for external users. If we have to support more tenants, we can create more custom namespaces. By default, we’re limited to 100 namespaces per AWS account. To increase this limit, contact the QuickSight product team. For more information about multi-tenancy, see Embed multi-tenant analytics in applications with Amazon QuickSight.

In each namespace, we create different types of groups. For example, in the default namespace, we create the BI-Admin and BI-Developer groups for the admin and author users. For reader, we deploy two types of QuickSight groups to control asset access permissions and data access: object access permission groups and data segment groups.

The following table summarizes how the object access permission groups control permissions.

Group Name Namespace Permission Notes
critical Default View both dashboards (containing the critical data and highly confidential data)
highlyconfidential Default Only view Sales Highly-Confidential Dashboard
BI-Admin Default Account management and edit all assets Users in the BI-Admin group are assigned the Admin QuickSight user role.
BI-Developer Default Edit all assets Users in the BI-Developer group are assigned the Author QuickSight user role.
Power-reader Default View all assets and create ad hoc analysis to run self-service analytics reports

Users in the Power-reader group are assigned the Author QuickSight user role.

However, this group can’t save or share their ad hoc reports.

3rd-party Non-default namespaces (3rd-party namespace, for example) Can only share with readers (3rd-party-reader group, for example) in the same namespace In non-default namespaces, we can also create other object access permission groups, which is similar to the critical group in the default namespace.

For more information about QuickSight groups, users, and user roles, see Managing User Access Inside Amazon QuickSight, Provisioning Users for Amazon QuickSight, and Using administrative dashboards for a centralized view of Amazon QuickSight objects.

The second type of groups (data segment groups), combined with row-level security datasets and column-level security, control data access as described in the following table.

Group Name Namespace Permission Scope
USA Default Only view US data on any dashboard Row-level
GBR Default Only view UK data on any dashboard Row-level
All countries Default View data of all countries on any dashboard Row-level
non-PII Default Can’t view Social Security numbers, annual income, and all other columns of PII data Column-level
PII Default Can view all columns including PII data Column-level

We can set up similar groups in non-default namespaces.

These different groups can overlap each other. For example, if a user belongs to the groups USA, Critical, and PII, they can view US data on both dashboards, with all columns. The following Venn diagram illustrates the relationships between these groups.

In summary, we can define a multi-faceted security architecture by combining QuickSight features, including namespace, group, user, RLS, and CLS. All related configurations are saved in the Parameter Store. The QuickSight users list and group-user mapping information are in an Amazon Simple Storage Service (Amazon S3) bucket as a CSV file (named membership.csv). This CSV file could be output results of LDAP queries. Several AWS Lambda functions are scheduled to run hourly (you can also invoke these functions on demand, such as daily, weekly, or any time granularity that fits your requirements) to read the parameters and the membership.csv. According to the configuration defined, the Lambda functions create, update, or delete groups, users, and asset access permissions.

When the necessary security configurations are complete, a Lambda function calls the QuickSight APIs to get the updated information and record the results in an S3 bucket as CSV files. The BI admin team can build datasets with these files and visualize the results with dashboards. For more information, see Using administrative dashboards for a centralized view of Amazon QuickSight objects and Building an administrative console in Amazon QuickSight to analyze usage metrics.

In addition, the errors of Lambda functions and the user deletion events are stored in this S3 bucket for the admin team to review.

Automation

The following diagram illustrates the overall workflow of the Lambda functions.

We use a programmable method to create and configure the groups and users automatically. For any ad hoc user registration request (such as the user isn’t recorded in membership.csv yet due to latency), as long as the user can be authenticated, they can assume the AWS Identity and Access Management (IAM) role quicksight-fed-user to self-provision as a QuickSight reader. This self-provisioned reader can only view a landing page dashboard, which provides the list of dashboards and corresponding groups. According to the dashboard-group mapping, this new reader can apply for membership of a given group to access the dashboards. If the group owner approves the application, the hourly Lambda functions add the new user into the group the next time they run.

The CI/CD pipeline starts from AWS CDK. The BI administrator and author can update the Systems Manager parameters to release new dashboards or other QuickSight assets in the AWS CDK stack granular_access_stack.py. The BI administrator can update the Systems Manager parameters in the same stack to create, update, or delete namespaces, groups, or users. Then the DevOps team can deploy the updated AWS CDK stack to apply these changes to the Systems Manager parameters or other AWS resources. The Lambda functions are triggered hourly to call APIs to apply changes to the related QuickSight account.

Scale

The Lambda functions are restricted by the maximum runtime of 15 minutes. To overcome this limitation, we can convert the Lambda functions to AWS Glue Python shell scripts with the following high-level steps:

  1. Download Boto3 wheel files from pypi.org.
  2. Upload the wheel file into an S3 bucket.
  3. Download the Lambda functions and merge them into one Python script and create an AWS Glue Python shell script.
  4. Add the S3 path of the Boto3 wheel file into the Python library path. If you have multiple files to add, separate them with a comma.
  5. Schedule this AWS Glue job to run daily.

For more information, see Program AWS Glue ETL Scripts in Python and Using Python Libraries with AWS Glue.

Prerequisites

You must have the following prerequisites to implement this solution:

  • A QuickSight Enterprise account
  • Basic knowledge of Python
  • Basic knowledge of SQL
  • Basic knowledge of BI

Create the resources

Create your resources by downloading the AWS CDK stack from the GitHub repo.

In the granular_access folder, run the command cdk deploy granular-access to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Deploy the solution

When you deploy the AWS CDK stack, it creates five Lambda functions, as shown in the following screenshot.

The stack also creates additional supportive resources in your account.

The granular_user_governance function is triggered by the Amazon CloudWatch event rule qs-gc-everyhour. The information of groups and users is defined in the file membership.csv. The S3 bucket name is stored in the parameter store /qs/config/groups. The following diagram shows the flowchart of this function.

  1. Set the destination of granular_user_governance to another Lambda function, downgrade_user, with source=Asynchronous invocation and condition=On Success.

The following diagram is a flowchart of this function.

To avoid breaking critical access to QuickSight assets governed by Admin or Author, we demote an admin or author by deleting the admin or author user and creating a new reader user with the Lambda function downgrade_user. The granular_user_governance function handles downgrading admin to author, or upgrading author to admin.

  1. Set the destination of downgrade_user to the Lambda function granular_access_assets_govenance with source=Asynchronous invocation and condition=On Success.

The following diagram shows a flowchart of this function.

  1. Set the destination of downgrade_user to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Failure.

The check_team_members function simply calls QuickSight APIs to get the namespaces, groups, users, and assets information, and saves the results in the S3 bucket. The S3 key is monitoring/quicksight/group_membership/group_membership.csv and monitoring/quicksight/object_access/object_access.csv.

Besides the two output files of the previous step, the error logs and user deletion logs (logs of downgrade_user) are also saved in the monitoring/quicksight folder.

  1. Set the destination of granular_access_assets_govenance to the Lambda function check_team_members with source=Asynchronous invocation and condition=On Success or condition=On Failure.

Create row-level security datasets

As a final step, we create RLS datasets. This allows you to change the dashboard records based on the users that view the dashboards.

QuickSight supports RLS by applying a system-managed dataset that sub-selects records from the dashboard dataset. The mechanism allows the administrator to provide a filtering dataset (the RLS dataset) with username or groupname columns, which are automatically filtered to the user that is logged in. For example, a user named YingWang belongs to QuickSight group BI, so all the rows of the RLS dataset that correspond to the username YingWang or group name BI are filtered. The rows that remain in the RLS after applying the username and the group name filters are then used to filter the dashboard datasets further by matching columns with the same names. For more information about row-level security, see Using Row-Level Security (RLS) to Restrict Access to a Dataset.

In this solution, we export the sample user information into the file membership.csv, which is stored in an S3 bucket. In this file, we provide some sample groups for RLS dataset definition. These groups are the data segment groups, as described in the overall architecture design. The following screenshot shows some of the groups and the users in those groups.

The granular_user_governance function creates these groups and adds the related users to be members of these groups.

How do we create the RLS dataset? Let’s say we have a table called employee_information in our organization’s HR database. The following screenshot shows some sample data.

Based on the employee_information table, we create a view called rls for an RLS dataset. See the following SQL code:

create view
rls(groupname, username, country, city)
as
(SELECT 
concat('quicksight-fed-'::text, lower(employee_information.country::text)) AS groupname,
concat(concat('quicksight-fed-us-users/'::text, employee_information.employee_login::text),'@oktank.com'::text) AS username,
employee_information.country,
employee_information.city
FROM 
employee_information)

The following screenshot shows our sample data.

Now we have the table ready, we can create the RLS dataset with the following custom SQL:

select distinct 
r.groupname as GroupName,
null as UserName,
r.country,
null as city 
from 
rls as r 
join fact_revenue as f 
on r.country=f.country
union
select distinct 'quicksight-fed-all-countries' as GroupName,
null as UserName,
null as country,
null as city
from rls as r
union
select distinct null as GroupName,
r.username as UserName,
r.country,
r.city 
from 
rls as r
join fact_revenue as f 
on r.country=f.country 
and 
r.city=f.city

The following screenshot shows our sample data.

For the group quicksight-fed-all-countries, we set the username, country, and city as null, which means that all the users in this group can view the data of all countries.

For country level, only the security rules defined in the groupname and country columns are used for filtering. The username and city columns are set as null. The users in the quicksight-fed-usa group can view the data of USA, and the users in the quicksight-fed-gbr group can view the data of GBR.

For each user with groupname set as null, they can only view the specific country and city assigned to their username. For example, TerryRigaud can only view data of Austin, in the US.

In QuickSight, multiple rules in an RLS dataset are combined together with OR.

With these multi-faceted RLS rules, we can define a comprehensive data access pattern.

Clean up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy granular_access 

Conclusion

This post discussed how BI administrators can design and automate QuickSight authentication and authorization granular access control. We combined QuickSight security features like row-level and column-level security, groups, and namespaces to provide a comprehensive solution. Managing these changes through “BIOps” ensures a robust, scalable mechanism for managing QuickSight security. To learn more, sign up for a QuickSight demo.


About the Authors

Ying Wang is a Senior Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

 

 

 

Amir Bar Or is a Principal Data Architect at AWS Professional Services. After 20 years leading software organizations and developing data analytics platforms and products, he is now sharing his experience with large enterprise customers and helping them scale their data analytics in the cloud.

Calculated fields, level-aware aggregations, and evaluation order in Amazon QuickSight

Post Syndicated from Ian Liao original https://aws.amazon.com/blogs/big-data/calculated-fields-level-aware-aggregations-and-evaluation-order-in-amazon-quicksight/

Amazon QuickSight is a fast, cloud-native, serverless, business intelligence service that makes it easy to deliver insights to everyone. QuickSight has carefully designed concepts and features that enable analysis builders, such as QuickSight authors, to design content-rich, interactive, and dynamic dashboards to share with dashboard viewers. As authors build an analysis, QuickSight transforms, filters, and aggregates data from tabular datasets into result sets to answer business questions. You can implement sophisticated data analytics in QuickSight in minutes by using calculated fields, then share within QuickSight in your organization, or embedded into apps or portals to share with thousands of users without any servers or infrastructure to set up.

This post gives you an end-to-end overview of how to perform various calculations in QuickSight and introduces you to the concepts of evaluation order and level-aware aggregation, which allow you to build more advanced analytics that use scalar, aggregate, and table functions. We also explain these approaches using an analogy to SQL.

This post assumes that you have a basic knowledge of analytics, SQL, and QuickSight.

Sample dataset and the business question

For this post, we use the Patient-Info dataset, which holds fictional transactional records for inpatient services. It contains dummy data that is randomly generated by AWS for demonstration purposes. The tabular table has the following columns:

  • Patient ID – ID of the patient
  • Admit Date – Date when the patient is admitted
  • Hospital – Name of the hospital
  • Service – Service item provided during inpatient visit
  • Category – Category of the service during inpatient visit
  • Subcategory – Subcategory of the service during inpatient visit
  • Revenue – Revenue from the service rendered
  • Profit – Profit from the service rendered

For instructions on creating a SPICE dataset in QuickSight with this dataset, see Prepare Data.

We use QuickSight to answer the following business question and variations of it from the dataset: What is the average profit ratio across all categories?

This question has a two-step calculation logic, which is common in use cases like goal completion analysis:

  1. Find the profit ratio per category.
  2. Find the average profit ratio across category.

In the process of answering this, we explore potential solutions in different approaches while discussing different features QuickSight has to offer:

  • Scalar functions – Return a single value computed for every row of input data, such as Plus, Division
  • Aggregation functions – Operate against a collection of values and return a single summarized value, such as Avg()
  • Table functions – Operate against a collection of rows and return a collection of rows, such as Rank(), avgOver(), sumOver()
  • Level-aware aggregation – A special type of table function that is evaluated before aggregation or before filtering

Some of these potential solutions don’t lead to the desired answer. But you will have a deep understanding of these QuickSight function types by thinking about why they don’t work. You can also jump to the definition of the calculated field Average Profit Ratio M to see the final solution.

Scalar functions

After the SPICE dataset is created with Patient-Info, let’s create an analysis from the dataset, and then try to find the answer to the business question using scalar functions.

  1. In the analysis editor, on the + Add menu, choose Add calculated field.

  1. In the calculated field editor, enter the name and formula:
Profit Ratio = profit / revenue
  1. Choose Save.

  1. Add Profit Ratio to a KPI visual. Remember to set the aggregate function to Average because we want to find the average profit ratio.

  1. Add category and Profit Ratio to a table visual. Again, we want to set the aggregate function to Average.

What is calculated here? Our dataset is at transactional level, so QuickSight calculates the profit ratio for every transaction and aggregates the results to the desired level defined in Visuals.

The calculation QuickSight has performed is similar to the following code:

select avg(profit/revenue)                                                                      
from dataset                -- to calculate the KPI visual         

select category, avg(profit/revenue)                                                            
from dataset                                                           
group by category                -- to calculate the table visual

This isn’t the answer we’re looking for because the profit ratio for a category is defined as the total profit of the category divided by the total revenue of the category.

Aggregate functions

Let’s try a different approach using aggregate functions:

Profit Ratio with Agg Func = sum(profit)/sum(revenue)

QuickSight is smart enough to figure out that author wants to aggregate data to the visual level first, and then use the division.

When we compare the results with Profit Ratio we created earlier, the numbers are quite different! This is because Profit Ratio calculates the transactional-level ratio first and then finds the average; whereas Profit Ratio with Agg Func calculates the category-level totals of the numerator and denominator first and then finds the ratio. Therefore, Profit Ratio is skewed by some big percentage loss in certain transactions, whereas Profit Ratio with Agg Func returns more meaningful data.

The calculation can be modeled in SQL as the following:

select category                                                                                   
, avg(profit/revenue) as "Profit Ratio"                          
, sum(profit)/sum(revenue) as "Profit Ratio with Agg Func"       
from dataset                                                      
group by category   

Profit Ratio with Agg Func returns the category-level profit ratio we wanted. The next step is to find an average of the ratios.

Table functions

Now let’s look for help from table functions. A table function outputs the same number of rows as input, and by default it has to be used on top of another aggregation function. To find the average of profit ratios, we can try avgOver():

avgOver of Profit Ratio = avgOver({Profit Ratio with Agg Func})

The following code is the corresponding SQL:

with aggregation_step as (                                                       
select category                                                  
, sum(profit)/sum(revenue) as "Profit Ratio with Agg Func"       
from dataset                                                     
group by category                                                 
),                                                                                                                                     
select category                                                  
, "Profit Ratio with Agg Func"                                    
, avg("Profit Ratio with Agg Func") over()                                                        
    as "avgOver of Profit Ratio"                                 
from aggregation_step

This example is complicated enough that QuickSight has to follow a sequence of steps to calculate a single visual. By default, QuickSight goes through up to six stages to complete the calculations for a visual:

  1. Simple calculations – Scalar calculations that can be applied before filter and aggregation
  2. Analysis filters – Apply filters on dimensions and measures with no aggregation option selected
  3. Top/bottom N filters – A special type of filter that is defined on a dimension, and sorted by a field that doesn’t contain table functions
  4. ON-VISUAL – Aggregations (evaluate group by and aggregations) and filters (apply filters with aggregation in the having clause)
  5. Table calculations – Calculate table functions and evaluate filters with table functions
  6. Totals and subtotals – Calculate totals and subtotals

With avgOver(), we’ve got the answer we’re looking for: 6.94%. However, the number is displayed for every category, which is not preferred. Actually, we can only get this number when the category is on the visual.

When the category is removed, Profit Ratio with Agg Func is aggregated to the grand total level in the aggregation step, therefore its avgOver remains the same number, as shown in the following screenshot.

To avoid these drawbacks, we need a new tool.

Level-aware aggregations

QuickSight introduced a type of calculation mechanism called level-aware aggregation (LAA) to meet more analytical requirements. Like table functions, LAAs operate against a collection of rows and return the same number of rows. Regular table functions can only be evaluated after the aggregation and filter stage in QuickSight. With LAA, authors can evaluate a group of functions before aggregations or even before filters.

The following diagram illustrates the evaluation order of LAA.

Because LAA is evaluated before aggregation, both its input and output are at the dataset level. Calculated fields with LAA behave similarly to calculated fields with scalar functions. It can be specified as a dimension or a measure. An aggregation function needs to be applied on top of LAA when the calculated field is used as a measure in visuals. When you want to filter on a calculated filed with LAA, QuickSight asks you to choose between no aggregation or one aggregation function. Also, duplicated rows are likely populated within the partition groups because the output level of LAA remains at the dataset level.

Let’s return to the business question: What is the average profit ratio across category?

It seems that we can use sumOver with category as the partition group, and then use average as the aggregate function to find the answer:

sumOver(profit) = sumOver(profit,[category],PRE_AGG)
sumOver(revenue) = sumOver(reveune,[category],PRE_AGG)
countOver(profit) = countOver(profit,[category],PRE_AGG)

Average Profit Ratio = avg({sumOver(profit)}/{sumOver(revenue)})

The following screenshot shows the aggregation functions defined for each measure. countOver(profit)with min() as aggregate simply returns transaction counts per category. It’s also the number of duplicated rows sumOver(profit) and sumOver(revenue) output.

8.12% is not the correct answer to the business question. The correct average should be 6.94%, as we saw earlier. How does QuickSight come up with the number?

For Average Profit Ratio, QuickSight tried to calculate the following:

with LAA as (
select category
, sum(profit) over (partition by category) as "sumOver(profit)"
, sum(revenue) over (partition by category) as "sumOver(revenue)"
, count(profit) over (partition by category) as "countOver(profit)"
from dataset 
),                       -- notice that LAA is at the same level of dataset

select category,
, avg("sumOver(profit)" / "sumOver(revenue)") as "Average Profit Ratio"
from LAA
group by category;       -- for data at category level

select avg("sumOver(profit)" / "sumOver(revenue)") as "Average Profit Ratio"
from LAA;                --  for data at total level

This is a smart approach. But each category has a different number of transactions, therefore each category-level Profit Ratio has a different number of duplicated rows. The average in the last step is equivalent to a weighted average of category-level Profit Ratio—weighted by the number of duplicates.

We want to modify Average Profit Ratio to offset the weights. We start with the following formula:

Average Profit Ratio M = Sum(Profit Ratio per Category)/number of Categories

We know the following:

Profit Ratio from LAA = sumOver(profit) / sumOver(revenue)
number of Categories = distinct_count(category)

How can we handle the duplicated rows? We can divide Profit Ratio by the number of duplicates before summing them up:

Sum(Profit Ratio per Category) = Sum(Profit Ratio from LAA / # of duplicate rows per Category)

# of duplicate rows per Category = countOver(profit)

Put them together, and we can create the following:

Average Profit Ratio M = sum( sumOver(profit) / sumOver(revenue) / countOver(profit) ) / distinct_count(category)

In this dataset, countOver(profit) are large numbers in which intermediate results may be dimmed to zero because they’re smaller than QuickSight’s precision, so we can add another factor 10000 to inflate intermediate results and deflate the final output:

Average Profit Ratio M = sum( 10000 * sumOver(profit) / sumOver(revenue) / countOver(profit) ) / distinct_count(category) / 10000

6.94% in total is what is expected!

For Average Profit Ratio M, QuickSight tried to calculate in the following steps:

with LAA as (
select category
, sum(profit) over (partition by category) as "sumOver(profit)"
, sum(revenue) over (partition by category) as "sumOver(revenue)"
, count(profit) over (partition by category) as "countOver(profit)"
from dataset 
),                       -- notice that LAA is at the same level of dataset

select category,
, sum(10000 * "sumOver(profit)" / "sumOver(revenue)" / "countOver(profit)") 
/ count(distinct category) / 10000 as "Average Profit Ratio"
from LAA
group by category;       -- for data at category level

select sum(10000 * "sumOver(profit)" / "sumOver(revenue)" / "countOver(profit)") 
/ count(distinct category) / 10000 as "Average Profit Ratio"
from LAA;                -- for data at total level

Conclusion

This post discussed how you can build powerful and complicated data analytics using QuickSight. We also used SQL-like scripts to help you better understand QuickSight concepts and features.

Thanks for reading!


About the Author

Ian Liao is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Building well-architected serverless applications: Regulating inbound request rates – part 2

Post Syndicated from Julian Wood original https://aws.amazon.com/blogs/compute/building-well-architected-serverless-applications-regulating-inbound-request-rates-part-2/

This series of blog posts uses the AWS Well-Architected Tool with the Serverless Lens to help customers build and operate applications using best practices. In each post, I address the serverless-specific questions identified by the Serverless Lens along with the recommended best practices. See the introduction post for a table of contents and explanation of the example application.

Reliability question REL1: How do you regulate inbound request rates?

This post continues part 1 of this security question. Previously, I cover controlling inbound request rates using throttling. I go through how to use throttling to control steady-rate and burst rate requests. I show some solutions for performance testing to identify the request rates that your workload can sustain before impacting performance.

Good practice: Use, analyze, and enforce API quotas

API quotas limit the maximum number of requests a given API key can submit within a specified time interval. Metering API consumers provides a better understanding of how different consumers use your workload at sustained and burst rates at any point in time. With this information, you can determine fine-grained rate limiting for multiple quota limits. These can be done according to a group of consumer needs, and can adjust their limits on a regular basis.

Segregate API consumers steady-rate requests and their quota into multiple buckets or tiers

Amazon API Gateway usage plans allow your API consumer to access selected APIs at agreed-upon request rates and quotas. These help your consumers meet their business requirements and budget constraints. Create and attach API keys to usage plans to control access to certain API stages. I show how to create usage plans and how to associate them with API keys in “Building well-architected serverless applications: Controlling serverless API access – part 2”.

API key associated with usage plan

API key associated with usage plan

You can extract utilization data from usage plans to analyze API usage on a per-API key basis. In the example, I show how to use usage plans to see how many requests are made.

View API key usage

View API key usage

This allows you to generate billing documents and determine whether your customers need higher or lower limits. Have a mechanism to allow customers to request higher limits preemptively. When customers anticipate greater API usage, they can take action proactively.

API Gateway Lambda authorizers can dynamically associate API keys to a given request. This can be used where you do not control API consumers, or want to associate API keys based on your own criteria. For more information, see the documentation.

You can also visualize usage plans with Amazon QuickSight using enriched API Gateway access logs.

Visualize usage plans with Amazon QuickSight

Visualize usage plans with Amazon QuickSight

Define whether your API consumers are end users or machines

Understanding your API consumers helps you manage how they connect to your API. This helps you define a request access pattern strategy, which can distinguish between end users or machines.

Machine consumers make automated connections to your API, which may require a different access pattern to end users. You may decide to prioritize end user consumers to provide a better experience. Machine consumers may be able to handle request throttling automatically.

Best practice: Use mechanisms to protect non-scalable resources

Limit component throughput by enforcing how many transactions it can accept

AWS Lambda functions can scale faster than traditional resources, such as relational databases and cache systems. Protect your non-scalable resources by ensuring that components that scale quickly do not exceed the throughput of downstream systems. This can prevent system performance degrading. There are a number of ways to achieve this, either directly or via buffer mechanisms such as queues and streams.

For relational databases such as Amazon RDS, you can limit the number of connections per user, in addition to the global maximum number of connections. With Amazon RDS Proxy, your applications can pool and share database connections to improve their ability to scale.

Amazon RDS Proxy

Amazon RDS Proxy

For additional options for using RDS with Lambda, see the AWS Serverless Hero blog post “How To: Manage RDS Connections from AWS Lambda Serverless Functions”.

Cache results and only connect to, and fetch data from databases when needed. This reduces the load on the downstream database. Adjust the maximum number of connections for caching systems. Include a caching expiration mechanism to prevent serving stale records. For more information on caching implementation patterns and considerations, see “Caching Best Practices”.

Lambda provides managed scaling. When a function is first invoked, the Lambda service creates an instance of the function to process the event. This is called a cold start. After completion, the function remains available for a period of time to process subsequent events. These are called warm starts. If other events arrive while the function is busy, Lambda creates more instances of the function to handle these requests concurrently as cold starts. The following example shows 10 events processed in six concurrent requests.

Lambda concurrency

Lambda concurrency

You can control the number of concurrent function invocations to both reserve and limit the maximum concurrency your function can achieve. You can configure reserved concurrency to set the maximum number of concurrent instances for the function. This can protect downstream resources such as a database by ensuring Lambda can only scale up to the number of connections the database can support.

For example, you may have a traditional database or external API that can only support a maximum of 50 concurrent connections. You can set the maximum number of concurrent Lambda functions using the function concurrency settings. Setting the value to 50 ensures that the traditional database or external API is not overwhelmed.

Edit Lambda concurrency

Edit Lambda concurrency

You can also set the Lambda function concurrency to 0, which disables the Lambda function in the event of anomalies.

Another solution to protect downstream resources is to use an intermediate buffer. A buffer can persistently store messages in a stream or queue until a receiver processes them. This helps you control how fast messages are processed, which can protect the load on downstream resources.

Amazon Kinesis Data Streams allows you to collect and process large streams of data records in real time, and can act as a buffer. Streams consist of a set of shards that contain a sequence of data records. When using Lambda to process records, it processes one batch of records at a time from each shard.

Kinesis Data Streams control concurrency at the shard level, meaning that a single shard has a single concurrent invocation. This can reduce downstream calls to non-scalable resources such as a traditional database. Kinesis Data Streams also support batch windows up to 5 minutes and batch record sizes. These can also be used to control how frequent invocations can occur.

To learn how to manage scaling with Kinesis, see the documentation. To learn more how Lambda works with Kinesis, read the blog series “Building serverless applications with streaming data”.

Lambda and Kinesis shards

Lambda and Kinesis shards

Amazon Simple Queue Service (SQS) is a fully managed serverless message queuing service that enables you to decouple and scale microservices. You can offload tasks from one component of your application by sending them to a queue and processing them asynchronously.

SQS can act as a buffer, using a Lambda function to process the messages. Lambda polls the queue and invokes your Lambda function synchronously with an event that contains queue messages. Lambda reads messages in batches and invokes your function once for each batch. When your function successfully processes a batch, Lambda deletes its messages from the queue.

You can protect downstream resources using the Lambda concurrency controls. This limits the number of concurrent Lambda functions that pull messages off the queue. The messages persist in the queue until Lambda can process them. For more information see, “Using AWS Lambda with Amazon SQS

Lambda and SQS

Lambda and SQS

Conclusion

Regulating inbound requests helps you adapt different scaling mechanisms based on customer demand. You can achieve better throughput for your workloads and make them more reliable by controlling requests to a rate that your workload can support.

In this post, I cover using, analyzing, and enforcing API quotas using usage plans and API keys. I show mechanisms to protect non-scalable resources such as using RDS Proxy to protect downstream databases. I show how to control the number of Lambda invocations using concurrency controls to protect downstream resources. I explain how you can use streams and queues as an intermediate buffer to store messages persistently until a receiver processes them.

In the next post in the series, I cover the second reliability question from the Well-Architected Serverless Lens, building resiliency into serverless applications.

For more serverless learning resources, visit Serverless Land.

Improving Retail Forecast Accuracy with Machine Learning

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/improving-retail-forecast-accuracy-with-machine-learning/

The global retail market continues to grow larger and the influx of consumer data increases daily. The rise in volume, variety, and velocity of data poses challenges with demand forecasting and inventory planning. Outdated systems generate inaccurate demand forecasts. This results in multiple challenges for retailers. They are faced with over-stocking and lost sales, and often have to rely on increased levels of safety stock to avoid losing sales.

A recent McKinsey study indicates that AI-based forecasting improves forecasting accuracy by 10–20 percent. This translates to revenue increases of 2–3 percent. An accurate forecasting system can also help determine ideal inventory levels and better predict the impact of sales promotions. It provides a single view of demand across all channels and a better customer experience overall.

In this blog post, we will show you how to build a reliable retail forecasting system. We will use Amazon Forecast, and an AWS-vetted solution called Improving Forecast Accuracy with Machine Learning. This is an AWS Solutions Implementation that automatically produces forecasts and generates visualization dashboards. This solution can be extended to use cases across a variety of industries.

Improving Forecast Accuracy solution architecture

This post will illustrate a retail environment that has an SAP S/4 HANA system for overall enterprise resource planning (ERP). We will show a forecasting solution based on Amazon Forecast to predict demand across product categories. The environment also has a unified platform for customer experience provided by SAP Customer Activity Repository (CAR). Replenishment processes are driven by SAP Forecasting and Replenishment (F&R), and SAP Fiori apps are used to manage forecasts.

The solution is divided into four parts: Data extraction and preparation, Forecasting and monitoring, Data visualization, and Forecast import and utilization in SAP.

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

­­Data extraction and preparation

Historical demand data such as sales, web traffic, inventory numbers, and resource demand are extracted from SAP and uploaded to Amazon Simple Storage Service (S3). There are multiple ways to extract data from an SAP system into AWS. As part of this architecture, we will use operational data provisioning (ODP) extraction. ODP acts as a data source for OData services, enabling REST-based integrations with external applications. The ODP-Based Data Extraction via OData document details this approach. The steps involved are:

  1. Create a data source using transaction RSO2, allow Change Data Capture for specific data to be extracted
  2. Create an OData service using transaction SEGW
  3. Create a Data model for ODP extraction, which refers to the defined data source, then register the service
  4. Initiate the service from SAP gateway client
  5. In the AWS Management Console, create an AWS Lambda function to extract data and upload to S3. Check out the sample extractor code using Python, referenced in the blog Building data lakes with SAP on AWS

Related data that can potentially affect demand levels can be uploaded to Amazon S3. These could include seasonal events, promotions, and item price. Additional item metadata, such as product descriptions, color, brand, size may also be uploaded. Amazon Forecast provides built-in related time series data for holidays and weather. These three components together form the forecast inputs.

Forecasting and monitoring

An S3 event notification will be initiated when new datasets are uploaded to the input bucket. This in turn, starts an AWS Step Functions state machine. The state machine combines a series of AWS Lambda functions that build, train, and deploy machine learning models in Amazon Forecast. All AWS Step Functions logs are sent to Amazon CloudWatch. Administrators will be notified with the results of the AWS Step Functions through Amazon Simple Notification Service (SNS).

An AWS Glue job combines raw forecast input data, metadata, predictor backtest exports, and forecast exports. These all go into an aggregated view of forecasts in an S3 bucket. It is then translated to the format expected by the External Forecast import interface. Amazon Athena can be used to query forecast output in S3 using standard SQL queries.

Data visualization

Amazon QuickSight analyses can be created on a per-forecast basis. This provides users with forecast output visualization across hierarchies and categories of forecasted items. It also displays item-level accuracy metrics. Dashboards can be created from these analyses and shared within the organization. Additionally, data scientists and developers can prepare and process data, and evaluate Forecast outputs using an Amazon SageMaker Notebook Instance.

Forecast import and utilization in SAP

Amazon Forecast outputs located in Amazon S3 will be imported into the Unified Demand Forecast (UDF) module within the SAP Customer Activity Repository (CAR). You can read here about how to import external forecasts. An AWS Lambda function will be initiated when aggregated forecasts are uploaded to the S3 bucket. The Lambda function performs a remote function call (RFC) to the SAP system through the official SAP JCo Library. The SAP RFC credentials and connection information may be stored securely inside AWS Secrets Manager and read on demand to establish connectivity.

Once imported, forecast values from the solution can be retrieved by SAP Forecasting and Replenishment (F&R). They will be consumed as an input to replenishment processes, which consist of requirements calculation and­­­­­ requirement quantity optimization. SAP F&R calculates requirements based on the forecast, the current stock, and the open purchase orders. The requirement quantity then may be improved in accordance with optimization settings defined in SAP F&R.

­­­

Additionally, you have the flexibly to adjust the system forecast as required by the demand situation or analyze forecasts via respective SAP Fiori Apps.

Sample use case: AnyCompany Stores, Inc.

To illustrate how beneficial this solution can be for retail organizations, let’s consider AnyCompany Stores, Inc. This is a hypothetical customer and leader in the retailer industry with 985 stores across the United States. They struggle with issues stemming from their existing forecasting implementation. That implementation only understands certain categories and does not factor in the entire product portfolio. Additionally, it is limited to available demand history and does not consider related information that may affect forecasts. AnyCompany Stores is looking to improve their demand forecasting system.

Using Improving Forecast Accuracy with Machine Learning, AnyCompany Stores can easily generate AI-based forecasts at appropriate quantiles to address sensitivities associated with respective product categories. This mitigates inconsistent inventory buys, overstocks, out-of-stocks, and margin erosion. The solution also considers all relevant related data in addition to the historical demand data. This ensures that generated forecasts are accurate for each product category.

The generated forecasts may be used to complement existing forecasting and replenishment processes. With an improved forecasting solution, AnyCompany Stores will be able to meet demand, while holding less inventory and improving customer experience. This also helps ensure that potential demand spikes are accurately captured, so staples will always be in stock. Additionally, the company will not overstock expensive items with short shelf lives that are likely to spoil.

Conclusion

In this post, we explored how to implement an accurate retail forecasting solution using a ready-to-deploy AWS Solution. We use generated forecasts to drive inventory replenishment optimization and improve customer experience. The solution can be extended to inventory, workforce, capacity, and financial planning.

We showcase one of the ways in which Improving Forecast Accuracy with Machine Learning may be extended for a use case in the retail industry. If your organization would like to improve business outcomes with the power of forecasting, explore customizing this solution to fit your unique needs.

Further reading:

Analyze Fraud Transactions using Amazon Fraud Detector and Amazon Athena

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/analyze-fraud-transactions-using-amazon-fraud-detector-and-amazon-athena/

Organizations with online businesses have to be on guard constantly for fraudulent activity, such as fake accounts or payments made with stolen credit cards. One way they try to identify fraudsters is by using fraud detection applications. Some of these applications use machine learning (ML).

A common challenge with ML is the need for a large, labeled dataset to create ML models to detect fraud. You will also need the skill set and infrastructure to build, train, deploy, and scale your ML model.

In this post, I discuss how to perform fraud detection on a batch of many events using Amazon Fraud Detector. Amazon Fraud Detector is a fully managed service that can identify potentially fraudulent online activities. These can be situations such as the creation of fake accounts or online payment fraud. Unlike general-purpose ML packages, Amazon Fraud Detector is designed specifically to detect fraud. You can analyze fraud transaction prediction results by using Amazon Athena and Amazon QuickSight. I will explain how to review fraud using Amazon Fraud Detector and Amazon SageMaker built-in algorithms.

Batch fraud prediction use cases

You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. You may want to generate fraud predictions for a batch of events. These might be payment fraud, account take over or compromise, and free tier misuse while performing an offline proof-of-concept. You can also use batch predictions to evaluate the risk of events on an hourly, daily, or weekly basis depending upon your business need.

Batch fraud insights using Amazon Fraud Detector

Organizations such as ecommerce companies and credit card companies use ML to detect the fraud. Some of the most common types of fraud include email account compromise (personal or business), new account fraud, and non-payment or non-delivery (which includes compromised card numbers).

Amazon Fraud Detector automates the time-consuming and expensive steps to build, train, and deploy an ML model for fraud detection. Amazon Fraud Detector customizes each model it creates to your dataset, making the accuracy of models higher than current one-size-fits-all ML solutions. And because you pay only for what you use, you can avoid large upfront expenses.

If you want to analyze fraud transactions after the fact, you can perform batch fraud predictions using Amazon Fraud Detector. Then you can store fraud prediction results in an Amazon S3 bucket. Amazon Athena helps you analyze the fraud prediction results. You can create fraud prediction visualization dashboards using Amazon QuickSight.

The following diagram illustrates how to perform fraud predictions for a batch of events and analyze them using Amazon Athena.

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

Figure 1. Example architecture for analyzing fraud transactions using Amazon Fraud Detector and Amazon Athena

The architecture flow follows these general steps:

  1. Create and publish a detector. First create and publish a detector using Amazon Fraud Detector. It should contain your fraud prediction model and rules. For additional details, see Get started (console).
  2. Create an input Amazon S3 bucket and upload your CSV file. Prepare a CSV file that contains the events you want to evaluate. Then upload your CSV file into the input S3 bucket. In this file, include a column for each variable in the event type associated with your detector. In addition, include columns for EVENT_ID, ENTITY_ID, EVENT_TIMESTAMP, ENTITY_TYPE. Refer to Amazon Fraud Detector batch input and output files for more details. Read Create a variable for additional information on Amazon Fraud Detector variable data types and formatting.
  3. Create an output Amazon S3 bucket. Create an output Amazon S3 bucket to store your Amazon Fraud Detector prediction results.
  4. Perform a batch prediction. You can use a batch predictions job in Amazon Fraud Detector to get predictions for a set of events that do not require real-time scoring. Read more here about Batch predictions.
  5. Review your prediction results. Review your results in the CSV file that is generated and stored in the Amazon S3 output bucket.
  6. Analyze your fraud prediction results.
    • After creating a Data Catalog by using AWS Glue, you can use Amazon Athena to analyze your fraud prediction results with standard SQL.
    • You can develop user-friendly dashboards to analyze fraud prediction results using Amazon QuickSight by creating new datasets with Amazon Athena as your data source.

Fraud detection using Amazon SageMaker

The Amazon Web Services (AWS) Solutions Implementation, Fraud Detection Using Machine Learning, enables you to run automated transaction processing. This can be on an example dataset or your own dataset. The included ML model detects potentially fraudulent activity and flags that activity for review. The diagram following presents the architecture you can automatically deploy using the solution’s implementation guide and accompanying AWS CloudFormation template.

SageMaker provides several built-in machine learning algorithms that you can use for a variety of problem types. This solution leverages the built-in Random Cut Forest algorithm for unsupervised learning and the built-in XGBoost algorithm for supervised learning. In the SageMaker Developer Guide, you can see how Random Cut Forest and XGBoost algorithms work.

Figure 2. Fraud detection using machine learning architecture on AWS

Figure 2. Fraud detection using machine learning architecture on AWS

This architecture can be segmented into three phases.

  1. Develop a fraud prediction machine learning model. The AWS CloudFormation template deploys an example dataset of credit card transactions contained in an Amazon Simple Storage Service (Amazon S3) bucket. An Amazon SageMaker notebook instance with different ML models will be trained on the dataset.
  2. Perform fraud prediction. The solution also deploys an AWS Lambda function that processes transactions from the example dataset. It invokes the two SageMaker endpoints that assign anomaly scores and classification scores to incoming data points. An Amazon API Gateway REST API initiates predictions using signed HTTP requests. An Amazon Kinesis Data Firehose delivery stream loads the processed transactions into another Amazon S3 bucket for storage. The solution also provides an example of how to invoke the prediction REST API as part of the Amazon SageMaker notebook.
  3. Analyze fraud transactions. Once the transactions have been loaded into Amazon S3, you can use analytics tools and services for visualization, reporting, ad-hoc queries, and more detailed analysis.

By default, the solution is configured to process transactions from the example dataset. To use your own dataset, you must modify the solution. For more information, see Customization.

Conclusion

In this post, we showed you how to analyze fraud transactions using Amazon Fraud Detector and Amazon Athena. You can build fraud insights using Amazon Fraud Detector and Amazon SageMaker built-in algorithms Random Cut Forest and XGBoost. With the information in this post, you can build your own fraud insights models on AWS. You’ll be able to detect fraud faster. Finally, you’ll be able to solve a variety of fraud types. These can be new account fraud, online transaction fraud, and fake reviews, among others.

Read more and get started on building fraud detection models on AWS.

Establish private connectivity between Amazon QuickSight and Snowflake using AWS PrivateLink

Post Syndicated from Maxwell Moon original https://aws.amazon.com/blogs/big-data/establish-private-connectivity-between-amazon-quicksight-and-snowflake-using-aws-privatelink/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include Machine Learning-powered insights. QuickSight dashboards can be accessed from any device, and seamlessly embedded into your applications, portals, and websites.

QuickSight offers several sources for data, including but not limited to Amazon Athena, Amazon Redshift, Amazon Simple Storage Service (Amazon S3), and Snowflake. This post presents solutions to enable you to set up Snowflake as a data source for QuickSight regardless of your network configuration requirements.

We cover the following configurations of Snowflake as a data source for QuickSight:

  • QuickSight connection to Snowflake via AWS PrivateLink
  • QuickSight connection to Snowflake via AWS PrivateLink and virtual private cloud (VPC) peering (same Region)
  • QuickSight connection to Snowflake via AWS PrivateLink and VPC peering (cross-Region)
  • QuickSight connection to Snowflake (public network)

Prerequisites

To complete this solution, you need the following:

QuickSight connection to Snowflake via AWS PrivateLink

First, we show you how to connect to Snowflake with QuickSight over AWS PrivateLink. The following diagram illustrates the solution architecture.

Set up the Snowflake AWS PrivateLink integration

To start, we walk through enabling AWS PrivateLink for your Snowflake account. This includes locating resources in your AWS account, access to the Snowflake UI, and creating a support case with Snowflake.

  1. Identify the VPC you want to use to set up the AWS PrivateLink integration. To do so, retrieve a list of VPCs from the command line, then retrieve the VpcId element from the resulting JSON object for the desired VPC. See the following code:
aws ec2 describe-vpcs --output json
  1. Retrieve your AWS account ID. This post assumes that the account you’re targeting is your default account on your AWS CLI configuration.
aws sts get-caller-identity --output json
  1. If you’re setting up multiple accounts, repeat these steps for all accounts and VPCs (this post assumes you’re setting up a single account and VPC and will use this as the context moving forward).
  2. Contact Snowflake Support with your AWS account ID, VPC ID, and the corresponding account URL you use to access Snowflake (for example, <account id>.snowflakecomputing.com).

Enabling AWS PrivateLink for your Snowflake account can take up to two business days.

  1. After AWS PrivateLink is enabled, retrieve the AWS PrivateLink configuration for your Region by running the following command in a Snowflake worksheet, then retrieve the values for privatelink-account-url and privatelink_ocsp-url from the resulting JSON object. Examples of each value are as follows:
select SYSTEM$GET_PRIVATELINK_CONFIG();

privatelink-vpce-id: com.amazonaws.vpce.<region_id>.vpce-svc-xxxxxxxxxxxxxxxxx
privatelink-account-url: xxxxxxxx.<region>.privatelink.snowflakecomputing.com
privatelink_ocsp-url: ocsp.xxxxxxxx.<region>.privatelink.snowflakecomputing.com
  1. Store these values in a text editor for later use.

Next, we configure the VPC endpoint on the Amazon Virtual Private Cloud (Amazon VPC) console create all the required security groups.

  1. On the Amazon VPC console, choose Endpoints in the navigation menu.
  2. Choose Create endpoint.
  3. Select Find AWS Service by Name.
  4. For Service Name, enter the value for privatelink-vpce-id that we retrieved earlier.
  5. Choose Verify.

A green alert with “Service Name Found” appears and the VPC and subnet options automatically expand.

Depending on your targeted Region, your resulting screen may show another Region name.

  1. Choose the same VPC ID that you sent to Snowflake.
  2. Select the subnets where you want to create endpoints.

AWS recommends using more than one subnet for high availability.

  1. For Security group, choose Create a new security group.

This opens the Security groups page on the Amazon VPC console in a new tab.

  1. Choose Create security group.

  1. Give your new security group a name (for example, quicksight-doc-snowflake-privatelink-connection) and description.
  2. Choose the VPC ID you used in previous steps.

Next, you add two rules that allow traffic from within your VPC to this VPC endpoint.

  1. Retrieve the CIDR block for your targeted VPC:
aws ec2 describe-vpcs --vpc-ids vpc-xxxxxxxxxxxxxxxxx | jq -r '.Vpcs[].CidrBlock'
  1. Choose Add rule in the Inbound rules
  2. Choose HTTPS for the type, leave the source as Custom, and enter the value retrieved from the preceding describe-vpcs call (for example, 10.0.0.0/16).
  3. Choose Add rule in the Inbound rules
  4. Choose HTTP for the type, leave the source as Custom, and enter the value retrieved from the preceding describe-vpcs
  5. Choose Create security group.

  1. Retrieve the security group ID from the newly created security group.
  2. On the VPC endpoint configuration page, remove the default security group.
  3. Search for and select the new security group ID.

  1. Choose Create endpoint.

You’re redirected to a page that has a link to your VPC endpoint configuration, specified by the VPC ID. The next page has a link to view the configuration in full.

  1. Retrieve the topmost record in the DNS names list.

This can be differentiated from other DNS names because it only includes the Region name (such as us-west-2), and no Availability Zone letter notation (such as us-west-2a).

  1. Store this record in a text editor for later use.

Configure DNS for Snowflake endpoints in your VPC

To configure your Snowflake endpoints, complete the following steps:

  1. On the Route 53 console, choose Hosted Zones in the navigation pane.
  2. Choose Create hosted zone.
  3. For Domain name, enter the value you stored for privatelink-account-url from the previous steps.

In this field, we remove the Snowflake account ID from the DNS name and only use the value starting with the Region identifier (for example, <region>.privatelink.snowflakecomputing.com). We create a resource record set later for the subdomain.

  1. For Type, select Private hosted zone.

Your Region code may not be us-west-2; reference the DNS name returned to you by Snowflake.

  1. In the VPCs to associate with the hosted zone section, choose the Region in which your VPC is located and the VPC ID used in previous steps.

  1. Choose Create hosted zone.

Next. we create two records: one for privatelink-account-url and one for privatelink_ocsp-url.

  1. On the Hosted zones page, choose Create record set.
  2. For Record name, enter your Snowflake account ID (the first eight characters in privatelink-account-url).
  3. For Record type, choose CNAME.
  4. For Value, enter the DNS name for the Regional VPC endpoint we retrieved in the previous section.
  5. Choose Create records.

  1. Repeat these steps for the OCSP record we notated as privatelink-ocsp-url earlier, starting with ocsp through the eight-character Snowflake ID for the record name (for example, ocsp.xxxxxxxx).

Configure a Route 53 resolver inbound endpoint for your VPC

QuickSight doesn’t use the standard AWS resolver (the VPC’s .2 resolver). To resolve private DNS from QuickSight, you need to set up Route 53 resolver endpoints.

First, we create a security group for the Route 53 resolver inbound endpoint.

  1. On the Security groups page of the Amazon VPC console, choose Create security group.
  2. Enter a name for your security group (for example, quicksight-doc-route53-resolver-sg) and a description.
  3. Choose the VPC ID used in previous steps.
  4. Create rules that allow for DNS (Port 53) over UDP and TCP from within the VPC CIDR block.
  5. Choose Create security group.
  6. Note the security group ID, because we now add a rule to allow traffic to the VPC endpoint security group.

Now we create the Route 53 resolver inbound endpoint for our VPC.

  1. On the Route 53 console, choose Inbound endpoint in the navigation pane.
  2. Choose Create inbound endpoint.
  3. For Endpoint name, enter a name (for example, quicksight-inbound-resolver).
  4. For VPC in the Region, choose the VPC ID used in previous steps.
  5. For Security group for the endpoint, choose the security group ID you saved earlier.

  1. In the IP address section, choose two Availability Zones and subnets, and leave Use an IP address that is selected automatically selected.

  1. Choose Submit.
  2. Choose the inbound endpoint after it’s created and take note of the two IP addresses for the resolvers.

Connect a VPC to QuickSight

To connect a VPC to QuickSight, complete the following steps:

  1. On the Security groups page of the Amazon VPC console, choose Create security group.
  2. Enter a name (for example, quicksight-snowflake-privatelink-sg) and a description.
  3. Choose the VPC ID used in previous steps.

Security groups for QuickSight are different from other security groups in that they are stateless, rather than stateful. This means you must explicitly allow return traffic from the targeted security group. The inbound rule in your security group must allow traffic on all ports. It needs to do this because the destination port number of any inbound return packets is set to a randomly allocated port number. For more information, see Inbound Rules.

  1. Choose Create security group.
  2. Take note of the security group ID, because we now add a rule to allow traffic to the VPC endpoint security group.
  3. On the Security groups page, search for the security group ID that is used for the VPC endpoint.
  4. Choose Edit inbound rules.
  5. Add rules for both HTTPS and HTTP traffic, using the security group ID for the security group you created as the source.
  6. Choose Save rules.

Next, we move to the QuickSight console to configure the VPC connection.

  1. Navigate to the QuickSight console.
  2. Choose the user name and choose Manage QuickSight.

  1. In the navigation pane, choose Manage VPC connections.
  2. Choose Add a VPC connection.
  3. For VPC connection name, enter a name (for example, snowflake-privatelink).
  4. For VPC ID, choose the VPC used in previous steps.
  5. For Subnet ID, choose one of the subnets that has a VPC endpoint, as specified when you created the endpoint earlier.
  6. For Security group ID, enter the ID of the security group you created.
  7. For DNS resolver endpoints, enter the two IPs for the inbound resolver endpoint you created earlier.

  1. Choose Create.

Set up a Snowflake data source through the VPC

To set up a Snowflake data source, complete the following steps.

  1. On the QuickSight console, choose Datasets in the navigation page.
  2. Choose New dataset.
  3. Choose the Snowflake option.
  4. For Data source name, enter a name (for example, snowflake).
  5. For Connection type¸ choose the VPC connection you created earlier (snowflake-privatelink).
  6. For Database server, enter privatelink-account-url.
  7. For Database name, enter the name of your database.
  8. For Warehouse, enter the name of a running Snowflake warehouse.
  9. For Username, enter your Snowflake username.
  10. For Password, enter your Snowflake password.
  11. Choose Validate.
  12. Upon successful validation, choose Create data source.

Create your first QuickSight dashboard

In this section, we cover creating a dataset in QuickSight, then using this data in a visualization. We’re using a dummy dataset that has information about fictional employees.

  1. For Schema, choose your schema.
  2. For Tables, select your tables.
  3. Choose Select.

In the Finish dataset creation section, you can determine if QuickSight imports your dataset into SPICE to improve query performance or directly queries your data each time a dashboard is loaded. For more information about SPICE, see Importing Data into SPICE.

  1. For this post, we select Import to SPICE for quicker analytics.
  2. Choose Visualize.

Now that we have the schema, table, and SPICE configuration for the dataset, we can create our first visualization.

  1. Choose a field from the available fields list. For this post, we choose City.
  2. Choose a visualization in the Visual types

This only scratches the surface of the visualization capabilities of QuickSight. For more information, see Working with Amazon QuickSight Visuals.

Next, we cover a network configuration that allows for QuickSight to be connected to one VPC with AWS PrivateLink in another VPC, and use VPC peering to allow QuickSight to use the AWS PrivateLink connection.

QuickSight connection to Snowflake via AWS PrivateLink and VPC peering within the same Region

In this section, we show you how to connect to Snowflake with QuickSight with two VPCs peered and AWS PrivateLink. The following diagram illustrates the solution architecture.

Set up VPC peering

First, we create the VPC peering connection from the requesting VPC.

  1. On the Peering connections page of the Amazon VPC console, choose Create peering connection.
  2. For Select a local VPC to peer with, choose the VPC in which you configured your Snowflake AWS PrivateLink connection.

  1. In the Select another VPC to peer with section, leave the default options for Account and Region (My account and This Region, respectively).
  2. For VPC (Accepter), choose the VPC where your QuickSight is connected to.

  1. Choose Create peering connection.

Next, we accept the VPC connection from the accepting VPC.

  1. On the Peering connections page, select the connection you created.
  2. On the Actions menu, choose Accept.
  3. Review the information about the request. If everything looks correct, choose Yes, Accept.

Next, we configure DNS to resolve between the two VPCs.

  1. On the Peering connections page, choose your new peering connection.
  2. On the DNS tab, check if the two options show as Disabled.

If they’re enabled, you can skip to the steps on creating route tables.

  1. On the Actions menu, choose Edit DNS Settings.

This requires your VPC to have DNS host name and resolution enabled.

  1. Select both check boxes to allow DNS to resolve from both the acceptor and requestor VPCs.
  2. Choose Save.

Next, create the route table entry to allow for routes to propagate between the two VPCs.

  1. On the Route tables page, choose the route tables in your requesting VPC.
  2. On the Route tab, choose Edit routes.
  3. Add a route for the CIDR block that your peered VPC uses (for this post, 172.31.0.0/16).
  4. Choose Save routes.

  1. Repeat for the route tables in your accepter VPC.

Configure DNS in the accepter VPC

In this section, we associate the accepter VPC that with the same private hosted zone as the requester VPC (<region>.privatelink.snowflakecomputing.com).

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Select the hosted zone <region>.privatelink.snowflakecomputing.com and choose Edit.
  3. In the VPCs to associate with the hosted zone section, choose Add VPC.
  4. Choose the Region and VPC ID associated with the accepter VPC.

  1. Choose Save changes.

Configure Route 53 resolver inbound endpoints in the accepter VPC

To configure your Route 53 resolver inbound endpoints, complete the following steps:

  1. On the Security groups page of the Amazon VPC console, choose Create security group.
  2. Enter a name (for example, quicksight-doc-route53-resolver-sg) and a description.
  3. Choose the VPC ID used in previous steps.
  4. Create rules that allow for DNS (port 53) over UDP and TCP from within the VPC CIDR block (for this post, 172.31.0.0/16).
  5. Choose Create security group.

  1. Take note of the security group ID, because we now add a rule to allow traffic to the VPC endpoint security group.

Next, we set up the Route 53 inbound endpoint for this VPC.

  1. On the Route 53 console, choose Inbound endpoint in the navigation pane.
  2. Choose Create inbound endpoint.
  3. Enter a name for the endpoint (for example, quicksight-inbound-resolver).
  4. For VPC in the Region, choose the VPC ID for the accepter VPC.
  5. For Security group, choose the security group ID you saved earlier.
  6. In the IP Address section, select two Availability Zones and subnets, and leave Use an IP address that is selected automatically
  7. Choose Submit.
  8. Choose the inbound endpoint after it’s created.
  9. After the inbound endpoint has provisioned, note the two IP addresses for the resolvers.

Connect the accepter VPC to QuickSight

To start, we need to create a security group for QuickSight to allow traffic to the Route 53 resolver inbound endpoints, the VPC endpoint for AWS PrivateLink, and traffic within the local network.

  1. On the Security groups page of the Amazon VPC console, choose Create security group.
  2. Enter a name (for example, quicksight-snowflake-privatelink-vpc-peering-sg) and a description.
  3. Choose the VPC ID for the accepter VPC.
  4. Create the following ingress rules:
    1. One rule for the local network for all TCP ports (e.g., 172.31.0.0/16).
    2. One rule allowing DNS traffic from the security group for the Route 53 resolver inbound endpoint for all TCP ports.
    3. One rule allowing DNS traffic from the security group for the Route 53 resolver inbound endpoint for all UDP ports.
    4. One rule allowing traffic to the security group for the VPC endpoint (located in the peered VPC).

As discussed earlier, security groups for QuickSight are different from other security groups. You must explicitly allow return traffic from the targeted security group, and the inbound rule in your security group must allow traffic on all ports. For more information, see Inbound Rules.

Next, we modify the security group for the Route 53 resolver inbound endpoint to allow traffic from the security group we created.

  1. On the Security groups page, search for the security group ID used for the Route 53 resolver inbound endpoint.
  2. Choose Edit inbound rules.
  3. Add rules for both DNS over UDP and DNS over TCP, using the security group ID for the security group we created for QuickSight as the source.

  1. Choose Save rules.

Next, modify the security group that was created for the VPC endpoint for the AWS PrivateLink connection.

  1. On the Security groups page, search for the security group ID used for the VPC endpoint for the AWS PrivateLink connection.
  2. Choose Edit inbound rules.
  3. Add rules for both HTTPS and HTTP, using the security group ID for the security group created for QuickSight as the source.
  4. Choose Save rules.

Next, we set up the VPC connection in QuickSight.

  1. On the QuickSight console, choose the user name and choose Manage QuickSight.
  2. In the navigation pane, choose Manage VPC connections.
  3. Choose Add a VPC connection.
  4. For VPC connection name¸ enter a name (for example, snowflake-privatelink-vpc-peering).
  5. For Subnet, choose a subnet ID that has a route table with a peering connection to the requester VPC where the AWS PrivateLink connection resides.
  6. For Security group ID, enter the ID of the security group created earlier.
  7. For DNS resolver endpoints, enter the two IPs for the inbound resolver endpoint you created.
  8. Choose Create.

Set up a Snowflake data source in QuickSight through the VPC

To set up a Snowflake data source in QuickSight, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose the Snowflake option.
  4. Enter a data source name (for example, snowflake-dataset).
  5. Choose the VPC connection you created (snowflake-privatelink).
  6. For Database server, enter the privatelink-account-url.
  7. For Database name, enter the name of your database.
  8. For Warehouse, enter the name of a running Snowflake warehouse.
  9. For Username, enter your Snowflake username.
  10. For Password, enter your Snowflake password.
  11. Choose Validate.
  12. Upon successful validation, choose Create data source.

For steps to create a dashboard, see the earlier section, Create your first QuickSight dashboard.

In the next section, we cover a similar network configuration, with the difference being that we use cross-Region VPC peering.

QuickSight connection to Snowflake via AWS PrivateLink and VPC peering across Regions

In this section, we show you how to connect to Snowflake with QuickSight over AWS PrivateLink with two VPCs peered across Regions.

We refer to Regions generically throughout this post, denoting the Region that has the Snowflake AWS PrivateLink connection as Region A and the Region in which QuickSight is set up as Region B.

The following diagram illustrates our solution architecture.

Set up VPC peering between two Regions

First, we create the VPC peering connection from the requesting VPC.

  1. Navigate to the Peering connections page on the Amazon VPC console in Region B (the Region in which you plan to use QuickSight to deploy dashboards).
  2. Choose Create peering connection.
  3. In the Select a local VPC to peer with section, for VPC (Requester), choose the VPC in which you have connected or intend to connect QuickSight.

  1. For Select another VPC to peer with, select My account and Another Region.
  2. Choose the Region in which your Snowflake AWS PrivateLink connection exists.
  3. For VPC ID (Accepter), enter the VPC ID for the VPC in which your Snowflake AWS PrivateLink exists.

  1. Choose Create peering connection.
  2. Copy the VPC peering connection ID so we can easily locate it in the next steps (it looks like pcx-xxxxxxxxxxxx).

Next, we accept the VPC peering connection from the Region in which you created your AWS PrivateLink connection.

  1. Navigate to the Amazon VPC console in Region A (where your Snowflake AWS PrivateLink connection exists).
  2. Search for and select the peering connection you created.
  3. On the Actions menu, choose Accept Request.

  1. Review the information about the request. If everything looks correct, choose Yes, Accept.

Next, we configure DNS to resolve between the two VPCs.

  1. On the Peering connections page of the Amazon VPC console, choose your newly created VPC peering connection.
  2. On the DNS tab, check if the two options show Disabled.

If they’re enabled, skip to the steps on creating route tables.

  1. On the Actions menu, choose Edit DNS settings.

This requires your VPC to have DNS host name and resolution enabled.

  1. Select both check boxes to allow DNS to resolve from both the accepter and requestor VPCs.
  2. Choose Save.

Next, we create the route table entry to allow for routes to propagate between the two VPCs for Region B.

  1. Navigate to the Amazon VPC console in Region B (the Region in which you plan to use QuickSight to deploy dashboards).
  2. In the navigation pane, choose Route tables.
  3. Select the route tables in your requesting VPC.
  4. On the Route tab, choose Edit routes.
  5. Add a route for the CIDR block that your peered VPC uses (for this post, 10.0.0.0/16 is the CIDR block for the VPC in which the Snowflake AWS PrivateLink connection resides).
  6. Choose Save routes.

Next, create the route table entry to allow for routes to propagate between the two VPCs for Region A.

  1. Navigate to the Amazon VPC console in Region A (where your Snowflake AWS PrivateLink connection exists).
  2. Repeat the previous steps, using the CIDR block for the peered VPC (in this post, 172.16.0.0/16).

Configure DNS in the VPC in Region B

First, we need to associate the VPC in Region B (where you deploy QuickSight) with the same private hosted zone as the VPC in Region A where your Snowflake AWS PrivateLink connection exists (<region>.privatelink.snowflakecomputing.com).

  1. On the Route 53 console, choose Hosted zones in the navigation pane.
  2. Select the private hosted zone <region>.privatelink.snowflakecomputing.com and choose Edit.
  3. In the VPCs to associate with the hosted zone section, choose Add VPC.
  4. Choose the Region and VPC ID associated with the accepter VPC.

  1. Choose Save changes.

Configure the Route 53 resolver inbound endpoint for your VPC in Region B

To configure the resolver inbound endpoint in Region B, complete the following steps:

  1. On the Security groups page on the Amazon VPC console, choose Create security group.
  2. Enter a name (for example, quicksight-doc-route53-resolver-sg) and a description.
  3. Choose the VPC ID used in previous steps.
  4. Create rules that allow for DNS (port 53) over UDP and TCP from within the VPC CIDR block (for this post, 172.16.0.0/16).

  1. Choose Create security group.
  2. Take note the security group ID, because we now add a rule to allow traffic to the VPC endpoint security group.

Next, we set up the Route 53 inbound endpoint for this VPC.

  1. On the Route 53 console, choose Inbound endpoint in the navigation pane.
  2. Choose Create inbound endpoint.
  3. Enter a name for the endpoint (for example, quicksight-inbound-resolver).
  4. For VPC in the Region, choose the VPC ID used in previous steps.
  5. For Security group, choose the security group ID from the previous step.
  6. In the IP Address section, select two Availability Zones and subnets, and leave Use an IP address that is selected automatically
  7. Choose Submit.
  8. Choose the inbound endpoint after it’s created.
  9. After the inbound endpoint has provisioned, note the two IP addresses for the resolvers.

Connect the VPC to QuickSight in Region B

To start, we need to create a security group for QuickSight to allow traffic to the Route 53 resolver inbound endpoints, the VPC endpoint for AWS PrivateLink, and traffic within the local network.

  1. On the Security groups page of the Amazon VPC console in Region B, choose Create security group.
  2. Enter a name (for example, quicksight-snowflake-sg) and a description.
  3. Choose the VPC ID for the VPC where you previously created the VPC peering connection.
  4. Create the following ingress rules:
    1. One for the local network all TCP ports (for example, 172.16.0.0/16).
    2. One rule allowing DNS traffic from the security group for the Route 53 resolver inbound endpoint for all TCP ports.
    3. One rule allowing DNS traffic from the security group for the Route 53 resolver inbound endpoint for all UDP ports.
    4. One allowing traffic for all TCP ports to the CIDR block for the VPC located in Region A, where your Snowflake AWS PrivateLink connection exists (for this post, 10.0.0.0/16).

As discussed earlier, security groups for QuickSight are different from other security groups. You must explicitly allow return traffic from the targeted security group, and the inbound rule in your security group must allow traffic on all ports. For more information, see Inbound Rules.

Next, we modify the security group for the Route 53 resolver inbound endpoint in Region B to allow traffic from the security group we created.

  1. On the Security groups page, search for the security group ID used for the Route 53 resolver inbound endpoint.
  2. Choose Edit inbound rules.
  3. Add rules for both DNS over UDP and DNS over TCP, using the CIDR block for the VPC in Region B (for this post, 172.16.0.0/16).

  1. Choose Save rules.

Next, we need to modify the security group we’re using for the AWS PrivateLink connection.

  1. Navigate to the Security groups page on the Amazon VPC console in Region A.
  2. Search for the security group ID that is used for the VPC endpoint for the AWS PrivateLink connection.
  3. Choose Edit inbound rules.
  4. Add rules for both HTTPS and HTTP, using the CIDR Block for the VPC in Region B as the source (for this post, 172.16.0.0/16).

  1. Choose Save rules.

Finally, we set up the QuickSight VPC connection.

  1. Navigate to the QuickSight console in Region B.
  2. Choose the user name and choose Manage QuickSight.
  3. In the navigation pane, choose Manage VPC connection.
  4. Choose Add a VPC connection.
  5. For VPC connection name, enter a connection name (for example, snowflake-privatelink-cross-region).
  6. For VPC ID, choose the VPC ID of the VPC in Region B.
  7. For Subnet, choose a subnet ID from the VPC in Region B that has a route table with a peering connection to the VPC where the AWS PrivateLink connection resides.
  8. For Security group ID, enter the ID of the security group you created.
  9. For DNS resolver endpoints, enter the two IPs for the inbound resolver endpoint created earlier.

  1. Choose Create.

Set up a Snowflake data source in QuickSight through the VPC

To set up a Snowflake data source in QuickSight, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose the Snowflake option.
  4. Enter a name for your data source (for example, snowflake-dataset).
  5. Choose the VPC connection you created (snowflake-privatelink).
  6. For Database server, enter the privatelink-account-url.
  7. For Database name, enter the name of your database.
  8. For Warehouse, enter the name of a running Snowflake warehouse.
  9. For Username, enter your Snowflake username.
  10. For Password, enter your Snowflake password.
  11. Choose Validate.
  12. Upon successful validation, choose Create data source.

For steps to create a dashboard, see the earlier section, Create your first QuickSight dashboard.

For our last configuration, we cover how to set up a QuickSight connection to Snowflake without AWS PrivateLink.

QuickSight connection to Snowflake without AWS PrivateLink

In this section, we show you how to connect to Snowflake with QuickSight without using AWS PrivateLink.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose the Snowflake option.
  4. Enter a data source name (for example, snowflake-dataset).
  5. Leave the connection type as Public network.
  6. For Database name, enter the name of your database.
  7. For your database server, enter the URL you use to log in to your Snowflake (xxxxxxxx.snowflakecomputing.com).
  8. For Warehouse, enter the name of a running Snowflake warehouse.
  9. For Username, enter your Snowflake username.
  10. For Password, enter your Snowflake password.
  11. Choose Validate.
  12. Choose Create data source.

For steps to create a dashboard, see the earlier section, Create your first QuickSight dashboard.

Clean up

If your work with QuickSight, Snowflake, and PrivateLink is complete, remove your Route53 resolver inbound endpoint, Route 53 private host zone, and the VPC endpoint for Snowflake in order to avoid incurring additional fees.

Conclusion

In this post, we covered four scenarios for connecting QuickSight to Snowflake as a data source using AWS PrivateLink for connectivity in three different scenarios: the same VPC, with VPC peering in the same Region, and with VPC peering across Regions. We also covered how to connect QuickSight to Snowflake without AWS PrivateLink.

After you set up the data source, you can gain further insights from your data by setting up ML Insights in QuickSight, set up graphical representations of your data using QuickSight visuals, or join data from multiple datasets, as well as all other QuickSight features.


About the Author

Maxwell Moon is a Senior Solutions Architect at AWS working with Independent Software Vendors (ISVs) to design and scale their applications on AWS. Outside of work, Maxwell is a dad to two cats, is an avid supporter of the Wolverhampton Wanderers Football Club, and is patiently waiting for a new wave of ska music.

 

 

Bosco Albuquerque is a Sr Partner Solution Architect at AWS and has over 20 years of experience in working with database and analytics products, from enterprise database vendors, and cloud providers and has helped large technology companies in designing data analytics solutions as well as led engineering teams is designing and implementing data analytics platforms and data products.

Enable federation to multiple Amazon QuickSight accounts with Microsoft Azure Active Directory

Post Syndicated from Srikanth Baheti original https://aws.amazon.com/blogs/big-data/enable-federation-to-multiple-amazon-quicksight-accounts-with-microsoft-azure-active-directory/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud that supports identity federation in both Standard and Enterprise editions. Organizations are working towards centralizing their identity and access strategy across all of their applications, including on-premises, third-party, and applications on AWS. Many organizations use Microsoft Azure Active Directory (Azure AD) to control and manage user authentication and authorization centrally. If your organization uses Azure AD for cloud applications and multiple QuickSight accounts, you can enable federation to all of your QuickSight accounts without needing to create and manage users multiple times. This authorizes users to access QuickSight assets—analyses, dashboards, folders, and datasets—through centrally managed Azure AD.

In this post, we go through the steps to configure federated single sign-on (SSO) between a single Azure AD instance and multiple QuickSight accounts. We demonstrate registering an SSO application in Azure AD, creating roles in Azure AD, and assigning these roles to map to QuickSight roles (admin, author, and reader) These QuickSight roles represent three different personas supported in QuickSight. Administrators can publish the QuickSight app in the Azure App portal to enable users to SSO to QuickSight using their Azure AD credentials.

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  • An Azure AD subscription
  • One or more QuickSight account subscriptions

Solution overview

The walkthrough includes the following steps:

  1. Register an AWS Single Sign-On (AWS SSO) application in Azure AD.
  2. Configure the application in Azure AD.
  3. Add Azure AD as your SAML identity provider (IdP) in AWS.
  4. Configure AWS Identity and Access Management (IAM) policies.
  5. Configure IAM roles.
  6. Create roles in Microsoft Graph Explorer.
  7. Assign the newly created roles through Graph Explorer to users in Azure AD.
  8. Test the application from Azure AD.

Register an AWS SSO application in Azure AD

To configure the integration of an AWS SSO application in Azure AD, you need to add AWS SSO to your list of managed software as a service (SaaS) apps.

  1. Sign in to the Azure portal using a Microsoft account.
  2. Under Azure services, choose Azure Active Directory.

  1. In the navigation pane, under Manage, choose Enterprise Applications.

  1. Choose All applications.
  2. Choose New application.

  1. In the Browse Azure AD Gallery section, search for AWS Single Sign-On.
  2. Choose AWS Single Sign-On from the results panel and add the application.

  1. For Name, enter Amazon QuickSight.
  2. After the application is created, copy the Object ID value from the application overview.

You need this object ID in the later steps.

Configure an AWS SSO application in Azure AD

Follow these steps to enable Azure AD SSO in the Azure portal.

  1. In the Azure portal, on the AWS SSO application registered in first step, in the Manage section, choose single sign-on.
  2. On the Select a single sign-on method page, choose SAML.
  3. Choose the pencil icon.
  4. For Identifier (Entity ID), enter URN:AMAZON:WEBSERVICES.
  5. For Reply URL, enter https://signin.aws.amazon.com/saml.
  6. Leave Sign on URL blank
  7. For Relay State, enter https://quicksight.aws.amazon.com.
  8. Leave Logout URL blank.
  9. Choose Save.

  1. On the Set up Single Sign-On with SAML page, under User Attributes & Claims, choose Edit.

  1. In the Additional Claims section, configure SAML token attributes by using the values in the following table.
Name Source attribute Namespace
RoleSessionName user.userprincipalname https://aws.amazon.com/SAML/Attributes
Role user.assignedroles https://aws.amazon.com/SAML/Attributes
SessionDuration Provide a value from 900 seconds (15 minutes) to 43,200 seconds (12 hours) https://aws.amazon.com/SAML/Attributes
  1. In the SAML Signing Certificate section, choose Download to download the federation metadata XML file.

You this XML document later when setting up the SAML provider in IAM.

Add Azure AD as your SAML IdP in AWS

To configure Azure AD as your SAML IdP, complete the following steps:

  1. Open a new tab in your browser.
  2. Sign in to the IAM console in your AWS account with admin permissions.
  3. On the IAM console, under Access Management in the navigation pane, choose Identity providers.
  4. Choose Add provider.

  1. For Provider name, enter AzureActiveDirectory.
  2. Choose Choose file to upload the metadata document you downloaded in the earlier step.
  3. Choose Add provider.

  1. In the banner message that appears, choose View provider.

  1. Copy the ARN to use in a later step.

  1. Repeat these steps in other accounts where you want to enable SSO.

Configure IAM policies

In this step, you create three IAM policies for mapping to three different roles with permissions in QuickSight (admin, author, and reader).

Use the following steps to set up the QuickSight-Admin-Account1 policy. This policy grants admin privileges in QuickSight to the federated user.

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. Choose JSON and replace the existing text with the code from the following table for QuickSight-Admin-Account1.
Policy Name JSON Text
QuickSight-Admin-Account1
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "quicksight:CreateAdmin",
"Resource": "*"
}
]
}

QuickSight-Author-Account1
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "quicksight:CreateUser",
"Resource": "*"
}
]
}

QuickSight-Reader-Account1
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": " quicksight:CreateReader",
"Resource": "*"
}
]
}

  1. Choose Review policy
  2. For Name, enter QuickSight-Admin-Account1.
  3. Choose Create policy.
  4. Repeat the steps for QuickSight-Author-Account1 and QuickSight-Reader-Account1.
  5. Repeat these steps in other accounts where you want to enable SSO.

Configure IAM roles

Next, create the roles that your Azure AD users assume when federating into QuickSight. Use the following steps to set up the admin role:

  1. On the IAM console, choose Roles.
  2. Choose Create role.
  3. For Select type of trusted entity, choose SAML 2.0 federation.
  4. For SAML provider, choose the provider you created earlier (AzureActiveDirectory).
  5. Select Allow programmatic and AWS Management Console access.
  6. For Attribute, choose SAML:aud.
  7. For Value, enter https://signin.aws.amazon.com/saml.

  1. Choose Next: Permissions.
  2. Choose the QuickSight-Admin-Account1 IAM policy you created in the previous step.
  3. Choose Next: Tags.
  4. Choose Next: Review
  5. For Role name, enter QuickSight-Admin-Role.
  6. For Role description, enter a description.
  7. Choose Create role.

  1. On the IAM console, in the navigation pane, choose Roles.
  2. Choose the QuickSight-Admin-Role role you created to open the role’s properties.
  3. Copy the role ARN to the notepad.
  4. On the Trust Relationships tab, choose Edit Trust Relationship.
  5. Under Trusted Entities, verify that the IdP you created is listed.
  6. Under Conditions, verify that SAML:aud with a value of https://signin.aws.amazon.com/saml is present.

  1. Repeat these steps to create your author and reader roles and attach the appropriate policies:
    1. For QuickSight-Author-Role, use the policy QuickSight-Author-Account1.
    2. For QuickSight-Reader-Role, use the policy QuickSight-Reader-Account1.
  2. Repeat these steps in other accounts where you want to enable SSO.

Create roles in Microsoft Graph Explorer

Optionally, you can create roles within Azure AD with paid subscriptions. Open Microsoft Graph Explorer, and then do the following:

  1. Sign in to the Microsoft Graph Explorer site with the domain account for your tenant.

You need sufficient permissions to create the roles.

  1. To grant permissions, choose the ellipsis (three dots) next to your name and choose Select permissions.

  1. On the Permission list, expand Directory.
  2. Select the three directory-level permissions as shown in the following screenshot and choose Consent.

  1. Sign in to Graph Explorer again, and accept the site usage conditions.
  2. Choose GET for the method, and 0 for the version.
  3. In the query box, enter https://graph.microsoft.com/v1.0/servicePrincipals/<objectId> (use the object ID you saved earlier).
  4. In the Response preview pane, copy the response to an editor of your choice to modify.

  1. Extract the appRoles property from the service principal object.

Now you generate new roles for your application. These roles must match the IAM roles in AWS that you created earlier.

  1. From the notepad, use the format <Role ARN>, <IdP ARN> to create your roles:
    1. arn:aws:iam::5xxxxxxxxxx9:role/QuickSight-Admin-Role,arn:aws:iam::5xxxxxxxxxx9:saml-provider/AzureActiveDirectory
    2. arn:aws:iam::5xxxxxxxxxx9:role/QuickSight-Author-Role,arn:aws:iam::5xxxxxxxxxx9:saml-provider/AzureActiveDirectory
    3. arn:aws:iam::5xxxxxxxxxx9:role/QuickSight-Reader-Role,arn:aws:iam::5xxxxxxxxxx9:saml-provider/AzureActiveDirectory
    4. arn:aws:iam::0xxxxxxxxxx2:role/QS-Admin-AZAd-Role,arn:aws:iam::0xxxxxxxxxx2:saml-provider/AzureAd-Acct2
    5. arn:aws:iam::0xxxxxxxxxx2:role/QS-Author-AZAd-Role,arn:aws:iam::0xxxxxxxxxx2:saml-provider/AzureAd-Acct2
    6. arn:aws:iam::0xxxxxxxxxx2:role/QS-Reader-AZAd-Role,arn:aws:iam::0xxxxxxxxxx2:saml-provider/AzureAd-Acct2
  2. The following JSON code is an example of the appRoles Create a similar object to add the roles for your application:
            "appRoles": [
                {
                    "allowedMemberTypes": [
                        "User"
                    ],
                    "description": "User",
                    "displayName": "User",
                    "id": "8774f594-1d59-4279-b9d9-59ef09a23530",
                    "isEnabled": true,
                    "origin": "Application",
                    "value": null
                },
                {
                    "allowedMemberTypes": [
                        "User"
                    ],
                    "description": "msiam_access",
                    "displayName": "msiam_access",
                    "id": "e7f1a7f3-9eda-48e0-9963-bd67bf531afd",
                    "isEnabled": true,
                    "origin": "Application",
                    "value": null
                },
                {
                    "allowedMemberTypes": [
                        "User"
                    ],
                    "description": "Raji Quicksight Admin",
                    "displayName": "RajiQSAdmin",
                    "id": "9a07d03d-667f-405d-b5d7-68bec5b64584",
                    "isEnabled": true,
                    "origin": "ServicePrincipal",
                    "value": "arn:aws:iam::0xxxxxxxxxx2:role/QS-Admin-AZAd-Role,arn:aws:iam::0xxxxxxxxxx2:saml-provider/AzureAd-Acct2"
                },
                {
                    "allowedMemberTypes": [
                        "User"
                    ],
                    "description": "Sri Quicksight Admin",
                    "displayName": "SriQSAdmin",
                    "id": "77dd76d1-f897-4093-bf9a-8f3aaf25f30e",
                    "isEnabled": true,
                    "origin": "ServicePrincipal",
                    "value": "arn:aws:iam::5xxxxxxxxxx9:role/QuickSight-Admin-Role,arn:aws:iam::5xxxxxxxxxx9:saml-provider/AzureActiveDirectory"
                }
            ]

New roles must be followed by msiam_access for the patch operation. You can also add multiple roles, depending on your organization’s needs. Azure AD sends the value of these roles as the claim value in the SAML response.

When adding new roles, you must provide a new GUID for each ID attribute in the JSON payload. You can use online GUID generation tool for generating a new unique GUID per role.

  1. In Microsoft Graph Explorer, change the method from GET to PATCH.
  2. Patch the service principal object with the roles you want by updating the appRoles property, like the one shown in the preceding example.
  3. Choose Run Query to run the patch operation. A success message confirms the creation of the role for your AWS application.

After the service principal is patched with new roles, you can assign users and groups to their respective roles.

  1. In the Azure portal, go to the QuickSight application you created and choose Users and Groups.
  2. Create your groups.

We recommend creating a new group for every AWS role in order to assign a particular role to the group. This one-to-one mapping means that one group is assigned to one role. You can then add members to the group.

  1. After you create the groups, choose the group and assign it to the application.

Nested groups are not allowed.

  1. To assign the role to the group, choose the role, then choose Assign.

Test the application

In this section, you test your Azure AD SSO configuration by using Microsoft Applications.

  1. Navigate to Microsoft Applications.
  2. On the My Apps page, choose AWS Single Sign-On.

  1. Choose a specific role for the QuickSight account you want to use.

You’re redirected to the QuickSight console.

Summary

This post provided step-by-step instructions to configure federated SSO between a single Azure AD instance and multiple QuickSight accounts. We also discussed how to create new roles and map users and groups in Azure AD to IAM for secure access into multiple QuickSight accounts.

For information about federating from Azure AD to a single QuickSight account, see Enabling Amazon QuickSight federation with Azure AD.


About the Authors

 

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.

 

Raji Sivasubramaniam is a Specialist Solutions Architect at AWS, focusing on Analytics. Raji has 20 years of experience in architecting end-to-end Enterprise Data Management, Business Intelligence and Analytics solutions for Fortune 500 and Fortune 100 companies across the globe. She has in-depth experience in integrated healthcare data and analytics with wide variety of healthcare datasets including managed market, physician targeting and patient analytics. In her spare time, Raji enjoys hiking, yoga and gardening.

 

Padmaja Suren is a Senior Solutions Architect specialized in QuickSight. She has 20+ years of experience in building scalable data platforms for Reporting, Analytics and AI/ML using a variety of technologies. Prior to AWS, in her role as BI Architect at ERT, she designed, engineered and cloud-enabled the BI and Analytics platform for the management of large scale clinical trial data conducted across the world. She dedicates her free time on her passion project SanghWE which helps sexual trauma survivors in developing nations heal and recover.

Migrate Amazon QuickSight across AWS accounts

Post Syndicated from Abhinav Sarin original https://aws.amazon.com/blogs/big-data/migrate-amazon-quicksight-across-aws-accounts/

This blog post is co-written by Glen Douglas and Alex Savchenko from Integrationworx.

Enterprises that follow an Agile software development lifecycle (SDLC) process for their dashboard development and deployment typically have distinct environments for development, staging, QA and test, and production. One recommended approach when developing using AWS is to create multiple AWS accounts corresponding to the various environments. Amazon QuickSight is a fully managed, serverless business intelligence service offered by AWS for building interactive dashboards. With QuickSight, you can share dashboards with your internal users, or embed dashboards into your applications for external users or customers, scaling to 100s of 1000s of users with no servers or infrastructure to maintain. When an account is created on QuickSight, it corresponds to the underlying AWS account. So, when dashboards are created in the development environment, you need to migrate these assets to a higher environment to be in alignment with current DevOps practices. This requires cross-account dashboard migration. This post outlines the steps involved in migrating dashboards from one account to another.

Solution overview

The following diagram shows the architecture of how QuickSight accounts are mapped to AWS accounts. In this post, we outline the steps involved in migrating QuickSight assets in the dev account to the prod account.

Migrating QuickSight dashboards from a dev account to a prod account involves converting the underlying dev dashboard assets into JSON and then recreating them in prod. QuickSight provides a robust set of APIs to create QuickSight assets, such as dashboards, analysis, datasets, and themes. You can run these APIs via the AWS Command Line Interface (AWS CLI) or through SDKs available for various programming languages. For this post, we use the AWS CLI for illustration; for regular usage, we recommend you implement this with the AWS SDK.

To set up the AWS CLI, see Developing Applications with the QuickSight API.

The following diagram illustrates the high-level steps to migrate QuickSight assets from one account to another. First, we need prepare the dataset and data sources, and create the analysis, template, and dashboards in the dev account.

 Next, we use the dev template with the test dataset and data sources to promote the dashboards.

You can create any QuickSight entity programmatically using APIs (for example, create-dataset and create-datasources). Another API allows generating a JSON representation of the entity that was created (such as describe-dataset and describe-datasource). Analysis entity is an exception—as of this writing, there are no APIs for creating or generating a JSON representation. All analysis entities must be created using the AWS Management Console for the first time. From that point on, you can programmatically manage analysis using templates.

Definitions

The following table defines the parameters used in this post.

Environment Name Demo Reference AWS Account ID QuickSight Account Name
Source (Dev) Source account

 

31********64

 

QS-Dev
Target (Test) Target account

 

86********55

 

QS-Test

The following figure summarizes the QuickSight objects and AWS Identity and Access Management (IAM) resources that are referenced in the migration between accounts. The QuickSight objects from the source account are referenced and recreated in the target account.

The following table summarizes the QuickSight objects used in this post for migration from the source account (dev) to the target account (test). As part of the migration, the name of the data source is changed in order to denote the change in underlying data used in the target environment. As we demonstrate in this post, a QuickSight template is created in the source account only. This template contains the information about the source dashboard that is to be created in the target account.

QuickSight Object Type Object Name (Source) Object Name (Target)
Data Source QS_Dev QS_Test
Dataset sporting_event_ticket_info sporting_event_ticket_info
Dashboard

Sporting_event_ticket_info_dashboard

 

Sporting_event_ticket_info_dashboard

 

Template sporting_event_ticket_info_template

For this post, we use an Amazon RDS for PostgreSQL database as the dataset and create a QuickSight visualization using the database table sporting_event_ticket_info. You can use any of the data sources that QuickSight supports or easily test this process using a spreadsheet. The dashboard to be migrated from the development environment shows data from the corresponding dev database (QS_Dev).

Prerequisites

Consider the situation in which you need to migrate a QuickSight dashboard from a source (or dev) environment to a target (or test) environment. The migration requires the following prerequisites:

Step 1: Enable permissions to migrate QuickSight objects

To facilitate using AWS CLI commands and sharing QuickSight templates from the source account to the target account, we perform the following actions in the source environment:

1a) Create an IAM policy.

1b) Create a new IAM group and attach the policy.

1c) Create a new IAM user and assign it to the group.

1d) Invite the IAM user created to the QuickSight (dev) account.

1e) Create another reader policy in the source (dev) account to grant access to the target (test) account.

1f) Create a deployer role in the source account.

Step 1a: Create an IAM policy

You start by creating IAM resources. For this post, we create an IAM policy called Dev-QuickSightAdmin, which is used by IAM users and roles in the source account. The purpose of this policy is to allow a user in the source account to perform various actions on QuickSight objects.

  1. To get started, as the admin user, sign in to the IAM console.
  2. In the navigation pane, choose Policies.
  3. Choose Create policy.
  4. In the Service section, select Choose a service and choose QuickSight.
  5. Under Actions, select the following permissions:
    1. List – ListAnalysis, ListDashboards, ListDataSets, ListDataSources, ListTemplates
    2. Read – DescribeAnalysis, DescribeDashboard, DescribeDataSet, DescribeDataSource, DescribeTemplate
    3. Write – CreateTemplate, UpdateTemplate
    4. Permissions management – DescribeTemplatePermissions, UpdateTemplatePermissions
  6. Select the appropriate resources.

You can restrict the resources and Region based on your requirement. We allow all resources for this post.

  1. Review and create the policy.

Step 1b: Create a new IAM group

Create a new IAM group Dev-grp-QuickSightAdmin and assign the Dev-QuickSightAdmin policy (from Step 1a) to the group in the source account.

Step 1c: Create a new IAM user

Create a new IAM user called Dev-qs-admin-user and assign it to the Dev-grp-QuickSightAdmin group. You use this user later to run the AWS CLI commands in the source account. Alternately, you can use an existing IAM user for this purpose.

Step 1d: Invite the IAM user to the QuickSight (dev) account

Sign in to QuickSight in the source (dev) account and invite the user from Step 1c to QuickSight. Assign the role of ADMIN and for IAM user, choose Yes to indicate that this is an IAM user.

Step 1e: Create another reader policy in the source (dev) account

In the source (dev) account, create another IAM policy, called Dev-QuickSightReader, to grant access to the target (test) account. The purpose of this policy is to allow the target account to perform list and read actions on QuickSight objects in the source account.

  1. On the IAM console, choose Policies in the navigation pane.
  2. Choose Create policy.
  3. In the Service section, select Choose a service and choose QuickSight.
  4. Under Actions, make sure that All QuickSight actions is not selected.
  5. Under Access level, select List and Read.
  6. Select the following permissions:
    1. List – ListAnalysis, ListDashboards, ListDataSets, ListDataSources, ListTemplates
    2. Read – DescribeAnalysis, DescribeDashboard, DescribeDataSet, DescribeDataSource, DescribeTemplate
  7. Review and create the policy.

Verify the reader IAM policy Dev-QuickSightReader shows only the list and read access level for QuickSight services when complete.

Step 1f: Create a deployer role in the source account (dev)

You now create an IAM role called Dev-QuickSight-Deployer in the source account (dev). This role is specifically assigned to the target account ID and assigned the QuickSightReader policy, as noted in the previous step. This allows the external AWS target (test) account to read the QuickSight template contained in the source (dev) account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. Select Another AWS account and provide the account ID of the target account.
  4. In the Attach permissions policies section, attach the Dev-QuickSightReader
  5. In the Add tags section, add any tags (optional).
  6. Choose Next.
  7. In the Review section, assign a role name (we use Dev-QuickSight-Deployer).
  8. Enter a role description.
  9. Choose Create role.

You have completed the creation of the policy, group, and user in the source account. The next step is to configure the permissions in the target account.

  1. Switch from the source account to the target account.
  2. In the target account, sign in to the IAM console and repeat the steps in this section to create the Test-QuickSightAdmin policy and Test-grp-QuickSightAdmin group, and assign the policy to the group. Test-QuicksightAdmin should have following permissions:
    1. List – ListAnalysis, ListDashboards, ListDataSets, ListDataSources, ListTemplates
    2. Read – DescribeAnalysis, DescribeDashboard, DescribeDataSet, DescribeDataSource, DescribeTemplate
    3. Write – CreateTemplate, UpdateTemplate, Createdatasource, CreateDashboard, UpdateDataSet
    4. Permissions management – DescribeTemplatePermissions, UpdateTemplatePermissions
    5. Tag: TagResource, UntagResource
  1. Create the IAM user Test-qs-admin-user and add it to the Test-grp-QuickSightAdmin group
  2. Sign in to QuickSight and invite the test user.

One final step is to add Test-qs-admin-user as a trusted entity in the Dev-QuickSight-Deployer role. The reason is the target account needs cross-account access. We use the IAM user Test-qs-admin-user to create the dashboard in the test account.

  1. Switch back to the source (dev) account.
  2. Select role and search for the Dev-QuickSight-Deployer role.
  3. Choose Trust relationships.
  4. Edit the trust relationship and add the following ARN in the Principal section of the policy: arn:aws:iam::86XXXXX55:user/Test-qs-admin-user.

Step 2: Prepare QuickSight objects in the source account

To migrate QuickSight objects to a target environment, we perform the following actions in the source environment:

2a) Prepare the data source file.

2b) Prepare the dataset file.

2c) Create a dashboard template.

Step 2a: Prepare the data source file

Data sources represent connections to specific sources of data and are made available within a QuickSight account. Multiple data source types are supported within QuickSight, including a variety of relational databases, flat files, JSON semi-structured data files, and software as a service (SaaS) data providers.

Any QuickSight dashboards and datasets to be migrated to the target environment must have their corresponding data sources also migrated within the target environment.

In this step, you create a JSON file with the data source information from the source environment. Then you use the create-data-source AWS CLI command in the target environment with the JSON file as input. 

  1. First, identify the data source for your implementation by running the list-data-sources command in the source (dev) account: 
aws quicksight list-data-sources --aws-account-id 31********64

The following code shows the top portion of the command output:

{
    "Status": 200,
    "DataSources": [
        {
            "Arn": "arn:aws:quicksight:us-east-1:31********64:datasource/4b98fee7-4df1-4dc2-8ca3-115c1c1839ab"",
            "DataSourceId": "4b98fee7-4df1-4dc2-8ca3-115c1c1839ab",
            "Name": "QS_Dev",
            "Type": "POSTGRESQL",
            "Status": "CREATION_SUCCESSFUL",
            "CreatedTime": "2020-12-16T20:42:37.280000-08:00",
            "LastUpdatedTime": "2020-12-16T20:42:37.280000-08:00",
            "DataSourceParameters": {
                "RdsParameters": {
                    "InstanceId": "dmslabinstance",
                    "Database": "sportstickets"
                }
            },
            "SslProperties": {
                "DisableSsl": false
            }
        },

  1. Run the describe-data-source command, using the DataSourceId from the previous list-data-source command output.

This command provides details about the data source, which we use to create the data source in the target account.

aws quicksight describe-data-source --aws-account-id 31********64 --data-source-id "4b98fee7-4df1-4dc2-8ca3-115c1c1839ab "

The following is the resulting output:

{
    "Status": 200,
    "DataSource": {
        "Arn": "arn:aws:quicksight:us-east-1:31*******64:datasource/4b98fee7-4df1-4dc2-8ca3-115c1c1839ab",
        "DataSourceId": "4b98fee7-4df1-4dc2-8ca3-115c1c1839ab",
        "Name": "QS_Dev",
        "Type": "POSTGRESQL",
        "Status": "CREATION_SUCCESSFUL",
        "CreatedTime": "2020-12-16T20:42:37.280000-08:00",
        "LastUpdatedTime": "2020-12-16T20:42:37.280000-08:00",
        "DataSourceParameters": {
            "RdsParameters": {
                "InstanceId": "dmslabinstance",
                "Database": "sportstickets"
            }
        },
        "SslProperties": {
            "DisableSsl": false
        }
    },
    "RequestId": "c2455720-118c-442d-9ba3-4f446cb543f1"
}

If you’re migrating more than one data source, you need to repeat the step for each data source.

  1. Use the output from the describe-data-source command to create a JSON file called create-data-source-cli-input.json, which represents the data source that is being migrated.

The contents of the following JSON file reference the data source information (name, host, credentials) for the target environment:

{
    "AwsAccountId": "86********55",
    "DataSourceId": "QS_Test",
    "Name": "QS_Test",
    "Type": "POSTGRESQL",
    "DataSourceParameters": {
        "PostgreSqlParameters": {
            "Host": "dmslabinstance.***********.us-east-1.rds.amazonaws.com",
            "Port": 5432,
            "Database": "sportstickets"
        }
    },
    "Credentials": {
        "CredentialPair": {
            "Username": "xxxxxxx",
            "Password": "yyyyyyy"
        }
    },
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:86********55:user/default/Test-qs-admin-user",
            "Actions": [
                "quicksight:UpdateDataSourcePermissions",
                "quicksight:DescribeDataSource",
                "quicksight:DescribeDataSourcePermissions",
                "quicksight:PassDataSource",
                "quicksight:UpdateDataSource",
                "quicksight:DeleteDataSource"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "QS_Test"
        }
    ]
}

For this post, because the target environment is connecting to the same data source as the source environment, the values in the JSON can simply be provided from the previous describe-data-source command output.

Step 2b: Prepare the dataset file

Datasets provide an abstraction layer in QuickSight, which represents prepared data from a data source in the QuickSight account. The intent of prepared datasets is to enable reuse in multiple analyses and sharing amongst QuickSight users. A dataset can include calculated fields, filters, and changed file names or data types. When based on a relational database, datasets can join tables within QuickSight, or as part of the underlying SQL query used to define the dataset.

The sample dataset sporting_event_ticket_info represents a single table; however, in a relational database, datasets can join tables within QuickSight, or as part of the underlying SQL query used to define the dataset.

Similar to the process used for data sources, you create a JSON file representing the datasets from the source account.

  1. Run the list-data-sets command to get all datasets from the source account:
aws quicksight list-data-sets --aws-account-id 31********64

The following code is the output:

{
            "Arn": "arn:aws:quicksight:us-east-1:31********64:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
            "DataSetId": "24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
            "Name": "sporting_event_ticket_info",
            "CreatedTime": "2020-12-16T20:45:36.672000-08:00",
            "LastUpdatedTime": "2020-12-16T20:45:36.835000-08:00",
            "ImportMode": "SPICE"
}

  1. Run the describe-data-set command, specifying the DataSetId from the previous command’s response:
aws quicksight describe-data-set --aws-account-id 31********64 --data-set-id "24b1b03a-86ce-41c7-9df7-5be5343ff9d9"

The following code shows the output:

{
    "Status": 200,
    "DataSet": {
        "Arn": "arn:aws:quicksight:us-east-1:31********64:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
        "DataSetId": "24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
        "Name": "sporting_event_ticket_info",
        "CreatedTime": "2020-12-16T20:45:36.672000-08:00",
        "LastUpdatedTime": "2020-12-16T20:45:36.835000-08:00",
        "PhysicalTableMap": {
            "d7ec8dff-c136-4c9a-a338-7017a95a4473": {
                "RelationalTable": {
                    "DataSourceArn": "arn:aws:quicksight:us-east-1:31********64:datasource/f72f8c2a-f9e2-4c3c-8221-0b1853e920b2",
                    "Schema": "dms_sample",
                    "Name": "sporting_event_ticket_info",
                    "InputColumns": [
                        {
                            "Name": "ticket_id",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "event_id",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "sport",
                            "Type": "STRING"
                        },
                        {
                            "Name": "event_date_time",
                            "Type": "DATETIME"
                        },
                        {
                            "Name": "home_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "away_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "location",
                            "Type": "STRING"
                        },
                        {
                            "Name": "city",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_level",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "seat_section",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_row",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat",
                            "Type": "STRING"
                        },
                        {
                            "Name": "ticket_price",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "ticketholder",
                            "Type": "STRING"
                        }
                    ]
                }
            }
        },
        "LogicalTableMap": {
            "d7ec8dff-c136-4c9a-a338-7017a95a4473": {
                "Alias": "sporting_event_ticket_info",
                "DataTransforms": [
                    {
                        "TagColumnOperation": {
                            "ColumnName": "city",
                            "Tags": [
                                {
                                    "ColumnGeographicRole": "CITY"
                                }
                            ]
                        }
                    }
                ],
                "Source": {
                    "PhysicalTableId": "d7ec8dff-c136-4c9a-a338-7017a95a4473"
                }
            }
        },
        "OutputColumns": [
            {
                "Name": "ticket_id",
                "Type": "DECIMAL"
            },
            {
                "Name": "event_id",
                "Type": "INTEGER"
            },
            {
                "Name": "sport",
                "Type": "STRING"
            },
            {
                "Name": "event_date_time",
                "Type": "DATETIME"
            },
            {
                "Name": "home_team",
                "Type": "STRING"
            },
            {
                "Name": "away_team",
                "Type": "STRING"
            },
            {
                "Name": "location",
                "Type": "STRING"
            },
            {
                "Name": "city",
                "Type": "STRING"
            },
            {
                "Name": "seat_level",
                "Type": "DECIMAL"
            },
            {
                "Name": "seat_section",
                "Type": "STRING"
            },
            {
                "Name": "seat_row",
                "Type": "STRING"
            },
            {
                "Name": "seat",
                "Type": "STRING"
            },
            {
                "Name": "ticket_price",
                "Type": "DECIMAL"
            },
            {
                "Name": "ticketholder",
                "Type": "STRING"
            }
        ],
        "ImportMode": "SPICE",
        "ConsumedSpiceCapacityInBytes": 318386511
    },
    "RequestId": "8c6cabb4-5b9d-4607-922c-1916acc9da1a"
}
  1. Based on the dataset description, create a JSON file based on the template file (create-data-set-cli-input-sql.json) with the details listed in the describe-data-set command output:
{

    "AwsAccountId": "86********55",
    "DataSetId": "24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
    "Name": "person",
    "PhysicalTableMap": {
        "23fb761f-df37-4242-9f23-ba61be20a0df": {
            "RelationalTable": {
                "DataSourceArn": "arn:aws:quicksight:us-east-1: 86********55:datasource/QS_Test",
                "Schema": "dms_sample",
                "Name": " sporting_event_ticket_info ",
                "InputColumns": [

                        {
                            "Name": "ticket_id",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "event_id",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "sport",
                            "Type": "STRING"
                        },
                        {
                            "Name": "event_date_time",
                            "Type": "DATETIME"
                        },
                        {
                            "Name": "home_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "away_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "location",
                            "Type": "STRING"
                        },
                        {
                            "Name": "city",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_level",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "seat_section",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_row",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat",
                            "Type": "STRING"
                        },
                        {
                            "Name": "ticket_price",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "ticketholder",
                            "Type": "STRING"
                        }
                    ]
                }
            }
        },
        "LogicalTableMap": {
            "23fb761f-df37-4242-9f23-ba61be20a0df": {
                "Alias": "person",
                "Source": {
                    "PhysicalTableId": "23fb761f-df37-4242-9f23-ba61be20a0df"
                }
            }
        },
    "ImportMode": "SPICE",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1: 86********55:user/default/Test-qs-admin-user",
            "Actions": [
                "quicksight:UpdateDataSetPermissions",
                "quicksight:DescribeDataSet",
                "quicksight:DescribeDataSetPermissions",
                "quicksight:PassDataSet",
                "quicksight:DescribeIngestion",
                "quicksight:ListIngestions",
                "quicksight:UpdateDataSet",
                "quicksight:DeleteDataSet",
                "quicksight:CreateIngestion",
                "quicksight:CancelIngestion"
            ]
        }
    ],
    "Tags": [
        {
            "Key": "Name",
            "Value": "QS_Test"
        }
    ]
}

The DataSource Arn should reference the ARN of the existing data source (in this case, the source created in the previous step).

Step 2c: Create a QuickSight template

A template in QuickSight is an entity that encapsulates the metadata that describes an analysis. A template provides a layer of abstraction from a specific analysis by using placeholders for the underlying datasets used to create the analysis. When we replace dataset placeholders in a template, we can recreate an analysis for a different dataset that follows the same schema as the original analysis.

You can also share templates across accounts. This feature, combined with the dataset placeholders in a template, provides the means to migrate a dashboard from one account to another.

  1. To create a template, begin by using the list-dashboards command to get list of available dashboards in the source environment:
aws quicksight list-dashboards --aws-account-id 31********64

The command output can be lengthy, depending on the number of dashboards in the source environment.

  1. Search for the “Name” of the desired dashboard in the output file and copy the corresponding DashboardId to use in the next step:
{
            "Arn": "arn:aws:quicksight:us-east-1:31********64:dashboard/c5345f81-e79d-4a46-8203-2763738489d1",
            "DashboardId": "c5345f81-e79d-4a46-8203-2763738489d1",
            "Name": "Sportinng_event_ticket_info_dashboard",
            "CreatedTime": "2020-12-18T01:30:41.209000-08:00",
            "LastUpdatedTime": "2020-12-18T01:30:41.203000-08:00",
            "PublishedVersionNumber": 1,
            "LastPublishedTime": "2020-12-18T01:30:41.209000-08:00"
        }
  1. Run the describe-dashboard command for the DashboardId copied in the preceding step:
aws quicksight describe-dashboard --aws-account-id 31********64 --dashboard-id "c5345f81-e79d-4a46-8203-2763738489d1"

The response should look like the following code:

{
    "Status": 200,
    "Dashboard": {
        "DashboardId": "c5345f81-e79d-4a46-8203-2763738489d1",
        "Arn": "arn:aws:quicksight:us-east-1:31********64:dashboard/c5345f81-e79d-4a46-8203-2763738489d1",
        "Name": "Sportinng_event_ticket_info_dashboard",
        "Version": {
            "CreatedTime": "2020-12-18T01:30:41.203000-08:00",
            "Errors": [],
            "VersionNumber": 1,
            "SourceEntityArn": "arn:aws:quicksight:us-east-1:31*******64:analysis/daddefc4-c97c-4460-86e0-5b488ec29cdb",
            "DataSetArns": [
                "arn:aws:quicksight:us-east-1:31********64:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9"
            ]
        },
        "CreatedTime": "2020-12-18T01:30:41.209000-08:00",
        "LastPublishedTime": "2020-12-18T01:30:41.209000-08:00",
        "LastUpdatedTime": "2020-12-18T01:30:41.203000-08:00"
    },
    "RequestId": "be9e60a6-d271-4aaf-ab6b-67f2ba5c1c20"
}
  1. Use the details obtained from the describe-dashboard command to create a JSON file based on the file create-template-cli-input.json.

The following code represents the input for creating a QuickSight template:

 {
    "AwsAccountId": "31********64",
    "TemplateId": "Sporting_event_ticket_info_template",
    "Name": "Sporting event ticket info template",
    "SourceEntity": {
        "SourceAnalysis": {
            "Arn": "arn:aws:quicksight:us-east-1:31********64:analysis/daddefc4-c97c-4460-86e0-5b488ec29cdb",
            "DataSetReferences": [
                {
                    "DataSetPlaceholder": "TicketInfo",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:31********64:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9"
                }
            ]
        }
    },
    "VersionDescription": "1"
}
  1. Run the create-template command to create a template object based on the file you created.

For example, the JSON file named create-template-cli-input.json would be run as follows:

aws quicksight create-template --cli-input-json file://./create-template-cli-input.json

The following is the expected response for the create-template command:

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:31********64:template/Sporting_event_ticket_info_template",
    "VersionArn": "arn:aws:quicksight:us-east-1:31********64:template/Sporting_event_ticket_info_template/version/1",
    "TemplateId": "Sporting_event_ticket_info_template",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "c32e4eb1-ecb6-40fd-a2da-cce86e15660a" 
}

The template is created in the background, as noted by the CreationStatus. Templates aren’t visible within the QuickSight UI; they’re a developer-managed or admin-managed asset that is only accessible via the APIs.

  1. To check status of the template, run the describe-template command:
aws quicksight describe-template --aws-account-id 31********64 --template-id "Sporting_event_ticket_info_template"

The expected response for the describe-template command should indicate a Status of CREATION_SUCCESSFUL:

{
    "Status": 200,
    "Template": {
        "Arn": "arn:aws:quicksight:us-east-1:31********64:template/Sporting_event_ticket_info_template",
        "Name": "Sporting event ticket info template",
        "Version": {
            "CreatedTime": "2020-12-18T01:56:49.135000-08:00",
            "VersionNumber": 1,
            "Status": "CREATION_SUCCESSFUL",
            "DataSetConfigurations": [
                {
                    "Placeholder": "TicketInfo",
                    "DataSetSchema": {
                        "ColumnSchemaList": [
                            [
                        {
                            "Name": "ticket_id",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "event_id",
                            "Type": "INTEGER"
                        },
                        {
                            "Name": "sport",
                            "Type": "STRING"
                        },
                        {
                            "Name": "event_date_time",
                            "Type": "DATETIME"
                        },
                        {
                            "Name": "home_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "away_team",
                            "Type": "STRING"
                        },
                        {
                            "Name": "location",
                            "Type": "STRING"
                        },
                        {
                            "Name": "city",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_level",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "seat_section",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat_row",
                            "Type": "STRING"
                        },
                        {
                            "Name": "seat",
                            "Type": "STRING"
                        },
                        {
                            "Name": "ticket_price",
                            "Type": "DECIMAL"
                        },
                        {
                            "Name": "ticketholder",
                            "Type": "STRING"
                        }
                        ]
                    },
                    "ColumnGroupSchemaList": []
                }
            ],
            "Description": "1",
            "SourceEntityArn": "arn:aws:quicksight:us-east-1:31********64:analysis/daddefc4-c97c-4460-86e0-5b488ec29cdb"
        },
        "TemplateId": "Sporting_event_ticket_info_template",
        "LastUpdatedTime": "2020-12-18T01:56:49.125000-08:00",
        "CreatedTime": "2020-12-18T01:56:49.125000-08:00"
    },
    "RequestId": "06c61098-2a7e-4b0c-a27b-1d7fc1742e06"
}
  1. Take note of the TemplateArn value in the output to use in subsequent steps.
  2. After you verify the template has been created, create a second JSON file (TemplatePermissions.json) and replace the Principal value with the ARN for the target account:
[
{
"Principal": "arn:aws:iam::86*******55:root",
"Actions": ["quicksight:UpdateTemplatePermissions","quicksight:DescribeTemplate"]
}
]
  1. Use this JSON file as the input for the update-template-permissions command, which allows cross-account read access from the source template (source account) to the target account:
aws quicksight update-template-permissions --aws-account-id 31********64 --template-id "Sporting_event_ticket_info_template" --grant-permissions file://./TemplatePermission.json --profile default

This command permits the target account to view the template in the source account. The expected response for the update-template-permissions command should look like the following code:

{
    "Status": 200,
    "TemplateId": "Sporting_event_ticket_info_template",
    "TemplateArn": "arn:aws:quicksight:us-east-1:31********64:template/Sporting_event_ticket_info_template",
    "Permissions": [
        {
            "Principal": "arn:aws:iam::86********55:root",
            "Actions": [
                "quicksight:UpdateTemplatePermissions",
                "quicksight:DescribeTemplate"
            ]
        }
    ],
    "RequestId": "fa153511-5674-4891-9018-a8409ad5b8b2"
}

At this point, all the required work in the source account is complete. The next steps use the AWS CLI configured for the target account.

Step 3: Create QuickSight resources in the target account

To create the data sources and data templates in the target account, you perform the following actions in the target environment using Test-qs-admin-user:

3a) Create a data source in the target account.

3b) Create datasets in the target account.

3c) Create dashboards in the target account.

Step 3a: Create a data source in the target account

To create a data source in your target account, complete the following steps:

  1. Use the data source file created in Step 2a to run the create-data-source command in the target environment:
aws quicksight create-data-source --cli-input-json file://./create-data-source-cli-input.json

The response from the command should indicate the creation is in progress:

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:86********55:datasource/QS_Test",
    "DataSourceId": "QS_Test",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "3bf160e2-a5b5-4c74-8c67-f651bef9b729"
}
  1. Use the describe-data-source command to validate that the data source was created successfully:
aws quicksight describe-data-source --aws-account-id 86********55 --data-source-id "QS_Test"

The following code shows the response:

{
    "Status": 200,
    "DataSource": {
        "Arn": "arn:aws:quicksight:us-east-1:86********55:datasource/QS_Test",
        "DataSourceId": "QS_Test",
        "Name": "QS_Test",
        "Type": "POSTGRESQL",
        "Status": "CREATION_SUCCESSFUL",
        "CreatedTime": "2020-12-21T12:11:30.734000-08:00",
        "LastUpdatedTime": "2020-12-21T12:11:31.236000-08:00",
        "DataSourceParameters": {
            "PostgreSqlParameters": {
                "Host": "dmslabinstance.*************.us-east-1.rds.amazonaws.com",
                "Port": 5432,
                "Database": "sportstickets"
            }
        },
        "SslProperties": {
            "DisableSsl": false
        }
    },
    "RequestId": "fbb72f11-1f84-4c57-90d2-60e3cbd20454"}
  1. Take note of the DataSourceArn value to reference later when creating the dataset.

The data source should now be available within QuickSight. Keep in mind that the data source is visible to the Test-qs-admin-user user, so you must sign in as Test-qs-admin-user and open QuickSight. For this post, the input JSON file renamed the data source to reflect the test environment. Alternatively, sign in to the target QuickSight account and choose Create new dataset to view the available data source.

Step 3b: Create datasets in the target account

Now that the data source is in the target environment, you’re ready to create your datasets.

  1. Use the create-data-set command to create the dataset using the create-data-set-cli-input-sql.json created in Step 2b.

Make sure to replace the DataSourceARN in create-data-set-cli-input-sql.json with the Data SourceArn value shown in the describe-data-source command in Step 3a.

aws quicksight create-data-set --cli-input-json file://./create-data-set-cli-input-sql.json

The following code shows our results:

{
    "Status": 201,
    "Arn": "arn:aws:quicksight:us-east-1:86********55:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
    "DataSetId": "24b1b03a-86ce-41c7-9df7-5be5343ff9d9",
    "IngestionArn": "arn:aws:quicksight:us-east-1:86********55:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9/ingestion/d819678e-89da-4392-b550-04221a0e4c11",
    "IngestionId": "d819678e-89da-4392-b550-04221a0e4c11",
    "RequestId": "d82caa28-ca43-4c18-86ab-5a92b6b5aa0c"
}
  1. Make note of the ARN of the dataset to use in a later step.
  2. Validate that the dataset was created using the describe-data-set command:
aws quicksight describe-data-set --aws-account-id 86********55 --data-set-id "24b1b03a-86ce-41c7-9df7-5be5343ff9d9"

Alternately, sign in to QuickSight to see new the new datasets on the list.

Step 3c: Create dashboards in the target account

Now that you shared the template with the target account in Step 2c, the final step is to create a JSON file that contains details about the dashboard to migrate to the target account.

  1. Create a JSON file (create-dashboard-cli-input.json) based on the following sample code, and provide the target account and the source account that contains the template:
{
    "AwsAccountId": "86********55",
    "DashboardId": "TicketanalysisTest",
    "Name": "Sportinng_event_ticket_info_dashboard",
    "Permissions": [
        {
            "Principal": "arn:aws:quicksight:us-east-1:86********55:user/default/Test-qs-admin-user",
            "Actions": [
                "quicksight:DescribeDashboard",
                "quicksight:ListDashboardVersions",
                "quicksight:UpdateDashboardPermissions",
                "quicksight:QueryDashboard",
                "quicksight:UpdateDashboard",
                "quicksight:DeleteDashboard",
                "quicksight:DescribeDashboardPermissions",
                "quicksight:UpdateDashboardPublishedVersion"
            ]
        }
    ],
    "SourceEntity": {
        "SourceTemplate": {
            "DataSetReferences": [
                {
                   "DataSetPlaceholder": "TicketInfo",
                    "DataSetArn": "arn:aws:quicksight:us-east-1:86********55:dataset/24b1b03a-86ce-41c7-9df7-5be5343ff9d9"
                }
            ],
            "Arn": "arn:aws:quicksight:us-east-1:31********64:template/Sporting_event_ticket_info_template"
        }
    },
    "VersionDescription": "1",
    "DashboardPublishOptions": {
        "AdHocFilteringOption": {
            "AvailabilityStatus": "DISABLED"
        },
        "ExportToCSVOption": {
            "AvailabilityStatus": "ENABLED"
        },
        "SheetControlsOption": {
            "VisibilityState": "EXPANDED"
        }
    }
}

The preceding JSON file has a few important values:

  • The Principal value in the Permissions section references a QuickSight user (Test-qs-admin-user) in the target QuickSight account, which is assigned various actions on the new dashboard to be created.
  • The DataSetPlaceholder in the SourceTemplate must use the same name as specified in the template created in Step 2c. This applies to all DataSetPlaceholder values if more than one is referenced in the dashboard.
  • The DataSetArn value is the ARN of the dataset created in Step 3b.
  • The ARN value in the SourceTemplate section references the ARN of the template created in the source account in Step 2c.
  1. After you create the file, run the create-dashboard command to create the dashboard in the target QuickSight account:
aws quicksight create-dashboard --cli-input-json file://./create-dashboard-cli-input.json

The following code shows the response:

{
    "Status": 202,
    "Arn": "arn:aws:quicksight:us-east-1:86********55:dashboard/TicketanalysisTest",
    "VersionArn": "arn:aws:quicksight:us-east-1:86********55:dashboard/TicketanalysisTest/version/1",
    "DashboardId": "TicketanalysisTest",
    "CreationStatus": "CREATION_IN_PROGRESS",
    "RequestId": "d57fa0fb-4736-441d-9e74-c5d64b3e4024"
}
  1. Open QuickSight for the target account to see the newly created dashboard.

Use the same Test-qs-admin-user to sign in to QuickSight. The following screenshot shows that the provided DashboardId is part of the URL.

Like any QuickSight dashboard, you can share the dashboard with users and groups within the target QuickSight account.

The QuickSight CLI includes additional commands for performing operations to update existing datasets and dashboards in a QuickSight account. You can use these commands to promote new versions of existing dashboards.

Clean up your resources

As a matter of good practice, when migration is complete, you should revoke the cross-account role created in Step 1 that allows trusted account access. Similarly, you should disable or remove any other user accounts and permissions created as part of this process.

 Conclusion

This post demonstrated an approach to migrate QuickSight objects from one QuickSight account to another. You can use this solution as a general-purpose method to move QuickSight objects between any two accounts or as a way to support SDLC practices for managing and releasing versions of QuickSight solutions in operational environments.

For more information about automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation, check out the video Admin Level-Up Virtual Workshop, V2 on YouTube.


About the authors

Abhinav Sarin is a senior partner solutions architect at Amazon Web Services, his core interests include databases, data analytics and machine learning. He works with AWS customers/partners to provide guidance and technical assistance on database projects, helping them improve the value of their solutions when using AWS.

 

Michael Heyd is a Solutions Architect with Amazon Web Services and is based in Vancouver, Canada. Michael works with enterprise AWS customers to transform their business through innovative use of cloud technologies. Outside work he enjoys board games and biking.

 

 

Glen Douglas is an Enterprise Architect with over 25 years IT experience and is a Managing Partner at Integrationworx. He works with clients to solve challenges with data integration, master data management, analytics and data engineering, in a variety of industries and computing platforms. Glen is involved in all aspects of client solution definition through project delivery and has been TOGAF 8 & 9 certified since 2008.

 

Alex Savchenko is a Senior Data Specialist with Integrationworx. He is TOGAF 9 certified and has over 22 years experience applying deep knowledge in data movement, processing, and analytics in a variety of industries and platforms.

Power operational insights with Amazon QuickSight

Post Syndicated from Luis Gerardo Baeza original https://aws.amazon.com/blogs/big-data/power-operational-insights-with-amazon-quicksight/

Organizations need a consolidated view of their applications, but typically application health status is siloed: end-users complain on social media platforms, operational data coming from application logs is stored on complex monitoring tools, formal ticketing systems track reported issues, and synthetic monitoring data is only available for the tool administrators.

In this post, we show how to use Amazon QuickSight, AWS’ fully managed, cloud-native Business Intelligence service to quickly build a dashboard that consolidates:

  • Operational data from application logs coming from Amazon CloudWatch.
  • Issues reported on Jira software as a service (SaaS) edition.
  • Public posts on Twitter.
  • Synthetic monitoring performed using a CloudWatch Synthetics canary.

This dashboard can provide a holistic view of the health status of a workload, for example, you can:

  • Trace end-users complaining on Twitter or creating issues on Jira back to the application logs where the error occurred.
  • Identify when customers are complaining about system performance or availability on social media.
  • Corelate reports with monitoring metrics (such as availability and latency).
  • Track down errors in application code using log information.
  • Prioritize issues already being addressed based on Jira information.

Solution overview

The following architecture for the solution consists of a subscription filter for CloudWatch Logs to continuously send the application logs to an Amazon Simple Storage Service (Amazon S3) bucket, an AWS Glue crawler to update Amazon S3 log table metadata, a view on Amazon Athena that formats the data on the bucket, and three QuickSight datasets: Athena, Jira, and Twitter.

To implement this solution, complete the following steps:

  1. Set up CloudWatch and AWS Glue resources.
  2. Set up a QuickSight dataset for Athena.
  3. Set up a QuickSight dataset for CloudWatch Synthetics.
  4. Set up a QuickSight dataset for Twitter.
  5. Set up a QuickSight dataset for Jira.
  6. Create a QuickSight overview analysis.
  7. Create a QuickSight detailed analysis.
  8. Publish your QuickSight dashboard.

Prerequisites

To get started, make sure you make the following prerequisites:

  • An AWS account.
  • Previous experience working with the AWS Management Console.
  • A Twitter account.
  • A Jira SaaS account. Make sure that the DNS name of your Jira Cloud is accessible to QuickSight.
  • Access to the Athena engine v2.

Set up CloudWatch and AWS Glue resources

Start by deploying a Lambda transformation function to use with your Amazon Kinesis Data Firehose delivery stream:

  1. On the Lambda console, launch a new function using the kinesis-firehose-cloudwatch-logs-processor blueprint.
  2. Enter a function name and choose Create function.
  3. Modify the transformLogEvent function in the Lambda code:
function transformLogEvent(logEvent) {
return Promise.resolve(`${logEvent.timestamp},${logEvent.message}\n`);
}
  1. Choose Deploy to update the function.

As part of these steps, you create a new AWS Identity and Access Management (IAM) role with basic permissions and will attach an IAM policy created by AWS CloudFormation later.

  1. Choose Copy ARN and save the ARN temporarily for later use.

To create the sample resources, complete the following steps:

  1. Choose Launch Stack:

  1. Choose Next.
  2. Enter a stack name.
  3. For TransformationLambdaArn, enter the function ARN you copied earlier.
  4. Choose Next twice, then acknowledge the message about IAM capabilities.
  5. Choose Create stack.

By default, when you launch the template, you’re taken to the AWS CloudFormation Events page. After 5 minutes, the stack launch is complete.

Test the Lambda function

We can test the function to write sample logs into the log group.

  1. On the Resources page of the AWS CloudFormation console search for LogGenerator.
  2. Choose the physical ID of the Lambda function.
  3. On the Lambda console, choose the function you created.
  4. Choose Test, enter sample for the Event name and leave the other configurations at their default.
  5. Choose Create.
  6. Choose Test again and you should receive the message “Logs generated.”
  7. On the Functions page of the Lambda console, choose the transformation function you created.

Change the AWS Lambda function configuration

  1. On the Configuration tab, choose General configuration.
  2. Choose Edit.
  3. For Memory, set to 256 MB.
  4. For Timeout, increase to 5 minutes.
  5. Choose Save.
  6. Choose Permissions, then choose the role name ID to open the IAM console.
  7. Choose Attach policies.
  8. Search for and select InsightsTransformationFunctionPolicy.
  9. Choose Attach policy.

The policy you attached allows your Lambda transformation function to put records into your Kinesis Data Firehose delivery stream.

Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. The Firehose delivery stream automatically partitions data by date.

  1. On the Amazon S3 console, locate and choose the bucket insightlogsbucket.
  2. Choose the cwlogs prefix and navigate through the partitions created by Kinesis Data Firehose (year/month/day/hour).

Make sure the bucket contains at least one file. (It may take up to 5 minutes to show because Kinesis Data Firehose buffers the data by default.)

To optimize the log storage, you can later enable record transformation to Parquet, a columnar data format. For more information, see Converting Input Record Format (Console).

Run the AWS Glue crawler

To complete this section, run the AWS Glue crawler to discover bucket metadata:

  1. On the AWS Glue console, choose Crawlers.
  2. Select the crawler InsightsLogCrawler and choose Run crawler.

  1. Wait for the crawler to complete, then choose Tables.
  2. Choose the filter bar and choose the resource attribute Database.
  3. Enter insightsdb and choose Enter.
  4. You should see a table with a CSV classification.
  5. On the Athena console, enter the following into the query editor:
select * from cwlogs limit 5;
  1. Choose Run query.

You should see a table like the following screenshot.

Set up a QuickSight dataset for Athena

If you haven’t signed up for a QuickSight subscription, do so before creating your dataset.

To use the table created in the AWS Glue Data Catalog, you have to authorize connections to Athena.

  1. On the QuickSight console, choose your QuickSight username and choose Manage QuickSight.
  2. Choose Security & permissions in the navigation pane.
  3. Choose Add or remove to set up access to Amazon S3.

  1. Choose your S3 bucket insightslosbucket created by the CloudFormation template to allow QuickSight to access it.
  2. Choose the QuickSight logo to exit the management screen.

Add the Athena dataset

Now we can set up the Athena dataset.

  1. On the QuickSight console, choose Datasets in the navigation pane, then choose New dataset.
  2. Choose Athena from the available options.
  3. For Data source name, enter cwlogs.
  4. Leave the default workgroup selected (primary).
  5. Choose Create data source.
  6. Open the list of databases and choose insightsdb.
  7. Choose the cwlogs table and choose Use custom SQL.
  8. Replace New custom SQL with cwlogs and enter the following SQL code:
SELECT col1 as logmessage, col0 as datetime FROM "insightsdb"."cwlogs"
  1. Choose Confirm query.

You receive a confirmation of the dataset creation.

  1. Choose Edit/Preview Data.
  2. For Dataset name, enter cwlogs.

Separate severity levels

The logs contain a severity level (INFO, WARN, ERRR) embedded into the message, to enable analysis of the logs based on this value, you separate the severity level from the message using QuickSight calculated fields.

  1. Choose Dataset below the query editor and choose the datetime
  2. Change the type to Date

  1. Open the Fields panel on the left and choose Add calculated field.
  2. For Add name, enter level.
  3. Enter the following code:
substring(logmessage,1,4)
  1. Choose Save.
  2. Choose Add calculated field.
  3. For Add name, enter message.
  4. Enter the following code:
replace(logmessage,concat(level," - "),"")
  1. Choose Save.
  2. Choose the options icon (three dots) next to the logmessage field and choose Exclude field.

The final dataset should be similar to the following screenshot.

  1. Choose Save.

Set up a QuickSight dataset for CloudWatch Synthetics

You add Synthetics monitoring metrics to our QuickSight dashboard to have visibility into the availability of your website. For this, you use the Athena CloudWatch connector.

Create a CloudWatch Synthetics canary

To create a CloudWatch Synthetics canary that monitors your website using heartbeats (sample requests) to test availability, complete the following steps:

  1. On the CloudWatch Synthetics console, choose Create canary.
  2. Make sure the blueprint Heartbeat monitoring is selected.
  3. For Name, enter webstatus.
  4. For Application or endpoint URL, enter your website’s URL.
  5. Under Schedule, enter a frequency that works best for you (1–60 minutes).

The default setup is every 5 minutes.

  1. Enter an S3 location to store artifacts.
  2. Choose Create canary.

Set up the Athena CloudWatch connector

CloudWatch Synthetics sends availability data from the canary to CloudWatch Metrics. To query the metrics, you can use the Athena CloudWatch Metrics connector.

  1. On the Athena console, choose Data sources.
  2. Choose Connect data source and choose Query a data source.
  3. Choose Amazon CloudWatch Metrics and choose Next.
  4. Choose Configure new AWS Lambda function.

 A new tab opens in the browser, which you return to after deploying the Lambda function.

  1. For SpillBucket, enter the name of your S3 bucket insightslobsbucket created by the CloudFormation template you deployed.
  2. For AthenaCatalogName, enter cwmetrics.
  3. Select I acknowledge that this app creates custom IAM roles.
  4. Choose Deploy.
  5. Close the browser tab and go to the Athena tab you were on before.
  6. Refresh the list of Lambda functions by choosing the refresh icon.
  7. Under Lambda function, choose the Lambda function you just created.
  8. For Catalog name, enter cwmetrics.
  9. Choose Connect.

Set up permissions for QuickSight to use the connector

The CloudWatch connector runs on Lambda, which uses a spill bucket to handle large queries, so QuickSight needs permission to invoke the Lambda function and write to the spill bucket. For more information, see the GitHub repo. Let’s set up permission to allow QuickSight use the CloudWatch connector.

  1. On the QuickSight console admin page, choose Security & permissions.
  2. Choose Add or remove.
  3. Choose Athena.
  4. On the S3 tab, specify write permissions for your insightslogsbucket S3 bucket
  5. On the Lambda tab, choose your Lambda function cwmetrics.

Create the new QuickSight dataset

Now we set up the QuickSight dataset for CloudWatch Synthetics.

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset and choose Athena.
  3. For Data source name, enter cwmetrics.
  4. Choose Create data source.
  5. Open the list of catalogs and choose cwmetrics.
  6. Choose the metric_samples table and choose Use custom SQL.
  7. For New custom SQL field, enter cwmetrics.
  8. Enter the following SQL code:
SELECT timestamp, value, dimension.dim_name, dimension.dim_value
    FROM cwmetrics.default.metric_samples CROSS JOIN UNNEST(dimensions) as  t(dimension)
    WHERE namespace='CloudWatchSynthetics' AND metric_name='SuccessPercent'
        AND statistic='Average' and dim_name='StepName' 
        AND dimension.dim_name='CanaryName' AND dimension.dim_value='webstatus'
        AND timestamp BETWEEN To_unixtime(Now() - INTERVAL '7' DAY) 
        AND To_unixtime(Now())
  1. Choose Confirm query.

You receive a confirmation of the dataset creation.

  1. Choose Edit/Preview data.
  2. Choose Dataset below the query editor and choose the timestamp field.
  3. Change the type to Date.
  4. Choose Save.

Set up a QuickSight dataset for Twitter

To set up the Twitter dataset, complete the following steps:

  1. On the QuickSight console, create a new dataset with Twitter as the source.
  2. For Data source name, enter twitterds.
  3. For Query, enter a hashtag or keyword to analyze from Twitter posts.
  4. Choose Create data source.

A new window opens requesting you to give QuickSight OAuth authorization for Twitter.

  1. Sign in to Twitter and choose Authorize application.

  1. Choose the table Twitt, then choose Edit/Preview data.

It might take a couple of minutes for the records to be imported into SPICE, the QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. You can continue with the tutorial while SPICE finishes in the background.

Let’s create a calculated field that classifies tweets as a Good experience or Bad experience by searching for the words “error,” “problem,” or “expensive.” You can choose other words that fit your use case.

  1. For Dataset name, enter twitterds.
  2. Choose Add calculated field.
  3. For Add name, enter Experience.
  4. Enter the following code:
ifelse(locate(toLower({Text}),"error")<>0,"BAD",locate(toLower({Text}),"problem")<>0,"BAD",locate(toLower({Text}),"expensive")<>0,"BAD","GOOD")
  1. Choose Save and then choose Save & visualize.

The Twitter Standard Search API returns data for 7 days only. For more information, see Supported Data Sources.

Set up a QuickSight dataset for Jira Cloud

QuickSight can connect to SaaS data sources, including Jira Cloud. To set up the Jira dataset, complete the following steps:

  1. To create an API token, open the Jira website within an authenticated browser.
  2. Choose Create API token.
  3. For Label, enter QuickSight.
  4. Choose Create.

  1. Create a new dataset in QuickSight and choose Jira as the data source.
  2. For Data source name, enter jirads.
  3. For Site base URL, enter the URL you use to access Jira.
  4. For Username, enter your Jira username.
  5. For API token or password, enter the token you created.
  6. Choose Create and choose the Issues  table.
  7. Choose Select and then choose Visualize.

You’re redirected to a newly created QuickSight analysis.

Create a QuickSight overview analysis

We use the previously created analysis as a starting point for our overview analysis.

  1. Choose the Edit data icon (pencil).

  1. Choose Add dataset.
  2. Choose the Issues dataset and choose Select.
  3. Repeat the steps for the twitterds, cwlogs and cwmetrics

You should see the four datasets added to the QuickSight visual.

In the navigation pane, you can find the available fields for the selected dataset and on the right, the visuals. Each visual is tied to a particular dataset. When the tutorial instructs you to create a new visual from a dataset, you must choose the dataset from the list first, then choose + Add.

  1. Choose the visual, then the options icon (three dots).
  2. Choose Delete.

  1. Update the QuickSight analysis name to Operational Dashboard.

Add a visual for the Twitter dataset

To add your first visual, complete the following steps:

  1. Add a new visual from the Twitter dataset.
  2. Change the visual type to KPI.
  3. Choose the field RetweetCount.
  4. Name the visual Bad experience retweet count.
  5. Use the resize button to set the size to a fourth of the screen width.

  1. In the navigation pane, choose Filter and then choose Create one.
  2. Choose the Created field and then choose the filter.
  3. Change the filter type to relative dates and choose
  4. Choose Last N hours and enter 24 for Number of hours.
  5. Choose

This filter allows us to see only the most up-to-date information from the last 24 hours.

  1. Add a second filter and choose the Experience
  2. Leave BAD selected marked and choose Apply.

Now you only see information about Twitter customers with a bad experience.

Add a visual for CloudWatch metrics

Next, we add a new visual for our CloudWatch metrics.

  1. Choose Visualize and then choose the cwmetrics
  2. Add a new gauge chart.
  3. Choose Field wells to open the visual field configuration.
  4. Choose Value and change Aggregate to Average.

  1. Drag and drop the value field from the field list into Target value and change the aggregate to
  2. Name the visual System health status.

  1. Similar to what you did on the previous visual, add a filter to include only the last 24 hours based on the timestamp field.

Add a visual for CloudWatch error logs

Next, we create a visual for our CloudWatch logs dataset.

  1. Choose the cwlogs dataset and add a new KPI visual.
  2. Drag and drop the message field into the Value
  3. Name the visual Error log count.
  4. Create a filter using the level field and from the list of values.
  5. Deselect all but ERRR.
  6. Choose Apply.

This filters only logs where there was an error found.

  1. Create a filter for the last 24 hours.

Add a visual for Jira issues

Now we add a visual for open Jira issues.

  1. Choose the Issues dataset.
  2. Create a KPI visual using the Id field for Value.
  3. Add a filter for issues of type bug.
    1. Use the IssueType_Name field and select only the records with value Error.
  4. Add a filter for issues open.
    1. Use the Status_Name field and select only the records with the value On-going or To-do.
  5. Add a filter for the last 24 hours using the Date_Created
  6. Name the filter Bug Issues open.
  7. Resize the visuals and organize them as needed.

Complete the oversight dashboard

We’re almost done with our oversight dashboard.

  1. Create four new visuals as specified in the following table.
Dataset Visual type Field on X-axis Field on Value Field on Color
twitterds Stacked line chart Created (aggregate: hour) RetweetCount (aggregate: sum)
cwmetrics Stacked line chart timestamp (aggregate: hour) Value (aggregate: avg)
cwlogs Stacked line chart datetime (aggregate: hour) message (aggregate: count)
Issues Vertical stacked bar Date_Created (aggregate: hour) Id (aggregate: count) Status_Name
  1. Modify every filter you created to apply them to all the visuals.

  1. Choose the tab Sheet 1 twice to edit it.
  2. Enter Overview.
  3. Choose Enter.

The overview analysis of our application health dashboard is complete.

QuickSight provides a drill-up and drill-down feature to view data at different levels of a hierarchy; the feature is added automatically for date fields. For more information, see Adding Drill-Downs to Visual Data in Amazon QuickSight.

In the previous step, you applied a drill-down when you changed the aggregation to hour.

Create a QuickSight detailed analysis

To create a detailed analysis, we create new tabs for CloudWatch logs, Tweets, and Jira issues.

Create a tab for CloudWatch logs

To create a tab to analyze our CloudWatch logs, complete the following steps:

  1. Choose the add icon next to the Overview tab and name it Logs.
  2. Create three visuals from the cwlogs dataset :
    1. Donut chart with the level field for Group/Color and message (count) for Value.
    2. Stacked combo bar with the datetime (aggregate: hour) field for X axis box, level (count) for Bars, and level for Group/Color for bars.
    3. Table with the fields level, message, and datetime for Value.

To improve the analysis of log data, format the level field based on its content using QuickSight conditional formatting: red for ERRR and Green for DEBG.

  1. Choose the table visual and choose on the visual options icon (three dots), then choose Conditional formatting.
  2. Choose the add icon and select the level field, then choose Add Text color.
  3. For Value, enter ERRR.
  4. For Color, choose red.
  5. Choose Add condition.
  6. For Value, enter DEBG.
  7. For Color, choose green.

  1. Choose Apply.
  2. Resize the visuals and update their titles as needed.

To enable data exploration, let’s set up an action on the Logs tab.

  1. Choose the Top Logs visual and choose Actions.
  2. Choose Filter same-sheet visuals.
  3. Choose ERRR on the donut chart.

The Latest Logs table filters only the DEBG level rows.

  1. Create a filter for the last 24 hours applicable to all visuals from the dataset.

Create a tab for Tweets

To create a tab for Twitter analysis and add visuals, complete the following steps:

  1. Choose the add icon next to the Logs tab and name the new tab Tweets.
  2. Delete the visual added and create a new donut chart from the twitterds dataset.
  3. Choose the field Source and name the visual Twit Source.
  4. Create a new word cloud visual and choose the Username

With QuickSight, you can exclude visual elements to focus the analysis on certain data. If you see a big “Other” username on the word cloud, choose it and then choose Hide “other” categories.

  1. To narrow down the elements on the word cloud to the top 50, choose the mesh icon.
  2. Under the Group by panel, enter 50 for Number of words.
  3. Choose the mesh icon again and choose Allow vertical words.
  4. Name the visual Top 50 Users.

Let’s create a table with the Twitter details.

  1. Add a new table visual.
  2. Drag and drop the field Text into the Group by box and RetweetCount into Value.
  3. Name the visual Top Retweet.
  4. Resize the columns on the table using the headers border as needed.
  5. To sort the table from the top retweeted posts, choose the header of the field RetweetCount and choose the sort descending icon.

Let’s add a color and an icon based on number of retweets.

  1. Choose the configuration icon (three dots) and choose conditional formatting.
  2. Choose the RetweetCount field, then choose the add icon and choose the three bars icon set.

  1. Choose the Custom conditions option and enter the Value field as follows:
    1. Condition #1 – Value: 10000; color: red
    2. Condition #2 – Start Value: 2000; End Value: 10000; color: orange
    3. Condition #3 – Value: 2000; color: keep the default

Now you can see the field RetweetCount formatted with an icon and color based on the value.

Now we add the user location to the analysis.

  1. Add a new horizontal bar chart visual.
  2. Use the UserLocation field for Y axis and the RetweetCount as Value.
  3. Sort descending by RetweetCount.
  4. Choose the mesh icon to expand the Y-axis panel and enter 10 for Number of data points to show.
  5. If you see an empty country, choose it and choose Exclude empty.
  6. Name the visual Top 10 Locations.

To complete this tab of your analysis, resize the visuals and organize them as follows.

  1. Similarly, as you did before, choose every visual and add the action Filter same-sheet visuals, which allows you to explore your data.

For example, you can choose one location or source and the Top users and Retweet tables are filtered.

  1. Create a filter for the last 24 hours applicable to all visuals from the dataset.

Create a tab for Jira issues

Finally, we create a tab for Jira issue analysis.

  1. Choose the add icon next to the Tweets tab and name the new tab Issues.
  2. Delete the visual created and create a new horizontal stacked 100% bar chart visual from the Issues dataset.
  3. Drag and drop the fields as follows (because this dataset has many fields, you can find them using the Field list search bar):
    1. Y-axisStatus_Name
    2. ValueId (count)
    3. Group/ColorAssigne_DisplayName

This visual shows you how issues have progressed among assignee name.

  1. Add a new area line chart visual with the field Date_Updated for X axis and TimeEstimate for Value.
  2. Add another word cloud visual to find out who the top issue reporters are; use Reporter_DisplayName for Group by and Id (count) for Size.
  3. The last visual you add for this tab is a table, include all the necessary fields on the Value box to be able to investigate. I suggest you include Id, Key, Summary, Votes, WatchCount, Priority, and Reporter_DisplayName.
  4. Resize and rearrange the visuals as needed.

  1. As you did before, choose every visual and add the action Filter same-sheet visuals, which allows you to explore your data.

For example, you can choose one reporter display name or a status and the other visuals are filtered.

  1. Create the filter for the last 24 hours applicable to all visuals from the dataset.

Publish your QuickSight dashboard

The analysis that you’ve been working on is automatically saved. To enable other people to view your findings with read-only capabilities, publish your analysis as a dashboard.

  1. Choose Share and choose Publish dashboard.
  2. Enter a name for your dashboard, such as holistic health status, and choose Publish dashboard.
  3. Optionally, select a person or group to share the dashboard with by entering a name in the search bar.
  4. Choose Share.

Your dashboard is now published and ready to use. You can easily correlate errors in application logs with posts on Twitter and availability data from your website, and quickly identify which errors are being already addressed based on Jira bug issues open.

By default, this dashboard can only be accessed by you, but you can share your dashboard with other people in your QuickSight account.

When you created the QuickSight datasets for Twitter and Jira, the data was automatically imported into SPICE, accelerating the time to query. You can also set up SPICE for other dataset types. Remember that data is imported into SPICE and must be refreshed.

Conclusion

In this post, you created a dashboard with a holistic view of your workload health status, including application logs, issue tracking on Jira, social media comments on Twitter, and monitoring data from CloudWatch Synthetics. To expand on this solution, you can  include data from Amazon CloudFront logs or Application Load Balancer access logs so you can have a complete view of your application. Also, you could easily embed your dashboard into a custom application.

You can also use machine learning to discover hidden data trends, saving hours of manual analysis with QuickSight ML Insights, or use QuickSight Q to power data discovery using natural language questions on your dashboards. Both features are ready to use in QuickSight without machine learning experience required.


About the author

Luis Gerardo Baeza is an Amazon Web Services solutions architect with 10 years of experience in business process transformation, enterprise architecture, agile methodologies adoption, and cloud technologies integration. Luis has worked with education, healthcare, and financial companies in México and Chile.

Building a Showback Dashboard for Cost Visibility with Serverless Architectures

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-showback-dashboard-for-cost-visibility-with-serverless-architectures/

Enterprises with centralized IT organizations and multiple lines of businesses frequently use showback or chargeback mechanisms to hold their departments accountable for their technology usage and costs. Chargeback involves actually billing a department for the cost of their division’s usage. Showback focuses on visibility to make the department more cost conscientious and encourage operational efficiency.

Building a showback mechanism can be potentially challenging for business and financial analysts of an AWS Organization. You may not have the scripting or data engineering skills needed to coordinate workflows and build reports at scale. Although you can use AWS Cost Explorer as starting point, you may want greater customizability, larger datasets beyond a one-year period, and more of a business intelligence (BI) experience.

In this post, we discuss the benefits of building a showback dashboard using the AWS Cost and Usage Report (AWS CUR). You can track costs by cost center, business unit, or project using managed services. Using a showback strategy, you can consolidate and present costs to a business unit to show resource use over a set period of time for your entire AWS Organization. Building this solution with managed services allows you to spend time understanding your costs rather than maintaining the underlying infrastructure.

This solution highlights AWS Glue DataBrew to prepare your data into the appropriate format for dashboards. We recommend DataBrew because it provides a no-code environment for data transformation. It allows anyone to create dashboards similar to those built in the Cloud Intelligence Dashboards Workshop for your Organization.

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Figure 1. QuickSight showback dashboard using CUR data transformed by Glue DataBrew and leveraging QuickSight insights

Tags for cost allocation

The success of your showback dashboard partially depends on your cost allocation tagging strategy. Typically, customers use business tags such as cost center, business unit, or project to associate AWS costs with traditional financial reporting dimensions within their organization.

The CUR supports the ability to break down AWS costs by tag. For example, if a group of resources are labeled with the same tag, you’ll be able to see the total cost and usage of that group of resources. Read more about Tagging Best Practices to develop a tagging strategy for your organization.

A serverless data workflow for showback dashboards

You can build showback dashboards with managed services such as Amazon QuickSight, without the need to write any code or manage any servers.

Figure 2. A serverless architecture representing data workflow

Figure 2. A serverless architecture representing data workflow

AWS automatically delivers the data you need for showback dashboards through the CUR. Once this data arrives in an Amazon Simple Storage Service (S3) bucket, you can transform the data without the need to write any code by using DataBrew. You can also automatically identify the data schema, and catalog the data’s properties to run queries using Amazon Athena. Lastly, you can visualize the results by publishing and sharing dashboards to key stakeholders within your organization using Amazon QuickSight.

The key benefits of this approach are:

  • Automatic data delivery
  • No-code data transformation
  • Automatic cataloging and querying
  • Serverless data visualization

Let’s take a look at each in more detail.

Automatic data delivery

The CUR is the source for historical cost and usage data. The CUR provides the most comprehensive set of cost and usage data available and will include your defined cost allocation tags for your entire Organization. You configure CUR to deliver your billing data to an Amazon S3 bucket at the payer account level. This will consolidate data for all linked accounts. After delivery starts, Amazon updates the CUR files at least once a day.

No-code data transformation

You can use DataBrew to transform the data in the Amazon S3 bucket aggregating cost and usage according to your tagging strategy. DataBrew summarizes your data for discovery. You can also run transformations called “jobs” in DataBrew without writing any code, using over 250 built-in transforms. Figures 3 through 5 show several job examples.

Figure 3. DataBrew recipe action: rename column

Figure 3. DataBrew recipe action: rename column

Figure 4. DataBrew recipe action: Create column from function

Figure 4. DataBrew recipe action: Create column from function

Figure 5. DataBrew recipe action: fill missing values

Figure 5. DataBrew recipe action: fill missing values

For a full list of columns available in CUR, review the CUR Data Dictionary. Following is a list of relevant columns for an executive summary showback dashboard:

  • bill_billing_period_start_date
  • line_item_usage_account_id
  • line_item_line_item_type
  • product_product_name
  • product_product_family
  • product_product_transfer_type
  • savings_plan_savings_plan_effective cost
  • reservation_effective_cost
  • line_item_unblended_cost

Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

Automatic cataloging and querying

You can use a Glue crawler to automatically classify your data to determine the data’s format, schema, and associated properties. The crawlers write metadata to an AWS Glue Data Catalog to help data users find the data they need.

With the results in Amazon S3, and the metadata in the Glue Data Catalog, you can run standard SQL to queries with Athena. This will help you make more informed business decisions by tracking financial metrics and optimizing costs. This is done directly in Amazon S3 without having to move around data. Using standard SQL, you can create views that aggregate cost and usage by your defined tags.

Serverless data visualization

You can use Amazon QuickSight to create and share dashboards with your teams for cost visibility. QuickSight provides native integration with Athena and S3, and lets you easily create and publish interactive BI dashboards that include ML-powered insights. When building a showback dashboard such as the example in Figure 1, QuickSight authors create visuals and publish interactive dashboards.

Readers log in using your preferred authentication mechanism to view the shared dashboard. You can then filter data based on billing periods, account number, or cost allocation tags. You can also drill down to details using a web browser or mobile app.

Conclusion

In this blog, we’ve discussed designing and building a data transformation process and a showback dashboard. This gives you highly granular cost visualization without having to provision and manage any servers. You can use managed services such as AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight to crawl, catalog, analyze, and visualize your data.

We recommend defining your organization tagging strategy to be able to view costs by tags. You can then get started by creating Cost and Usage Reports. With the data in Amazon S3, you can use the services described in this post to transform the data that works for your business. Additionally, you can get started today by experimenting with the Cloud Intelligence Dashboards Workshop. This workshop provides examples of visualizations that you can build using native AWS services on top of your Cost and Usage Report. You will be able to get cost, usage, and operational insights about your AWS Cloud usage.

Enhancing Existing Building Systems with AWS IoT Services

Post Syndicated from Lewis Taylor original https://aws.amazon.com/blogs/architecture/enhancing-existing-building-systems-with-aws-iot-services/

With the introduction of cloud technology and by extension the rapid emergence of Internet of Things (IoT), the barrier to entry for creating smart building solutions has never been lower. These solutions offer commercial real estate customers potential cost savings and the ability to enhance their tenants’ experience. You can differentiate your business from competitors by offering new amenities and add new sources of revenue by understanding more about your buildings’ operations.

There are several building management systems to consider in commercial buildings, such as air conditioning, fire, elevator, security, and grey/white water. Each system continues to add more features and become more automated, meaning that control mechanisms use all kinds of standards and protocols. This has led to fragmented building systems and inefficiency.

In this blog, we’ll show you how to use AWS for the Edge to bring these systems into one data path for cloud processing. You’ll learn how to use AWS IoT services to review and use this data to build smart building functions. Some common use cases include:

  • Provide building facility teams a holistic view of building status and performance, alerting them to problems sooner and helping them solve problems faster.
  • Provide a detailed record of the efficiency and usage of the building over time.
  • Use historical building data to help optimize building operations and predict maintenance needs.
  • Offer enriched tenant engagement through services like building control and personalized experiences.
  • Allow building owners to gather granular usage data from multiple buildings so they can react to changing usage patterns in a single platform.

Securely connecting building devices to AWS IoT Core

AWS IoT Core supports connections with building devices, wireless gateways, applications, and services. Devices connect to AWS IoT Core to send and receive data from AWS IoT Core services and other devices. Buildings often use different device types, and AWS IoT Core has multiple options to ingest data and enabling connectivity within your building. AWS IoT Core is made up of the following components:

  • Device Gateway is the entry point for all devices. It manages your device connections and supports HTTPS and MQTT (3.1.1) protocols.
  • Message Broker is an elastic and fully managed pub/sub message broker that securely transmits messages (for example, device telemetry data) to and from all your building devices.
  • Registry is a database of all your devices and associated attributes and metadata. It allows you to group devices and services based upon attributes such as building, software version, vendor, class, floor, etc.

The architecture in Figure 1 shows how building devices can connect into AWS IoT Core. AWS IoT Core supports multiple connectivity options:

  • Native MQTT – Multiple building management systems or device controllers have MQTT support immediately.
  • AWS IoT Device SDK – This option supports MQTT protocol and multiple programming languages.
  • AWS IoT Greengrass – The previous options assume that devices are connected to the internet, but this isn’t always possible. AWS IoT Greengrass extends the cloud to the building’s edge. Devices can connect directly to AWS IoT Greengrass and send telemetry to AWS IoT Core.
  • AWS for the Edge partner products – There are several partner solutions, such as Ignition Edge from Inductive Automation, that offer protocol translation software to normalize in-building sensor data.
Data ingestion options from on-premises devices to AWS

Figure 1. Data ingestion options from on-premises devices to AWS

Challenges when connecting buildings to the cloud

There are two common challenges when connecting building devices to the cloud:

  • You need a flexible platform to aggregate building device communication data
  • You need to transform the building data to a standard protocol, such as MQTT

Building data is made up of various protocols and formats. Many of these are system-specific or legacy protocols. To overcome this, we suggest processing building device data at the edge, extracting important data points/values before transforming to MQTT, and then sending the data to the cloud.

Transforming protocols can be complex because they can abstract naming and operation types. AWS IoT Greengrass and partner products such as Ignition Edge make it possible to read that data, normalize the naming, and extract useful information for device operation. Combined with AWS IoT Greengrass, this gives you a single way to validate the building device data and standardize its processing.

Using building data to develop smart building solutions

The architecture in Figure 2 shows an in-building lighting system. It is connected to AWS IoT Core and reports on devices’ status and gives users control over connected lights.

The architecture in Figure 2 has two data paths, which we’ll provide details on in the following sections, but here’s a summary:

  1. The “cold” path gathers all incoming data for batch data analysis and historical dashboarding.
  2. The “warm” bidirectional path is for faster, real-time data. It gathers devices’ current state data. This path is used by end-user applications for sending control messages, real-time reporting, or initiating alarms.
Figure 2. Architecture diagram of a building lighting system connected to AWS IoT Core

Figure 2. Architecture diagram of a building lighting system connected to AWS IoT Core

Cold data path

The cold data path gathers all lighting device telemetry data, such as power consumption, operating temperature, health data, etc. to help you understand how the lighting system is functioning.

Building devices can often deliver unstructured, inconsistent, and large volumes of data. AWS IoT Analytics helps clean up this data by applying filters, transformations, and enrichment from other data sources before storing it. By using Amazon Simple Storage Service (Amazon S3), you can analyze your data in different ways. Here we use Amazon Athena and Amazon QuickSight for building operational dashboard visualizations.

Let’s discuss a real-world example. For building lighting systems, understanding your energy consumption is important for evaluating energy and cost efficiency. Data ingested into AWS IoT Core can be stored long term in Amazon S3, making it available for historical reporting. Athena and QuickSight can quickly query this data and build visualizations that show lighting state (on or off) and annual energy consumption over a set period of time. You can also overlay this data with sunrise and sunset data to provide insight into whether you are using your lighting systems efficiently. For example, adjusting the lighting schedule accordingly to the darker winter months versus the brighter summer months.

Warm data path

In the warm data path, AWS IoT Device Shadow service makes the device state available. Shadow updates are forwarded by an AWS IoT rule into downstream services such an AWS IoT Event, which tracks and monitors multiple devices and data points. Then it initiates actions based on specific events. Further, you could build APIs that interact with AWS IoT Device Shadow. In this architecture, we have used AWS AppSync and AWS Lambda to enable building controls via a tenant smartphone application.

Let’s discuss a real-world example. In an office meeting room lighting system, maintaining a certain brightness level is important for health and safety. If that space is unoccupied, you can save money by turning the lighting down or off. AWS IoT Events can take inputs from lumen sensors, lighting systems, and motorized blinds and put them into a detector model. This model calculates and prompts the best action to maintain the room’s brightness throughout the day. If the lumen level drops below a specific brightness threshold in a room, AWS IoT Events could prompt an action to maintain an optimal brightness level in the room. If an occupancy sensor is added to the room, the model can know if someone is in the room and maintain the lighting state. If that person leaves, it will turn off that lighting. The ongoing calculation of state can also evaluate the time of day or weather conditions. It would then select the most economical option for the room, such as opening the window blinds rather than turning on the lighting system.

Conclusion

In this blog, we demonstrated how to collect and aggregate the data produced by on-premises building management platforms. We discussed how augmenting this data with the AWS IoT Core platform allows for development of smart building solutions such as building automation and operational dashboarding. AWS products and services can enable your buildings to be more efficient while and also provide engaging tenant experiences. For more information on how to get started please check out our getting started with AWS IoT Core developer guide.

Building a Cloud-based OLAP Cube and ETL Architecture with AWS Managed Services

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-cloud-based-olap-cube-and-etl-architecture-with-aws-managed-services/

For decades, enterprises used online analytical processing (OLAP) workloads to answer complex questions about their business by filtering and aggregating their data. These complex queries were compute and memory-intensive. This required teams to build and maintain complex extract, transform, and load (ETL) pipelines to model and organize data, oftentimes with commercial-grade analytics tools.

In this post, we discuss building a cloud-based OLAP cube and ETL architecture that will yield faster results at lower costs without sacrificing performance by:

  • Connecting your on-premises database to the cloud for data profiling, discovery, and transformation
  • Running OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate data
  • Using AWS Glue Data Catalog, Amazon Athena, Amazon QuickSight, and Amazon SageMaker to catalog and visualize data with machine learning (ML)

Data analytics pipeline with AWS Managed Services

The proposed architecture in Figure 1 relies on AWS Managed Services. AWS Glue DataBrew is a no-code data transformation service that you can use to quickly build your transformation jobs. AWS Glue crawlers collect metadata from the transformed data and catalogs it for analytics and visualization using Athena and QuickSight. SageMaker will build, train, and deploy ML models.

This architecture will help you get answers from your data to your users as fast as possible without needing to migrate your data to AWS. There is no coding required, so you can leverage data transformation, cataloging, analytics, and ML quickly.

Figure 1. Example architecture using AWS Managed Services

Figure 1. Example architecture using AWS Managed Services

Benefits of AWS Managed Services for data analytics

Immediate connectivity to on-premises databases

The example architecture in Figure 1 begins with an online transaction processing (OLTP) database running in your corporate data center. Figure 2 shows how you can establish a Java database connectivity (JDBC) connection from the OLTP database to DataBrew running in AWS to run OLAP workloads. DataBrew supports data sources using JDBC for common data stores such as Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.

DataBrew - JDBC connection to data source

Figure 2. DataBrew – JDBC connection to data source

Automatic data discovery

Figures 3 through 6 show how DataBrew summarizes your data for discovery. You can profile your data to understand patterns and detect anomalies. You can also run transformations called “jobs” in DataBrew without writing any code using over 250 built-in transforms.

DataBrew - dataset profiling overview

Figure 3. DataBrew – dataset profiling overview

 

DataBrew - data correlation patterns

Figure 4. DataBrew – data correlation patterns

 

DataBrew - data points distribution

Figure 5. DataBrew – data points distribution

No-code data transformation and cataloging

To run OLAP-type transactions, you can create jobs based on the transformation steps shown in Figure 6. These steps collectively are referred to as DataBrew recipes. These recipe results can be run as a job and outputted to an Amazon Simple Storage Service (Amazon S3) bucket.

A DataBrew project user interface view with sample data and transformation functions

Figure 6. A DataBrew project user interface view with sample data and transformation functions

Scheduled DataBrew jobs act similarly to scheduled ETL pipelines in OLAP. Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

The OLAP catalog is a set of metadata that sits between the actual OLAP data stored and applications. To create a Data Catalog, you can use AWS Glue crawlers to automatically classify your data to determine the data’s format, schema, and associated properties. Figure 7 shows the results of a crawler’s results written to Data Catalog as metadata to help data users find the data they need.

AWS Glue crawler metadata table output of column names and data types

Figure 7. AWS Glue crawler metadata table output of column names and data types

Data analytics without third-party software licenses

You can run analytics on your data by referring to the metadata definitions in the Data Catalog as references to the actual data in Amazon S3 using Athena. Athena is well suited for running one-time queries using standard SQL to query the transformed data directly in Amazon S3 without having to move data around. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Enterprises often supplement their OLAP workloads with separate visualization and business intelligence (BI) tools. These tools often come with their own licensing, server management, and security considerations.

You can visualize curated data using QuickSight, a scalable, serverless, embeddable, ML-powered BI service. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights, as shown in Figure 8. These dashboards can be shared with other users and embedded within your own applications.

A sample of data visualization options with Amazon QuickSight

Figure 8. A sample of data visualization options with Amazon QuickSight

Finally, you can incorporate ML workloads to OLAP workloads using SageMaker. In the past, ML workloads were often expensive, resource-intensive, and inaccessible. SageMaker provides a fully managed ML service to quickly and easily build and train ML models and directly deploy them into a production-ready hosted environment.

Conclusion

In this post, we show you how to connect your on-premises database using a JDBC connection to DataBrew for data profiling, discovery, and transformation. We looked at how you can use DataBrew recipes and jobs to run OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate any data. We also looked at AWS capabilities in data cataloging, visualization, and machine learning using Data Catalog, Athena, QuickSight, and SageMaker without having to manage any servers.

Laying the foundation to modernize an analytics workflow is critical for many enterprises that are looking to reduce the time it takes to understand their business. With AWS, you can perform enterprise-scale analytics with our portfolio of analytics services.

 

DOCOMO empowers business units with self-service knowledge access thanks to agile AWS QuickSight business intelligence

Post Syndicated from Daiki Itoh original https://aws.amazon.com/blogs/big-data/docomo-empowers-business-units-with-self-service-knowledge-access-thanks-to-agile-aws-quicksight-business-intelligence/

NTT DOCOMO is the largest telecom company in Japan. It provides innovative, convenient, and secure mobile services that enable customers to realize smarter lives. More than 73 million customers in Japan connect through its advanced wireless networks, including a nationwide LTE network and one of the world’s most progressive LTE Advanced networks. In addition to a wide range of communications-related services, DOCOMO offers value-added Smart Life offerings, including those under its +d initiative. These include the d POINT CLUB customer loyalty point program and d Payment, which enables customers to shop online and make electronic payments using their mobile devices.

All of these services create tremendous amounts of data, providing an opportunity of the company to extract insights that drive business value. To accomplish this goal, the company uses Amazon QuickSight and AWS data technologies to help better understand customers and support sales teams.

Many products, one data team

The company’s data team manages the marketing platform, which includes capturing, analyzing, and reporting on data for DOCOMO Smart Life businesses. For data collection and aggregation, it uses Amazon Redshift as a data warehouse. Amazon Redshift is ideal for storing and querying large amounts of structured data with high performance and reliability.

“With millions of connected customers, we need a highly capable data platform,” says Issei Nishimura, Manager, Marketing Platform Planning Department at DOCOMO. “AWS delivers the right levels of performance.”

In addition to regular reporting from the data warehouse, the company’s business leadership is interested in real-time key performance indicators (KPIs). For d Payment, these include metrics such as active users on a monthly and daily basis. Based on these analyses, leadership can decide how to improve the usage or the sales of each service.

Helping business users access analytics

However, when non-technical decision-makers requested self-service access, the data team had no easy way to provide it—until it decided to adopt QuickSight. QuickSight is a fast, cloud-powered business intelligence service that was easy to deploy and required no on-premises infrastructure.

“Because of native integration with existing AWS services, especially Amazon Redshift, we were able to roll out Amazon QuickSight quickly and easily,” Nishimura says. “In fact, it only took one day to build our first QuickSight dashboards.”

The automated data pipeline starts with Amazon Elastic Compute Cloud (Amazon EC2) to perform extract, transform, and load (ETL) on data, which is then pushed to Amazon Redshift. Amazon SageMaker aggregates and exports it to Amazon Simple Storage Service (Amazon S3), from which QuickSight accesses data for dashboards.

The following is a sample dashboard from NTT DOCOMO. For every marketing campaign, NTT DOCOMO analyzes the number of new unique users of d Payment relative to the number of registrations to the campaign. This allows them to understand how much effect the campaign had on each user category.

With pay-per-session pricing, the company can provide ad hoc data access for line of business decision-makers without capital investment and at low total cost. At the same time, QuickSight can scale to support as many users as needed.

The dashboards can be accessed from any device, providing convenience to the product management teams. It’s easy and intuitive enough for non-technical users—there’s no need for them to write SQL or scripts.

Faster insights to compete in an accelerated marketplace

Previously, it would take a few days to meet requests for ad hoc reports. Now, when people want to check a KPI, they can do it instantly.

“Our team can focus on managing the data warehouse and the regular reporting cadence because the volume of out-of-band requests has been reduced,” Nishimura says.

The solution has had immediate benefits for d Payment sales representatives, who work closely with retailers that use the payment service. These sales representatives want to be able to present relevant KPIs and demographics to the retailers to show trends and improve the services. With QuickSight, the sales team can generate appealing, up-to-date visualizations of the relevant information. They no longer have to spend time building graphics because the QuickSight visualizations are ready to use right away.

Summary

DOCOMO is a data-driven company, using insights to continuously improve its services. AWS enables them to run an enterprise data warehouse that ingests millions of data points with unlimited scale—and provides services such as QuickSight that give non-technical users rapid access to the real-time information they need.

“With these solutions, DOCOMO is breaking down barriers to enable greater use of analytics across the organization,” Nishimura says.


About the Authors

Daiki Itoh is a Business Development Manager for Amazon QuickSight in Japan.

 

 

 

 

Chirag Dhull is a Principal Product Marketing Manager for Amazon QuickSight.

 

 

 

 

 

Build a data quality score card using AWS Glue DataBrew, Amazon Athena, and Amazon QuickSight

Post Syndicated from Nitin Aggarwal original https://aws.amazon.com/blogs/big-data/build-a-data-quality-score-card-using-aws-glue-databrew-amazon-athena-and-amazon-quicksight/

Data quality plays an important role while building an extract, transform, and load (ETL) pipeline for sending data to downstream analytical applications and machine learning (ML) models. The analogy “garbage in, garbage out” is apt at describing why it’s important to filter out bad data before further processing. Continuously monitoring data quality and comparing it with predefined target metrics helps you comply with your governance frameworks.

In November 2020, AWS announced the general availability of AWS Glue DataBrew, a new visual data preparation tool that helps you clean and normalize data without writing code. This reduces the time it takes to prepare data for analytics and ML by up to 80% compared to traditional approaches to data preparation.

In this post, we walk through a solution in which we apply various business rules to determine the quality of incoming data and separate good and bad records. Furthermore, we publish a data quality score card using Amazon QuickSight and make records available for further analysis.

Use case overview

For our use case, we use a public dataset that is available for download at Synthetic Patient Records with COVID-19. It contains 100,000 synthetic patient records in CSV format. Data hosted within SyntheticMass has been generated by SyntheaTM, an open-source patient population simulation made available by The MITRE Corporation.

When we unzip the 100k_synthea_covid19_csv.zip file, we see the following CSV files:

  • Allergies.csv
  • Careplans.csv
  • Conditions.csv
  • Devices.csv
  • Encounters.csv
  • Imaging_studies.csv
  • Immunizations.csv
  • Medications.csv
  • Observations.csv
  • Organizations.csv
  • Patients.csv
  • Payer_transitions.csv
  • Payers.csv
  • Procedures.csv
  • Providers.csv
  • Supplies.csv

We perform the data quality checks categorized by the following data quality dimensions:

  • Completeness
  • Consistency
  • Integrity

For our use case, these CSV files are maintained by your organization’s data ingestion team, which uploads the updated CSV file to Amazon Simple Storage Service (Amazon S3) every week. The good and bad records are separated through a series of data preparation steps, and the business team uses the output data to create business intelligence (BI) reports.

Architecture overview

The following architecture uses DataBrew for data preparation and building key KPIs, Amazon Athena for data analysis with standard SQL, and QuickSight for building the data quality score card.

The workflow includes the following steps:

  1. The ingestion team receives CSV files in an S3 input bucket every week.
  2. The DataBrew job scheduled to run every week triggers the recipe job.
  3. DataBrew processes the input files and generates output files that contain additional fields depending on the recipe job logic.
  4. After the output data is written, we create external table on top of it by creating and running an AWS Glue crawler.
  5. The good and bad records are separated by creating views on top of the external table.
  6. Data analysts can use Athena to analyze good and bad records.
  7. The records can also be separated directly using QuickSight calculated fields.
  8. We use QuickSight to create the data quality score card in the form of a dashboard, which fetches data through Athena.

Prerequisites

Before beginning this tutorial, make sure you have the required permissions to create the resources required as part of the solution.

Additionally, create the S3 input and output buckets to capture the data, and upload the input data into the input bucket.

Create DataBrew datasets

To create a DataBrew dataset for the patient data, complete the following steps:

  1. On the DataBrew console, choose Datasets.
  2. Choose Connect new dataset.
  3. For Dataset name, enter a name (for this post, Patients).
  4. For Enter your source from S3, enter the S3 path of the patients input CSV.
  5. Choose Create Dataset.

Repeat these steps to create datasets for other CSV files, such as encounters, conditions, and so on.

Create a DataBrew project

To create a DataBrew project for marketing data, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create a project.
  3. For Project name, enter a name (for this post, patients-data-quality).
  4. For Select a dataset, select My datasets.
  5. Select the patients dataset.
  6. Under Permissions, for Role name, choose an AWS Identity and Access Management (IAM) role that allows DataBrew to read from your Amazon S3 input location.

You can choose a role if you already created one, or create a new one. For more information, see Adding an IAM role with data resource permissions.

  1. Wait till the dataset is loaded (about 1–2 minutes).
  2. To make a consistency check, choose Birthdate.
  3. On the Create menu, choose Flag column.
  4. Under Create column, for Values to flag, select Custom value.
  5. For Source column, choose BIRTHDATE.
  6. For Values to flag, enter the regular expression (?:(?:18|19|20)[0-9]{2}).
  7. For Flag values as, choose Yes or no.
  8. For Destination column, enter BIRTHDATE_flagged.

The new column BIRTHDATE_FLAGGED now displays Yes for a valid four-digit year within BIRTHDATE.

  1. To create a completeness check, repeat the preceding steps to create a DRIVERS_FLAGGED column by choosing the DRIVERS column to mark missing values.
  2. To create an integrity check, choose the JOIN transformation.
  3. Choose the encounters dataset and choose Next.
  4. For Select join type, select Left join.
  5. For Join keys, choose Id for Table A and Patient for Table B.
  6. Under Column list, unselect all columns from Table B except for Patient.
  7. Choose Finish.
  8. Choose the Patient column and create another flag column PATIENTS_FLAG to mark missing values from the Patient column.

For our use case, we created three new columns to demonstrate data quality checks for data quality dimensions in scope (consistency, completeness, and integrity), but you can integrate additional transformations on the same or additional columns as needed.

  1. After you finish applying all your transformations, choose Publish on the recipe.
  2. Enter a description of the recipe version and choose Publish.

Create a DataBrew job

Now that our recipe is ready, we can create a job for it, which gets invoked through our AWS Lambda functions.

  1. On the DataBrew console, choose Jobs.
  2. Choose Create a job.
  3. For Job name¸ enter a name (for example, patient-data-quality).

Your recipe is already linked to the job.

  1. Under Job output settings¸ for File type, choose your final storage format (for this post, we choose CSV).
  2. For S3 location, enter your final S3 output bucket path.
  3. For Compression, choose the compression type you want to apply (for this post, we choose None).
  4. For File output storage, select Replace output files for each job run.

We choose this option because our use case is to publish a data quality score card for every new set of data files.

  1. Under Permissions, for Role name¸ choose your IAM role.
  2. Choose Create and run job.

Create an Athena table

If you’re familiar with Apache Hive, you may find creating tables on Athena to be familiar. You can create tables by writing the DDL statement on the query editor, or by using the wizard or JDBC driver. To use the query editor, enter the following DDL statement to create a table:

CREATE EXTERNAL TABLE `blog_output`(
  `id` string, 
  `birthdate` string, 
  `birthdate_flagged` string, 
  `deathdate` string, 
  `ssn` string, 
  `drivers` string, 
  `drivers_flagged` string, 
  `passport` string, 
  `prefix` string, 
  `first` string, 
  `last` string, 
  `suffix` string, 
  `maiden` string, 
  `marital` string, 
  `race` string, 
  `ethnicity` string, 
  `gender` string, 
  `birthplace` string, 
  `address` string, 
  `city` string, 
  `state` string, 
  `county` string, 
  `zip` bigint, 
  `lat` double, 
  `lon` double, 
  `healthcare_expenses` double, 
  `healthcare_coverage` double, 
  `patient` string, 
  `patient_flagged` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<your-bucket>/blog_output/';

Let’s validate the table output in Athena by running a simple SELECT query. The following screenshot shows the output.

Create views to filter good and bad records (optional)

To create a good records view, enter the following code:

CREATE OR REPLACE VIEW good_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'Yes' AND
drivers_flagged = 'No' AND
patient_flagged = 'No'

To create a bad records view, enter the following code:

CREATE OR REPLACE VIEW bad_records AS
SELECT * FROM "databrew_blog"."blog_output"
where 
birthdate_flagged = 'No' OR
drivers_flagged = 'Yes' OR 
patient_flagged = 'Yes'

Now you have the ability to query the good and bad records in Athena using these views.

Create a score card using QuickSight

Now let’s complete our final step of the architecture, which is creating a data quality score card through QuickSight by connecting to the Athena table.

  1. On the QuickSight console, choose Athena as your data source.
  2. For Data source name, enter a name.
  3. Choose Create data source.
  4. Choose your catalog and database.
  5. Select the table you have in Athena.
  6. Choose Select.

Now you have created a dataset.

To build the score card, you add calculated fields by editing the dataset blog_output.

  1. Locate your dataset.
  2. Choose Edit dataset.
  3. Choose Add calculated field.
  4. Add the field DQ_Flag with value ifelse({birthdate_flagged} = 'No' OR {drivers_flagged} = 'Yes' OR {patient_flagged} = 'Yes' , 'Invalid', 'Valid').

Similarly, add other calculated fields.

  1. Add the field % Birthdate Invalid Year with value countIf({birthdate_flagged}, {birthdate_flagged} = 'No')/count({birthdate_flagged}).
  2. Add the field % Drivers Missing with value countIf({drivers_flagged}, {drivers_flagged} = 'Yes')/count({drivers_flagged}).
  3. Add the field % Patients missing encounters with value countIf({patient_flagged}, {patient_flagged} = 'Yes')/count({patient_flagged}).
  4. Add the field % Bad records with the value countIf({DQ_Flag}, {DQ_Flag} = 'Invalid')/count({DQ_Flag}).

Now we create the analysis blog_output_analysis.

  1. Change the format of the calculated fields to display the Percent format.
  2. Start adding visuals by choosing Add visual on the + Add menu.

Now you can create a quick report to visualize your data quality score card, as shown in the following screenshot.

If QuickSight is using SPICE storage, you need to refresh the dataset in QuickSight after you receive notification about the completion of the data refresh. If the QuickSight report is running an Athena query for every request, you might see a “table not found” error when data refresh is in progress. We recommend using SPICE storage to get better performance.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

Conclusion

This post explains how to create a data quality score card using DataBrew, Athena queries, and QuickSight.

This gives you a great starting point for using this solution with your datasets and applying business rules to build a complete data quality framework to monitor issues within your datasets. We encourage you to use various built-in transformations to get the maximum value for your project.


About the Authors

Nitin Aggarwal is a Senior Solutions Architect at AWS, where helps digital native customers with architecting data analytics solutions and providing technical guidance on various AWS services. He brings more than 16 years of experience in software engineering and architecture roles for various large-scale enterprises.

 

 

 

Gaurav Sharma is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

 

 

 

Vivek Kumar is a Solutions Architect at AWS. He works with digital native business customers providing architectural guidance on AWS services.

Create threshold-based alerts in Amazon QuickSight

Post Syndicated from Lillie Atkins original https://aws.amazon.com/blogs/big-data/create-threshold-based-alerts-in-amazon-quicksight/

Every business has a set of key metrics that stakeholders focus on to make the most accurate, data-driven decisions, such as sales per week, inventory turnover rate, daily website visitors, and so on. With threshold-based alerts in Amazon QuickSight, we’re making it simpler than ever for consumers of QuickSight dashboards to stay informed about their key metrics. This blog post walks you through the process of setting up threshold-based alerts to track important metrics on QuickSight dashboards.

Set up an alert

Threshold alerts are set up on dashboards and can be created from KPIs or gauge charts. To set a threshold alert, choose the visual and then choose the alert icon. The alert takes into account all of the filters currently applied to the visual and creates a new alert rule.

The following two screenshots show first a KPI visual and then a gauge visual as well as where to locate the alert icon:

You can set up multiple alerts from a visual, which lets you monitor the data for different sets of changes or conditions.

After you choose the alert icon, you must provide a few configuration details. The alert name auto-fills to the name of the visual. The Alert value is the value in the data that the threshold is checked against based on your rule; this defaults to the primary value of the visual. For the gauge, this means the percent of the goal already achieved (currently 79.81%). In the following screenshot, we see for the KPI it means the week over week difference in the forecasted revenue (currently -$367,456).

Let’s say you want to be alerted whenever the forecasted new monthly revenue dips below $300,000—even if that is multiple times a day. To configure this alert, complete the following steps:

  1. For Alert value¸ choose the actual value rather than the difference value (which was the default).
  2. For Condition, choose Is below.
  3. Enter the value 300,000.
  4. For Notification preference, choose As frequently as possible.
  5. Choose Save.

Let’s now say you change your mind and only want to get notified once a week at most when this alert is going off. This is controlled through the notification preference. To make changes to the alert, you go to the management portal, which can be found by choosing Alerts on the navigation bar.

Here is where you can edit, delete, enable, or disable the alert. When the alert has triggered, you will be able to see a list along with other historical (90-day) alerts. This alert doesn’t have any history, because it hasn’t been triggered yet.

To update your notification preference, choose Edit, under Notification Preference pick Weekly at most, then hit Save.

When an alert is triggered, you receive an email notification customized to what Alert Value you have the alert configured for. You can quickly get to the dashboard by choosing View Dashboard.

Alerts are created based on the visual at that point in time and don’t update with changes to the visual in the future. This means the visual can change or be deleted and the alert continues to work as long as the data in the dataset remains valid.

The evaluation schedule for threshold alerts is based on the dataset. For SPICE datasets alert rules are checked against the data after a successful data refresh. With datasets querying your data sources directly, alerts are evaluated daily at a random time between 6PM to 8AM based on the region of the dataset. We’re working on a control for direct query dataset owners to be able to set up their own schedules for checking alerts and increase the frequency up to hourly.

The admin for the QuickSight account can restrict who has access to set threshold alerts through custom permissions. For more information, see Customizing user permissions in Embed multi-tenant analytics in applications with Amazon QuickSight.

Pricing and availability

Threshold alerts are billed for each evaluation, and follow the familiar pricing used for anomaly detection, starting at $0.50 per 1,000 evaluations. For example, if you set up an alert on a SPICE dataset that refreshes daily, you have 30 evaluations of the alert rule in a month, which costs 30 * $0.5/1000 = $0.015 in a month. For more information, see Amazon QuickSight Pricing.

Threshold alerts are a QuickSight Enterprise Edition feature and available for dashboards consumed in the QuickSight website. Threshold alerts aren’t yet available in embedded QuickSight dashboards or on the mobile app.

Conclusion

In this post, we demonstrated how to set up threshold-based alerts to track important metrics on QuickSight dashboards. This makes it even easier for consumers of QuickSight dashboards to stay up to date on their key metrics. For more information see, Amazon QuickSight Documentation.


About the Author

Lillie Atkins is a Product Manager for Amazon QuickSight, Amazon Web Service’s cloud-native, fully managed BI service.

 

Securely analyze your data with AWS Lake Formation and Amazon QuickSight

Post Syndicated from Julia Soscia original https://aws.amazon.com/blogs/big-data/securely-analyze-your-data-with-aws-lake-formation-and-amazon-quicksight/

Many useful business insights can arise from analyzing customer preferences, behavior, and usage patterns. With this information, businesses can innovate faster and improve the customer experience, leading to better engagement and accelerating product adoption. More and more businesses are looking for ways to securely store and restrict access to customer data, which may include personally identifiable information (PII) and other sensitive information. Accessing customer data for use cases such as analytics and Machine Learning requires careful management of access permissions and the ability to audit usage.

This post explores the ways in which you can securely store, analyze, and visualize sensitive customer data. We start by storing encrypted sample test data in our Amazon Simple Storage Service (Amazon S3) based data lake. We use AWS Lake Formation to configure fine-grained permissions to restrict user access, and finally analyze the data and visualize business insights using Amazon QuickSight.

Prerequisites

For this post, you should be familiar with the following:

The AWS CloudFormation template associated with this post automatically sets up the different architecture components. You then need to take additional steps to configure user permissions directly in Lake Formation. If you already manage access using AWS Glue resource permissions and IAM, launching this CloudFormation template in the same account could result in access being denied to existing users. We recommend launching this template in a new AWS account or one not running business-critical workloads.

Architecture overview

The following diagram illustrates our solution’s architecture.

The workflow includes the following steps:

  1. Use Lake Formation to create a data lake on Amazon S3 that is protected with IAM and encrypted with AWS Key Management Service (AWS KMS).
  2. AWS Glue crawlers scan your datasets and populate the Data Catalog.
  3. The Data Catalog serves as a central repository to store the metadata of the datasets.
  4. Athena accesses the data for ad hoc queries, using the Data Catalog.
  5. You can securely visualize your data with QuickSight.

Use case and business value

The following use case helps illustrate the challenge we’re trying to solve.

Assume you run an ecommerce company and to help improve the customer experience, you need to collect and store customers’ purchasing history. It’s common for the credit card holder to be different than the individual purchasing a product. If you can identify the relationship between the account holder and the cardholder, you may be able to create targeted recommendations.

For example, the account holder and cardholder share the same last name and home address but have a different first name, and the cardholder’s age is greater. You can use this information to deduce with high certainty that the account holder is using the credit card of a family member. You could analyze the account holder’s purchasing history and correlate it with third-party information collected on the cardholder to create a holistic view of the customer and their circle of close family influencers.

With this new information, you can now make personal product recommendations to improve the customer experience. In the following sections, we demonstrate how to accomplish this by setting up a secure data lake, encrypting data at rest, masking sensitive fields, and restricting access to data using fine-grained permissions.

For similar customer use cases related to security and data analytics on AWS, see AWS Cloud Security.

Set up the environment

After you sign in to your test AWS account, launch the CloudFormation template by choosing Launch Stack:  

This template configures the following resources:

  • An encrypted S3 bucket representing the data lake that is registered with Lake Formation
  • Two IAM groups that represent the data lake admin and analyst
  • Two IAM users ([email protected] and [email protected])
  • IAM roles and policies that go along with these services
  • An AWS Glue crawler to populate the Data Catalog with tables
  • An AWS Glue role for running within the data lake
  • An AWS Glue extract, transform, and load (ETL) job to mask the credit card information
  • A KMS customer master key (CMK) for the Amazon S3 server-side encryption (SSE)

When you launch the template, enter values for the following parameters:

  • AdminPassword – An initial password for the [email protected] user
  • AnalystPassword – An initial password for the [email protected] user
  • BucketNameParam – A unique name to be used for the data lake bucket
  • CFNExecRoleOrUserArnParam – An ARN of principal (user or role) executing CloudFormation stack. You can find ARN of logged in user from IAM

The passwords you provide must comply to your account’s password policy. When you later log in as those users, you’re asked to reset these passwords. Make note of the passwords you choose.

Configure your data lake’s security

This section doesn’t cover all the available security offerings in AWS, but rather focuses on restricting user access and protecting data at rest. You should always start by securing all access to your data and resources. It’s important to create policies that use least privilege from the start, even before any data lands in the data lake. You can do this with IAM and AWS KMS.

Set up IAM access policies for your users

First, consider the types of users that access the AWS account that hosts a data lake and use analytic services to access the data assets in the data lake. For this post, we discuss two types of AWS users:

  • Data lake administrator – Responsible for configuring the data lake and managing Lake Formation permissions to the data lake users
  • Data analyst – Responsible for data visualization

When you work in your production account (not the test account), you consider a third user type: the data engineer. The data engineer is responsible for data transformation, loading data into Amazon S3, and making the necessary changes to the data structures so the analysts don’t have access to sensitive data directly. For this post, the data lake administrator acts as the data engineer.

On the IAM console, on the Users page, you can see the following users created by the CloudFormation template.

The recommended best practice is to provide your users with the least access privileges.

For the central access control for data in data lake, we use the Lake Formation permission model. This requires us to create a data lake administrator who can grant any principal (including self) any permission on any Data Catalog resource and data location. The designated data lake administrator user can then grant more granular permissions of resources to other principals. The permission scope of the data lake administrator should be to only have the ability to manage the data lake within the Lake Formation and AWS Glue Data Catalog. The data lake administrator should have the ability to add existing IAM users and roles to data lake, but not actually create the IAM users. The permissions scope of the data engineer user should be limited to the IAM policies required for them to successfully do their job, following the principle of least privilege. Specifically, this means running ETL jobs to mask sensitive information from the data records prior to analysis. Also, because the aim of the use case is to maintain compliance, it’s imperative that no individual users have access to cardholder data either visually or programmatically.

It’s considered a best practice to use AWS managed policies whenever possible to reduce the operational overhead of maintaining your own policies. The data lake admin group role should be granted the AWSLakeFormationDataAdmin, AWSGlueConsoleFullAccess, and AmazonS3ReadOnlyAccess managed policies to administer data pipelines with AWS Glue, configure Lake Formation permissions, run AWS Glue crawlers, and store data on Amazon S3. For more information about IAM managed permissions, see Data Lake Administrator Permissions.

Encryption at rest within the data lake

In addition to using IAM to delegate permissions to your data lake, we recommend encrypting the data at rest. In this post, Amazon S3 acts as the basis for storage of our data lake. Amazon S3 offers several options for encryption based on your preference and compliance requirements. For more information, see Protecting data using encryption.

In this post, we use server-side encryption using CMK because it provides low operational overhead. AWS KMS (SSE-KMS) uses CMK stored in KMS to protect customer data. AWS KMS provides fine-grained audit and access control over requests for encryption keys. SSE-KMS and AWS Glue both use data keys protected by your CMK. You can view the default encryption details by navigating to the Amazon S3 console, choosing the bucket created as part of the CloudFormation template, and viewing its properties.

Encryption in transit within the data lake

After you secure the data at rest, let’s make sure that the S3 service endpoints are protected with SSL. By default, public S3 service endpoints can be accessed via HTTPS or HTTP. Therefore, to ensure that traffic is encrypted in transit, use the default HTTPS endpoint for the Amazon S3 API. To ensure that data in Amazon S3 is protected with SSL, you must force SSL on the bucket using a bucket policy. This policy is enforced on all of your buckets where encryption is required to meet your security needs. For details on encrypting your data in transit as you move your data into your data lake in the AWS Well-Architected Framework, see How do you anticipate, respond to, and recover from incidents?

Preprocess data to restrict access to sensitive data

Not all consumers of the data are allowed to see sensitive information, so we need to split the data in a way that gives us more control over how data is exposed to users. To do that, we transform it using PySpark running in AWS Glue. This allows us to aggregate the data for our analysts and mask the sensitive information before saving it back to Amazon S3 for analysis.

Configure Lake Formation

First, we use Lake Formation to create a central data lake repository on Amazon S3 to store and analyze your data.

  1. On the Lake Formation console, under Data Catalog, choose Settings.
  2. Deselect the two check boxes associated with the Data Catalog using only IAM permissions.

To maintain backward compatibility with AWS Glue, these settings are enabled by default for new Data Catalog resources. These settings effectively cause access to Data Catalog resources to be controlled solely by IAM policies. Deselect the permissions so that you can give individual permissions to Data Catalog resources from Lake Formation.

  1. Choose Save.

Next, we revoke generic IAM principal access to tables in our database. This makes sure that only permissions applied using Lake Formation will apply.

  1. On the Lake Formation console, choose Administrative roles and tasks.
  2. Under Database Creators¸ select IAMAllowedPrincipals.
  3. Choose Revoke.
  4. For Catalog permissions, select Create database.
  5. Leave all other settings at their default and choose Revoke.

We now need to revoke permissions for IAMAllowedPrincipals.

  1. On the Data permissions page, revoke all grants to the group IAMAllowedPrincipals.

Lake Formation is now the single pane of glass for data governance within your data lake. To configure user permissions in Lake Formation, you must be a data lake admin. The CloudFormation template already created [email protected] as our data lake admin. When you’re logged in as the admin, you need to grant them the ability to manage permissions for users.

  1. On the IAM console, choose Users.
  2. Choose the [email protected] user.
  3. On the Security Credentials tab, copy the link for that user to log in.
  4. Open the link in a new browser or private browser window.
  5. Reset the password (on your first login).
  6. On the Lake Formation console, choose Data permissions.
  7. Choose Grant.
  8. Make sure the admin user has both database and grantable Super permissions on the db1 database.
  9. On the Databases page, select the db1 database.
  10. On the Actions menu, choose Edit.
  11. Choose the S3 bucket created as part of the CloudFormation stack as the database location.

The naming convention of the S3 bucket is <account_number><region_name><bucket_name>. For example, it should appear as similar to 111122223333virginiamybucketname. Do not choose the bucket name ending with athenaoutput.

  1. Deselect Use only IAM access control for new tables in this database.
  2. Choose Save.

After this step, if you see IAMAllowedPrincpals under Data permissions, follow the steps as explained before to revoke the permissions.

The next permission we want to grant is the ability for our AWS Glue execution role to create new tables in our db1 database.

  1. On the Data permissions page, choose Grant.
  2. For IAM users and roles, choose the AWS Glue role created as part of the CloudFormation stack.
  3. For Database, choose the db1 database.
  4. For Database permissions, select Create table.
  5. Make sure that no options are selected for Grantable permissions.
  6. Choose Grant.

Now that data lake admin is set up and Lake Formation is managing permissions, we can work on creating table definitions of cards, customers, and sales data into the Lake Formation Data Catalog. Let’s verify the files created by the CloudFormation template into S3 bucket folders.

  1. On the Amazon S3 console, choose the bucket that you chose for the db1 location.

The following CSV files are in their respective folders cards, customers, and sales:

  • cards.csv
  • customers.csv
  • sales.csv

Now that we’ve verified the files, let’s catalog it in the Lake Formation Data Catalog using AWS Glue crawlers.

  1. On the AWS Glue console, choose Crawlers.
  2. Select the crawler security-blog-crawler and choose Run crawler.

This crawler was created by the CloudFormation template. It can crawl multiple data stores like cards, customers, and sales to populate the Data Catalog.

After you run the crawler, you should see the tables in the Data Catalog. To view the tables, switch to the Lake Formation console, choose Databases, select the db1 database, and choose View tables.

Configure data access controls

Now that our datasets are cataloged, it’s important to define access controls to ensure only authorized users are allowed to see sensitive information. We use Lake Formation to define these fine-grained controls for users who need access to data in the data lake.

Configure data encryption for AWS Glue

We use AWS Glue security configuration to secure data when being accessed by AWS Glue crawlers and ETL jobs. The data being written by the ETL jobs to Amazon S3 targets and logs to Amazon CloudWatch. The security settings were automatically configured by the CloudFormation template and can be viewed on the AWS Glue console.

Process the result set

Our dataset includes information about our customer demographics and references between customers that share credit cards when making purchases. We develop a simple job using PySpark to combine the purchasing user’s information with the cardholder. You can perform other transformations and enrichment such as masking sensitive fields or looking up additional details in external systems. When the job is complete, it outputs the data in columnar format to give us better performance and reduce cost when we later analyze it. For more information on how this process works and the tools required, see How to extract, transform, and load data for analytic processing using AWS Glue (Part 2).

To demonstrate this capability in action, you run the AWS Glue ETL jobs created by CloudFormation template. To run the script, you log in as an admin user, but ideally, you should have a data engineer managing the ETL at this point. For the sake of simplicity, we configured the data lake administrator to have these permissions.

Let’s run the ETL jobs to clean the cards and sales data. They create new files under the clean_cards and clean_sales S3 folders with the modifications. We start with cleaning the card data. The job replaces full card numbers with the last four digits of the card numbers and create a new file in the clean_cards folder.

  1. Make sure you’re signed in as the data lake admin with username [email protected].
  2. On the AWS Glue console, choose Jobs.
  3. Select the job clean_cards_data and on the Action menu, choose Run job.
  4. Expand Security configuration, script libraries, and job parameters.
  5. Under Job parameters, add the key --output_s3_bucket_name and the value as the bucket name that contains the CSV files.
  6. Choose Run job.

Next, we clean up our sales data. The dollar amounts for the purchase prices are casted as strings with a dollar sign ($) in them. To make analytics easier downstream, we want to have those casted as decimals without the dollar signs.

  1. Follow the same procedure to run the clean_sales_data

Now that we have generated our clean cards and clean sales data in the S3 bucket, we run security-blog-crawler to add the clean cards and clean sales tables to our Data Catalog.

  1. In the navigation pane, choose Crawlers.
  2. Select the crawler called security-blog-crawler and choose Run crawler.

Now that we have our new tables with masked card data and cleaned sales data, you grant the analyst user permission to access it in Lake Formation.

  1. On the Lake Formation console, grant the Select permission to the clean_cards and clean_sales tables for the user [email protected].

This completes the permissions scope for the analyst user.

Query and visualize data with QuickSight

Now that our data is ready, we use QuickSight to visualize the data. We first add [email protected] to QuickSight as an author who can create different analysis for viewers. We use Lake Formation fine-grained permissions to grant secure access to the analyst user, who can prepare analysis in QuickSight.

You need to be logged in as your account administrator, not the analyst or data lake admin. Your account must be subscribed to QuickSight as an Enterprise user to enable integration with Lake Formation fine-grained access control.

Enable fine-grained permission for QuickSight users

Fine-grained permissions defined in Lake Formation are mapped to QuickSight authors or a group of viewers. To enable fine-grained permissions for viewers, you must first add them to a QuickSight group that can be configured in Lake Formation with fine-grained permissions. For this post, we create the QuickSight user [email protected].

  1. Sign in to your AWS account with the admin user (not the Lake Formation admin user).
  2. In a new tab, open the QuickSight console.
  3. Choose the logged-in user and choose Manage QuickSight.
  4. Add the user [email protected] with the author role.

Now we create the QuickSight group Analyst and add the QuickSight user [email protected] to the group. We use the AWS Command Line Interface (AWS CLI) for this purpose.

  1. Run the following command to create the group (provide your account ID):
    aws quicksight create-group --aws-account-id=<account_id> --namespace=default --group-name="Analyst" --description="Data Analyst Group"

  2. Run the following command to add the user to the group:
    aws quicksight create-group-membership --group-name 'Analyst' --aws-account-id <account_id> --namespace default [email protected]

  3. Run the following command to get the ARN of the QuickSight group:
    aws quicksight describe-group --group-name=Analyst --aws-account-id <account_id> --namespace=default

  4. Make a note of this ARN.

We use this ARN to configure access permissions to this QuickSight group in Lake Formation.

Because we configured the data lake bucket and Athena output bucket with CMKs, we need to grant the following key operations to the QuickSight role.

  1. Enter the following AWS CLI command to create the QuickSight role when you subscribe to QuickSight (also provide the KMS key ID, created by the CloudFormation stack):
    aws kms create-grant --key-id <kms-key> --grantee-principal arn:aws:iam::<accountid>:role/service-role/aws-quicksight-s3-consumers-role-v0 --operations Decrypt Encrypt DescribeKey GenerateDataKey GenerateDataKeyPair

  2. Sign in with the [email protected]
  3. On the Lake Formation console, choose Data permissions.
  4. Choose Grant.
  5. For SAML and Amazon QuickSight users and groups, enter the Analyst group ARN you copied earlier.
  6. For Database, choose db1.
  7. For Tables, choose clean_cards and clean_sales.
  8. For Table permissions, select Select.
  9. Choose Grant.

Now let’s grant permissions to the customers table by excluding the address and email fields.

  1. On the Data permissions page, choose Grant.
  2. For SAML and Amazon QuickSight users and groups, enter the Analyst group ARN you copied earlier.
  3. For Database, choose db1.
  4. For Tables, choose customers.
  5. For Columns, choose Exclude columns.
  6. For Exclude columns, choose address and email.
  7. For Table permissions, select Select.
  8. Choose Grant.

Now the Analyst group has the Lake Formation permission for the proper datasets.

Additional security measures

In addition to configuring permissions for QuickSight, you can implement security measures to ensure that you’re visualizing sensitive data properly.

  1. On the QuickSight console, choose Security & permissions.

Because we’re using Athena to connect to source data in data lake, we grant write permissions to the S3 bucket for the Athena query result output. The naming convention of the Athena S3 bucket is <account_number><region_name><bucket_name>. For example, it should appear similar to 111122223333virginiamybucketname.

Visualize the user data

This post would be incomplete without some data science that demonstrates how we can create correlations between users. We have secured our infrastructure, scrubbed our sensitive data in preprocessing, output the results to a columnar tables in our AWS Glue Data Catalog managed by Lake Formation, and configured security settings in QuickSight. The next step is to import the dataset into QuickSight to analyze the data. For more information, see Working with Datasets.

In our use case, we logically grouped our users and cardholders into a columnar format in preprocessing, which we can now use to sort through and derive value. In QuickSight, we use the heat map visualization to map the buyer to the number of times they used someone else’s credit card. This gives us a visual of which users used different credit cards the most.

  1. Sign in using [email protected].
  2. On the QuickSight console, choose New analysis.
  3. Choose New datasets.
  4. For Create a Dataset, choose Athena.
  5. Enter ABCCompany Analysis as the data source.
  6. Choose Create data source.
  7. Choose the database db1.

As shown in the following screenshot, you’re only shown the table permissions granted to the QuickSight Analyst group.

  1. Choose Use custom SQL.
  2. Enter the query name as Sales-Cards-Query.

We now run queries to verify that the analyst user doesn’t have access to the cards and sales tables.

  1. Enter the following query for the cards table:
    SELECT * from "db1"."cards"

The following screenshot shows that we get a permission error.

  1. Similarly, you can verify the permissions for sales table by running following query. You should see the same permission error as for the cards table.
    SELECT * from "db1"."sales"

  2. Enter the following query for the customers
    SELECT * from "db1"."customers"

The following screenshot shows that the analyst only has access to customer fields other than address and email.

The analyst only has access to the clean_sales, clean_cards, and customers tables (excluding the address and email fields).

  1. Enter the following SQL and choose Edit/Preview Data:
    SELECT "db1"."clean_sales"."customer_id", COUNT("db1"."clean_sales"."customer_id") as num_diff_card_used FROM "db1"."clean_sales" JOIN "db1"."customers" ON "db1"."clean_sales"."customer_id"="db1"."customers"."customer_id" AND "db1"."clean_sales"."card_id" != "db1"."customers"."card_id" GROUP BY "db1"."clean_sales"."customer_id" ORDER BY num_diff_card_used DESC

The following screenshot shows our query results.

  1. Choose Save & visualize to create a visualization.
  2. Choose + Add and choose Add visual.
  3. Choose the heat map visual type.
  4. Set Rows to customer_id.
  5. Set Columns and Values to num_diff_card.
  6. On the Values drop-down menu, choose Aggregate sum.

The following screenshot shows our QuickSight analysis. You can change the color by choosing Format visual.

From this visualization, we can see that several customers are making purchases with more than one card that they don’t own. We can also add further visualizations that add more context to our data like customer IDs and the total number of purchases made with cards that customers don’t own.

The following are some of the additional datasets and visualizations that you can add to your analysis.

This data may provide valuable insights into the relationships between users and also provide a starting point for forensic investigations into customers that may be making fraudulent purchases.

For instructions on creating a similar dataset, see Creating a Dataset Using Amazon Athena Data. For instructions on creating visualizations with your datasets, see Creating an Amazon QuickSight Visual.

Clean up

When you’re done with this exercise, complete the following steps to delete your resources and stop incurring costs:

  1. On the Amazon S3 console, delete the objects in the buckets created by the CloudFormation stack.
  2. On the CloudFormation console, select your stack and choose Delete.

This cleans up all the resources created by the stack.

  1. Cancel your QuickSight account.

Conclusion

In this post, we used an example of credit card purchases to discuss different ways to protect sensitive data, based on separation of responsibilities with least privileges, encryption, and fine-grained access control. With AWS, you gain the control and confidence you need to securely run your business with the most flexible and secure cloud computing environment available today. To show this in action, we created separate IAM users and granted permissions based on the principle of least privilege to allow users to perform their duties. We reviewed the required AWS KMS configuration to protect data at rest using server-side encryption. We used AWS Glue to prepare the data, remove sensitive information, and further protect data access using Lake Formation’s fine-grained access controls. After we applied the security controls, we showed you how the analyst user can safely create different analyses using QuickSight for visualization.


References

For more information about the steps in this solution, see the following:


About the Authors

Julia Soscia is a Solutions Architect Manager with Amazon Web Services on the Startup team, based out of New York City. Her main focus is to help startups create well-architected environments on the AWS cloud platform and build their business. She enjoys skiing on the weekends in Vermont and visiting the many art museums across New York City.

 

 

 

Mitesh Patel is a Senior Solutions Architect at AWS. He works with customers in SMB to help them develop scalable, secure and cost effective solutions in AWS. He enjoys helping customers in modernizing applications using microservices and implementing serverless analytics platform.

Run usage analytics on Amazon QuickSight using AWS CloudTrail

Post Syndicated from Sunil Salunkhe original https://aws.amazon.com/blogs/big-data/run-usage-analytics-on-amazon-quicksight-using-aws-cloudtrail/

Amazon QuickSight is a cloud-native BI service that allows end users to create and publish dashboards in minutes, without provisioning any servers or requiring complex licensing. You can view these dashboards on the QuickSight product console or embed them into applications and websites. After you deploy a dashboard, it’s important to assess how they and other assets are being adopted, accessed, and used across various departments or customers.

In this post, we use a QuickSight dashboard to present the following insights:

  • Most viewed and accessed dashboards
  • Most updated dashboards and analyses
  • Most popular datasets
  • Active users vs. idle users
  • Idle authors
  • Unused datasets (wasted SPICE capacity)

You can use these insights to reduce costs and create operational efficiencies in a deployment. The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

Solution components

The following table summarizes the AWS services and resources that this solution uses.

Resource Type Name Purpose
AWS CloudTrail logs CloudTrailMultiAccount Capture all API calls for all AWS services across all AWS Regions for this account. You can use AWS Organizations to consolidate trails across multiple AWS accounts.
AWS Glue crawler

QSCloudTrailLogsCrawler

QSProcessedDataCrawler

Ensures that all CloudTrail data is crawled periodically and that partitions are updated in the AWS Glue Data Catalog.
AWS Glue ETL job QuickSightCloudTrailProcessing Reads catalogued data from the crawler, processes, transforms, and stores it in an S3 output bucket.
AWS Lambda function ExtractQSMetadata_func Extracts event data using the AWS SDK for Python, Boto3. The event data is enriched with QuickSight metadata objects like user, analysis, datasets, and dashboards.
Amazon Simple Storage Service (s3)

CloudTrailLogsBucket

QuickSight-BIonBI-processed

One bucket stores CloudTrail data. The other stores processed data.
Amazon QuickSight Quicksight_BI_On_BO_Analysis Visualizes the processed data.

 Solution walkthrough

AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. You can use CloudTrail to log, continuously monitor, and retain account activity related to actions across your AWS infrastructure. You can define a trail to collect API actions across all AWS Regions. Although we have enabled a trail for all Regions in our solution, the dashboard shows the data for single Region only.

After you enable CloudTrail, it starts capturing all API actions and then, at 15-minute intervals, delivers logs in JSON format to a configured Amazon Simple Storage Service (Amazon S3) bucket. Before the logs are made available to our ad hoc query engine, Amazon Athena, they must be parsed, transformed, and processed by the AWS Glue crawler and ETL job.

Before the logs are made available to our ad hoc query engine

This will be handled by AWS Glue Crawler & AWS Glue ETL Job. The AWS Glue crawler crawls through the data every day and populates new partitions in the Data Catalog. The data is later made available as a table on the Athena console for processing by the AWS Glue ETL job. Glue ETL Job QuickSightCloudtrail_GlueJob.txt filters logs and processes only those events where the event source is QuickSight. (for example, eventSource = quicksight.amazonaws.com’).

  This will be handled by AWS Glue Crawler & AWS Glue ETL Job.

The following screenshot shows the sample JSON for the QuickSight API calls.

The following screenshot shows the sample JSON for the QuickSight API calls.

The job processes those events and creates a Parquet file. The following table summarizes the file’s data points.

Quicksightlogs
Field Name Data Type
eventtime Datetime
eventname String
awsregion String
accountid String
username String
analysisname String
Date Date

The processed data is stored in an S3 folder at s3://<BucketName>/processedlogs/. For performance optimization during querying and connecting this data to QuickSight for visualization, these logs are partitioned by date field. For this reason, we recommend that you configure the AWS Glue crawler to detect the new data and partitions and update the Data Catalog for subsequent analysis. We have configured the crawler to run one time a day.

We need to enrich this log data with metadata from QuickSight, such as a list of analyses, users, and datasets. This metadata can be extracted using descibe_analysis, describe_user, describe_data_set in the AWS SDK for Python.

We provide an AWS Lambda function that is ideal for this extraction. We configured it to be triggered once a day through Amazon EventBridge. The extracted metadata is stored in the S3 folder at s3://<BucketName>/metadata/.

Now that we have processed logs and metadata for enrichment, we need to prepare the data visualization in QuickSight. Athena allows us to build views that can be imported into QuickSight as datasets.

We build the following views based on the tables populated by the Lambda function and the ETL job:

CREATE VIEW vw_quicksight_bionbi 
AS 
  SELECT Date_parse(eventtime, '%Y-%m-%dT%H:%i:%SZ') AS "Event Time", 
         eventname  AS "Event Name", 
         awsregion  AS "AWS Region", 
         accountid  AS "Account ID", 
         username   AS "User Name", 
         analysisname AS "Analysis Name", 
         dashboardname AS "Dashboard Name", 
         Date_parse(date, '%Y%m%d') AS "Event Date" 
  FROM   "quicksightbionbi"."quicksightoutput_aggregatedoutput" 

CREATE VIEW vw_users 
AS 
  SELECT usr.username "User Name", 
         usr.role     AS "Role", 
         usr.active   AS "Active" 
  FROM   (quicksightbionbi.users 
          CROSS JOIN Unnest("users") t (usr)) 

CREATE VIEW vw_analysis 
AS 
  SELECT aly.analysisname "Analysis Name", 
         aly.analysisid   AS "Analysis ID" 
  FROM   (quicksightbionbi.analysis 
          CROSS JOIN Unnest("analysis") t (aly)) 

CREATE VIEW vw_analysisdatasets 
AS 
  SELECT alyds.analysesname "Analysis Name", 
         alyds.analysisid   AS "Analysis ID", 
         alyds.datasetid    AS "Dataset ID", 
         alyds.datasetname  AS "Dataset Name" 
  FROM   (quicksightbionbi.analysisdatasets 
          CROSS JOIN Unnest("analysisdatasets") t (alyds)) 

CREATE VIEW vw_datasets 
AS 
  SELECT ds.datasetname AS "Dataset Name", 
         ds.importmode  AS "Import Mode" 
  FROM   (quicksightbionbi.datasets 
          CROSS JOIN Unnest("datasets") t (ds))

QuickSight visualization

Follow these steps to connect the prepared data with QuickSight and start building the BI visualization.

  1. Sign in to the AWS Management Console and open the QuickSight console.

You can set up QuickSight access for end users through SSO providers such as AWS Single Sign-On (AWS SSO), Okta, Ping, and Azure AD so they don’t need to open the console.

You can set up QuickSight access for end users through SSO providers

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset to create a dataset for our analysis.

Choose New dataset to create a dataset for our analysis.

  1. For Create a Data Set, choose Athena.

In the previous steps, we prepared all our data in the form of Athena views.

  1. Configure permission for QuickSight to access AWS services, including Athena and its S3 buckets. For information, see Accessing Data Sources.

Configure permission for QuickSight to access AWS services,

  1. For Data source name, enter QuickSightBIbBI.
  2. Choose Create data source.

Choose Create data source.

  1. On Choose your table, for Database, choose quicksightbionbi.
  2. For Tables, select vw_quicksight_bionbi.
  3. Choose Select.

Choose Select.

  1. For Finish data set creation, there are two options to choose from:
    1. Import to SPICE for quicker analytics – Built from the ground up for the cloud, SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, and machine code generation to run interactive queries on large datasets and get rapid responses. We use this option for this post.
    2. Directly query your data – You can connect to the data source in real time, but if the data query is expected to bring bulky results, this option might slow down the dashboard refresh.
  2. Choose Visualize to complete the data source creation process.

Choose Visualize to complete the data source creation process.

Now you can build your visualizations sheets. QuickSight refreshes the data source first. You can also schedule a periodic refresh of your data source.

Now you can build your visualizations sheets.

The following screenshot shows some examples of visualizations we built from the data source.

The following screenshot shows some examples of visualizations we built from the data source.

 

This dashboard presents us with two main areas for cost optimization:

  • Usage analysis – We can see how analyses and dashboards are being consumed by users. This area highlights the opportunity for cost saving by looking at datasets that have not been used for the last 90 days in any of the analysis but are still holding a major chunk of SPICE capacity.
  • Account governance – Because author subscriptions are charged on a fixed fee basis, it’s important to monitor if they are actively used. The dashboard helps us identify idle authors for the last 60 days.

Based on the information in the dashboard, we could do the following to save costs:

Conclusion

In this post, we showed how you can use CloudTrail logs to review the use of QuickSight objects, including analysis, dashboards, datasets, and users. You can use the information available in dashboards to save money on storage, subscriptions, understand maturity of QuickSight Tool adoption and more.


About the Author

Sunil SalunkheSunil Salunkhe is a Senior Solution Architect working with Strategic Accounts on their vision to leverage the cloud to drive aggressive growth strategies. He practices customer obsession by solving their complex challenges in all the aspects of the cloud journey including scale, security and reliability. While not working, he enjoys playing cricket and go cycling with his wife and a son.

Building an administrative console in Amazon QuickSight to analyze usage metrics

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/building-an-administrative-console-in-amazon-quicksight-to-analyze-usage-metrics/

Given the scalability of Amazon QuickSight to hundreds and thousands of users, a common use case is to monitor QuickSight group and user activities, analyze the utilization of dashboards, and identify usage patterns of an individual user and dashboard. With timely access to interactive usage metrics, business intelligence (BI) administrators and data team leads can efficiently plan for stakeholder engagement and dashboard improvements. For example, you can remove inactive authors to reduce license cost, as well as analyze dashboard popularity to understand user acceptance and stickiness.

This post demonstrates how to build an administrative console dashboard and serverless data pipeline. We combine QuickSight APIs with AWS CloudTrail logs to create the datasets to collect comprehensive information of user behavior and QuickSight asset usage patterns.

This post provides a detailed workflow that covers the data pipeline, sample Python code, and a sample dashboard of this administrative console. With the guidance of this post, you can configure this administrative console in your own environment.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they have collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

  1. The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get QuickSight namespace, group, user, and assets access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. CloudTrail logs are stored in S3 bucket.
  3. Based on the file in Amazon S3 that contains user-group information, the QuickSight assets access permissions information, as well as view dashboard and user login events in CloudTrail logs. Three Amazon Athena tables and several views are created. Optionally, the BI engineer can combine these two tables with employee information tables to display human resource information of the users.
  4. Two QuickSight datasets fetch the data in the Athena tables created in Step 3 through SPICE mode. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Access to the following AWS services:
    • Amazon QuickSight
    • Amazon Athena
    • AWS Lambda
    • Amazon S3
  • Basic knowledge of Python
  • Optionally, Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by launching the following AWS CloudFormation stack:

After the stack creation is successful, you have one Amazon CloudWatch Events rule, one Lambda function, one S3 bucket, and the corresponding AWS Identity and Access Management (IAM) policies.

To create the resources in a Region other than us-east-1, download the Lambda function.

Creating Athena tables

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight APIs list_namespaces, list_users, list_user_groups, list_dashboards, list_datasets, list_datasources, list_analyses, list_themes, describe_data_set_permissions, describe_dashboard_permissions, describe_data_source_permissions, describe_analysis_permissions, and describe_theme_permissions to get QuickSight users and assets access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

Run the following SQL query to create two Athena tables (group_membership and object_access):

CREATE EXTERNAL TABLE `group_membership`(
`namespace` string,   
`group` string, 
`user` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/group_membership/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')
CREATE EXTERNAL TABLE `object_access`(
`aws_region` string,   
`object_type` string, 
`object_name` string,
`object_id` string,
`principal_type` string,
`principal_name` string,
`namespace` string,
`permissions` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/object_access/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the object_access table.

The following screenshot is sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

Creating views in Athena

Now we have the tables ready in Athena and can run SQL queries against them to generate some views to analyze the usage metrics of dashboards and users.

Create a view of a user’s role status with the following code:

CREATE OR REPLACE VIEW users AS
(select Namespace,
 Group,
 User,
(case 
when Group in ('quicksight-fed-bi-developer', 'quicksight-fed-bi-admin') 
then 'Author' 
else 'Reader' 
end) 
as author_status
from "group_membership" );

Create a view of GetDashboard events that happened in the last 3 months with the following code:

CREATE OR REPLACE VIEW getdashboard AS 
(SELECT 
"useridentity"."type",   "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn",'/', 2) AS "assumed_role", COALESCE("useridentity"."username","concat"("split_part"("userid
entity"."arn", '/', 2), '/', "split_part"("useridentity"."arn",
'/', 3))) AS "user_name",
awsregion,
"split_part"("split_part"("serviceeventdetails", 'dashboardName":', 2),',', 1) AS dashboard_name, "split_part"("split_part"("split_part"("split_part"("serviceeventdetails", 'dashboardId":', 2),',', 1), 'dashboard/', 2),'"}',1) AS dashboardId,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time, max(date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) AS latest_event_time
FROM cloudtrail_logs
WHERE 
eventsource = 'quicksight.amazonaws.com' 
AND
eventname = 'GetDashboard' 
AND
DATE_TRUNC('day',date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) > cast(current_date - interval '3' month AS date)
GROUP BY  1,2,3,4,5,6,7)

In the preceding query, the conditions defined in the where clause only fetch the records of GetDashboard events of QuickSight.

How can we design queries to fetch records of other events? We can review the CloudTrail logs to look for the information. For example, let’s look at the sample GetDashboard CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<principal_id>: <user_name>",
        "arn": "arn:aws:sts:: <aws_account_id>:assumed-role/<IAM_role_ name>/<user_name>",
        "accountId": "<aws_account_id>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<principal_id>",
                …
            }
        }
    },
    "eventTime": "2021-01-13T16:55:36Z",
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "GetDashboard",
    "awsRegion": "us-east-1",
    "eventID": "a599c8be-003f-46b7-a40f-2319efb6b87a",
    "readOnly": true,
    "eventType": "AwsServiceEvent",
    "serviceEventDetails": {
        "eventRequestDetails": {
            "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>"
        },
        "eventResponseDetails": {
            "dashboardDetails": {
                "dashboardName": "Admin Console",
                "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>",
                "analysisIdList": [
                    "arn:aws:quicksight:us-east-1: <aws_account_id>:analysis/<analysis_id>"
            }
        }
    }
}

With eventSource=“quicksight.amazonaws.com” and eventName=“GetDashboard”, we can get all the view QuickSight dashboard events.

Similarly, we can define the condition as eventname = ‘AssumeRoleWithSAML‘ to fetch the user login events. (This solution assumes that the users log in to their QuickSight account with identity federation through SAML.) For more information about querying CloudTrail logs to monitor other interesting user behaviors, see Using administrative dashboards for a centralized view of Amazon QuickSight objects.

Furthermore, we can join with employee information tables to get a QuickSight user’s human resources information.

Finally, we can generate a view called admin_console with QuickSight group and user information, assets information, CloudTrail logs, and, optionally, employee information. The following screenshot shows an example preview.

The following screenshot shows an example preview.

Creating datasets

With the Athena views ready, we can build some QuickSight datasets. We can load the view called admin_console to build a SPICE dataset called admin_console and schedule this dataset to be refreshed hourly. Optionally, you can create a similar dataset called admin_console_login_events with the Athena table based on eventname = ‘AssumeRoleWithSAML‘ to analyze QuickSight users log in events. According to the usage metrics requirement in your organization, you can create other datasets to serve the different requests.

Building dashboards

Now we can build a QuickSight dashboard as the administrative console to analyze usage metrics. The following steps are based on the dataset admin_console. The schema of the optional dataset admin_console_login_events is the same as admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

  1. Create parameters.

For example, we can create a parameter called InActivityMonths, as in the following screenshot.For example, we can create a parameter called InActivityMonths, as in the following screenshot.Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

  1. Create controls based on the parameters.

Create controls based on the parameters.

  1. Create calculated fields.

For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code.

The following screenshot shows the relevant code.

According to end user’s requirement, we can define several calculated fields to perform the analysis.

  1. Create visuals.

For example, we create an insight to display the top three dashboards view by readers and a visual to display the authors of these dashboards.

For example, we create an insight to display the top three dashboards view by readers and a visual to display the authors of these dashboards.

  1. We can add URL action to define some extra features to email inactive authors or check details of users.

We can add URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

The following sample code defines the action to email inactive authors:
The following screenshots show an example dashboard that you can make using our data.

The following is the administrative console landing page. We provide the overview, terminology explanation and thumbnails of the other two tabs in this page.

The following is the administrative console landing page.

The following screenshots show the User Analysis tab.

The following screenshots show the User Analysis tab.

The following screenshots show the Dashboards Analysis tab.

The following screenshots show the Dashboards Analysis tab.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference to public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Additional usage metrics

Additionally, we can perform some complicated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t do any viewing of dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

Cleaning up

To avoid incurring future charges, delete the resources you created with the CloudFormation template.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the administrative console dashboard.

You can request a demo of this administrative console to try for yourself.


About the Authors

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

 

 

 

Jill FlorantJill Florant manages Customer Success for the Amazon QuickSight Service team

Create a custom data connector to Slack’s Member Analytics API in Amazon QuickSight with Amazon Athena Federated Query

Post Syndicated from Pablo Redondo Sanchez original https://aws.amazon.com/blogs/big-data/create-a-custom-data-connector-to-slacks-member-analytics-api-in-amazon-quicksight-with-amazon-athena-federated-query/

Amazon QuickSight recently added support for Amazon Athena Federated Query, which allows you to query data in place from various data sources. With this capability, QuickSight can extend support to query additional data sources like Amazon CloudWatch Logs, Amazon DynamoDB, and Amazon DocumentDB (with Mongo DB compatibility) via their existing Amazon Athena data source. You can also use the Athena Query Federation SDK to write custom connectors and query any source accessible with a Java API, whether it is relational, non-relational, object, or a custom data endpoint.

A common analytics use case is to access data from a REST API endpoint and blend it with information from other sources. In this post, I walk you through the process of setting up a custom federated query connector in Athena to query data from a REST API endpoint and build a QuickSight dashboard that blends data from the REST API endpoint with other data sources.

To illustrate this use case, we work with Slack, the makers of a leading channel-based messaging platform, to test their Member Analytics API, which can help our mock company, Example Corp, understand Slack adoption and member engagement across different teams.

How the Slack Member Analytics API works

The following diagram illustrates the Slack Member Analytics API.

The following diagram illustrates the Slack Member Analytics API.

The Slack Member Analytics API is a REST API endpoint available for Slack Enterprise Grid customers. Authorized users and services can access the member usage stats dataset via the admin.analytics.getFile endpoint of the Slack Web API. The data consists on a new-line delimited JSON file with daily Slack activity stats at the member level. A record looks like the following code:

{ 
    "enterprise_id":"AAAAAAA",
    "date":"2020-11-10",
    "user_id":"U01ERHY4589",
    "email_address":"[email protected]",
    "is_guest":false,
    "is_billable_seat":false,
    "is_active":true,
    "is_active_ios":false,
    "is_active_android":false,
    "is_active_desktop":true,
    "reactions_added_count":3,
    "messages_posted_count":10, 
    "channel_messages_posted_count":0,
    "files_added_count":0
}

To request data, you must provide a date argument in the format of YYYY-MM-DD, a type argument with the value member, and an OAuth bearer token as the header. The response is a compressed (.gzip) JSON file with data for the requested date. See the following code of a sample request:

curl -X GET -H “Authorization: Bearer xoxp-..."  https://slack.com/api/admin.analytics.getFile?date=2020-09-01&type=member > data.gzip

Building the solution for Example Corp

For our use case, Example Corp has recently purchased Slack for 1,000 users and as the Collaboration team onboards new teams to Slack, they want to measure Slack adoption and engagement within each new team. If they see low adoption or engagement within a group at the company, they can work with that group to understand why they aren’t using Slack and provide education and support, as needed.

Example Corp wants to provide analysts access to the Slack member usage stats to run ad hoc queries in place (directly from the source) without maintaining a new extract, transform, and load (ETL) pipeline. They use the QuickSight cross data source join feature to blend their Slack usage stats with their HR dataset.

To achieve this, Example Corp implements the following steps:

  1. Authorize the custom federated query connector with Slack to access the Member Analytics API.
  2. Develop and deploy a custom federated query connector in the Example Corp AWS account.
  3. Create a dataset in the Example Corp QuickSight environment that reads Slack member usage data for the last 30 days and blends it with an HR dataset.
  4. Create a QuickSight dashboard that shows usage trends of provisioned vs. active users.

Example Corp program managers can now monitor slack engagement using their QuickSight Dashboard (see the following screenshot).

Example Corp program managers can now monitor slack engagement using their QuickSight Dashboard

The following diagram illustrates the overall architecture of the solution.

The following diagram illustrates the overall architecture of the solution.

The following sections describe the components in detail and provide sample code to implement the solution in your environment.

Authorizing the custom federated query connector to access the Slack Analytics API

Data REST API endpoints typically have an authentication mechanism such as standard HTTP authentication or a bearer token. In the case of the Slack Web API, a bearer token is required on every request. The Slack Member Analytics API uses an OAuth protocol to authorize applications’ read access to data from an organization’s Slack environment.

To perform the OAuth handshake, Example Corp deploys a custom web application on Amazon Elastic Compute Cloud (Amazon EC2) and registers it as a new Slack application. When it’s deployed, Example Corp Slack admins can access the web application UI to authenticate with Slack and authorize read access to the custom federated query connector. After successful authentication, the custom web application stores the bearer token as a secret in AWS Secrets Manager. Only the custom application server and the federated query connector have access to this secret.

The following is an architecture diagram and brief description of the OAuth authorization workflow between Slack.com and the custom web application. As a prerequisite, you need to register your custom application with Slack.

The following is an architecture diagram and brief description of the OAuth authorization workflow between Slack.com and the custom web application.

  1. The Slack admin accesses the custom application UI from their browser and chooses Add to Slack to begin the authorization process.

The Slack admin accesses the custom application UI from their browser and chooses Add to Slack to begin the authorization process.

  1. The custom application redirects the admin to Slack.com to authenticate and authorize the client with an admin.analytics:read access for Example Corp Slack Enterprise Grid.

The custom application redirects the admin to Slack.com to authenticate and authorize the client with an admin.analytics:read access for Example Corp Slack Enterprise Grid.

  1. Slack.com redirects the admin back to the custom application UI, passing a temporary authorization code in the request.
  2. On the backend, the custom application retrieves Slack client secrets from a Secrets Manager secret. The Slack client secrets are obtained during the Slack application registration.
  3. The custom application server makes a request for a bearer token to the Slack API, passing both the temporary authorization code and the Slack client secrets.
  4. If both the temporary authorization code and the client secrets are valid, then the Slack API returns a bearer token to the custom application server.
  5. The custom application saves the bearer token in the Secrets Manager secret.
  6. Finally, the application sends a confirmation of successful authorization to the admin.

Slack admins can revoke access to the application from the organization’s console at any time.

You can find the source code and detailed instructions to deploy this sample OAuth web application in the GitHub repo. When the authorization workflow is complete, you can pause or stop the resources running the web application. Going forward, the federated query connector accesses the token from Secrets Manager.

Deploying the custom federated query connector

When the OAuth workflow is complete, we can deploy the custom federated query connector in the Example Corp AWS environment. For Example Corp, we develop a custom AWS Lambda function using the Athena Query Federation Java SDK and a Java HTTP client to connect with the Slack Member Analytics REST API. Finally, we register it as a new data source within Athena.

The following is a diagram of how the custom connector workflow operates.

The following is a diagram of how the custom connector workflow operates.

The workflow includes the following steps:

  1. Users submit a query to Athena using the following query: select * from <catalog_name>.slackanalytics.member_analytics where date='2020-11-10', where <catalog_name> is the name specified when creating the Athena data source.
  2. Athena compiles the query and runs the Lambda function to retrieve the Slack authorization token from Secrets Manager and determine the number of partitions based on the query predicates (where clause).
  3. The Slack Member Analytics Connector partitions the data by date and runs a Lambda function for each partition (date) specified in the query. For example, if the predicate is WHERE date IN (‘2020-11-10’, ‘2020-11-12’), Athena runs two instances of the Lambda function. When no dates are specified in the where clause, the connector gets data for the last 30 days.
  4. Each instance of the Lambda function makes a request to the Slack Member API to retrieve data for each day.
  5. Finally, Athena performs any aggregation and computation specified in the query and return the results to the client.

You can deploy this sample Slack Member Analytics Lambda function in your AWS environment via AWS CloudFormation with the following template. If you want to modify and build the connector from scratch, you can find the source code and instructions in the GitHub repo.

After the Lambda function has been deployed, create a new data source in Athena. For step-by-step instructions, see Deploying a Connector and Connecting to a Data Source.

  1. On the Athena console, in the query editor, choose Connect data source.

On the Athena console, in the query editor, choose Connect data source.

  1. Select All other data sources.
  2. Point your catalog to your new Lambda function.

Point your catalog to your new Lambda function.

You should be able to browse your new catalog within Athena from the Athena console and query the Slack Member Analytics API using SQL.

You should be able to browse your new catalog within Athena from the Athena console and query the Slack Member Analytics API using SQL.

Creating a dataset that reads Slack member usage data and blends it with an HR dataset

As a prerequisite to query the Slack Member Analytics API from QuickSight, we must provide the proper permission for QuickSight to access the federated query data source in Athena. We do this directly from the QuickSight admin UI following these steps:

  1. As an admin, on the Admin menu, choose Manage QuickSight.
  2. Under Security & Permissions, choose QuickSight access to AWS services.
  3. Choose Add or Remove services.
  4. Select Athena.
  5. Choose Next when prompted to set the Amazon Simple Storage Service (Amazon S3) bucket and Lambda function permissions.

QuickSight browses the Athena catalogs and displays any Lambda functions associated with your account. If you don’t see a Lambda function, it means you haven’t mapped a data source within Athena.

  1. Select the function.
  2. Choose Finish.

Choose Finish.

When the Example Corp QuickSight environment has the proper permissions, analysts can query the Slack Analytics Member API using their existing Athena data source. For instructions on creating your own dataset, see Creating a Dataset Using Amazon Athena Data.

The custom connector appears as a new Catalog, Database, and Tables option.

  1. In QuickSight, on the Datasets page, choose New dataset.

In QuickSight, on the Datasets page, choose New dataset.

  1. Choose Athena as your data source.
  2. Choose Create dataset.

Choose Create dataset.

  1. Choose your table or, for this use case, choose Use custom SQL.

Choose your table or, for this use case, choose Use custom SQL.

For this analysis, we write a custom SQL that gets member activity for the last 30 days:

SELECT date,
       is_active,
       email_address,
       messages_posted_count
FROM   slackanalytics_catalog.slackanalytics.member_analytics
WHERE  date >= date_format(date_trunc('month',current_date),'%Y-%m-%d')

With the QuickSight cross data source join feature, analysts can enrich the Slack member stats with their HR info. For this use case, we imported a local HR_dataset.csv file containing the list of subscribed users with their respective Example Corp department, and joined them via the employee_email field.

With the QuickSight cross data source join feature, analysts can enrich the Slack member stats with their HR info.

The result is a dataset with Slack activity by employee and department. We’ve also updated the date field from a String type to a Date type using the QuickSight Data Prep page to take advantage of additional visualization features with Date type fields.

The result is a dataset with Slack activity by employee and department.

Creating a QuickSight dashboard that shows usage trends of provisioned vs. active users

Example Corp Analysts want to visualize the trend of provisioned users vs. active users and understand Slack adoption by department. To support these visualizations, we created the following calculated fields within our QuickSight analysis:

  • active distinct_countIf(employee,{is_active}='true')
  • provisioneddistinct_count(employee)

You can also create these calculated fields when you create your dataset. This way, you can reuse them in other QuickSight analyses. 

We use QuickSight narrative insights, a line chart, a bar chart, and a pivot table with conditional formatting to create the following analysis.

We use QuickSight narrative insights, a line chart, a bar chart, and a pivot table with conditional formatting to create the following analysis.

From this analysis, Example Corp can see that the adoption trend is positive; however, there is an adoption gap within the Marketing team. The program managers can engage the Marketing department leads and focus their training resources to improve their adoption.

From this analysis, Example Corp can see that the adoption trend is positive; however, there is an adoption gap within the Marketing team.

This dashboard can now be published to stakeholders within the organization as needed—either within the QuickSight app or embedded within existing enterprise applications. 

Conclusion

With the recent integration of QuickSight and Athena Federated Query, organizations can access additional data sources beyond those already supported by QuickSight. Analysts can leverage QuickSight capabilities to analyze and build dashboards that blend data from a variety of data sources, and with the Athena Query Federation SDK, you can build custom connectors to access relational, non-relational, object, and custom data endpoints using standard SQL.

To get started, try the lab Athena Deploying Custom Connector.


About the Author

Pablo Redondo SanchezPablo Redondo Sanchez is a Senior Solutions Architect at Amazon Web Services. He is a data enthusiast and works with customers to help them achieve better insights and faster outcomes from their data analytics workflows. In his spare time, Pablo enjoys woodworking and spending time outdoor with his family in Northern California.