Tag Archives: Amazon QuickSight

Run usage analytics on Amazon QuickSight using AWS CloudTrail

Post Syndicated from Sunil Salunkhe original https://aws.amazon.com/blogs/big-data/run-usage-analytics-on-amazon-quicksight-using-aws-cloudtrail/

Amazon QuickSight is a cloud-native BI service that allows end users to create and publish dashboards in minutes, without provisioning any servers or requiring complex licensing. You can view these dashboards on the QuickSight product console or embed them into applications and websites. After you deploy a dashboard, it’s important to assess how they and other assets are being adopted, accessed, and used across various departments or customers.

In this post, we use a QuickSight dashboard to present the following insights:

  • Most viewed and accessed dashboards
  • Most updated dashboards and analyses
  • Most popular datasets
  • Active users vs. idle users
  • Idle authors
  • Unused datasets (wasted SPICE capacity)

You can use these insights to reduce costs and create operational efficiencies in a deployment. The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

Solution components

The following table summarizes the AWS services and resources that this solution uses.

Resource Type Name Purpose
AWS CloudTrail logs CloudTrailMultiAccount Capture all API calls for all AWS services across all AWS Regions for this account. You can use AWS Organizations to consolidate trails across multiple AWS accounts.
AWS Glue crawler

QSCloudTrailLogsCrawler

QSProcessedDataCrawler

Ensures that all CloudTrail data is crawled periodically and that partitions are updated in the AWS Glue Data Catalog.
AWS Glue ETL job QuickSightCloudTrailProcessing Reads catalogued data from the crawler, processes, transforms, and stores it in an S3 output bucket.
AWS Lambda function ExtractQSMetadata_func Extracts event data using the AWS SDK for Python, Boto3. The event data is enriched with QuickSight metadata objects like user, analysis, datasets, and dashboards.
Amazon Simple Storage Service (s3)

CloudTrailLogsBucket

QuickSight-BIonBI-processed

One bucket stores CloudTrail data. The other stores processed data.
Amazon QuickSight Quicksight_BI_On_BO_Analysis Visualizes the processed data.

 Solution walkthrough

AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account. You can use CloudTrail to log, continuously monitor, and retain account activity related to actions across your AWS infrastructure. You can define a trail to collect API actions across all AWS Regions. Although we have enabled a trail for all Regions in our solution, the dashboard shows the data for single Region only.

After you enable CloudTrail, it starts capturing all API actions and then, at 15-minute intervals, delivers logs in JSON format to a configured Amazon Simple Storage Service (Amazon S3) bucket. Before the logs are made available to our ad hoc query engine, Amazon Athena, they must be parsed, transformed, and processed by the AWS Glue crawler and ETL job.

Before the logs are made available to our ad hoc query engine

This will be handled by AWS Glue Crawler & AWS Glue ETL Job. The AWS Glue crawler crawls through the data every day and populates new partitions in the Data Catalog. The data is later made available as a table on the Athena console for processing by the AWS Glue ETL job. Glue ETL Job QuickSightCloudtrail_GlueJob.txt filters logs and processes only those events where the event source is QuickSight. (for example, eventSource = quicksight.amazonaws.com’).

  This will be handled by AWS Glue Crawler & AWS Glue ETL Job.

The following screenshot shows the sample JSON for the QuickSight API calls.

The following screenshot shows the sample JSON for the QuickSight API calls.

The job processes those events and creates a Parquet file. The following table summarizes the file’s data points.

Quicksightlogs
Field Name Data Type
eventtime Datetime
eventname String
awsregion String
accountid String
username String
analysisname String
Date Date

The processed data is stored in an S3 folder at s3://<BucketName>/processedlogs/. For performance optimization during querying and connecting this data to QuickSight for visualization, these logs are partitioned by date field. For this reason, we recommend that you configure the AWS Glue crawler to detect the new data and partitions and update the Data Catalog for subsequent analysis. We have configured the crawler to run one time a day.

We need to enrich this log data with metadata from QuickSight, such as a list of analyses, users, and datasets. This metadata can be extracted using descibe_analysis, describe_user, describe_data_set in the AWS SDK for Python.

We provide an AWS Lambda function that is ideal for this extraction. We configured it to be triggered once a day through Amazon EventBridge. The extracted metadata is stored in the S3 folder at s3://<BucketName>/metadata/.

Now that we have processed logs and metadata for enrichment, we need to prepare the data visualization in QuickSight. Athena allows us to build views that can be imported into QuickSight as datasets.

We build the following views based on the tables populated by the Lambda function and the ETL job:

CREATE VIEW vw_quicksight_bionbi 
AS 
  SELECT Date_parse(eventtime, '%Y-%m-%dT%H:%i:%SZ') AS "Event Time", 
         eventname  AS "Event Name", 
         awsregion  AS "AWS Region", 
         accountid  AS "Account ID", 
         username   AS "User Name", 
         analysisname AS "Analysis Name", 
         dashboardname AS "Dashboard Name", 
         Date_parse(date, '%Y%m%d') AS "Event Date" 
  FROM   "quicksightbionbi"."quicksightoutput_aggregatedoutput" 

CREATE VIEW vw_users 
AS 
  SELECT usr.username "User Name", 
         usr.role     AS "Role", 
         usr.active   AS "Active" 
  FROM   (quicksightbionbi.users 
          CROSS JOIN Unnest("users") t (usr)) 

CREATE VIEW vw_analysis 
AS 
  SELECT aly.analysisname "Analysis Name", 
         aly.analysisid   AS "Analysis ID" 
  FROM   (quicksightbionbi.analysis 
          CROSS JOIN Unnest("analysis") t (aly)) 

CREATE VIEW vw_analysisdatasets 
AS 
  SELECT alyds.analysesname "Analysis Name", 
         alyds.analysisid   AS "Analysis ID", 
         alyds.datasetid    AS "Dataset ID", 
         alyds.datasetname  AS "Dataset Name" 
  FROM   (quicksightbionbi.analysisdatasets 
          CROSS JOIN Unnest("analysisdatasets") t (alyds)) 

CREATE VIEW vw_datasets 
AS 
  SELECT ds.datasetname AS "Dataset Name", 
         ds.importmode  AS "Import Mode" 
  FROM   (quicksightbionbi.datasets 
          CROSS JOIN Unnest("datasets") t (ds))

QuickSight visualization

Follow these steps to connect the prepared data with QuickSight and start building the BI visualization.

  1. Sign in to the AWS Management Console and open the QuickSight console.

You can set up QuickSight access for end users through SSO providers such as AWS Single Sign-On (AWS SSO), Okta, Ping, and Azure AD so they don’t need to open the console.

You can set up QuickSight access for end users through SSO providers

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset to create a dataset for our analysis.

Choose New dataset to create a dataset for our analysis.

  1. For Create a Data Set, choose Athena.

In the previous steps, we prepared all our data in the form of Athena views.

  1. Configure permission for QuickSight to access AWS services, including Athena and its S3 buckets. For information, see Accessing Data Sources.

Configure permission for QuickSight to access AWS services,

  1. For Data source name, enter QuickSightBIbBI.
  2. Choose Create data source.

Choose Create data source.

  1. On Choose your table, for Database, choose quicksightbionbi.
  2. For Tables, select vw_quicksight_bionbi.
  3. Choose Select.

Choose Select.

  1. For Finish data set creation, there are two options to choose from:
    1. Import to SPICE for quicker analytics – Built from the ground up for the cloud, SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, and machine code generation to run interactive queries on large datasets and get rapid responses. We use this option for this post.
    2. Directly query your data – You can connect to the data source in real time, but if the data query is expected to bring bulky results, this option might slow down the dashboard refresh.
  2. Choose Visualize to complete the data source creation process.

Choose Visualize to complete the data source creation process.

Now you can build your visualizations sheets. QuickSight refreshes the data source first. You can also schedule a periodic refresh of your data source.

Now you can build your visualizations sheets.

The following screenshot shows some examples of visualizations we built from the data source.

The following screenshot shows some examples of visualizations we built from the data source.

 

This dashboard presents us with two main areas for cost optimization:

  • Usage analysis – We can see how analyses and dashboards are being consumed by users. This area highlights the opportunity for cost saving by looking at datasets that have not been used for the last 90 days in any of the analysis but are still holding a major chunk of SPICE capacity.
  • Account governance – Because author subscriptions are charged on a fixed fee basis, it’s important to monitor if they are actively used. The dashboard helps us identify idle authors for the last 60 days.

Based on the information in the dashboard, we could do the following to save costs:

Conclusion

In this post, we showed how you can use CloudTrail logs to review the use of QuickSight objects, including analysis, dashboards, datasets, and users. You can use the information available in dashboards to save money on storage, subscriptions, understand maturity of QuickSight Tool adoption and more.


About the Author

Sunil SalunkheSunil Salunkhe is a Senior Solution Architect working with Strategic Accounts on their vision to leverage the cloud to drive aggressive growth strategies. He practices customer obsession by solving their complex challenges in all the aspects of the cloud journey including scale, security and reliability. While not working, he enjoys playing cricket and go cycling with his wife and a son.

Building an administrative console in Amazon QuickSight to analyze usage metrics

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/building-an-administrative-console-in-amazon-quicksight-to-analyze-usage-metrics/

Given the scalability of Amazon QuickSight to hundreds and thousands of users, a common use case is to monitor QuickSight group and user activities, analyze the utilization of dashboards, and identify usage patterns of an individual user and dashboard. With timely access to interactive usage metrics, business intelligence (BI) administrators and data team leads can efficiently plan for stakeholder engagement and dashboard improvements. For example, you can remove inactive authors to reduce license cost, as well as analyze dashboard popularity to understand user acceptance and stickiness.

This post demonstrates how to build an administrative console dashboard and serverless data pipeline. We combine QuickSight APIs with AWS CloudTrail logs to create the datasets to collect comprehensive information of user behavior and QuickSight asset usage patterns.

This post provides a detailed workflow that covers the data pipeline, sample Python code, and a sample dashboard of this administrative console. With the guidance of this post, you can configure this administrative console in your own environment.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they have collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

  1. The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get QuickSight namespace, group, user, and assets access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. CloudTrail logs are stored in S3 bucket.
  3. Based on the file in Amazon S3 that contains user-group information, the QuickSight assets access permissions information, as well as view dashboard and user login events in CloudTrail logs. Three Amazon Athena tables and several views are created. Optionally, the BI engineer can combine these two tables with employee information tables to display human resource information of the users.
  4. Two QuickSight datasets fetch the data in the Athena tables created in Step 3 through SPICE mode. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Access to the following AWS services:
    • Amazon QuickSight
    • Amazon Athena
    • AWS Lambda
    • Amazon S3
  • Basic knowledge of Python
  • Optionally, Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by launching the following AWS CloudFormation stack:

After the stack creation is successful, you have one Amazon CloudWatch Events rule, one Lambda function, one S3 bucket, and the corresponding AWS Identity and Access Management (IAM) policies.

To create the resources in a Region other than us-east-1, download the Lambda function.

Creating Athena tables

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight APIs list_namespaces, list_users, list_user_groups, list_dashboards, list_datasets, list_datasources, list_analyses, list_themes, describe_data_set_permissions, describe_dashboard_permissions, describe_data_source_permissions, describe_analysis_permissions, and describe_theme_permissions to get QuickSight users and assets access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

Run the following SQL query to create two Athena tables (group_membership and object_access):

CREATE EXTERNAL TABLE `group_membership`(
`namespace` string,   
`group` string, 
`user` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/group_membership/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')
CREATE EXTERNAL TABLE `object_access`(
`aws_region` string,   
`object_type` string, 
`object_name` string,
`object_id` string,
`principal_type` string,
`principal_name` string,
`namespace` string,
`permissions` string
)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3:// admin-console<aws_account_id>/monitoring/quicksight/object_access/'
TBLPROPERTIES (
  'areColumnsQuoted'='false', 
  'classification'='csv', 
  'columnsOrdered'='true', 
  'compressionType'='none', 
  'delimiter'=',',
  'typeOfData'='file')

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the group_membership table.

The following screenshot is sample data of the object_access table.

The following screenshot is sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

Creating views in Athena

Now we have the tables ready in Athena and can run SQL queries against them to generate some views to analyze the usage metrics of dashboards and users.

Create a view of a user’s role status with the following code:

CREATE OR REPLACE VIEW users AS
(select Namespace,
 Group,
 User,
(case 
when Group in ('quicksight-fed-bi-developer', 'quicksight-fed-bi-admin') 
then 'Author' 
else 'Reader' 
end) 
as author_status
from "group_membership" );

Create a view of GetDashboard events that happened in the last 3 months with the following code:

CREATE OR REPLACE VIEW getdashboard AS 
(SELECT 
"useridentity"."type",   "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn",'/', 2) AS "assumed_role", COALESCE("useridentity"."username","concat"("split_part"("userid
entity"."arn", '/', 2), '/', "split_part"("useridentity"."arn",
'/', 3))) AS "user_name",
awsregion,
"split_part"("split_part"("serviceeventdetails", 'dashboardName":', 2),',', 1) AS dashboard_name, "split_part"("split_part"("split_part"("split_part"("serviceeventdetails", 'dashboardId":', 2),',', 1), 'dashboard/', 2),'"}',1) AS dashboardId,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time, max(date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) AS latest_event_time
FROM cloudtrail_logs
WHERE 
eventsource = 'quicksight.amazonaws.com' 
AND
eventname = 'GetDashboard' 
AND
DATE_TRUNC('day',date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ')) > cast(current_date - interval '3' month AS date)
GROUP BY  1,2,3,4,5,6,7)

In the preceding query, the conditions defined in the where clause only fetch the records of GetDashboard events of QuickSight.

How can we design queries to fetch records of other events? We can review the CloudTrail logs to look for the information. For example, let’s look at the sample GetDashboard CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "<principal_id>: <user_name>",
        "arn": "arn:aws:sts:: <aws_account_id>:assumed-role/<IAM_role_ name>/<user_name>",
        "accountId": "<aws_account_id>",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "<principal_id>",
                …
            }
        }
    },
    "eventTime": "2021-01-13T16:55:36Z",
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "GetDashboard",
    "awsRegion": "us-east-1",
    "eventID": "a599c8be-003f-46b7-a40f-2319efb6b87a",
    "readOnly": true,
    "eventType": "AwsServiceEvent",
    "serviceEventDetails": {
        "eventRequestDetails": {
            "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>"
        },
        "eventResponseDetails": {
            "dashboardDetails": {
                "dashboardName": "Admin Console",
                "dashboardId": "arn:aws:quicksight:us-east-1: <aws_account_id>:dashboard/<dashboard_id>",
                "analysisIdList": [
                    "arn:aws:quicksight:us-east-1: <aws_account_id>:analysis/<analysis_id>"
            }
        }
    }
}

With eventSource=“quicksight.amazonaws.com” and eventName=“GetDashboard”, we can get all the view QuickSight dashboard events.

Similarly, we can define the condition as eventname = ‘AssumeRoleWithSAML‘ to fetch the user login events. (This solution assumes that the users log in to their QuickSight account with identity federation through SAML.) For more information about querying CloudTrail logs to monitor other interesting user behaviors, see Using administrative dashboards for a centralized view of Amazon QuickSight objects.

Furthermore, we can join with employee information tables to get a QuickSight user’s human resources information.

Finally, we can generate a view called admin_console with QuickSight group and user information, assets information, CloudTrail logs, and, optionally, employee information. The following screenshot shows an example preview.

The following screenshot shows an example preview.

Creating datasets

With the Athena views ready, we can build some QuickSight datasets. We can load the view called admin_console to build a SPICE dataset called admin_console and schedule this dataset to be refreshed hourly. Optionally, you can create a similar dataset called admin_console_login_events with the Athena table based on eventname = ‘AssumeRoleWithSAML‘ to analyze QuickSight users log in events. According to the usage metrics requirement in your organization, you can create other datasets to serve the different requests.

Building dashboards

Now we can build a QuickSight dashboard as the administrative console to analyze usage metrics. The following steps are based on the dataset admin_console. The schema of the optional dataset admin_console_login_events is the same as admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

  1. Create parameters.

For example, we can create a parameter called InActivityMonths, as in the following screenshot.For example, we can create a parameter called InActivityMonths, as in the following screenshot.Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

  1. Create controls based on the parameters.

Create controls based on the parameters.

  1. Create calculated fields.

For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code.

The following screenshot shows the relevant code.

According to end user’s requirement, we can define several calculated fields to perform the analysis.

  1. Create visuals.

For example, we create an insight to display the top three dashboards view by readers and a visual to display the authors of these dashboards.

For example, we create an insight to display the top three dashboards view by readers and a visual to display the authors of these dashboards.

  1. We can add URL action to define some extra features to email inactive authors or check details of users.

We can add URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

The following sample code defines the action to email inactive authors:
The following screenshots show an example dashboard that you can make using our data.

The following is the administrative console landing page. We provide the overview, terminology explanation and thumbnails of the other two tabs in this page.

The following is the administrative console landing page.

The following screenshots show the User Analysis tab.

The following screenshots show the User Analysis tab.

The following screenshots show the Dashboards Analysis tab.

The following screenshots show the Dashboards Analysis tab.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference to public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Additional usage metrics

Additionally, we can perform some complicated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t do any viewing of dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

Cleaning up

To avoid incurring future charges, delete the resources you created with the CloudFormation template.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the administrative console dashboard.

You can request a demo of this administrative console to try for yourself.


About the Authors

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

 

 

 

Jill FlorantJill Florant manages Customer Success for the Amazon QuickSight Service team

Create a custom data connector to Slack’s Member Analytics API in Amazon QuickSight with Amazon Athena Federated Query

Post Syndicated from Pablo Redondo Sanchez original https://aws.amazon.com/blogs/big-data/create-a-custom-data-connector-to-slacks-member-analytics-api-in-amazon-quicksight-with-amazon-athena-federated-query/

Amazon QuickSight recently added support for Amazon Athena Federated Query, which allows you to query data in place from various data sources. With this capability, QuickSight can extend support to query additional data sources like Amazon CloudWatch Logs, Amazon DynamoDB, and Amazon DocumentDB (with Mongo DB compatibility) via their existing Amazon Athena data source. You can also use the Athena Query Federation SDK to write custom connectors and query any source accessible with a Java API, whether it is relational, non-relational, object, or a custom data endpoint.

A common analytics use case is to access data from a REST API endpoint and blend it with information from other sources. In this post, I walk you through the process of setting up a custom federated query connector in Athena to query data from a REST API endpoint and build a QuickSight dashboard that blends data from the REST API endpoint with other data sources.

To illustrate this use case, we work with Slack, the makers of a leading channel-based messaging platform, to test their Member Analytics API, which can help our mock company, Example Corp, understand Slack adoption and member engagement across different teams.

How the Slack Member Analytics API works

The following diagram illustrates the Slack Member Analytics API.

The following diagram illustrates the Slack Member Analytics API.

The Slack Member Analytics API is a REST API endpoint available for Slack Enterprise Grid customers. Authorized users and services can access the member usage stats dataset via the admin.analytics.getFile endpoint of the Slack Web API. The data consists on a new-line delimited JSON file with daily Slack activity stats at the member level. A record looks like the following code:

{ 
    "enterprise_id":"AAAAAAA",
    "date":"2020-11-10",
    "user_id":"U01ERHY4589",
    "email_address":"[email protected]",
    "is_guest":false,
    "is_billable_seat":false,
    "is_active":true,
    "is_active_ios":false,
    "is_active_android":false,
    "is_active_desktop":true,
    "reactions_added_count":3,
    "messages_posted_count":10, 
    "channel_messages_posted_count":0,
    "files_added_count":0
}

To request data, you must provide a date argument in the format of YYYY-MM-DD, a type argument with the value member, and an OAuth bearer token as the header. The response is a compressed (.gzip) JSON file with data for the requested date. See the following code of a sample request:

curl -X GET -H “Authorization: Bearer xoxp-..."  https://slack.com/api/admin.analytics.getFile?date=2020-09-01&type=member > data.gzip

Building the solution for Example Corp

For our use case, Example Corp has recently purchased Slack for 1,000 users and as the Collaboration team onboards new teams to Slack, they want to measure Slack adoption and engagement within each new team. If they see low adoption or engagement within a group at the company, they can work with that group to understand why they aren’t using Slack and provide education and support, as needed.

Example Corp wants to provide analysts access to the Slack member usage stats to run ad hoc queries in place (directly from the source) without maintaining a new extract, transform, and load (ETL) pipeline. They use the QuickSight cross data source join feature to blend their Slack usage stats with their HR dataset.

To achieve this, Example Corp implements the following steps:

  1. Authorize the custom federated query connector with Slack to access the Member Analytics API.
  2. Develop and deploy a custom federated query connector in the Example Corp AWS account.
  3. Create a dataset in the Example Corp QuickSight environment that reads Slack member usage data for the last 30 days and blends it with an HR dataset.
  4. Create a QuickSight dashboard that shows usage trends of provisioned vs. active users.

Example Corp program managers can now monitor slack engagement using their QuickSight Dashboard (see the following screenshot).

Example Corp program managers can now monitor slack engagement using their QuickSight Dashboard

The following diagram illustrates the overall architecture of the solution.

The following diagram illustrates the overall architecture of the solution.

The following sections describe the components in detail and provide sample code to implement the solution in your environment.

Authorizing the custom federated query connector to access the Slack Analytics API

Data REST API endpoints typically have an authentication mechanism such as standard HTTP authentication or a bearer token. In the case of the Slack Web API, a bearer token is required on every request. The Slack Member Analytics API uses an OAuth protocol to authorize applications’ read access to data from an organization’s Slack environment.

To perform the OAuth handshake, Example Corp deploys a custom web application on Amazon Elastic Compute Cloud (Amazon EC2) and registers it as a new Slack application. When it’s deployed, Example Corp Slack admins can access the web application UI to authenticate with Slack and authorize read access to the custom federated query connector. After successful authentication, the custom web application stores the bearer token as a secret in AWS Secrets Manager. Only the custom application server and the federated query connector have access to this secret.

The following is an architecture diagram and brief description of the OAuth authorization workflow between Slack.com and the custom web application. As a prerequisite, you need to register your custom application with Slack.

The following is an architecture diagram and brief description of the OAuth authorization workflow between Slack.com and the custom web application.

  1. The Slack admin accesses the custom application UI from their browser and chooses Add to Slack to begin the authorization process.

The Slack admin accesses the custom application UI from their browser and chooses Add to Slack to begin the authorization process.

  1. The custom application redirects the admin to Slack.com to authenticate and authorize the client with an admin.analytics:read access for Example Corp Slack Enterprise Grid.

The custom application redirects the admin to Slack.com to authenticate and authorize the client with an admin.analytics:read access for Example Corp Slack Enterprise Grid.

  1. Slack.com redirects the admin back to the custom application UI, passing a temporary authorization code in the request.
  2. On the backend, the custom application retrieves Slack client secrets from a Secrets Manager secret. The Slack client secrets are obtained during the Slack application registration.
  3. The custom application server makes a request for a bearer token to the Slack API, passing both the temporary authorization code and the Slack client secrets.
  4. If both the temporary authorization code and the client secrets are valid, then the Slack API returns a bearer token to the custom application server.
  5. The custom application saves the bearer token in the Secrets Manager secret.
  6. Finally, the application sends a confirmation of successful authorization to the admin.

Slack admins can revoke access to the application from the organization’s console at any time.

You can find the source code and detailed instructions to deploy this sample OAuth web application in the GitHub repo. When the authorization workflow is complete, you can pause or stop the resources running the web application. Going forward, the federated query connector accesses the token from Secrets Manager.

Deploying the custom federated query connector

When the OAuth workflow is complete, we can deploy the custom federated query connector in the Example Corp AWS environment. For Example Corp, we develop a custom AWS Lambda function using the Athena Query Federation Java SDK and a Java HTTP client to connect with the Slack Member Analytics REST API. Finally, we register it as a new data source within Athena.

The following is a diagram of how the custom connector workflow operates.

The following is a diagram of how the custom connector workflow operates.

The workflow includes the following steps:

  1. Users submit a query to Athena using the following query: select * from <catalog_name>.slackanalytics.member_analytics where date='2020-11-10', where <catalog_name> is the name specified when creating the Athena data source.
  2. Athena compiles the query and runs the Lambda function to retrieve the Slack authorization token from Secrets Manager and determine the number of partitions based on the query predicates (where clause).
  3. The Slack Member Analytics Connector partitions the data by date and runs a Lambda function for each partition (date) specified in the query. For example, if the predicate is WHERE date IN (‘2020-11-10’, ‘2020-11-12’), Athena runs two instances of the Lambda function. When no dates are specified in the where clause, the connector gets data for the last 30 days.
  4. Each instance of the Lambda function makes a request to the Slack Member API to retrieve data for each day.
  5. Finally, Athena performs any aggregation and computation specified in the query and return the results to the client.

You can deploy this sample Slack Member Analytics Lambda function in your AWS environment via AWS CloudFormation with the following template. If you want to modify and build the connector from scratch, you can find the source code and instructions in the GitHub repo.

After the Lambda function has been deployed, create a new data source in Athena. For step-by-step instructions, see Deploying a Connector and Connecting to a Data Source.

  1. On the Athena console, in the query editor, choose Connect data source.

On the Athena console, in the query editor, choose Connect data source.

  1. Select All other data sources.
  2. Point your catalog to your new Lambda function.

Point your catalog to your new Lambda function.

You should be able to browse your new catalog within Athena from the Athena console and query the Slack Member Analytics API using SQL.

You should be able to browse your new catalog within Athena from the Athena console and query the Slack Member Analytics API using SQL.

Creating a dataset that reads Slack member usage data and blends it with an HR dataset

As a prerequisite to query the Slack Member Analytics API from QuickSight, we must provide the proper permission for QuickSight to access the federated query data source in Athena. We do this directly from the QuickSight admin UI following these steps:

  1. As an admin, on the Admin menu, choose Manage QuickSight.
  2. Under Security & Permissions, choose QuickSight access to AWS services.
  3. Choose Add or Remove services.
  4. Select Athena.
  5. Choose Next when prompted to set the Amazon Simple Storage Service (Amazon S3) bucket and Lambda function permissions.

QuickSight browses the Athena catalogs and displays any Lambda functions associated with your account. If you don’t see a Lambda function, it means you haven’t mapped a data source within Athena.

  1. Select the function.
  2. Choose Finish.

Choose Finish.

When the Example Corp QuickSight environment has the proper permissions, analysts can query the Slack Analytics Member API using their existing Athena data source. For instructions on creating your own dataset, see Creating a Dataset Using Amazon Athena Data.

The custom connector appears as a new Catalog, Database, and Tables option.

  1. In QuickSight, on the Datasets page, choose New dataset.

In QuickSight, on the Datasets page, choose New dataset.

  1. Choose Athena as your data source.
  2. Choose Create dataset.

Choose Create dataset.

  1. Choose your table or, for this use case, choose Use custom SQL.

Choose your table or, for this use case, choose Use custom SQL.

For this analysis, we write a custom SQL that gets member activity for the last 30 days:

SELECT date,
       is_active,
       email_address,
       messages_posted_count
FROM   slackanalytics_catalog.slackanalytics.member_analytics
WHERE  date >= date_format(date_trunc('month',current_date),'%Y-%m-%d')

With the QuickSight cross data source join feature, analysts can enrich the Slack member stats with their HR info. For this use case, we imported a local HR_dataset.csv file containing the list of subscribed users with their respective Example Corp department, and joined them via the employee_email field.

With the QuickSight cross data source join feature, analysts can enrich the Slack member stats with their HR info.

The result is a dataset with Slack activity by employee and department. We’ve also updated the date field from a String type to a Date type using the QuickSight Data Prep page to take advantage of additional visualization features with Date type fields.

The result is a dataset with Slack activity by employee and department.

Creating a QuickSight dashboard that shows usage trends of provisioned vs. active users

Example Corp Analysts want to visualize the trend of provisioned users vs. active users and understand Slack adoption by department. To support these visualizations, we created the following calculated fields within our QuickSight analysis:

  • active distinct_countIf(employee,{is_active}='true')
  • provisioneddistinct_count(employee)

You can also create these calculated fields when you create your dataset. This way, you can reuse them in other QuickSight analyses. 

We use QuickSight narrative insights, a line chart, a bar chart, and a pivot table with conditional formatting to create the following analysis.

We use QuickSight narrative insights, a line chart, a bar chart, and a pivot table with conditional formatting to create the following analysis.

From this analysis, Example Corp can see that the adoption trend is positive; however, there is an adoption gap within the Marketing team. The program managers can engage the Marketing department leads and focus their training resources to improve their adoption.

From this analysis, Example Corp can see that the adoption trend is positive; however, there is an adoption gap within the Marketing team.

This dashboard can now be published to stakeholders within the organization as needed—either within the QuickSight app or embedded within existing enterprise applications. 

Conclusion

With the recent integration of QuickSight and Athena Federated Query, organizations can access additional data sources beyond those already supported by QuickSight. Analysts can leverage QuickSight capabilities to analyze and build dashboards that blend data from a variety of data sources, and with the Athena Query Federation SDK, you can build custom connectors to access relational, non-relational, object, and custom data endpoints using standard SQL.

To get started, try the lab Athena Deploying Custom Connector.


About the Author

Pablo Redondo SanchezPablo Redondo Sanchez is a Senior Solutions Architect at Amazon Web Services. He is a data enthusiast and works with customers to help them achieve better insights and faster outcomes from their data analytics workflows. In his spare time, Pablo enjoys woodworking and spending time outdoor with his family in Northern California.

 

 

 

How the Yahoo! JAPAN Smart Devices Team is improving voice user interfaces with Amazon QuickSight business intelligence

Post Syndicated from Kazuhide Fujita original https://aws.amazon.com/blogs/big-data/how-the-yahoo-japan-smart-devices-team-is-improving-voice-user-interfaces-with-amazon-quicksight-business-intelligence/

This is a guest blog post by Kazuhide Fujita, Product Manager at Yahoo! JAPAN.

Yahoo! JAPAN is a large internet search and media company, with Yahoo! JAPAN’s web portal being the one of the most commonly used websites in Japan. Our smart devices team is responsible for building and improving Yahoo! JAPAN apps for voice user interfaces (VUI) such as Amazon Alexa and Google Assistant. We see VUI as a market that will grow exponentially in the future, and we want to be ready to lead the consumer experience with such devices. In this post, I discuss how we’re using Amazon QuickSight business intelligence (BI) to help our product teams improve these services.

Enhanced access to insights at lower cost

To continuously improve our services, we use data to understand how consumers are interacting with the software and to identify growth trends. However, the data we get directly from smart device makers is limited. So, we built our own log system to capture more granular data, such as the types of commands customers are using, the time of day they use the application, and how frequently they use it.

Early on, we used Amazon Elasticsearch Service (Amazon ES) and Kibana to analyze data. Although this solution was very capable, it came at a higher price point than we were targeting. Another option was to export data directly to Microsoft Excel for ad hoc analysis. However, this was very time consuming and limited us to working with extracts of historical data rather than the latest information.

We decided to look for a solution that would suit the full spectrum of our needs while being cost-effective for our specific use case. While we were searching, our data team made the decision to standardize on a data lake architecture using Amazon Simple Storage Service (Amazon S3) and Amazon Athena. This approach provided a high level of flexibility and scalability. To visualize our data, it made sense to use QuickSight, the serverless BI solution with pay-per-session pricing on AWS.

Unifying data to understand customers better

This system has proven to be a good fit for our needs. The data lake allows us to accumulate different types of data from monitoring many KPIs and VUI products. For example, we might want to know the number of active users over a given period, and then drill down into how active those users were in the 2 weeks from when they registered. The data lake makes this possible. It’s easy to maintain even though the data is very diverse. For aggregating and performing calculations on the data, we use Athena because it provides optimal performance for complex queries thanks to the distributed computing model.

For ad hoc analysis, dashboards, and reporting, QuickSight connects seamlessly to our data lake. QuickSight makes it easy to view trends in customer behavior such as the time of usage, method of interaction, typical settings, and so on. The following screenshot shows a sample dashboard in QuickSight.

The following screenshot shows a sample dashboard in QuickSight.

For example, the default wake word for Alexa-powered devices is to say the name of the voice assistant: “Hey, Alexa.” However, Japanese customers may prefer to say “ohayō,” which means “good morning” in Japanese. Which setting customers prefer could be an important trend for us to know when we configure our offerings. With QuickSight, it’s easy to compare trends for this type of behavior across other user characteristics.

This is only one small example of the kinds of insights we glean by using QuickSight. Another use case is regarding initiatives to increase product usage through marketing or incentives. We can track the outcome of these programs using QuickSight by tracking whether they result in an uptick in usage relative to the communications we send out.

The freedom to focus on what matters to the business

One of the big advantages of using QuickSight and other AWS services is that we don’t have to worry about maintaining on-premises systems for our data lake and analytics. It’s easy to manage and we can focus on gaining insights and improving our products—not running data center infrastructure. Building our end-to-end data-to-insights pipeline on AWS ensures that we can easily apply security and governance policies to all our data.

Overall, QuickSight provides us with the flexibility to analyze all kinds of data quickly, so we can aim to be the market leader in the VUI marketplace. We’re excited to see what the future holds for this powerful tool—and to apply the knowledge we gain to improving our services.


About the Author

Kazuhide Fujita is the Skill Team Product Manager, Smart Device Division, at Yahoo Japan Corporation

How Edmunds GovTech unifies data and analytics data for municipalities with Amazon QuickSight

Post Syndicated from Edmunds GovTech original https://aws.amazon.com/blogs/big-data/how-edmunds-govtech-unifies-data-and-analytics-data-for-municipalities-with-amazon-quicksight/

This is a guest post from an Amazon QuickSight customer, Edmunds GovTech

Over the past 30 years, Edmunds GovTech has grown to provide enterprise resource planning (ERP) solutions to thousands of East Coast municipalities. We also serve cities and towns in 25 other states. In this blog, I’ll talk about how we used Amazon QuickSight embedded business intelligence (BI) to quickly bring powerful dashboards to our on-premises and cloud-based customers.

Unifying insights

Our customers rely on our suite of solutions to manage finances, personnel, revenue, and municipal management activities such as permits, land management, business licensing, and fleet maintenance. They can access a wide variety of reports and data analysis tools tailored to the needs of users in finance, operations, and other departments. Recent acquisitions have also added new capabilities to our offering, each with its own set of reporting tools.

These reports serve specialist users well. However, we wanted to add the ability to aggregate and visualize information in one easy-to-consume service. Time-starved executives, boards, and decision-makers needed a better way to gain key insights into spending trends and implement better cost and cash management strategies. They strive to better achieve the financial goals of their municipalities without having to spend time running reports in different areas of the solution.

Production-ready in record time

With this vision in place, our primary directive was speed to market, with the aim of releasing a production-ready solution in just 4 months. We carefully evaluated our priorities and functional requirements and, ultimately outsourcing infrastructure management was key. QuickSight, a fully managed, cloud-native BI service from AWS, was the only option that allowed us to deliver so quickly.

Just as importantly, our professional services team saves an extensive amount of time to implement and train customers. That means immediate value for the customer and more time for our professional services team to spend on other activities, increasing profitability. We sell the embedded dashboard service as a subscription-based add-on, so customers can easily purchase and use it.

Flexible and future-proof

Although many of our customers use traditional client/server configurations in their own data centers, our cloud-hosted solution is becoming increasingly popular, especially with increasing numbers of remote workers. We’re also developing a software as a service (SaaS) version of our suite and continue to acquire other vendors to add functionality. All these factors mean our QuickSight dashboard service needs to be platform-agnostic. It must work with any source application, whether in AWS or on premises.

We accomplished this using Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Storage Service (Amazon S3). The source application emits events about finance accounts, vendors, and yearly budgets using Amazon SQS, with Amazon S3 available to ingest large messages that exceed the limits of Amazon SQS. We rely on AWS Lambda serverless functions to handle the ingestion and routing of the messages. Each customer has an individual reporting store, separate from the database of the source system.

This system transforms data from the customer’s system into a format that is normalized for QuickSight reporting. By pointing QuickSight at these schemas, we enable it to report on that data. The customer dashboard is embedded in the ERP application, so the customer doesn’t need to go to the QuickSight website to access it.

Any source application that can adhere to the messaging format can be reported on. The source system is responsible for the number-crunching, so any customizations the customer has applied are reflected in the reports.

The following diagram illustrates this architecture.

The following diagram illustrates this architecture.

The high-level architecture is as follows:

  1. Application sends JSON message to SQS queue. If the message is too large, it is added to an S3 bucket and the message contains a reference to the large message. Note this source can be any application as long as it produces messaging adhering to predefined JSON schema.
  1. Lambda consumer ingests batch of messages, validates payloads, and transforms payloads from JSON to tenant’s MYSQL Aurora reporting database that uses a star schema. The consumer can ingest small messages directly from SQS event or retrieve large messages from S3.
  1. QuickSight Namespace for tenant contains dashboard created from a master template that points to the appropriate reporting database.
  1. Source application requests dashboard on users behalf. Dashboard is embedded within the source application UI.

Because the system relies on Lambda functions, it’s a modern, decoupled architecture that is inherently future-proof and scalable. We don’t have to manage cloud or on-premises servers, and we only pay for what clients actually use.

Additionally, we were able to build a user interface that makes it easy to deploy new customers with just a few clicks. We use the installer to create the infrastructure for new clients using the AWS Command Line Interface (AWS CLI). The customer simply pushes a button from the source system to push data to the dashboard. They can be using the dashboard in less than an hour from start to finish.

Continuously increasing customer value

QuickSight has rolled out a lot of new features just in the short time we’ve been using it, and we’re already taking advantage of them. For example, QuickSight now remembers filter selections by user, so that the choices a user makes are remembered from session to session. That saves our customers time and effort and helps them get the information they need faster.

Embedded authoring is another significant feature that we’re looking forward to rolling out soon. As of this writing, we manage and maintain reporting templates for customers and push them out to clients using the AWS CLI. With the new embedded authoring capability of QuickSight, customers will be able to explore data in a self-service manner, perform ad hoc analysis, and create new dashboards. This will greatly increase the utility of the service while maintaining ease of use for customers and simplicity of management for our team. We’re also adopting the new namespace functionality to help customers maintain data separation from others in our multi-tenant solution.

Together today and tomorrow

Working with AWS has been a great experience. Our account representative has always been available for questions and feedback, which helped us succeed especially on such an accelerated timeframe. In addition to bringing QuickSight to our customers, we value the relationship we’ve developed with AWS and look forward to building on it as we move forward with our cloud solutions. Partnering with AWS has led to many benefits across our entire organization.

Marketing and sales teams in our organization are leading client demos with the QuickSight dashboard because it looks great and works seamlessly, and it’s something a lot of customers have been asking for. For department heads, executives, and other leaders, the ability to quickly visualize current and historical budget information is huge. They can also show their boards the information they need in a very easy-to-consume way. By giving customers one place to go for a high-level strategic view across their municipality, we’re helping them make better decisions and ultimately serve their constituents more effectively.


About the Author

Thomas Mancini is the VP, Concept Development at Edmunds GovTech

Centrally tracking dashboard lineage, permissions, and more with Amazon QuickSight administrative dashboards

Post Syndicated from Jesse Gebhardt original https://aws.amazon.com/blogs/big-data/centrally-tracking-dashboard-lineage-permissions-and-more-with-amazon-quicksight-administrative-dashboards/

This post is co-written with Shawn Koupal, an Enterprise Analytics IT Architect at Best Western International, Inc.

A common ask from Amazon QuickSight administrators is to understand the lineage of a given dashboard (what analysis is it built from, what datasets are used in the analysis, and what data sources do those datasets use). QuickSight APIs allow us to capture the metadata from each object and build a complete picture of the linkages between each object. As a QuickSight administrator, you can build a dashboard that displays the lineage from dashboard to data source, along with the permissions for each asset type. It can be helpful to see all permissions assigned to each of your assets as well as the relationships between them, all in one place.

Solution overview

In this solution, you build an end-to-end data pipeline using QuickSight to ingest data from an AWS Glue table.

The following diagram illustrates the architecture of the solution.

You can invoke the QuickSight APIs via the AWS Software Development Kit (AWS SDK) or the AWS Command Line Interface (AWS CLI). For this post, we use the AWS SDK.

The solution starts with an AWS Lambda function that calls the QuickSight list APIs (list_data_sources, list_data_sets, list_analyses, list_templates, and list_dashboards) depending on the event message to build lists of assets in chunks of 100, which are iterated through by a second Lambda function. The reason for splitting the work into two functions is to work around the 15-minute time limit in Lambda. You can schedule the Lambda function to run on each asset type based on an event rule trigger. See the following code:

import boto3
import os
import time
import datetime
import json
​
AWS_ACCOUNT_ID=os.environ['AWS_ACCOUNT_ID']
AWS_REGION=os.environ['AWS_REGION']
QS_S3_BUCKET=os.environ['QS_S3_BUCKET']
DownloaderFunctionName=os.environ['DownloaderFunctionName']
​
client = boto3.client('quicksight', region_name=AWS_REGION)
lambda_client = boto3.client('lambda')
​
def invoke_downloader(iteration, apicall, list_results):
​
    apicall=apicall.replace("list_data_sources","datasource").replace("list_data_sets","dataset").replace("list_analyses","analysis").replace("list_dashboards","dashboard").replace("list_templates","template")
    msg = {"Iteration": iteration, "api": apicall, "Objects":  list_results }
    invoke_response = lambda_client.invoke(FunctionName=DownloaderFunctionName,
                                           InvocationType='Event',
                                           Payload=json.dumps(msg, default=datetime_handler))
​
​
def datetime_handler(x):
    if isinstance(x, datetime.datetime):
        return x.isoformat()
    raise TypeError("Unknown type")
​
def file_cleanup(apicall):
    #Replace the apicall with the S3 folder name
    object_type=apicall.replace("list_data_sources","datasource").replace("list_data_sets","dataset").replace("list_analyses","analysis").replace("list_dashboards","dashboard").replace("list_templates","template")
    
    s3_path='quicksight_lineage/'+object_type+'/'
    s3_path2='quicksight_lineage/'+object_type+'_permissions/'
    fileprefix="QuickSight_"+object_type
    botoSession = boto3.Session (region_name = 'us-west-2')
    s3_session = botoSession.resource('s3')
    bucket = s3_session.Bucket(QS_S3_BUCKET)
    #Delete Any files with prefix in s3_path and s3_path2
    bucket.objects.filter(Prefix=s3_path+fileprefix).delete()
    bucket.objects.filter(Prefix=s3_path2+fileprefix).delete()
​
def lambda_handler(event, context):
​
​
    if event == {}:
        #Call All APIs assests 
        apicall_list=['list_data_sources','list_data_sets','list_analyses','list_dashboards','list_templates']
    elif  event["api"] == 'datasource':
        apicall_list=['list_data_sources']
    elif event["api"] == 'dataset':
        apicall_list=['list_data_sets']
    elif event["api"] == 'analysis':
        apicall_list=['list_analyses']
    elif event["api"] == 'dashboard':
        apicall_list=['list_dashboards']
    elif event["api"] == 'template':
        apicall_list=['list_templates']
    else:
        print("[WARN] Exception: Invalid Event Type.")
        return
    for apicall in apicall_list: 
        try:
            #Clean up files from previous run
            file_cleanup(apicall)
            #Reset variables for each apicall
            iteration=0
            user_token = None
            list_results={}
​
            while True:
                iteration+=1
                print("Calling ",apicall, iteration)
                
                if user_token is None:
                    exec("""results=client."""+apicall+"""(AwsAccountId='"""+AWS_ACCOUNT_ID+"""', MaxResults=100);""",globals(), list_results)
                else:
                    exec("""results=client."""+apicall+"""(AwsAccountId='"""+AWS_ACCOUNT_ID+"""', MaxResults=100,NextToken='"""+user_token+"""');""",globals(), list_results)
​
                invoke_downloader(iteration, apicall, list_results["results"])
                user_token=list_results["results"]["NextToken"]
                print(user_token)
        except KeyError:
            print("NextToken not found.")

The second Lambda function consumes the list of assets from the event parameter from the first function and uses the QuickSight describe APIs (describe_datasource, describe_dataset, describe_analysis, describe_template, and describe_dashboard). The details of each QuickSight asset are written to CSV files in an Amazon Simple Storage Service (Amazon S3) bucket in groups of 100. Because the first function calls the second function in parallel, it’s recommended to set the reserved concurrency to 2 in the second Lambda function to avoid throttling errors (if you use the AWS CloudFormation template provided later in this post, this is automatically configured for you). See the following code:

import boto3
import os
import time
import datetime
import json

AWS_ACCOUNT_ID=os.environ['AWS_ACCOUNT_ID']
AWS_REGION=os.environ['AWS_REGION']
QS_S3_BUCKET=os.environ['QS_S3_BUCKET']

client = boto3.client('quicksight', region_name=AWS_REGION)
lambda_client = boto3.client('lambda')
s3 = boto3.client('s3')

def process_dashboards(list_dashboard,iteration,object_type):
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("DashboardId,Name,SourceEntityArn,VersionCreatedTime,VersionNumber,CreatedTime,DataSetArns,LastPublishedTime,LastUpdatedTime" + '\n')

    for dashboard in list_dashboard["DashboardSummaryList"]:
        dashboard_desc= client.describe_dashboard(AwsAccountId=AWS_ACCOUNT_ID,DashboardId=dashboard["DashboardId"])
        
        source_entity_arn = dashboard_desc["Dashboard"]["Version"]["SourceEntityArn"]
        version_created_time = dashboard_desc["Dashboard"]["Version"]["CreatedTime"].isoformat()
        version_number = str(dashboard_desc["Dashboard"]["Version"]["VersionNumber"])
        created_time = dashboard_desc["Dashboard"]["CreatedTime"].isoformat()

        last_published_time = dashboard_desc["Dashboard"]["LastPublishedTime"].isoformat()
        last_updated_time = dashboard_desc["Dashboard"]["LastUpdatedTime"].isoformat()
        try:
            for arn in dashboard_desc["Dashboard"]["Version"]["DataSetArns"]:
                f.write(dashboard["DashboardId"]+',"'+ dashboard["Name"] + '",' + source_entity_arn + ',' + version_created_time + ',' + version_number + ','   + created_time + ','+ arn + ',' + last_published_time + ',' + last_updated_time +'\n')
        except Exception as e:
            print(e)
            dataset_arn=''
            f.write(dashboard["DashboardId"]+',"'+ dashboard["Name"] + '",' + source_entity_arn + ',' + version_created_time + ',' + version_number + ','   + created_time + ','+ dataset_arn + ',' + last_published_time + ',' + last_updated_time +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            
        

def process_dashboards_permissions(list_dashboard,iteration,object_type):

    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("DashboardId,Name,Principal,Permission,Iteration" + '\n')
	
    for dashboard in list_dashboard["DashboardSummaryList"]:

        try:
            list_permissions = client.describe_dashboard_permissions(AwsAccountId=AWS_ACCOUNT_ID,DashboardId=dashboard["DashboardId"])
        except:
            print("Error Listing Permissions for:"+dashboard["DashboardId"])
            continue

        for permission in list_permissions["Permissions"]:
            #If Action includes delete operation then principal has co-owner permissions
            if "quicksight:DeleteDashboard" in permission["Actions"]:
                action = "Co-Owner"
            else:
                action = "View"

            f.write(dashboard["DashboardId"]+',"'+ dashboard["Name"] + '",' + permission["Principal"] +  ',' + action +  ',' + iteration +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)      


def process_analysis(list_analyses,iteration,object_type):
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("AnalysisId,Name,AnalysisArn,CreatedTime,LastUpdatedTime,DataSetArn,Iteration" + '\n')

    for analysis in list_analyses["AnalysisSummaryList"]:
        #Call describe_analysis
        analysis_desc= client.describe_analysis(AwsAccountId=AWS_ACCOUNT_ID,AnalysisId=analysis["AnalysisId"])

        analysis_arn = analysis_desc["Analysis"]["Arn"]
        created_time = analysis_desc["Analysis"]["CreatedTime"].isoformat()
        last_updated_time = analysis_desc["Analysis"]["LastUpdatedTime"].isoformat()

        try:
            for arn in analysis_desc["Analysis"]["DataSetArns"]:
                f.write(analysis["AnalysisId"]+',"'+ analysis["Name"] + '",' + analysis_arn + ',' + created_time + ','  + last_updated_time + ',' + arn + ',' + iteration  +'\n')
        except Exception as e:
            print(e)
            dataset_arn=''
            f.write(analysis["AnalysisId"]+',"'+ analysis["Name"] + '",' + analysis_arn + ',' + created_time + ','  + last_updated_time + ',' + dataset_arn  + ',' + iteration +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            

        
def process_analysis_permissions(list_analyses,iteration,object_type):

    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("AnalysisId,Name,Principal,Permission,Iteration" + '\n')
	
    for analysis in list_analyses["AnalysisSummaryList"]:

        try:
            list_permissions = client.describe_analysis_permissions(AwsAccountId=AWS_ACCOUNT_ID,AnalysisId=analysis["AnalysisId"])
        except:
            print("Error Listing Permissions for:"+analysis["AnalysisId"])
            continue
        for permission in list_permissions["Permissions"]:
            #If Action includes delete operation then principal has co-owner permissions
            if "quicksight:DeleteAnalysis" in permission["Actions"]:
                action = "Co-Owner"
            else:
                action = "View"

            f.write(analysis["AnalysisId"]+',"'+ analysis["Name"] + '",' + permission["Principal"] +  ',' + action +  ',' + iteration +'\n')
    
    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)      


def process_templates(list_templates,iteration,object_type):
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("TemplateId,Name,TemplateArn,CreatedTime,LastUpdatedTime,SourceEntityArn,VersionNumber,Iteration" + '\n')

    for template in list_templates["TemplateSummaryList"]:
        #Call describe_template
        template_desc= client.describe_template(AwsAccountId=AWS_ACCOUNT_ID,TemplateId=template["TemplateId"])

        template_arn = template_desc["Template"]["Arn"]
        created_time = template_desc["Template"]["CreatedTime"].isoformat()
        last_updated_time = template_desc["Template"]["LastUpdatedTime"].isoformat()
        source_entity_arn = template_desc["Template"]["Version"]["SourceEntityArn"]
        version_number = str(template_desc["Template"]["Version"]["VersionNumber"])
        f.write(template["TemplateId"]+',"'+ template["Name"] + '",' + template_arn + ',' + created_time + ','  + last_updated_time + ',' + source_entity_arn + ',' + version_number +  ',' + iteration  +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            

        
def process_templates_permissions(list_templates,iteration,object_type):

    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("TemplateId,Name,Principal,Permission,Iteration" + '\n')
	
    for template in list_templates["TemplateSummaryList"]:

        try:
            list_permissions = client.describe_template_permissions(AwsAccountId=AWS_ACCOUNT_ID,TemplateId=template["TemplateId"])
        except:
            print("Error Listing Permissions for:"+template["TemplateId"])
            continue
        for permission in list_permissions["Permissions"]:
            #If Action includes delete operation then principal has co-owner permissions
            if "quicksight:DeleteTemplate" in permission["Actions"]:
                action = "Co-Owner"
            else:
                action = "View"

            f.write(template["TemplateId"]+',"'+ template["Name"] + '",' + permission["Principal"] +  ',' + action +  ',' + iteration +'\n')
    
    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)      


def process_datasources(list_data_sources,iteration,object_type):
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("DataSourceId,DataSourceArn,Name,Type,LastUpdatedTime,CreatedTime,Status,ErrorInfo,Iteration" + '\n')

    global datasource_list
    datasource_list=[]
    for datasource in list_data_sources["DataSources"]:
        datasource_id=datasource["DataSourceId"]
        name=datasource["Name"]
        datasource_type=datasource["Type"]
        try:
            status=datasource["Status"]
        except:
            status=''
        CreatedTime=str(datasource["CreatedTime"])
        LastUpdatedTime=str(datasource["LastUpdatedTime"])
        try:
            ErrorInfo="Type: "+datasource["ErrorInfo"]["Type"]+" Message: "+datasource["ErrorInfo"]["Message"]
        except:
            ErrorInfo="Null"

        f.write( datasource_id + ',' + datasource["Arn"] + ',"' + name + '",'  + datasource_type + ',' + LastUpdatedTime+ ',' + CreatedTime + ',' + status + ',' + ErrorInfo+ ',' + iteration +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            


def process_datasources_permissions(list_data_sources,iteration,object_type):
    
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write("DataSourceID,Name,Principal,Permission,Iteration" + '\n')

    for datasource in list_data_sources["DataSources"]:
        try:
            list_permissions = client.describe_data_source_permissions(AwsAccountId=AWS_ACCOUNT_ID,DataSourceId=datasource["DataSourceId"])
        except:
            print("Error Listing Permissions for:"+datasource["DataSourceId"])
            continue
        for permission in list_permissions["Permissions"]:
            #If Action includes delete operation then principal has co-owner permissions
            if "quicksight:DeleteDataSource" in permission["Actions"]:
                action = "Co-Owner"
            else:
                action = "View"
                
            f.write(datasource["DataSourceId"]+',"'+ datasource["Name"] + '",' + permission["Principal"] +  ',' + action +  ',' + iteration +'\n')
    
    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            
    

def process_datasets(list_datasets,iteration,object_type):

    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    #CSV Header
    f.write('DatasetId,DataSetArn,Name,SpiceSize,ImportMode,LastUpdatedTime,CreatedTime,DataSourceArn,DataSourceName,DataSourceType,Source,Columns,Iteration' + '\n')
    
    for dataset in list_datasets["DataSetSummaries"]:
        
        try:
            response= client.describe_data_set(AwsAccountId=AWS_ACCOUNT_ID,DataSetId=dataset["DataSetId"])
        except Exception as e:
            print("Dataset ID: ", dataset["DataSetId"], e)
            f.write( dataset["DataSetId"] + ',' + dataset["Arn"] + ',"' + dataset["Name"] + '",' + '0' + ',' + dataset["ImportMode"] + ',' + str(dataset["LastUpdatedTime"])+ ','+ str(dataset["CreatedTime"])+ ',' + 'n/a' + ',"' + 'n/a' + '",' +  'n/a'  + ',' +  'n/a' + ',"'  + 'n/a'+ '",' + iteration +'\n')
            continue

        dataset_id=response["DataSet"]["DataSetId"]
        dataset_name=response["DataSet"]["Name"]
        dataset_size=response["DataSet"]["ConsumedSpiceCapacityInBytes"]
        ImportMode=response["DataSet"]["ImportMode"]
        LastUpdatedTime=response["DataSet"]["LastUpdatedTime"].isoformat()
        CreatedTime=response["DataSet"]["CreatedTime"].isoformat()

        try:
            for key in response["DataSet"]["PhysicalTableMap"].keys():
                
                if key == 's3PhysicalTable':
                    
                    source='S3Source'
                    DataSourceArn=response["DataSet"]["PhysicalTableMap"]["s3PhysicalTable"]["S3Source"]["DataSourceArn"]
                    Columns=response["DataSet"]["PhysicalTableMap"]["s3PhysicalTable"]["S3Source"]["InputColumns"]
                    #SqlQuery="Null"

                else:

                    try:
                        DataSourceArn=response["DataSet"]["PhysicalTableMap"][key]["RelationalTable"]["DataSourceArn"]
                        Columns=""
                        source="VisualEditor"
                    except:
                        DataSourceArn=response["DataSet"]["PhysicalTableMap"][key]["CustomSql"]["DataSourceArn"]
                        Columns=response["DataSet"]["PhysicalTableMap"][key]["CustomSql"]["Columns"]
                        source="CustomSql"

                DataSourceName=""
                DataSourceType=""
                
                f.write( dataset_id + ',' + dataset["Arn"] + ',"' + dataset_name + '",' + str(dataset_size) + ',' + ImportMode + ',' + LastUpdatedTime+ ','+ CreatedTime+ ',' + DataSourceArn + ',"' + DataSourceName + '",' +  DataSourceType  + ',' +  source + ',"'  + str(Columns) + '",' + iteration +'\n')
                
        except:
            print("[DEBUG]: Exception in main write for: " + str(dataset))
            f.write( dataset_id  + ',' + dataset["Arn"] +',"' + dataset_name + '",' + str(dataset_size) + ',' + ImportMode + ',' + LastUpdatedTime+ ',' + CreatedTime + ',,,,Unknown,"'  + str(Columns) + '",' + iteration +'\n')

    f.close()
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)


def process_datasets_permissions(list_datasets,iteration,object_type):
    
    filename="QuickSight_"+object_type+"_"+iteration+".csv"
    filePath = os.path.join("/tmp",filename)
    f=open(filePath,"w")
    f.write('DataSetID,Name,Principal,Permission,Iteration'+'\n')

    for dataset in list_datasets["DataSetSummaries"]:
        try:
            list_permissions = client.describe_data_set_permissions(AwsAccountId=AWS_ACCOUNT_ID,DataSetId=dataset["DataSetId"])
        except:
            print("Error Listing Permissions for:"+dataset["DataSetId"])
            continue
        
        for permission in list_permissions["Permissions"]:
            #If Action includes delete operation then principal has co-owner permissions
            if "quicksight:DeleteDataSet" in permission["Actions"]:
                action = "Co-Owner"
            else:
                action = "View"
                
            f.write(dataset["DataSetId"]+',"'+ dataset["Name"] + '",' + permission["Principal"] +  ',' + action+  ',' + iteration +'\n')

    f.close()        
    s3_path='quicksight_lineage/'+object_type+'/'
    s3.upload_file("{}/{}".format("/tmp",  filename), QS_S3_BUCKET, s3_path + filename)            


def lambda_handler(event, context):

    list_objects=event["Objects"]
    iteration=str(event["Iteration"])
    
    print("Iteration: ", iteration)
    print("[INFO]Processing QuickSight:", event["api"] )
    
    if  event["api"] == 'datasource':
        process_datasources(list_objects, iteration, event["api"])
        process_datasources_permissions(list_objects, iteration, event["api"]+'_permissions')
    elif event["api"] == 'dataset':
        process_datasets(list_objects, iteration, event["api"])
        process_datasets_permissions(list_objects, iteration, event["api"]+'_permissions')
    elif event["api"] == 'analysis':
        process_analysis(list_objects, iteration, event["api"])
        process_analysis_permissions(list_objects, iteration, event["api"]+'_permissions')
    elif event["api"] == 'dashboard':
        process_dashboards(list_objects, iteration, event["api"])
        process_dashboards_permissions(list_objects, iteration, event["api"]+'_permissions')
    elif event["api"] == 'template':
        process_templates(list_objects, iteration, event["api"])
        process_templates_permissions(list_objects, iteration, event["api"]+'_permissions')
    else:
        print("[WARN] Exception: Invalid Event Type.")
        return

Afterwards, the S3 bucket has the directory structure under the quicksight_lineage folder as shown in the following screenshot.

You then use AWS Glue to store the metadata of each file in an AWS Glue table, which allows you to query the information from QuickSight using an Amazon Athena or Amazon Redshift Spectrum data source (if you run the CloudFormation stack, the tables are set up for you).

The following diagram shows the tables and relationships.

Walkthrough overview

The workflow is comprised of the following high-level steps:

  1. Deploy the CloudFormation template to build the Lambda functions, AWS Identity and Access Management (IAM) roles, S3 bucket, AWS Glue database, and AWS Glue tables.
  2. Run the Python Lambda functions to build CSV files that contain the QuickSight object details.
  3. Visualize the data in QuickSight. To do so, you must create your data source, dataset, and then analysis.

For this post, we use Athena as the query engine. To use Redshift Spectrum, you must modify the provided queries.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An IAM user with access to AWS resources used in this solution (CloudFormation, IAM, Amazon S3, AWS Glue, Athena, QuickSight)
  • Athena configured with a query result location
  • QuickSight Enterprise Edition

Creating resources

Create your resources by launching the following CloudFormation stack:

During the stack creation process, you must provide an S3 bucket name in the S3BucketName parameter (AWSAccountNumber is appended to the bucket name provided to make it unique).

After the stack creation is successful, you have two Lambda functions, two S3 buckets, an AWS Glue database and tables, and the corresponding IAM roles and policies.

Running the Lambda function

To run your Lambda function, complete the following steps:

  1. On the Lambda console, navigate to the QuickSight-Lineage-Dispatcher function.
  2. From the Select a test event menu, choose Configure test events.

  1. Select Create new test event.

You create one test event for all QuickSight assets.

  1. For Event name, enter all.
  2. Enter an empty JSON object ({}).

  1. Choose Test to run the Lambda function and generate CSV files of the assets.

Alternatively, you can create test events for each QuickSight object (Data Source, DataSet, Analysis, Dashboard, and Template) for larger QuickSight environments:

  • Test event DataSource code:
    {
      "api": "datasource"
    }

  • Test event DataSet code:
    {
      "api": "dataset"
    }

  • Test event Analysis code:
    {
      "api": "analysis"
    }

  • Test event Dashboard code:
    {
      "api": "dashboard"
    }

  • Test event Template code:
    {
      "api": "template"
    }

The following screenshot shows the configuration of a test event for Analysis.

Creating your data source and lineage data set

In this step, you use QuickSight to access the tables in your AWS Glue database.

  1. Log in to QuickSight.
  2. Choose Manage QuickSight.
  3. Choose Security & permissions.
  4. Ensure that access to the S3 bucket (that was created through CloudFormation) is enabled.
  5. Choose New analysis.
  6. Choose New dataset.
  7. For the data source, choose Athena.

  1. For your data source name, enter QuickSight-Lineage.
  2. Choose Create data source.

QuickSight prompts you to select your schema or database.

  1. Choose Use custom SQL.

  1. Update the query name from New custom SQL to QuickSight Lineage.
  2. Enter the following code into the query box:
    select 
       a.analysisid      as analysis_id,
       a.name            as analysis_name,
       a.analysisarn     as analysis_arn,
       date_parse(substr(a.createdtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')     as analysis_createdtime,
       date_parse(substr(a.lastupdatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as analysis_lastupdatedtime,
       a.datasetarn      as analysis_datasetarn,
       r.dashboardid        as dashboard_id,
       r.name               as dashboard_name,
       r.name||' - ID: '||r.dashboardid as dashboard_name_w_id,
       date_parse(substr(r.versioncreatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as dashboard_versioncreatedtime,
       r.versionnumber      as dashboard_versionnumber     ,
       date_parse(substr(r.createdtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')  as dashboard_createdtime,
       date_parse(substr(r.lastpublishedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as dashboard_lastpublishedtime ,
       date_parse(substr(r.lastupdatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as dashboard_lastupdatedtime,
       d.datasetid        as dataset_id,
       d.datasetarn       as dataset_arn,
       d.name             as dataset_name,
       d.spicesize        as dataset_spicesize,
       d.importmode       as dataset_importmode,
       date_parse(substr(d.lastupdatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')  as dataset_lastupdatedtime,
       date_parse(substr(d.createdtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')      as dataset_createdtime,
       d.source           as dataset_source,
       d.columns          as dataset_columns,
       s.datasourceid     as datasource_id,
       s.datasourcearn    as datasource_arn,
       s.name             as datasource_name,
       s.type             as datasource_type,
       date_parse(substr(s.lastupdatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as datasource_lastupdatedtime,
       date_parse(substr(s.createdtime,1,26),'%Y-%m-%dT%H:%i:%s.%f') as datasource_createdtime,
       s.status           as datasource_status,
       s.errorinfo        as datasource_errorinfo,
       t.templateid       as template_id,
       t.name             as template_name,
       t.templatearn      as template_arn,
       date_parse(substr(t.createdtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')      as template_createtime,
       date_parse(substr(t.lastupdatedtime,1,26),'%Y-%m-%dT%H:%i:%s.%f')  as template_lastupdatedtime
    from "quicksight-lineage"."dashboard" r
    left join  "quicksight-lineage"."analysis" a
    on a.analysisarn = r.sourceentityarn and a.datasetarn=r.datasetarns
    left join "quicksight-lineage"."template" t
    on t.templatearn = r.sourceentityarn
    left join  "quicksight-lineage"."dataset" d
    on d.datasetarn = r.datasetarns
    left join  "quicksight-lineage"."datasource" s
    on s.datasourcearn = d.datasourcearn

  1. Choose Confirm query.

  1. Select Import to SPICE for quicker analytics.
  2. Choose Visualize.

In the new analysis, one empty visual is loaded by default.

  1. Change the visual type to pivot table.
  2. Choose (single-click) dashboard_name, analysis_name, template_name, dataset_name, and datasource_name in the Fields list.

You can search for name in field list to make this step easier

  1. Confirm that all fields were also added to the Rows

If you have assets with duplicates names, it can helpful to add the corresponding ID columns to the visual; for example, dashboard_id, analysis_id, template_id, dataset_id, datasource_id.

Visualizing your assets and lineage

You now create five new visuals, one for each asset type (Dashboard, Analysis, Template, Dataset, Data Source), to display the additional columns pulled from the APIs.

  1. From the Add menu, choose Add visual.

  1. For the first new visual, choose the table visual type.
  2. Search for dashboard_ in Field List.
  3. Choose (single-click) all matching columns.

  1. For the second visual, choose the table visual type.
  2. Search for analysis_ in the Field List.
  3. Choose (single-click) all matching columns.
  4. Move the second visual underneath the first visual.
  5. Repeat same steps for template_, dataset_, and datasource_.

Creating your permissions data set

You now create your new data set.

  1. Leave the analysis by choosing the QuickSight logo on the top left.
  2. In the navigation pane, choose Datasets.
  3. Choose New dataset.
  4. Locate and choose the QuickSight-Lineage data source created earlier in the FROM EXISTING DATA SOURCES
  5. In the QuickSight Lineage data source window, choose Create data set.
  6. Choose Use custom SQL.

  1. Update the name from New custom SQL to QuickSight Lineage Permissions.
  2. Enter the following code into the query box:
    select distinct 'datasource' as QuickSightObjectType, sp.datasourceid as "QuickSightID",sp.name,
    split_part(principal,':',6) as principal,permission, r.name||' - ID: '||r.dashboardid as dashboard_name_w_id
     from "quicksight-lineage"."datasource_permissions"  sp
     inner join  "quicksight-lineage"."datasource" s
      on s.datasourceid = sp.datasourceid
     left join  "quicksight-lineage"."dataset" d
       on s.datasourcearn = d.datasourcearn
     left join  "quicksight-lineage"."dashboard" r
       on d.datasetarn = r.datasetarns
    union
    select distinct 'dataset' as QuickSightObjectType, dp.datasetid as "QuickSightID",dp.name,
    split_part(principal,':',6) as principal,permission, r.name||' - ID: '||r.dashboardid as dashboard_name_w_id 
     from "quicksight-lineage"."dataset_permissions" dp
     inner join  "quicksight-lineage"."dataset" d
       on d.datasetid = dp.datasetid
     left join  "quicksight-lineage"."dashboard" r
       on d.datasetarn = r.datasetarns
    union
    select distinct 'analysis' as QuickSightObjectType, ap.analysisid as "QuickSightID",ap.name,
    split_part(principal,':',6) as principal,permission, r.name||' - ID: '||r.dashboardid as dashboard_name_w_id  
     from "quicksight-lineage"."analysis_permissions" ap
      inner join  "quicksight-lineage"."analysis" a
       on a.analysisid = ap.analysisid
      left join  "quicksight-lineage"."dashboard" r
       on a.analysisarn = r.sourceentityarn  
    union
    select distinct 'template' as QuickSightObjectType, tp.templateid as "QuickSightID",tp.name,
    split_part(principal,':',6) as principal,permission, r.name||' - ID: '||r.dashboardid as dashboard_name_w_id  
     from "quicksight-lineage"."template_permissions" tp
      inner join  "quicksight-lineage"."template" t
       on t.templateid = tp.templateid
      left join  "quicksight-lineage"."dashboard" r
       on t.templatearn = r.sourceentityarn     
    union
    select distinct 'dashboard' as QuickSightObjectType, dashboardid as "QuickSightID",name,
    split_part(principal,':',6) as principal,permission, name||' - ID: '||dashboardid as dashboard_name_w_id
     from "quicksight-lineage"."dashboard_permissions"

  1. Choose Edit / Preview data.
  2. Choose Apply.
  3. For Query mode, select SPICE.

  1. Choose Save.
  2. Navigate to the Analyses page and open the analysis created earlier.
  3. Choose the pencil icon to add the new dataset.

  1. Choose Add data set.

  1. Select QuickSight Lineage Permissions.
  2. Choose Select.

  1. Make sure the new QuickSight Lineage Permissions dataset is active in the Data set drop-down menu.

Visualizing your permissions

You now add a new visual to display permissions. 

  1. Choose the table visual type.
  2. Choose (single-click) name, principal, and permission in the Field List.
  3. In the navigation pane, choose Filter.
  4. Choose +.
  5. Choose quicksightobjecttype.

  1. Choose the new filter.
  2. Deselect Select all.
  3. Select dashboard.
  4. Choose Apply.

  1. Choose Close.
  2. Move the new permissions visual so it’s to the right of the dashboard visual.

 

  1. On the new permissions visual, choose the menu options (…).
  2. Choose Duplicate visual.
  3. Repeat this step four times.
  4. Modify the quicksightobjectype filter on each new permission visual so you have one visual for each asset type.
  5. Move the visual to the right of the corresponding asset type visual.

  

Creating parameters for filtering

At this point all the visuals are created; next you need to create a parameter. You can simplify the following steps by using the new simplified filter control creation process. For more information, see Amazon QuickSight adds support for on-sheet filter controls. The following steps still work fine, but to add filter controls to an analysis, you don’t need to create parameters anymore. 

  1. Navigate to the Parameters menu.
  2. Choose Create one
  3. For Name, enter DashboardNameWithID.
  4. Choose Create.

 

  1. Choose Create a new control for a filter or a calculated field.
  2. For Display name, enter Dashboard Name with ID.
  3. For Style, choose Single select drop down.
  4. For Values, select Link to a data set field.
  5. For Select a data set, choose QuickSight Lineage Permissions.
  6. For Select a column, choose dashboard_name_w_id.
  7. Choose Add.

  1. Choose the first visual (Count of Records by Dashboard_name, Template_name, Dataset_name, Datasource_name, and Analysis_name).
  2. Add a filter in the dashboard_name_w_id field.
  3. Choose the newly added filter.
  4. Set the filter scope to All visuals.
  5. For Filter type, choose Custom filter.
  6. Select Use parameters.
  7. From the drop-down menu, choose DashboardNameWithId.
  8. Choose Apply.
  9. Choose Close.

  1. Choose the first permissions visual (Permission, Principal, and Name).
  2. Add a filter in the dashboard_name_w_id field.
  3. Set the filter scope to All visuals.
  4. For Filter type, choose Custom filter.
  5. Select Use parameters.
  6. From the drop-down menu, choose DashboardNameWithID.
  7. Choose Apply.
  8. Choose Close.

The analysis build is complete and can be published as a dashboard.

Creating additional visuals

You can also create additional visuals for different use cases.

Visualizing SPICE usage across all your SPICE datasets

To visualize Spice usage across your SPICE datasets, complete the following steps.

  1. Use the QuickSight Lineage dataset and choose the donut chart visual.
  2. For Group/Color, add dataset_name.
  3. For Value, add dataset_spicesize.
  4. Change the aggregation of dataset_spicesize to Average because a dataset can be listed multiple times in the dataset if it is reused across multiple dashboards.

This visual can be useful to track down what is consuming SPICE storage.

Visualizing SPICE refreshes by hour

To visualize SPICE refreshes by hour, complete the following steps:

  1. Use the QuickSight Lineage dataset to create a vertical stacked bar chart.
  2. For X axis, add dataset_lastupdatetime aggregated by HOUR.
  3. For Value, add dataset_id aggregated by Count district.
  4. For Group/Color, add dataset_name.
  5. Create a filter on dataset_importmode equal to SPICE.

This visual can be useful to see when all the SPICE dataset refreshes last occurred. The source data is a snapshot in time, so you need to update the source data by running the Lambda function on a regular basis.

Cleaning up

To avoid incurring future charges, delete the resources you created in this walkthrough by deleting the CloudFormation stack. Also, be sure to delete the analysis and dataset (to free up SPICE usage).

Conclusion

We also created some visuals to display SPICE usage by data set as well as the last refresh time per data set, allowing you to view the health of your SPICE refreshes and to free up SPICE capacity by cleaning up older data sets.

Give this technique of building administrative dashboards from data collected via the QuickSight APIs a try, and share you feedback and questions in the comments.


About the Authors

Jesse Gebhardt is a senior global business development manager focused on analytics. He has spent over 10 years in the Business Intelligence industry. At AWS, he aids customers around the globe gain insight and value from the data they have stored in their data lakes and data warehouses. Jesse lives in sunny Phoenix, and is an amateur electronic music producer.

 

 

Arun Santhosh is a Specialized World Wide Solution Architect for Amazon QuickSight. Arun started his career at IBM as a developer and progressed on to be an Application Architect. Later, he worked as a Technical Architect at Cognizant. Business Intelligence has been his core focus in these prior roles as well.

 

 

Shawn Koupal is an Enterprise Analytics IT Architect at Best Western International, Inc.

 

Accessing and visualizing data from multiple data sources with Amazon Athena and Amazon QuickSight

Post Syndicated from Saurabh Bhutyani original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-data-from-multiple-data-sources-with-amazon-athena-and-amazon-quicksight/

Amazon Athena now supports federated query, a feature that allows you to query data in sources other than Amazon Simple Storage Service (Amazon S3). You can use federated queries in Athena to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources. Athena queries including federated queries can be run from the Athena console, a JDBC or ODBC connection, the Athena API, the Athena CLI, the AWS SDK, or AWS Tools for Windows PowerShell.

The goal for this post is to discuss how we can use different connectors to run federated queries with complex joins across different data sources with Athena and visualize the data with Amazon QuickSight.

Athena Federated Query

Athena uses data source connectors that run on AWS Lambda to run federated queries. A data source connector is a piece of code that translates between your target data source and Athena. You can think of a connector as an extension of the Athena query engine. Prebuilt Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB (with MongoDB compatibility), Amazon Elasticsearch Service (Amazon ES), Amazon ElastiCache for Redis, and JDBC-compliant relational data sources such as MySQL, PostgreSQL, and Amazon Redshift under the Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. After you deploy data source connectors, the connector is associated with a catalog name that you can specify in your SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query.

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C++, Java, Python, and Rust. Because connectors run in Lambda, you can use them to access data from any data source in the cloud or on premises that is accessible from Lambda.

Prerequisites

Before creating your development environment, you must have the following prerequisites:

Configuring your data source connectors

After you deploy your CloudFormation stack, follow the instructions in the post Extracting and joining data from multiple data sources with Athena Federated Query to configure various Athena data source connectors for HBase on Amazon EMR, DynamoDB, ElastiCache for Redis, and Amazon Aurora MySQL.

You can run Athena federated queries in the AmazonAthenaPreviewFunctionality workgroup created as part of the CloudFormation stack or you could run them in the primary workgroup or other workgroups as long as you’re running with Athena engine version 2. As of this writing, Athena Federated Query is generally available in the Asia Pacific (Mumbai), Asia Pacific (Tokyo), Europe (Ireland), US East (N. Virginia), US East (Ohio), US West (N. California), and US West (Oregon) Regions. If you’re running in other Regions, use the AmazonAthenaPreviewFunctionality workgroup.

For information about changing your workgroup to Athena engine version 2, see Changing Athena Engine Versions.

Configuring QuickSight

The next step is to configure QuickSight to use these connectors to query data and visualize with QuickSight.

  1. On the AWS Management Console, navigate to QuickSight.
  2. If you’re not signed up for QuickSight, you’re prompted with the option to sign up. Follow the steps to sign up to use QuickSight.
  3. After you log in to QuickSight, choose Manage QuickSight under your account.

After you log in to QuickSight, choose Manage QuickSight under your account.

  1. In the navigation pane, choose Security & permissions.
  2. Under QuickSight access to AWS services, choose Add or remove.

Under QuickSight access to AWS services, choose Add or remove.

A page appears for enabling QuickSight access to AWS services.

  1. Choose Athena.

Choose Athena.

  1. In the pop-up window, choose Next.

In the pop-up window, choose Next.

  1. On the S3 tab, select the necessary S3 buckets. For this post, I select the athena-federation-workshop-<account_id> bucket and another one that stores my Athena query results.
  2. For each bucket, also select Write permission for Athena Workgroup.

For each bucket, also select Write permission for Athena Workgroup.

  1. On the Lambda tab, select the Lambda functions corresponding to the Athena federated connectors that Athena federated queries use. If you followed the post Extracting and joining data from multiple data sources with Athena Federated Query when configuring your Athena federated connectors, you can select dynamo, hbase, mysql, and redis.

For information about registering a data source in Athena, see the appendix in this post.

  1. Choose Finish.

Choose Finish.

  1. Choose Update.
  2. On the QuickSight console, choose New analysis.
  3. Choose New dataset.
  4. For Datasets, choose Athena.
  5. For Data source name, enter Athena-federation.
  6. For Athena workgroup, choose primary.
  7. Choose Create data source. 

As stated earlier, you can use the AmazonAthenaPreviewFunctionality workgroup or another workgroup as long as you’re running Athena engine version 2 in a supported Region.

You can use the AmazonAthenaPreviewFunctionality workgroup or another workgroup as long as you’re running Athena engine version 2 in a supported Region.

  1. For Catalog, choose the catalog that you created for your Athena federated connector.

For information about creating and registering a data source in Athena, see the appendix in this post.

For information about creating and registering a data source in Athena, see the appendix in this post.

  1. For this post, I choose the dynamo catalog, which does a federation to the Athena DynamoDB connector.

For this post, I choose the dynamo catalog, which does a federation to the Athena DynamoDB connector.

I can now see the database and tables listed in QuickSight.

  1. Choose Edit/Preview data to see the data.
  2. Choose Save & Visualize to start using this data for creating visualizations in QuickSight.

22. Choose Save & Visualize to start using this data for creating visualizations in QuickSight.

  1. To do a join with another Athena data source, choose Add data and select the catalog and table.
  2. Choose the join link between the two datasets and choose the appropriate join configuration.
  3. Choose Apply.

Choose Apply

You should be able to see the joined data.

You should be able to see the joined data.

Running a query in QuickSight

Now we use the custom SQL option in QuickSight to run a complex query with multiple Athena federated data sources.

  1. On the QuickSight console, choose New analysis.
  2. Choose New dataset.
  3. For Datasets, choose Athena.
  4. For Data source name, enter Athena-federation.
  5. For the workgroup, choose primary.
  6. Choose Create data source.
  7. Choose Use custom SQL.
  8. Enter the query for ProfitBySupplierNation.
  9. Choose Edit/Preview data.

Choose Edit/Preview data.

Under Query mode, you have the option to view your query in either SPICE or direct query. SPICE is the QuickSight Super-fast, Parallel, In-memory Calculation Engine. It’s engineered to rapidly perform advanced calculations and serve data. Using SPICE can save time and money because your analytical queries process faster, you don’t need to wait for a direct query to process, and you can reuse data stored in SPICE multiple times without incurring additional costs. You also can refresh data in SPICE on a recurring basis as needed or on demand. For more information about refresh options, see Refreshing Data.

With direct query, QuickSight doesn’t use SPICE data and sends the query every time to Athena to get the data.

  1. Select SPICE.
  2. Choose Apply.
  3. Choose Save & visualize.

Choose Save & visualize.

  1. On the Visualize page, under Fields list, choose nation and sum_profit.

QuickSight automatically chooses the best visualization type based on the selected fields. You can change the visual type based on your requirement. The following screenshot shows a pie chart for Sum_profit grouped by Nation.

The following screenshot shows a pie chart for Sum_profit grouped by Nation.

You can add more datasets using Athena federated queries and create dashboards. The following screenshot is an example of a visual analysis over various datasets that were added as part of this post.

The following screenshot is an example of a visual analysis over various datasets that were added as part of this post.

When your analysis is ready, you can choose Share to create a dashboard and share it within your organization.

Summary

QuickSight is a powerful visualization tool, and with Athena federated queries, you can run analysis and build dashboards on various data sources like DynamoDB, HBase on Amazon EMR, and many more. You can also easily join relational, non-relational, and custom object stores in Athena queries and use them with QuickSight to create visualizations and dashboards.

For more information about Athena Federated Query, see Using Amazon Athena Federated Query and Query any data source with Amazon Athena’s new federated query.


Appendix

To register a data source in Athena, complete the following steps:

  1. On the Athena console, choose Data sources.

On the Athena console, choose Data sources.

  1. Choose Connect data source.

Choose Connect data source.

  1. Select Query a data source.
  2. For Choose a data source, select a data source (for this post, I select Redis).
  3. Choose Next.

Choose Next.

  1. For Lambda function, choose your function.

For this post, I use the redis Lambda function, which I configured as part of configuring the Athena federated connector in the post Extracting and joining data from multiple data sources with Athena Federated Query.

  1. For Catalog name, enter a name (for example, redis).

The catalog name you specify here is the one that is displayed in QuickSight when selecting Lambda functions for access.

  1. Choose Connect.

Choose Connect.

When the data source is registered, it’s available in the Data source drop-down list on the Athena console.

When the data source is registered, it’s available in the Data source drop-down list on the Athena console.


About the Author

Saurabh Bhutyani is a Senior Big Data Specialist Solutions Architect at Amazon Web Services. He is an early adopter of open-source big data technologies. At AWS, he works with customers to provide architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. In his free time, he likes to watch movies and spend time with his family.

 

 

 

Building a Controlled Environment Agriculture Platform

Post Syndicated from Ashu Joshi original https://aws.amazon.com/blogs/architecture/building-a-controlled-environment-agriculture-platform/

This post was co-written by Michael Wirig, Software Engineering Manager at Grōv Technologies.

A substantial percentage of the world’s habitable land is used for livestock farming for dairy and meat production. The dairy industry has leveraged technology to gain insights that have led to drastic improvements and are continuing to accelerate. A gallon of milk in 2017 involved 30% less water, 21% less land, a 19% smaller carbon footprint, and 20% less manure than it did in 2007 (US Dairy, 2019). By focusing on smarter water usage and sustainable land usage, livestock farming can grow to provide sustainable and nutrient-dense food for consumers and livestock alike.

Grōv Technologies (Grōv) has pioneered the Olympus Tower Farm, a fully automated Controlled Environment Agriculture (CEA) system. Unique amongst vertical farming startups, Grōv is growing cattle feed to improve that sustainable use of land for livestock farming while increasing the economic margins for dairy and beef producers.

The challenges of CEA

The set of growing conditions for a CEA is called a “recipe,” which is a combination of ingredients like temperature, humidity, light, carbon dioxide levels, and water. The optimal recipe is dynamic and is sensitive to its ingredients. Crops must be monitored in near-real time, and CEAs should be able to self-correct in order to maintain the recipe. To build a system with these capabilities requires answers to the following questions:

  • What parameters are needed to measure for indoor cattle feed production?
  • What sensors enable the accuracy and price trade-offs at scale?
  • Where do you place the sensors to ensure a consistent crop?
  • How do you correlate the data from sensors to the nutrient value?

To progress from a passively monitored system to a self-correcting, autonomous one, the CEA platform also needs to address:

  • How to maintain optimum crop conditions
  • How the system can learn and adapt to new seed varieties
  • How to communicate key business drivers such as yield and dry matter percentage

Grōv partnered with AWS Professional Services (AWS ProServe) to build a digital CEA platform addressing the challenges posed above.

Olympus Tower - Grov Technologies

Tower automation and edge platform

The Olympus Tower is instrumented for measuring recipe ingredients by combining the mechanical, electrical, and domain expertise of the Grōv team with the IoT edge and sensor expertise of the AWS ProServe team. The teams identified a primary set of features such as height, weight, and evenness of the growth to be measured at multiple stages within the Tower. Sensors were also added to measure secondary features such as water level, water pH, temperature, humidity, and carbon dioxide.

The teams designed and developed a purpose-built modular and industrial sensor station. Each sensor station has sensors for direct measurement of the features identified. The sensor stations are extended to support indirect measurement of features using a combination of Computer Vision and Machine Learning (CV/ML).

The trays with the growing cattle feed circulate through the Olympus Tower. A growth cycle starts on a tray with seeding, circulates through the tower over the cycle, and returns to the starting position to be harvested. The sensor station at the seeding location on the Olympus Tower tags each new growth cycle in a tray with a unique “Grow ID.” As trays pass by, each sensor station in the Tower collects the feature data. The firmware, jointly developed for the sensor station, uses AWS IoT SDK to stream the sensor data along with the Grow ID and metadata that’s specific to the sensor station. This information is sent every five minutes to an on-site edge gateway powered by AWS IoT Greengrass. Dedicated AWS Lambda functions manage the lifecycle of the Grow IDs and the sensor data processing on the edge.

The Grōv team developed AWS Greengrass Lambda functions running at the edge to ingest critical metrics from the operation automation software running the Olympus Towers. This information provides the ability to not just monitor the operational efficiency, but to provide the hooks to control the feedback loop.

The two sources of data were augmented with site-level data by installing sensor stations at the building level or site level to capture environmental data such as weather and energy consumption of the Towers.

All three sources of data are streamed to AWS IoT Greengrass and are processed by AWS Lambda functions. The edge software also fuses the data and correlates all categories of data together. This enables two major actions for the Grōv team – operational capability in real-time at the edge and enhanced data streamed into the cloud.

Grov Technologies - Architecture

Cloud pipeline/platform: analytics and visualization

As the data is streamed to AWS IoT Core via AWS IoT Greengrass. AWS IoT rules are used to route ingested data to store in Amazon Simple Sotrage Service (Amazon S3) and Amazon DynamoDB. The data pipeline also includes Amazon Kinesis Data Streams for batching and additional processing on the incoming data.

A ReactJS-based dashboard application is powered using Amazon API Gateway and AWS Lambda functions to report relevant metrics such as daily yield and machine uptime.

A data pipeline is deployed to analyze data using Amazon QuickSight. AWS Glue is used to create a dataset from the data stored in Amazon S3. Amazon Athena is used to query the dataset to make it available to Amazon QuickSight. This provides the extended Grōv tech team of research scientists the ability to perform a series of what-if analyses on the data coming in from the Tower Systems beyond what is available in the react-based dashboard.

Data pipeline - Grov Technologies

Completing the data-driven loop

Now that the data has been collected from all sources and stored it in a data lake architecture, the Grōv CEA platform established a strong foundation for harnessing the insights and delivering the customer outcomes using machine learning.

The integrated and fused data from the edge (sourced from the Olympus Tower instrumentation, Olympus automation software data, and site-level data) is co-related to the lab analysis performed by Grōv Research Center (GRC). Harvest samples are routinely collected and sent to the lab, which performs wet chemistry and microbiological analysis. Trays sent as samples to the lab are associated with the results of the analysis with the sensor data by corresponding Grow IDs. This serves as a mechanism for labeling and correlating the recipe data with the parameters used by dairy and beef producers – dry matter percentage, micro and macronutrients, and the presence of myco-toxins.

Grōv has chosen Amazon SageMaker to build a machine learning pipeline on its comprehensive data set, which will enable fine tuning the growing protocols in near real-time. Historical data collection unlocks machine learning use cases for future detection of anomalous sensors readings and sensor health monitoring, as well.

Because the solution is flexible, the Grōv team plans to integrate data from animal studies on their health and feed efficiency into the CEA platform. Machine learning on the data from animal studies will enhance the tuning of recipe ingredients that impact the animals’ health. This will give the farmer an unprecedented view of the impact of feed nutrition on the end product and consumer.

Conclusion

Grōv Technologies and AWS ProServe have built a strong foundation for an extensible and scalable architecture for a CEA platform that will nourish animals for better health and yield, produce healthier foods and to enable continued research into dairy production, rumination and animal health to empower sustainable farming practices.

New charts, formatting, and layout options in Amazon QuickSight

Post Syndicated from Sapna Maheshwari original https://aws.amazon.com/blogs/big-data/new-charts-formatting-and-layout-options-in-amazon-quicksight/

Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization. In this post, we explore how authors of QuickSight dashboards can use some of the new chart types, layout options, and dashboard formatting controls to deliver dashboards that intuitively deliver insights to all your users, whether within QuickSight or embedded in your websites or multi-tenant apps.

This blog post explores some of the visualization and dashboard customization features offered in Amazon QuickSight with the following datasets:

Picking the right chart for your use case

In this section, we explore some of the new charts QuickSight introduced in 2020 and how these help with various use cases.

Funnel charts

Funnel charts help visualize the progressive reduction of data as it passes from one phase to another. Data in each of these phases is represented as different portions of 100% (the whole). The most common use of the funnel chart is in visualizing conversion data. For example, you can represent sales lead generation showing different stages of sales conversion from first contact to lead generation.

To build a funnel chart with our Ads dataset, complete the following steps:

  1. On the analysis page, choose Visualize.
  2. Choose Add, then choose Add visual.
  3. In the Visual types pane, choose the funnel chart icon.
  4. For Group by, choose Stage.
  5. For Value, choose Ad name.
  6. To change default configuration, choose the gear icon.
  7. In the Data labels section, for Metric label style, choose Value and percent of first.

The video below demonstrates these steps.

Stacked area charts

Stacked area charts are best used to visualize part-to-whole relationships, to show how each category contributes to the cumulative total. For this post, we create a stacked area chart with the Ads dataset.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual types pane, choose the stacked area chart icon.
  • For X axis, choose Date (MONTH).
  • For Value, choose Cost (Sum).
  • For Color, choose Segment.
  • Choose the gear icon.
  • Under Legend, deselect Legend title.
  • Under Y-Axis, select Show Y axis label.
  • Under Data labels, select Show data labels.
  • Choose your desired position, font size, font color, and label pattern.

Histograms

Histograms help visualize the frequency distribution of a dataset and display numerical data by grouping data into bins of equal width. Each bin is plotted as a bar whose height corresponds to the number of data points within the bin.

For this post, we use the Student Performance dataset to create a histogram.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual types pane, choose the histogram icon.
  • For Value, choose math score.

You can customize the histogram to show bins by bin count, bin width, or a custom start value. For this post, we sort by bin width.

  • Under Histogram, select Bin width.
  • For Bin width, enter 5.

Box plots

Box plot (also called box or whisker plot) is a standardized way of displaying distribution of data based on a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). This is useful to determine if data is symmetrical, skewed, or tightly grouped. Box plots also show outliers.

For this post, we create a box plot on the Student Performance dataset.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual types pane, choose the box plot icon.
  • For Group by, choose Gender.
  • For Value, choose writing score and reading score.
  • In the visual settings, under Box plot, select Show outliers and Show all data points.
  • Under Legend, deselect Show legend title.

 

Waterfall charts

Waterfall charts help you understand the cumulative effect of sequentially introduced positive or negative values. This is great to understand contributions to a whole, for example the main contributors to a monthly increase in revenue, or the breakdown of revenue vs costs.

We use the P&L dataset to create a waterfall chart.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual selection pane, choose the waterfall chart icon.
  • For Category, choose Line item.
  • For Value¸ choose Value (Sum).
  • Under Legend, deselect Show legend title.
  • For Position, select Bottom.
  • Under Title, deselect Show title.

Choropleth maps

Choropleth maps use differences in shading or coloring within geographical areas or regions to indicate the value of a variable in those areas

We use the Patient Info dataset to create a choropleth map.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual types pane, choose the funnel chart icon.
  • For Location, choose State.
  • For Color, choose Revenue (Sum).
  • Choose the menu options icon (…) and choose Conditional formatting.
  • For Column, choose Revenue.
  • For Fill type, select Gradient.
  • For Max value, choose a color (for this post, we choose blue).

The video below demonstrates these steps.

You can also control the color and shading of the geographic areas using conditional formatting as shown below.

The video below demonstrates these steps.

Additionally, you can configure the alignment of column headers and content within. You can change the vertical alignment – top, middle, bottom and also the horizontal alignment – left, center and right.

Customization and formatting options

QuickSight also supports several formatting options that allow you to streamline visualizations and convey additional information in your dashboards.

Table/pivot table formatting

Pin or unpin and add custom text totals

You can now pin totals to the top or bottom of tables and pivot tables in QuickSight. This feature helps you view the totals even while scrolling through the tables.

  • Go to visual setting (gear icon on the visual menu).
  • Under Total, select Pin totals.
  • For Position, choose a position (for this post, we choose Bottom).

Additionally, you can edit the text you want to show on totals and subtotals.

  • For Total label, enter your custom text (for this post, we enter Grand Total).

Table alignment and wrapping

You can now align horizontal (left, right, center, auto) and vertical (top, middle, bottom) alignment on column headers and cell values in a table visual. Additionally, you can apply text wrapping on table and pivot table headers so that long headers are readable versus having to scroll over the header.

These options are available under Table options.

Hide +/- buttons on pivot tables

You can now show or hide +/- buttons on pivot tables. This allows you to improve presentation of pivot tables by removing these icons and keeping the pivot table simple. This option is available under Styling.

Visual customization options

In this section, we discuss additional customization options in QuickSight.

Custom sorting

If you want to sort your charts in a custom defined order different from the default alphabetic order, you can now do so on QuickSight. For example, you can sort geographical regions in the order of East, West, Central, and South, by ranking these regions 1– 4 and then sorting on this rank field. See video below to learn how to.

You can also therefore sort using any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

You can also therefore sort on any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

The following screenshot shows your sorted visualizations.

 

Adding descriptive text, images, and links

You can add images or logos to your dashboard using QuickSight’s narrative component available in Enterprise Edition.

  • On the analysis page, choose Visualize.
  • Choose Add, then choose Add visual.
  • In the Visual types pane, choose the insights icon.
  • Choose Customize insight.
  • Remove any existing text and add your custom text.

You can also add hyperlinks to text and images. Upload your image to a secure location where the image is accessible to QuickSight and your users.

The video below demonstrates these steps.

Customizing colors and fonts

QuickSight offers easy-to-build themes that allow customization of the color palette, background and foreground colors, spacing, fonts, and more. Themes can be created by authors and shared within an account, and are also accessible via APIs for programmatic management. Themes can also be defaulted for all users in the organization using APIs.

You can also prioritize the colors that you want to use in your dashboard by prioritizing them within your theme’s color palette.

You can apply predefined themes available out of the box or create your own themes that fit your corporate branding. The following screenshots show how a dashboard looks in both dark and light themes.

For more information, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.

Null value customization

You may occasionally have null values in your data and want to represent nulls with different annotations. For each of the dimensions and metrics in the dataset, you can provide custom text for null values. Go to More formatting options for any field.

The option is available under Null values.

Reference lines

You can draw reference lines based on a calculated field or a constant value. Choose the gear icon and navigate to reference lines section. In the following screenshot, the orange reference line is based off a calculation (average profit) and the black reference line is plotted on a constant value of 1,000.

You can also link reference lines to parameters via a calculated field, which allows you to create what-if scenarios within your dashboard.

Custom colors on heat and tree maps

Color gradient customization on heat and tree maps allows you to select colors for lower, intermediate, and upper limits so that the gradient is applied within these colors. You can configure this under Color in the visual settings.

Using logarithmic scale

If your metric numbers aren’t in the same order of magnitude, where some are extremely on the higher end and others on the lower end (for example, representing stock price for different entities or pandemic spread rates), you can represent them on a logarithmic scale so they’re normalized, yet relative. To use logarithmic scale, go to visual setting and under Y-Axis, for Scale, select Logarithmic.

The following screenshot shows your visualization after applying logarithmic scale.

Adjustable font size

You can now apply different font sizes on all visual content and visuals titles. In pivot tables and tables, you can see font sizes for table headers, cells, totals, and subtotals. In key performance indicators (KPIs), you can set font sizes for primary and comparison values, which allows you to keep dashboards dense and add more KPIs.

Actions

Finally, to all these charts, you can apply the following actions:

  • Filter actions– Select points on a chart to filter across the dashboard. QuickSight supports hierarchical filter actions that allow you to trigger one filter action from more than one chart. For more information, see Enhancing dashboard interactivity with Amazon QuickSight Actions.
  • URL actions – Trigger navigation from the dashboard to an external website and pass dynamic values within a URL.

Layout enhancements

QuickSight dashboards default to auto-fit mode, which makes them responsive based on screen size. However, in many situations, it’s preferable that the view you design is exactly what end-users see, whether on a laptop or a large monitor. QuickSight offers optimized layouts that allow you to pick a specific screen resolution to optimize for (such as the screen size most of your users use on a daily basis), and QuickSight automatically scales the dashboard view to render appropriately on larger or smaller screens. This doesn’t affect mobile devices—QuickSight automatically optimizes for mobile devices using a single-column layout. To adjust the scaling mode, choose Settings in the navigation pane while in dashboard authoring (analysis) mode.

If you build your dashboard for a 1024 px screen, for example, QuickSight scales that view to a larger or smaller screen to ensure that all users see the same content (mobile devices continue to fall back to a single-column, mobile-specific layout to ensure usability). Opting for the optimized mode also makes sure that your email reports look exactly like the dashboard that your viewers interact with.

On-sheet filter controls

You can now add filters to your dashboard directly without having to create parameters. Choose the field that you need to filter on choose Add filter for this field. Choose the newly added filter and choose Add to sheet.

If you need to pin it to the controls section, choose the filter and choose Pin to top.

The video below demonstrates these steps.

Thus, QuickSight allows you to choose from any of these control types to add to dashboards – single-select drop-downs, multi-select drop-downs, date and time picker, single-sided slider, single-line text box, time range picker, relative date selection, and numeric range slider. Learn more about on-sheet controls from the blog post here.

Other launches in 2020

While this blog covers all key charting and visualization launches in 2020, you can take a look at all new features enabled across other areas within QuickSight from this blog post here.

Conclusion

With these new QuickSight feature releases, you can now choose the chart type that is best suited to represent your data. You can provide richer dashboards for your readers by using the new formatting table options, dynamic titles, and reference lines. For more information about authoring dashboards in QuickSight, watch the virtual workshop Build Advanced Analytics and Dashboards with Amazon QuickSight and consider subscribing to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs.


About the Author

Sapna Maheshwari is a Specialist Solutions Architect for Amazon QuickSight. She is passionate about telling stories with data. In her previous roles at American Express and Early Warning services , she managed and led several projects in the data and analytics space.She enjoys helping customers unearth actionable insights from their data.

Field Notes: Ingest and Visualize Your Flat-file IoT Data with AWS IoT Services

Post Syndicated from Paul Ramsey original https://aws.amazon.com/blogs/architecture/field-notes-ingest-and-visualize-your-flat-file-iot-data-with-aws-iot-services/

Customers who maintain manufacturing facilities often find it challenging to ingest, centralize, and visualize IoT data that is emitted in flat-file format from their factory equipment. While modern IoT-enabled industrial devices can communicate over standard protocols like MQTT, there are still some legacy devices that generate useful data but are only capable of writing it locally to a flat file. This results in siloed data that is either analyzed in a vacuum without the broader context, or it is not available to business users to be analyzed at all.

AWS provides a suite of IoT and Edge services that can be used to solve this problem. In this blog, I walk you through one method of leveraging these services to ingest hard-to-reach data into the AWS cloud and extract business value from it.

Overview of solution

This solution provides a working example of an edge device running AWS IoT Greengrass with an AWS Lambda function that watches a Samba file share for new .csv files (presumably containing device or assembly line data). When it finds a new file, it will transform it to JSON format and write it to AWS IoT Core. The data is then sent to AWS IoT Analytics for processing and storage, and Amazon QuickSight is used to visualize and gain insights from the data.

Samba file share solution diagram

Since we don’t have an actual on-premises environment to use for this walkthrough, we’ll simulate pieces of it:

  • In place of the legacy factory equipment, an EC2 instance running Windows Server 2019 will generate data in .csv format and write it to the Samba file share.
    • We’re using a Windows Server for this function to demonstrate that the solution is platform-agnostic. As long as the flat file is written to a file share, AWS IoT Greengrass can ingest it.
  • An EC2 instance running Amazon Linux will act as the edge device and will host AWS IoT Greengrass Core and the Samba share.
    • In the real world, these could be two separate devices, and the device running AWS IoT Greengrass could be as small as a Raspberry Pi.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS Account
  • Access to provision and delete AWS resources
  • Basic knowledge of Windows and Linux server administration
  • If you’re unfamiliar with AWS IoT Greengrass concepts like Subscriptions and Cores, review the AWS IoT Greengrass documentation for a detailed description.

Walkthrough

First, we’ll show you the steps to launch the AWS IoT Greengrass resources using AWS CloudFormation. The AWS CloudFormation template is derived from the template provided in this blog post. Review the post for a detailed description of the template and its various options.

  1. Create a key pair. This will be used to access the EC2 instances created by the CloudFormation template in the next step.
  2. Launch a new AWS CloudFormation stack in the N. Virginia (us-east-1) Region using iot-cfn.yml, which represents the simulated environment described in the preceding bullet.
    •  Parameters:
      • Name the stack IoTGreengrass.
      • For EC2KeyPairName, select the EC2 key pair you just created from the drop-down menu.
      • For SecurityAccessCIDR, use your public IP with a /32 CIDR (i.e. 1.1.1.1/32).
      • You can also accept the default of 0.0.0.0/0 if you can have SSH and RDP open to all sources on the EC2 instances in this demo environment.
      • Accept the defaults for the remaining parameters.
  •  View the Resources tab after stack creation completes. The stack creates the following resources:
    • A VPC with two subnets, two route tables with routes, an internet gateway, and a security group.
    • Two EC2 instances, one running Amazon Linux and the other running Windows Server 2019.
    • An IAM role, policy, and instance profile for the Amazon Linux instance.
    • A Lambda function called GGSampleFunction, which we’ll update with code to parse our flat-files with AWS IoT Greengrass in a later step.
    • An AWS IoT Greengrass Group, Subscription, and Core.
    • Other supporting objects and custom resource types.
  • View the Outputs tab and copy the IPs somewhere easy to retrieve. You’ll need them for multiple provisioning steps below.

3. Review the AWS IoT Greengrass resources created on your behalf by CloudFormation:

    • Search for IoT Greengrass in the Services drop-down menu and select it.
    • Click Manage your Groups.
    • Click file_ingestion.
    • Navigate through the SubscriptionsCores, and other tabs to review the configurations.

Leveraging a device running AWS IoT Greengrass at the edge, we can now interact with flat-file data that was previously difficult to collect, centralize, aggregate, and analyze.

Set up the Samba file share

Now, we set up the Samba file share where we will write our flat-file data. In our demo environment, we’re creating the file share on the same server that runs the Greengrass software. In the real world, this file share could be hosted elsewhere as long as the device that runs Greengrass can access it via the network.

  • Follow the instructions in setup_file_share.md to set up the Samba file share on the AWS IoT Greengrass EC2 instance.
  • Keep your terminal window open. You’ll need it again for a later step.

Configure Lambda Function for AWS IoT Greengrass

AWS IoT Greengrass provides a Lambda runtime environment for user-defined code that you author in AWS Lambda. Lambda functions that are deployed to an AWS IoT Greengrass Core run in the Core’s local Lambda runtime. In this example, we update the Lambda function created by CloudFormation with code that watches for new files on the Samba share, parses them, and writes the data to an MQTT topic.

  1. Update the Lambda function:
    • Search for Lambda in the Services drop-down menu and select it.
    • Select the file_ingestion_lambda function.
    • From the Function code pane, click Actions then Upload a .zip file.
    • Upload the provided zip file containing the Lambda code.
    • Select Actions > Publish new version > Publish.

2. Update the Lambda Alias to point to the new version.

    • Select the Version: X drop-down (“X” being the latest version number).
    • Choose the Aliases tab and select gg_file_ingestion.
    • Scroll down to Alias configuration and select Edit.
    • Choose the newest version number and click Save.
    • Do NOT use $LATEST as it is not supported by AWS IoT Greengrass.

3. Associate the Lambda function with AWS IoT Greengrass.

    • Search for IoT Greengrass in the Services drop-down menu and select it.
    • Select Groups and choose file_ingestion.
    • Select Lambdas > Add Lambda.
    • Click Use existing Lambda.
    • Select file_ingestion_lambda > Next.
    • Select Alias: gg_file_ingestion > Finish.
    • You should now see your Lambda associated with the AWS IoT Greengrass group.
    • Still on the Lambda function tab, click the ellipsis and choose Edit configuration.
    • Change the following Lambda settings then click Update:
      • Set Containerization to No container (always).
      • Set Timeout to 25 seconds (or longer if you have large files to process).
      • Set Lambda lifecycle to Make this function long-lived and keep it running indefinitely.

Deploy AWS IoT Greengrass Group

  1. Restart the AWS IoT Greengrass daemon:
    • A daemon restart is required after changing containerization settings. Run the following commands on the Greengrass instance to restart the AWS IoT Greengrass daemon:
 cd /greengrass/ggc/core/

 sudo ./greengrassd stop

 sudo ./greengrassd start

2. Deploy the AWS IoT Greengrass Group to the Core device.

    • Return to the file_ingestion AWS IoT Greengrass Group in the console.
    • Select Actions Deploy.
    • Select Automatic detection.
    • After a few minutes, you should see a Status of Successfully completed. If the deployment fails, check the logs, fix the issues, and deploy again.

Generate test data

You can now generate test data that is ingested by AWS IoT Greengrass, written to AWS IoT Core, and then sent to AWS IoT Analytics and visualized by Amazon QuickSight.

  1. Follow the instructions in generate_test_data.md to generate the test data.
  2. Verify that the data is being written to AWS IoT Core following these instructions (Use iot/data for the MQTT Subscription Topic instead of hello/world).

screenshot

Setup AWS IoT Analytics

Now that our data is in IoT Cloud, it only takes a few clicks to configure AWS IoT Analytics to process, store, and analyze our data.

  1. Search for IoT Analytics in the Services drop-down menu and select it.
  2. Set Resources prefix to file_ingestion and Topic to iot/data. Click Quick Create.
  3. Populate the data set by selecting Data sets > file_ingestion_dataset >Actions > Run now. If you don’t get data on the first run, you may need to wait a couple of minutes and run it again.

Visualize the Data from AWS IoT Analytics in Amazon QuickSight

We can now use Amazon QuickSight to visualize the IoT data in our AWS IoT Analytics data set.

  1. Search for QuickSight in the Services drop-down menu and select it.
  2. If your account is not signed up for QuickSight yet, follow these instructions to sign up (use Standard Edition for this demo)
  3. Build a new report:
    • Click New analysis > New dataset.
    • Select AWS IoT Analytics.
    • Set Data source name to iot-file-ingestion and select file_ingestion_dataset. Click Create data source.
    • Click Visualize. Wait a moment while your rows are imported into SPICE.
    • You can now drag and drop data fields onto field wells. Review the QuickSight documentation for detailed instructions on creating visuals.
    • Following is an example of a QuickSight dashboard you can build using the demo data we generated in this walkthrough.

Cleaning up

Be sure to clean up the objects you created to avoid ongoing charges to your account.

  • In Amazon QuickSight, Cancel your subscription.
  • In AWS IoT Analytics, delete the datastore, channel, pipeline, data set, role, and topic rule you created.
  • In CloudFormation, delete the IoTGreengrass stack.
  • In Amazon CloudWatch, delete the log files associated with this solution.

Conclusion

Gaining valuable insights from device data that was once out of reach is now possible thanks to AWS’s suite of IoT services. In this walkthrough, we collected and transformed flat-file data at the edge and sent it to IoT Cloud using AWS IoT Greengrass. We then used AWS IoT Analytics to process, store, and analyze that data, and we built an intelligent dashboard to visualize and gain insights from the data using Amazon QuickSight. You can use this data to discover operational anomalies, enable better compliance reporting, monitor product quality, and many other use cases.

For more information on AWS IoT services, check out the overviews, use cases, and case studies on our product page. If you’re new to IoT concepts, I’d highly encourage you to take our free Internet of Things Foundation Series training.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Transform data and create dashboards simply using AWS Glue DataBrew and Amazon QuickSight

Post Syndicated from Prithiviraj Jothikumar original https://aws.amazon.com/blogs/big-data/transform-data-and-create-dashboards-simply-using-aws-glue-databrew-and-amazon-quicksight/

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. The range and complexity of data transformation steps required depends on the visuals you would like in your dashboard. Often, the data transformation process is time-consuming and highly iterative, especially when you are working with large datasets.

In this post, we show you how to apply data transformations on COVID-related tweets (with sentiment scores) using AWS Glue DataBrew. AWS Glue DataBrew’s visual and interactive interface allows us to apply data transformations without any coding. Some examples of transformations we apply are: changing date formats, transformation of text strings and performing table pivots. We then use Amazon QuickSight to visualize the transformed data in a dashboard.

See the AWS architecture diagram below for an overview of the complete and serverless data pipeline.


These are seven main steps in the data pipeline:

  1. Upload CSV file with tweets to S3
  2. Trigger AWS Lambda to partition the dataset in case the number of rows exceeds a threshold, and store the output in S3. A threshold is placed to prevent a memory exception failure to compute VADER sentiment score.
  3. Use Amazon SQS and AWS Lambda to compute VADER sentiment scores for tweets and store the VADER output as parquet files in S3
  4. Crawl the VADER output and produce a Data Catalog using AWS Glue, so that the data can be queried easily using Amazon Athena
  5. Use the Data Catalog connector in AWS Glue DataBrew to create a recipe and perform transformations
  6. Setup a Recipe job to perform the transformation and send the output to an S3 location; repeat step 5 to produce a data catalog of the transformed data
  7. Query and submit the data to Amazon QuickSight to create visuals for the dashboard

Each section below corresponds to one (or more) of the seven steps described above.

Steps 1 to 3: Setting and Deploying AWS Lambda pipeline

To test the solution we can use the AWS CloudFormation template found here. The AWS CloudFormation template automatically creates the following for you: the S3 Bucket to store the CSV and parquet files, the Lambda function to partition the files, the SQS queue and the VADER Lambda function.

Deploying the CloudFormation Template

  1. Create an S3 bucket with the name of your choice and note the bucket name. Please refer to this guide for detailed steps on how to create an S3 bucket.
  2. Upload all four individual zip files to the S3 bucket created in step 1. To do so, download the zip file from here, make sure to unzip the downloaded file and place each of the four individual zip files into the S3 bucket created in step 1.
  3. Open the AWS Management Console in the AWS Region you want to deploy the solution to, and on the Services menu, choose CloudFormation.
  4. Choose Create Stack, choose Upload a template to Amazon S3, and then choose the file databrew-cloudformation.yaml included in the solution that you downloaded earlier and if not, click on the link provided on this step to download.
  5. Set Stack name to databrew-stack. Specify the Amazon S3 bucket that contains the compressed version of AWS Lambda function code and layers uploaded in step 2.
  6. For Options, you can specify tags for your stack and an optional IAM role to be used by AWS CloudFormation to create resources. If the role isn’t specified, a new role is created. You can also perform additional configuration for rollback settings and notification options.
  7. The review section shows a recap of the information. Be sure to select the three AWS CloudFormation acknowledgements to allow AWS CloudFormation to create resources with custom names on your behalf. Also, create a change set, because the AWS CloudFormation template includes the AWS::Serverless-2016-10-31

  8. Click Execute.
  9. The Outputs for the stack lists all the resources created.

Downloading the dataset

  1. Download the dataset from here.

Testing the solution

  1. Navigate to the S3 console.
  2. Click on the bucket name (created by CloudFormation template and listed in the Outputs section of the stack).
  3. Create folder named input in the S3 bucket.
  4. Upload the downloaded dataset to the folder created in step 3.
  5. The above will trigger a lambda function to chunk the files into smaller files and proceed to another lambda function triggered by an SQS event to perform the following preprocessing steps: removal of https links, VADER sentiment scores of the 4 categories (compound, positive, negative, and neutral), and saving the file as a parquet file in S3 the path called processedv4.

Step 4: Setup AWS Glue Data Catalog

An AWS Glue Data Catalog will allows us to easily import data into AWS Glue DataBrew. Follow these steps to create a Glue crawler that crawls the the raw data with VADER output in partitioned parquet files in S3 and determines the schema:

  1. Choose a crawler name.
  2. Use the default options for Crawler source type.
  3. Provide the S3 location of the parquet files.
  4. Choose/Create an IAM role that has read/write permissions to S3 and the AWSGlueServiceRole policy attached.
  5. Set the frequency as Run on demand.
  6. Choose a database name for the crawler’s output.
  7. Leave all remaining options as default.

Once the crawler has been created, select and run the new crawler. Upon completion, the table schema is generated and visible within Tables under the Databases section in the AWS Glue console.

Step 5: Creating a DataBrew project

To get started with AWS Glue DataBrew, complete the following steps:

  1. On the DataBrew console, choose Projects.
  2. Choose Create project.
  3. For Project Name, enter covid19-tweets.
  4. For Attached recipe, choose Create new recipe.
  5. For Recipe name, enter covid19-tweets-recipe.
  6. For Select a dataset, select New dataset.
  7. For Dataset name, enter covid19-tweets-data.
  8. Import the AWS Glue table <YOUR-GLUE-TABLE-NAME> from the AWS Glue database <YOUR-GLUE-DATABASE-NAME>.
  9. Create a new AWS Identity and Access Management (IAM) policy and IAM role by following the steps on the AWS Glue DataBrew console, which provides DataBrew the necessary permissions to access Amazon S3, Amazon Athena and AWS Glue.
  10. Select Create Project.

Step 6: Data transformations, creating a AWS Glue DataBrew recipe and recipe job

Exploratory data analysis (EDA) is an essential component of the data transformation workflow. EDA allows us to gain an intuitive understanding of the dataset by summarizing its main characteristics such as the distribution of data across columns, the corresponding data types and summary statistics. For more information on EDA using AWS Glue DataBrew, see the Exploratory Data Analysis section of the post Data preprocessing for machine learning on Amazon EMR made easy with AWS Glue DataBrew.

The set of data transformation steps listed below are based on the EDA and the attributes we would like to visualize in the Amazon QuickSight dashboard.

  1. Delete the following columns user_descriptionuser_createduser_verifiedis_retweet, user_followers, user_friends, user_favourites and use_name, because the aforementioned columns are not dependent for Amazon QuickSight dashboard.

Based on the schema view of the dataset, we can observe that the date column is represented as a string. AWS Glue DataBrew provides various transformations date-time formats to convert date types.

  1. Select from the dropdown select Date functions, then DATEFORMAT.
    1. Create a new column based on the values in the source column.
    2. For source column, select date.
    3. For date format, select mm/dd/yy*HH:MM.
  2. Delete the date column and rename the date_DATEFORMAT column to date.


A visual representation of the positive, negative, neutral, and compound scores generated by the VADER AWS Lambda function is helpful to understand the sentiments of the tweets over time. Individual scores could have large variation between rows leading to large fluctuations on the line graph. We can use the rolling window transformation over observations in a column, to create a smoother, less fluctuating line graph. To apply the rolling window function, create a less granular version of the date column that is truncated of the hours and minute.

  1. Select From the dropdown select Date functions, then DATEFORMAT.
    1. Create a new column based on the values in the source column
    2. For source column, select date
    3. For date format, select mm/dd/yy to transform the observations at a date-level
    4. For destination column, enter date_level.
  2. Select From the dropdown select Window functions, then ROLLING AVERAGE.
    1. For source column, select compound
    2. Select 3 rows before and after. This is the size of the rolling window that slides over the column and impacts the smoothness of the average score from one row to another; larger window sizes produce more smoother rolling average scores and vice-versa.
    3. Order the observations by the date column because it’s granularity is at the hour- and minute-level
    4. Group the observations by the date_level column, which is only represented in date without the time
    5. For the destination column name, enter rolling_compound.

  3. Repeat step 5 above for each of the source columns pos, neu and neg to create rolling_pos, rolling_neu and rolling_neg.

To support downstream visuals in Amazon QuickSight, apply the unpivot transformation to convert the selected columns into row values.

  1. Select Pivot transformation
    1. For pivot type, select Unpivot: Columns to rows
    2. In the unpivot columns dropdown, select the columns rolling_compound, rolling_pos, rolling_neu and rolling_neg
    3. For column name, enter Rolling_category and for column value enter VADER_score

The hashtags column contains an array of hashtags. To visualize the hashtags as a word cloud in Amazon QuickSight, create a new row for each hashtag from the array by applying the unnest transformation.

  1. Select unnest
    1. For source column, select hashtags
    2. For the destination column name, enter unnest_hashtags

Normalize the data in the unnest_hashtags column

  1. Format the data in unnest_hashtags column to lowercase
  2. Remove all special characters by selecting the clean transformation

The data preparation phase is now complete, and the set of 20 transformations that consist of date formatting, rolling window functions and normalized columns are combined into a recipe.

The transformations above were applied on a sample of the first 500 rows of the dataset. AWS Glue DataBrew recipe job provides the ability to scale the set of transformation steps from a sample of data to the entire dataset.

To create a recipe job, complete the following steps:

  1. On the AWS Glue DataBrew console, choose Jobs.
  2. Choose Create recipe job.
  3. For Job name, enter a name.
  4. Create a new folder in Amazon S3 for the recipe job output, select the file type as GLUEPARQUET and compression as Snappy.

If your dataset is updated on a regular basis, AWS Glue DataBrew provides an option to schedule jobs.

To query the newly transformed data from S3 into Amazon QuickSight, create another new crawler/table in AWS Glue similar to steps provided earlier (refer to the following section: Step 4: Setup an AWS Glue Data Catalog).

Use AWS QuickSight to visualize transformed data

Before proceeding, make sure your Amazon QuickSight account has IAM permissions to access Amazon Athena and S3. Add a new dataset by clicking on Datasets from the left panel and click on New dataset. Select Athena from the list of data sources and provide a name for the Data source name and on the next section. When prompted, select the database and table that contains the table post AWS Glue DataBrew transformation and click Use custom SQL followed by changing the New custom SQL name to TableOrderedData and paste the following SQL query before selecting Edit/Preview data:

SELECT * FROM "<db>"."<table_name>" t ORDER BY t.date;

Replace <db> with your database and <table_name> with your table name. Leave the quotes in place.

Click Apply to make some further modifications.

  1. For the date column, change the data type from Stringto Date and provide the format the date as it is presented in the column (i.e. MM/dd/yy HH:mm). Similarly, change date_level column into Date
  2. Click Save and visualize to approach the next step, which is to analyze the data. In the Analyses section, select Horizontal bar chart under Visual types followed by clicking vader_score and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate.
  3. For this visual, select Actions and select Filter same-sheet visuals under Quick create.
  4. Create another panel by selecting Add to create a line graph that allows rolling score. Select Line chart under Visual types and select the following in the presented order: vader_score, date, and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate. Likewise, change date (Day) to date (Hour) as the Aggregate. Similar to the previous panel, create a quick action item.
  5. Further panels can also be created, such as word cloud of hashtags or a pivot table consisting of the different VADER categories with score and date. The following image is based on selecting the rolling_pos bar on the first quadrant, which filters and cascades to the rest of the panels to that of the rolling_pos filter.
  6. First quadrant depicts the overall average of each category of the entirety of the dataset. The second quadrant depicts the rolling average that is aggregated by the hour. The third quadrant is table representative that is depicted as the rolling average that is aggregated by the day. The word cloud panel is from the hashtags column.

Summary

In this post, we showed you how to interactively analyze and visualize a dataset using AWS Lambda, AWS Glue Databrew and Amazon QuickSight. We began with a COVID19 tweets dataset and computed their sentiment scores using the VADER algorithm. We then cleaned and prepared the tweets and their sentiment scores in AWS Glue DataBrew, and finally visualized key characteristics of the dataset in Amazon QuickSight.

We encourage you to extend this pipeline to your data analytics + visualization use case – there are many more pre-built transformations in AWS Glue DataBrew and pre-built visuals in Amazon QuickSight to explore.

Happy Building!


About the Authors

Prithiviraj Jothikumar, PhD, is a Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and spending time to meditate.

 

 

 

Kartik Kannapur is a Data Scientist with AWS Professional Services. He holds a Master’s degree in Applied Mathematics and Statistics from Stony Brook University and focuses on using machine learning to solve customer business problems.

 

 

 

Bala Krishnamoorthy is a Data Scientist with AWS Professional Services, where he helps customers solve problems and run machine learning workloads on AWS. He has worked with customers across diverse industries, including software, finance, and healthcare. In his free time, he enjoys spending time outdoors, running with his dog, beating his family and friends at board games and keeping up with the stock market.

 

Paritosh Walvekar is a Cloud Application Architect with AWS Professional Services, where he helps customer build cloud native applications. He has a Master’s degree in Computer Science from University at Buffalo. In his free time, he enjoys watching movies and is learning to play the piano.

 

Building an ad-to-order conversion engine with Amazon Kinesis, AWS Glue, and Amazon QuickSight

Post Syndicated from Gandhi Raketla original https://aws.amazon.com/blogs/big-data/building-an-ad-to-order-conversion-engine-with-aws-glue-amazon-kinesis-data-streams-and-amazon-quicksight/

Businesses in ecommerce have the challenge of measuring their ad-to-order conversion ratio for ads or promotional campaigns displayed on a webpage. Tracking the number of users that clicked on a particular promotional ad and the number of users who actually added items to their cart or placed an order helps measure the ad’s effectiveness. Utilizing promotional ads that have higher conversion rates enables you to effectively utilize limited space on your ecommerce websites and applications.

This post demonstrates how to sessionize and aggregate clickstream and order data, compute the conversion ratio in real time, and generate data visualizations. We use Amazon Kinesis Data Streams to ingest and send data to Amazon Simple Storage Service (Amazon S3), and AWS Glue, Amazon Athena, and Amazon QuickSight to catalog, analyze, and visualize the data, respectively.

Solution overview

To measure ad-to-order conversion, you need two important pieces of data: user clicks and orders. Clickstream data is captured as users navigate through the site, each time users click on the webpage, and the metadata associated with those clicks. Depending on the user base and number of active users at any moment, clickstream data can be a large amount of data generated per second. Typically, every ecommerce system has a centralized order management system that captures orders created from different channels like a web portal or mobile app. To compute an ad-to-order conversion rate, you join clickstream data and order data over time: (total number of orders/total number of clicks) *100.

The following diagram illustrates the architecture of our solution.

The solution has six main categories.

  • Data generators – Clickstream and order data is generated with the help of an AWS Lambda function. The function is triggered by a scheduled Amazon CloudWatch Events event every minute and generates random clicks for ingestion into a Kinesis data stream. Similarly, another function triggered by a CloudWatch event generates random orders for ingestion into a second data stream. In a production environment, this data comes from clickstream generators and a centralized order management system.
  • Data ingestion – Kinesis data streams ingest clickstream and order data as they are generated.
  • Data sessionization – Data sessionization helps group related data. For clickstream data, we can group clicks on an ad by different users or time periods. For order data, we can group orders by different ads. We use Amazon Kinesis Data Analytics for SQL to analyze streaming data in real time with standard SQL. Sessionized clickstream and order data is ingested into another in-application stream.
  • Data processing and storage – The sessionization stream from Kinesis Data Analytics for SQL is ingested into an Amazon Kinesis Data Firehose delivery stream, which delivers the data to a pre-configured S3 bucket.
  • Data Catalog – You use AWS Glue to crawl the clickstream and orders data in their respective S3 buckets, as well as build metadata definitions and tables in Athena. AWS Glue crawlers run every hour to update table definitions, and Athena views are built to compute the ad-to-order conversion.
  • Data visualization – You use QuickSight to generate visualizations.

Prerequisites

Before getting started, you must provision your resources with AWS CloudFormation. 

  1. Choose Launch Stack.
  1. Choose Next.
  2. For Stack name, enter a name for the stack.
  3. For Bucket Name for Clicks, enter the name of the S3 bucket that holds clickstream data (for this post, click-stream).
  4. For Bucket Name for Orders, enter the name of the S3 bucket that holds order data (order-stream).
  5. Enter any tags you wish to assign to the stack.
  6. Choose Next.
  7. Verify that the stack has been created successfully.

If you have never used QuickSight in this account before, sign up for QuickSight before moving on to the next step. Keep in mind that admin access to the Enterprise Edition QuickSight instance is needed to complete setup. 

Generating and ingesting clickstream data

On the Lambda console, view your function ingest-clickstream for ingesting clickstream data. The clickstream data attributes include UserId, Device, Event, EventType, and Timestamp. The event contains promotional ad information on the webpage clicked by the user. This function generates random clickstreams and ingests it into the data stream ClickStream. The following screenshot shows your function details on the console.

A CloudWatch Events rule invokes this function every minute. The following screenshot shows sample data that was ingested into the data stream. The Event column represents the portion of the webpage the user clicked; every click on the webpage has a unique ID and type assigned (for example, P601 has the event type Promotion, C301 has the event type Checkout).

Generating and ingesting order data

On the AWS Lambda console, view your function ingest-order for ingesting order data. This function ingests random orders.

Each order has order lines, which contain the attributes ItemId, Promotion, UnitPrice, and Quantity (see the following screenshot). The promotion attribute indicates the ad the user clicked before adding the item to their shopping cart. This function generates random orders and ingests it into OrderStream. The Promotion attribute joins clickstream data and order data.

Sessionizing the data

To sessionize the data, complete the following steps:

  1. On the Kinesis Data Analytics console, select <Stack Name>-ClickStreamApplication.
  2. Choose Run.
  3. Repeat the same step for <Stack Name>-OrderAnalysisApp.
  4. When the status changes to Running, choose the application name.
  5. Under Real time analytics, choose Go to SQL results.
  6. Choose the Real-time analytics

The application groups clicks in 1-minute intervals. Let’s take the ad P701 as an example. If this ad is clicked by multiple users, this SQL function adds all the clicks by different users in the last minute. If five users clicked on P701 in the last minute, the function outputs a ClickCount of 5. A stagger window is used because it’s well-suited for analyzing groups of data that arrive at inconsistent times.

  1. On the Kinesis Data Analytics console, choose OrderAnalysisApp.
  2. Choose Go to SQL results.
    This application groups orders by Promotion, as shown in the following screenshot.

Processing and storing the data

In the data processing and storage stage, aggregated clickstream and order data is delivered to a Kinesis Data Firehose delivery stream. Kinesis Data Firehose delivers clickstream aggregated records and orders to the click-stream and order-stream buckets, respectively. The data is partitioned by year, month, and day. The following screenshot shows the delivery streams on the console.

Analyzing the data

To analyze your data, complete the following steps:

  1. Verify that the S3 bucket was created for clickstream and orders.

The data in the bucket is partitioned by year, month, date, and hour.

  1. On the AWS Glue console, view the clickstream and orders crawlers.

These two crawlers crawl the click-stream and order-stream buckets every 15 minutes and create tables.

  1. To run the crawlers on demand, choose Run crawler.

When the crawler is finished, the Tables added column displays 1.

  1. In the navigation pane, choose Tables.
  2. Verify that the crawlers created the tables.
  3. On the Athena console, choose Saved queries.

You can see three queries have been created.

  1. Select view_clicks_aggregate to load it in the query editor.
  2. Select ad_to_order_conversion and choose Run Query.

If the Amazon S3 bucket name has -, the crawler replaces - with _ while creating the table.

  1. Replace - with _ in the table name when creating the view.
  2. Repeat the same process for view_orders_aggregate and view_conversion_ratio.

Make sure you run view_clicks_aggregate and view_orders_aggregate before running view_conversion_ratio.

  1. Choose view_conversion_ratio and choose Preview.

Orders and clicks for each promotion and the corresponding conversion ratio are displayed.

Visualizing the data

To visualize your data, you first load it into QuickSight. You can then create visualizations. In this section, we also configure a scheduled data refresh.

Loading the data

To visualize your data, you must first load your data into QuickSight.

  1. On the QuickSight console, from the Admin drop-down menu, choose Manage QuickSight.
  2. In the navigation pane, choose Security & Permissions.
  3. Choose Add or remove.
  4. Select Amazon Athena.
  5. Select Amazon S3 to edit QuickSight access to your S3 buckets.
  6. Choose the Details link next to Amazon S3.
  7. Choose Select S3 buckets.
  8. Select the bucket names you provided for clicks and orders.
  9. Choose Finish.
  10. Choose Update.
  11. Choose the QuickSight icon on the top left of the admin panel to proceed back to the home screen.
  12. In the navigation pane, choose Datasets.
  13. Choose New dataset.
  14. Choose Athena.
  15. For Data source name, enter Ad-To-Order-Conversion.
  16. Choose Validate Connection.
  17. After your connection is validated, choose Create data source.
  18. For Database, choose ad-to-order-conversion.
  19. For Tables, select view_conversion_ratio.
  20. Choose Select.
  21. Choose Visualize.

Creating visualizations

In this section, we create two visualizations of our data. We first make a horizontal bar chart.

  1. From the Add menu, choose Add Calculated Field.
  2. Enter Clicks_to_Orders.
  3. Enter the formula sum(orders)/sum(clicks).
  4. Choose Save.
  5. Choose next to Click to orders.
  6. For Show as, choose Percent.
  7. For Visual type, choose Horizontal bar chart.
  8. Drag promotion to Y-axis.
  9. Drag clicks_to_orders to Value.
  10.  Drag date to Group/Color.

The following screenshot shows our visualization.

We now make our second visualization, a vertical bar chart.

  1. Choose the + icon next to Sheet1.
  2. For Visual types, choose Vertical bar chart.
  3. Drag promotions to Y-axis.
  4. Drag clicks and orders to Value.

This graph displays clicks and orders for each promotion.

  1. Choose Insights on the left panel to see a summary of your insights.

Refreshing the data

We can also set up a scheduled refresh for our data.

  1. Choose Manage Data.
  2. Choose view_conversion_ratio.
  3. Choose Schedule refresh.
  4. Choose Create.
  5. For Repeats, choose Hourly.
  6. Choose Create.

You see a confirmation message that you configured a refresh one time per hour.

Conclusion

In this post, we showed you how to use AWS analytics and storage services to address business challenges that require handling large volumes of data. Kinesis Data Streams and Kinesis Data Analytics let you ingest large volumes of data and sessionize the data. We also showed you how to analyze and visualize the clickstream and order data using AWS Glue, Athena, and QuickSight.


About the Authors

Gandhi Raketla is a Senior Solutions Architect for AWS. He works with AWS customers and partners on cloud adoption, architecting solutions that help customers foster agility and innovation.

 

 

 

Nick Sack is a DevOps Consultant for AWS Professional Services. He is passionate about working with customers and building automated solutions to help customers on their cloud journeys. When not working, Nick enjoys hiking, playing soccer, reading, and learning about technology.

Amazon QuickSight: 2020 in review

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-2020-in-review/

As 2020 draws to a close, we’ve put together this post to walk you through all that’s changed in Amazon QuickSight in 2020. For your reading convenience, this post is broken up into the following sections:

  • Embedded Analytics at scale
  • Faster insights with Q & Machine Learning (ML)
  • Business Intelligence (BI) with QuickSight
    • Build Rich, Interactive Dashboards
    • Scale your data with SPICE
    • Centralized governance, security, and administration
  • QuickSight learning resources

Embedded Analytics at scale

Embedded analytics has been a key focus area for both independent software vendors (ISVs) and enterprises in the past year. Thousands of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including the NFL, Blackboard, Comcast, Panasonic Avionics, EHE Health, and more, with the ability to scale from tens of users to tens of thousands without any server provisioning or management. Embedding analytics within existing end-user workflows has meant better engagement in apps and portals, and less fatigue for users as they do not need to browse multiple apps for insights.

Dashboard embedded within the website of a fictional company

In 2020, we launched namespaces in QuickSight adding native support for multi-tenancy, an important consideration for developers and ISVs looking for in-app analytics. Together with embedded QuickSight authoring (dashboard creation capabilities), namespaces allow developers and ISVs to use QuickSight to provide self-service data exploration and dashboard creation options to specific users of an application or portal. E.g., ISVs can empower power users to build their own dashboards and share with other end-users from the same organization, which provides feature differentiation for the ISV’s app and also new revenue opportunities for ISVs. We also updated the QuickSight JavaScript SDK  to allow you to pass parameters into an embedded dashboard, switch tabs, change languages, and listen to events within an embedded dashboard.

As developers, ISVs, and enterprises roll out analytics with QuickSight for hundreds of thousands, we also saw the need for a pricing model that scales with growth. The QuickSight per-user model for dashboard readers offers a per-user max charge of $5 per reader per month, offering a low per-user price point for BI use cases. However, as dashboards are rolled out to hundreds of thousands of users, we find that average session consumption per user is usually low (less than 10 sessions a month), but aggregate session usage across users within the account is high.

To ensure scalable pricing for growth, we launched a new session capacity pricing model that allows you to commit to session usage in bulk, and benefit from increased per-session discounts with commitments. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Unlike traditional options for capacity pricing, which require a server with annual commitments, session capacity pricing allows you to get started easily with a $250 per month starter option. After starting with the monthly option, you can move to one of the annual session capacity tiers as your session usage increases—ensuring that costs scale with growing usage. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy or lean periods of the year (the first or last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, see Amazon QuickSight Pricing.

With this new model in place, we now also support embedding QuickSight dashboards in scenarios where provisioning and managing end-user reader identities might not be convenient (or even possible). Examples of these situations include embedded dashboards for sharing public statistics within a city or county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for tens of thousands of users per customer, department, or job location or site within an app.

To make it easier to get started with QuickSight embedded analytics, we also have a new embedded analytics developer portal. Learn more about how to embed, view code samples, see capabilities of the JavaScript SDK, and interact with embedded dashboards as well as the authoring experience without having to sign up or code!

Faster insights with Q and ML

While embedding dashboards provides a way to present users with key statistics and insights that are curated ahead of time, end-users often want to dig in further and ask detailed questions. While embedded authoring capabilities allow for such detailed exploration, it is often power users or analysts who leverage such features. For most regular end users, it is either the app developer (in case of embedded apps) or the BI team (in case of BI dashboards) that provides answers to such detailed questions. Both app devs and BI teams are often backlogged, and take time to respond. As a result, this workflow can be tedious and time consuming, and disruptive to both the decision-makers and the data teams. This prompted us to develop Q, the machine learning (ML)-powered natural language query capability in QuickSight. With Q, business users of QuickSight can ask questions in everyday business language and get answers in seconds. For more information on Amazon Quicksight Q, see the product page and our blog post New – Amazon Quicksight Q answers natural-language questions about business data. Q is now available in preview, sign up now.

Separately, we have also continued our efforts to integrate AWS advances in ML into QuickSight to provide easy-to-use ML capabilities for business analysts and end-users. The most popular feature of our ML-Insights suite has been natural language narratives in QuickSight, which use natural language generation to provide key insights that you can customize and add to dashboards. We added the ability to add images and links to these insights (and conditionally showing them as needed), allowing you to use these as a versatile component of your dashboards, which can present up-to-date insights in simple business language for easy consumption by end-users.

We also made Amazon SageMaker integration generally available this year. With this integration, you can perform ML-powered inferences on your datasets with just a few clicks. These include predicting the likelihood of customer churn, scoring leads to prioritize sales activity, assessing credit risk for loan applications, and more. With a native integration, data scientists and data engineering teams no longer have to do the heavy lifting of writing code for ETL jobs, consuming inference APIs, storing output in a queryable source, and finally importing to QuickSight. For more information, see Visualizing Amazon SageMaker machine learning predictions with Amazon QuickSight. For customers who want to understand anomalies in their data better without setting up or managing Sagemaker, we continue to enhance anomaly detection in QuickSight. Anomaly detection runs against your data in a database, data warehouse, data lake, or in the QuickSight SPICE in-memory store, and alerts you to anomalous data points. For more information, see Setting Up ML-Powered Anomaly Detection for Outlier Analysis.

Business Intelligence (BI) with QuickSight

With a serverless architecture, QuickSight allows customers to get started with rich interactive BI dashboards in no time. Customers such as Capital One, Best Western, Rio Tinto, and the NFL have taken the stage at re:Invent previously to talk about how a serverless model has allowed a launch to tens of thousands of users without the typical infrastructure planning and monitoring aspects. Additionally, the completely web-based dashboard authoring experience in QuickSight means analysts don’t need to download any clients to create and publish dashboards. Administrators also don’t spend weeks of time and effort in software updates, because QuickSight is fully managed, with updates twice a month. Overall, this translates to an easy setup and rollout of QuickSight as a business intelligence (BI) solution for thousands of users.

Build Rich, Interactive Dashboards

We introduced a significant number of enhancements around dashboards and the dashboard creation experience in 2020. We introduced six new chart types: filled (choropleth) map, histogram, funnel chart, stacked area chart, waterfall chart, and boxplot, bringing the total number of supported charts to 28. QuickSight now also supports reference lines, either based on a value from the chart itself or an external calculated field, which allow authors to convey additional information to the reader. QuickSight charts now also support custom sorting using a field not present in the visual, thereby performing custom ordering of fields based on business context. Through the narrative component available as part of ML-Insights, you can also add relevant text, images, and links into dashboard, allowing for rich visual layouts.

Tables and pivot tables are among the most popular visual types among business users, and we’ve introduced a number of enhancements to these, including text wrap for headers, text alignment for content and headers, customization of header and total text, representing null data, and more. You can also now export tables and pivot tables to Excel, in addition to the existing CSV option.

On the filtering front, we’ve added support for on-sheet filter controls. You can create these controls with a single click, and also scope them to work across multiple datasets used on the dashboard sheet. Available controls include drop-downs, date/time pickers, slider controls, date range controls, and relative data controls—allowing for a breadth of options for dashboards. You can set these controls to cascade so that one filters the other and only shows relevant values to the end-user.

We also followed up on filter actions launched last year to add the ability to cascade these interactive options for users. This means you can choose a state from a visual on a dashboard, which filters the entire dashboard by the state selected, and then drill down further based on the counties presented within the state coming from another visual. Lastly, we introduced dashboard persistence for readers, allowing them to pick up on dashboards where they had left them. This means that filter control and sheet selections are retained for users based on where they last left off.

Authors of dashboards continue to see new calculations available, with support added for:

  • Minimums and maximums for date fields
  • Modulo operations
  • First and last values of metrics and dimensional values
  • Discrete and continuous percentile

Themes for dashboards now also support a selection of fonts, allowing you to pick the font that aligns well with your design language and corporate standards.

Lastly, QuickSight dashboards now support an optimized scaling mode, which ensures that a dashboard is optimized for the most common screen size expected, yet scaled beautifully when on a projector or a larger screen, or emailed to end-users.

We also continue to invest in our mobile apps (iOS and Android), which allow you to access your dashboards on the go, and interact with filters, actions, drill downs, and more.


Scale your data with SPICE

One of the key benefits of choosing QuickSight for your BI needs is QuickSight’s SPICE in-memory data store, which provides fast, interactive access to data for both authors and readers, and automatically scales to meet high concurrency needs. Each dataset in SPICE can now be up to 250 million rows in size (or 500 GB), up from 100 million rows previously. SPICE data can be ingested from supported database or SaaS sources via an API or a scheduled, with email notifications available upon failures.

For relational data sources and Athena, QuickSight supports both SPICE and direct query options, giving you the choice depending on your use case and performance needs. Choosing SPICE with summarized sets of data allows you to reduce traffic on your backend data stores while also automatically scaling to peaks in usage (such as at the end of the month) without lifting a finger. At the same time, direct query is useful when data may change frequently, or you need access to historical data residing in a data warehouse for occasional analysis. Dashboards in QuickSight can support a combination of both SPICE and direct query datasets, with cross visual filters and actions allowing fluid interactions.

Authors and data admins in QuickSight can combine data across multiple data sources (e.g., flat file, Amazon Redshift, Snowflake, Amazon RDS) and bring the resultant data into SPICE for dashboarding and analysis. In 2020, we added support for joining datasets using custom SQL statements with tables from another data source, other custom SQL datasets, or flat files. For more information, see Joining Data.

Centralized governance, security, and administration

With many organizations moving operations almost completely remote, governance, security, and administration have been key areas of focus this year. Within QuickSight, all operations are logged to AWS CloudTrail, providing auditability, alerting, and traceability as users make changes and access data or assets in QuickSight. Unlike many BI options in the market, QuickSight doesn’t follow a disjoint client/server model, but instead offers centralized control over data, assets, and user capabilities. We launch two key features that further emphasize these: folders and user permissions.

Folders in QuickSight come in two types: personal folders that authors can use to organize content for themselves, or shared folders that admins can create and share with authors and readers (at the user or group level), with the ability to delegate permissions to authors. Shared folders allow strong permissions to be enforced across the organization and help control user access to data and sharing. User permissions customization allow admins to restrict user permissions to actions such as sharing assets, creating folders, or downloading data as CSV. The combination of CloudTrail logging, shared folders, and user-level permissions management allows admins to create strong governance around a QuickSight account.

A key element in a governed analytics setup is enforcing security controls and common definitions in datasets used across the organization. We introduced column-level security as a native option in QuickSight, easily configurable from the UI or APIs. This feature compliments the existing row-level security feature. Together, they give authors control over who can access what data within their visualizations. QuickSight now also allows data owners to create datasets that provide pre-built calculations, column-level metadata, and other customizations that make it easier for authors consuming the dataset to understand the data and build on it.

QuickSight is the only BI tool to provide native AWS Identity and Access Management (IAM) permissions-based control over Amazon Simple Storage Service (Amazon S3), Amazon Athena, and other AWS native data sources. QuickSight has AWS PrivateLink support for secure data connectivity to databases in your VPC, data warehouses such as Amazon Redshift or Snowflake, or big data options such as Presto. QuickSight now also supports AWS Lake Formation permissions for data accessed via Athena and has added native connectivity to Amazon Timestream and Amazon Elasticsearch Service (Amazon ES). QuickSight also supports Oracle databases, either Amazon Relational Database Service (Amazon RDS) or running on Amazon Elastic Compute Cloud (Amazon EC2) or on premises.

On the account administration and management front, we launched APIs (including analysis and theme APIs) that enable admins to create a centralized view of assets within the account, and offer customizations that allow ISVs to provide a more integrated experience for embedded analytics users, including a default organization-specific theme for authors.

Admins will also appreciated QuickSight’s support for a variety of authentication options, including Active Directory (using AWS AD Connector acting as a proxy to the on-premises AD) or federated IAM users via an identity provider such as Azure AD, Okta, Ping, or others. We refined the end-user experience for the latter by introducing options for the user to authenticate from the QuickSight login page (SP-initiated login), as opposed to navigating to the identity provider every time. This enables a business user-friendly experience. Customers wanting to turn off non-federated users within accounts can also do so by logging a support request.

Finally, we continue to expand to make QuickSight accessible to more users and use cases, with expansion to India (Mumbai) and US GovCloud (West) Regions, as well as support for five new European languages: Danish, Dutch, Finnish, Norwegian, and Swedish, making us available in 12 AWS Regions and 15 languages.

QuickSight learning resources

New to QuickSight? Use our 2-month free trial (four authors free for 2 months) to take a quick test drive. You will be asked to sign up for AWS if you don’t have an AWS account. If you already have an account from your company or organization, navigate to the AWS Management Console and choose QuickSight to get started. To learn more about embedding QuickSight, use our embedded developer portal and check out our API documentation.

If you’re an existing QuickSight user, or have used QuickSight in the past, check out our YouTube channel, or dive into one of the following workshops to learn more about your area of interest:

  • Build Advanced Analytics and Dashboards with Amazon QuickSight Part 1, Part 2 – If you’re new to QuickSight or looking to build advanced calculations in QuickSight, this workshop is for you. It provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. You learn how to set up QuickSight in your AWS account, import and prepare data, and build advanced visuals to analyze data in a meaningful way.
  • Embed Amazon QuickSight Dashboards – In this virtual, instructor-led workshop, our Global QuickSight Solutions Architect walks you through setting up a reusable embedding framework. We also include a cloud formation template to quickly spin up this framework.
  • Administration on QuickSight – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. The ISV pipeline for data curation, data analysis, and dashboard publishing is addressed with distinct end-user personas. We also discuss how development, test, and production environments can be managed and operationalized by admins in a single or multiple QuickSight accounts.
  • Administration Level-Up – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. Learning objectives include automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation.
  • Cost Intelligence Dashboard Setup – Existing AWS admins can check out this video of how to use QuickSight to quickly set up dashboards of their AWS Cost and Usage Report data and provide access to business users within the organization.

Looking ahead

To see the full list of 2020 launches, see What’s New in Amazon QuickSight or subscribe to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs. You can also check out the repository of QuickSight blogs in the AWS Big Data Blog.

We have a packed roadmap for 2021, and continue to focus on enabling you with insights from all your data that you can share with your users, while not having to worry about operations and servers. Thank you for your support.

We wish you all the very best in the New Year!


About the Authors

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

 

 

 

 

Sahitya Pandiri is a technical program manager with Amazon Web Services.

 

 

 

New – Amazon QuickSight Q Answers Natural-Language Questions About Business Data

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-quicksight-q-to-answer-ad-hoc-business-questions/

We launched Amazon QuickSight as the first Business Intelligence (BI) service with Pay-per-Session pricing. Today, we are happy to announce the preview of Amazon QuickSight Q, a Natural Language Query (NLQ) feature powered by machine learning (ML). With Q, business users can now use QuickSight to ask questions about their data using everyday language and receive accurate answers in seconds.

For example, in response to questions such as, “What is my year-to-date year-over-year sales growth?” or “Which products grew the most year-over-year?” Q automatically parses the questions to understand the intent, retrieves the corresponding data and returns the answer in the form of a number, chart, or table in QuickSight. Q uses state-of-the art ML algorithms to understand the relationships across your data and build indexes to provide accurate answers. Also, since Q does not require BI teams to pre-build data models on specific datasets, you can ask questions across all your data.

The Need for Q
Traditionally, BI engineers and analysts create dashboards to make it easier for business users to view and monitor key metrics. When a new business question arises and no answers are found in the data displayed on an existing dashboard, the business user must submit a data request to the BI Team, which is often thinly staffed, and wait several weeks for the question to be answered and added to the dashboard.

A sales manager looking at a dashboard that outlines daily sales trends may want to know what their overall sales were for last week, in comparison to last month, the previous quarter, or the same time last year. They may want to understand how absolute sales compare to growth rates, or how growth rates are broken down by different geographies, product lines, or customer segments to identify new opportunities for growth. This may require a BI team to reconstruct the data, create new data models, and answer additional questions. This process can take from a few days to a few weeks. Such specific data requests increase the workload for BI teams that may be understaffed, increases the time spent waiting for answers, and frustrates business users and executives who need the data to make timely decisions.

How Q Works
To ask a question, you simply type your question into the QuickSight Q search bar. Once you start typing in your question, Q provides autocomplete suggestions with key phrases and business terms to speed up the process. It also automatically performs spell check, and acronym and synonym matching, so you don’t have to worry about typos or remember the exact business terms in the data. Q uses natural language understanding techniques to extract business terms (e.g., revenue, growth, allocation, etc.) and intent from your questions, retrieves the corresponding data from the source, and returns the answers in the form of numbers and graphs.

Q further learns from user interactions from within the organization to continually improve accuracy. For example, if Q doesn’t understand a phrase in a question, such as what “my product” refers to, Q prompts the user to choose from a drop-down menu of suggested options in the search bar. Q then remembers the phrase for next time, thus improving accuracy with use. If you ask a question about all your data, Q provides an answer using that data. Users are not limited to asking questions that are confined to a pre-defined dashboard and can ask any questions relevant to your business.

Let’s see a demo. We assume that there is a dashboard of sales for a company.

Dashboard of Quicksight

The business users of the dashboard can drill down and slice and dice the data simply by typing their questions on the Q search bar above.

Let’s use the Q search bar to ask a question, “Show me last year’s weekly sales in California.” Q generates numbers and a graph within seconds.

Generated dashboard

You can click “Looks good” or “Not quite right” on the answer. When clicking “Not quite right,” you can submit your feedback to your BI team to help improve Q. You can also investigate the answer further. Let’s add “versus New York” to the end of the question and hit enter. A new answer will pop up.

Generated new graph

Next, let’s investigate further in California. Type in “What are the best selling categories in California.

categories detail

With Q, you can easily change the presentation. Let’s see another diagram for the same question.

line start

Next, let’s take a look at the biggest industry, “Finance.” Type in “Show me the sales growth % week over week in the Finance sector” to Q, and specify “Line chart” to check weekly sales revenue growth.

The sales revenue shows growth, but it has peak and off-peak spikes. With these insights, you might now consider how to stabilize for a better profit structure.

Getting Started with Amazon QuickSight Q
A new “Q Topics” link will appear on the left navigation bar. Topics are a collection of one or more datasets and are meant to represent a subject area that users can ask questions about. For example, a marketing team may have Q Topics for “Ad Spending,” “Email Campaign,” “Website Analytics,” and others. Additionally, as an author, you can:

  • Add friendly names, synonyms, and descriptions to datasets and columns to improve Q’s answers.
  • Share the Topic to your users so they can ask questions about the Topic.
  • See questions your users are asking, how Q answered these questions, and improve upon the answer.

topic tool barSelect Topics, and set Topic name and its Description.

setting up topics

After clicking the Continue button, you can add datasets to a topic in two ways: You can add one or more datasets directly to your topic by selecting Add datasets, or you can import all the datasets in an existing dashboard into your topic by selecting Import dashboard.

The next step is to make your datasets natural-language friendly. Generally, names of datasets and columns are based on technical naming conventions and do not reflect how they are referred to by end users. Q relies heavily on names to match the right dataset and column with the terms used in questions. Therefore, such technical names must be converted to user-friendly names to ensure that they can be mapped correctly. Below are examples:

  • Dataset Name – D_CUST_DLY_ORD_DTL → Friendly Name: Customer Daily Order Details.
  • Column Name: pdt_cd Column → Friendly name: Product Code

Also, you can set up synonyms for each column so users can use the terms they are most comfortable with. For example, some users might input the term “client” or “segment” instead of “industry.” Q provides a feature to correct to the right name when typing the query, but BI operators can also set up synonyms for frequently used words. Click “Topics” in the left pane and choose the dashboard where you want to set synonyms.

Then, choose “datasets.

Now, we can set a Friendly Name or synonyms as Aliases, such as “client” for “Customer,” or “Segment” for “Industry.”

setting up Friendly Name

After adding synonyms, a user can save the changes and start asking questions in the Q search bar.

Amazon QuickSight Q Preview Available Today
Q is available in preview for US East (N. Virginia), US West (Oregon), US East (Ohio) and Europe (Ireland). Getting started with Q is just a few clicks away from QuickSight. You can use Q with AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3, or third-party commercial sources such as SQL Server, Teradata, and Snowflake. Salesforce, ServiceNow, and Adobe automatically integrate with all data sources supported by QuickSight, including business applications such as Analytics or Excel.

Learn more about Q and get started with the preview today.

– Kame

 

New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding without user provisioning, and developer portal for embedded analytics

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/new-in-amazon-quicksight-embedding-without-user-provisioning-session-capacity-pricing-and-embedded-developer-portal/

Amazon QuickSight Enterprise edition now offers a new, session capacity-based pricing model starting at $250/month, with annual commitment options that provide scalable pricing for embedded analytics and BI rollouts to 100s of 1000s of users. QuickSight now also supports embedding dashboards in apps, websites, and wikis without the need to provision and manage users (readers) in QuickSight, which utilizes this new pricing model. Lastly, we also have a new developer portal for embedded analytics that allows you to learn more about the different embedded solutions available with QuickSight and experience it first-hand.

Session Capacity Pricing

Amazon QuickSight’s new session capacity-based pricing model provides scalable pricing for large scale deployments. Session capacity pricing allows Developers, Independent Software Vendors (ISVs) and Enterprises to benefit from lower per-session rates as they roll out embedded analytics and BI to 100s of 1000s of users. In such scenarios, average session consumption per user is usually low (<10 sessions/month) but aggregate session usage across users within the account is high. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Session capacity pricing is also required for embedding without user management, where per-user pricing is not meaningful. Unlike traditional options for capacity pricing which require a server with annual commitments, QuickSight’s session capacity pricing allows you to get started easily with a $250/month starter option. QuickSight’s session capacities do not slow down with increased user concurrency or higher analytical complexity of dashboards (both common in a server-based model), but instead automatically scale to ensure a consistent, fast, end-user experience. After starting with the monthly option, you can move to one of QuickSight’s annual session capacity tiers as your sessions usage increases – ensuring that costs scale with growing usage. Any usage beyond the committed levels (monthly or annual) is charged at the overage session rates indicated, with no manual intervention for scaling needed – no more scrambling to add servers as you have bursts in usage or just greater success with your application/website. Annual session capacities are billed for monthly usage of sessions, with consumption of all committed sessions expected by end of the period. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy/lean periods of the year (e.g., first/last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, visit the QuickSight pricing page.

Embedding without user provisioning

Before this launch, Amazon QuickSight provided embedding dashboards in apps and portals where each end-user could be identified and provisioned in QuickSight, and charged using QuickSight’s per-user pricing model. This works well for situations where each end user can be uniquely identified, and often has permissions associated with their data access levels. The NFL chose QuickSight to embed dashboards in their next-gen stats portal and share insights with clubs, broadcasters and editorial teams. 1000s of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including Blackboard, Comcast, Panasonic Avionics, EHE Health. Customers can scale from 10s of users to 100s of 1000s without any server provisioning or management and also benefit from being able to utilize embedded QuickSight authoring capabilities to enable self-service dashboard creation.

With today’s launch, QuickSight will now also enable use cases with dashboards for 100s of 1000s of readers, where it is not possible to provision and manage users (or is highly inconvenient to do so).

Examples of these situations include embedded dashboards for sharing public statistics within a city/county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for 10s or 1000s of users per customer, department or job location/site within an app.

Let’s take a look at a couple of examples and then how to embed these dashboards. First, a dashboard that shows a live stream of the three main stock indices (S&P 500, DOW Jones, and NASDAQ) that uses QuickSight’s newly launched connector to Amazon Timestream to provide a real-time view of the market index in US Central Time.

Second, a dashboard showing industries and count of firms in those industries using a choropleth map at the state level with the ability to drill down to county-level data.

Both dashboards are setup so that they can be accessed without any user restrictions, and we don’t have to setup users to roll this out. You can see these dashboards accessible for anyone here. To try this out, you can use the AWS Command Line Interface (AWS CLI); note that the AWS CLI used here is simply to illustrate this process, and for actual integration into a website/app you have to use the AWS SDK to obtain an embeddable URL for every new visit to the page.

4 steps to embed a dashboard

  1. Configure an AWS Identity and Access Management (IAM) role for your application to use for embedding
    arn:aws:iam::xxxxxxxxxxxx:role/YourApplicationRole

  2. Attach the following policy to the role so the role can run the GetDashboardEmbedURL API for anonymous identity type:
    {
       "Version":"2012-10-17",
       "Statement":[
          {
             "Effect":"Allow",
             "Action":[
                "quicksight:GetDashboardEmbedUrl",
                "quickSight:GetAnonymousUserEmbedUrl"
             ],
             "Resource":[
                "replace with ARN for dashboard1",
                "replace with ARN for dashboard2",
                "replace with ARN for dashboard3"
             ]
          }
       ]
    }

  3. Run the GetDashboardEmbedURL API with IdentityType set to ANONYMOUS. This returns a response with the EmbedURL:
    INPUT
    aws quicksight get-dashboard-embed-url \
        --region us-west-2 \
        --namespace default \
        --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
        --identity-type ANONYMOUS \
        --aws-account-id 123456789012
        
    RESPONSE
    {
        "Status": 200,
        "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
        "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
    }

You can embed the returned URL in the IFRAME code of your application. Make sure your application is added to the allow list in QuickSight. This is a single-use URL and has to be generated dynamically by invoking get-dashboard-embed-url from the backend compute layer upon each load of the parent page.

  1. Embed the dashboard in your application with the following HTML code:
    <head>
        <title>Basic Embed</title>
        <!-- You can download the latest QuickSight embedding SDK version from https://www.npmjs.com/package/amazon-quicksight-embedding-sdk -->
        <!-- Or you can do "npm install amazon-quicksight-embedding-sdk", if you use npm for javascript dependencies -->
        <script src="./quicksight-embedding-js-sdk.min.js"></script>
        <script type="text/javascript">
            var dashboard;
    
            function embedDashboard() {
                var containerDiv = document.getElementById("embeddingContainer");
                var options = {
                    // replace this dummy url with the one generated via embedding API
                    url: "https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/dashboardId?isauthcode=true&identityprovider=quicksight&code=authcode",  
                    container: containerDiv,
                    scrolling: "no",
                    height: "700px",
     
                    footerPaddingEnabled: true
                };
                dashboard = QuickSightEmbedding.embedDashboard(options);
            }
        </script>
    </head>
    
    <body onload="embedDashboard()">
        <div id="embeddingContainer"></div>
    </body>
    
    </html> 

If you use NPM to manage you front end dependencies, run npm install amazon-quicksight-embedding-sdk. And then you can add use the following code to embed the URL in your application:

import { embedDashboard } from 'amazon-quicksight-embedding-sdk';

var options = {
    url: "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    container: document.getElementById("embeddingContainer"),
    parameters: {
        country: "United States",
        states: [
            "California",
            "Washington"
        ]
    },
    scrolling: "no",
    height: "700px",
    width: "1000px",
    locale: "en-US",
    footerPaddingEnabled: true
};
const dashboardSession = embedDashboard(options);

If you want to embed multiple dashboards and switch between them, you can pass more dashboard IDs by using the additional-dashboard-ids option while generating the URL. This generates the URL with authorization for all specified dashboard IDs and launches the dashboard specified under the dashboard-id option. See the following code:

INPUT
aws quicksight get-dashboard-embed-url \
    --region us-west-2 \
    --namespace default \
    --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
    --identity-type ANONYMOUS \
    --aws-account-id 123456789012
    --additional-dashboard-ids dashboardid1 dashboardid2
    
RESPONSE
{
    "Status": 200,
    "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
}

In the preceding code, we assign our primary dashboard’s ID to the --dashboard-id option, and the other dashboard IDs to the --additional-dashboard-ids option as a space-separated value. You can pass up to 20 dashboard IDs in this option.

The EmbedURL value in the response is the URL for the primary dashboard. You can embed this dashboard in your app, wiki, or website and use the JavaScript SDK to switch between dashboards. To switch to another dashboard without having to generate a fresh embed URL, invoke the navigateToDashboard function (available in our JavaScript library) with any of the dashboard IDs that were initially included in the get-dashboard-embed-url call. See the following example code:

var options = 
     {
      dashboardId: "dashboardid1", 
      parameters: 
       {
         country: 
          [
            "United States"
          ]
     }
    };
dashboard.navigateToDashboard(options);

For more information about the JavaScript SDK, see the GitHub repo.

Embedding without users is now generally available to all Amazon QuickSight Enterprise Edition users and requires session capacity pricing. 

To opt-in for session capacity pricing and embed dashboards without user provisioning, or simply enable discounted QuickSight usage for large scale deployments, you can opt-into session capacity pricing starting with the $250/month option via the “Manage QuickSight” > “Your subscriptions” page accessible to QuickSight administrators.

Embedded Developer Portal

We have a new embedded developer portal at https://developer.quicksight.aws that allows you to quickly interact with three key embedded scenarios – 1) embedded dashboards accessible to anyone accessing a website or portal (no user provisioning required), 2) embedded dashboards accessible to only authenticated users, and 3) embedded dashboard authoring for power users of apps.

The interactive dashboards, code snippets and setup instructions allow you to learn more about the rich capabilities of QuickSight embedding, easily get started with embedding QuickSight.

Summary

With new embedding capability, QuickSight offers a modern, serverless approach to deploying dashboards and visualizations into websites, apps and corporate portals in hours, without any user provisioning or management needed to scale to 100s of 1000s of users. Unlike traditional server-based models, QuickSight’s session capacity model allows you to start at a low $250/month, month-to-month price point. As usage grows, the available annual commitment models offer scalable pricing by reducing the per-session cost – enabling both ISVs and Enterprises to roll out embedded QuickSight dashboards at large scale, whether for internal or external users. Finally, with QuickSight’s new developer portal, you have instant access to samples of interactive embedded QuickSight dashboards as well as steps to integrating various embedded capabilities into your own websites, apps and portals.


About the Authors

Jose Kunnackal John is Sr. Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

 

 

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his as a career developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

 

 

Arun Santhosh is a Specialized World Wide Solution Architect for Amazon QuickSight. Arun started his career at IBM as a developer and progressed on to be an Application Architect. Later, he worked as a Technical Architect at Cognizant. Business Intelligence has been his core focus in these prior roles as well

Accessing and visualizing external tables in an Apache Hive metastore with Amazon Athena and Amazon QuickSight

Post Syndicated from James Sun original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-external-tables-in-an-apache-hive-metastore-with-amazon-athena-and-amazon-quicksight/

Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.

In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.

Solution walkthrough

The following diagram shows the architecture for our solution.


As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.

We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.

With Athena, you can create a data source connector based on an AWS Lambda function to access the Hive metastore hosted on the EMR cluster by using the Apache Thrift interface.

The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.

In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.

Deploying the resources with AWS CloudFormation

To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:

  • VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
  • EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
  • Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
  • S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
  • AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.

To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:

  1. Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
  2. Choose Launch Stack to launch the CloudFormation template:

This template has been tested in the US East (N. Virginia) Region.

  1. Choose Next.

You’re prompted to enter a few launch parameters.

  1. For Stack name, enter a name for the stack (for example, athena-hms).
  2. For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
  3. Continue to choose Next and leave other parameters at their default.
  4. On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
  5. Choose Create stack.

The stack takes 15–20 minutes to complete.

  1. On the Outputs tab of the stack details, save the key-value pairs to use later.

When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the parquet subfolder of the S3 bucket.

The following are the source codes for the bootstrap and Spark step actions for your reference:

To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a covid_data folder in the bucket. The folder contains the two subfolders, csv and parquet, which store the raw CSV and transformed Parquet data, respectively.

Querying the data in Athena

The CloudFormation template creates two users belonging to two different AWS IAM groups. The de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group athena-hms-DataEngineerGroup-xxxxxxx. The sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group athena-hms-SalesGroup-xxxxx. 

To query the data, first retrieve your secret values in AWS Secrets Manager:

  1. On the Secrets Manager console, choose Secrets.
  2. Choose DataEngineerUserCreds.

  1. Choose Retrieve secret value.

 

  1. Save the username and password.

 

  1. Repeat these steps for SalesUserCreds. 

With the data in place, we can now use Athena to query it.

Accessing data as the data engineer

To query the data as the de_user_us-east-1_athena-hms user, complete the following steps:

  1. Sign in to the Athena console as the de_user_us-east-1_athena-hms user with the credentials retrieved from Secrets Manager.

After logging in, we need to create a data source for Hive metastore.

  1. On the navigation bar, choose Data sources.
  2. Choose Connect data source.

  1. For Choose where you data is located, select Query data in Amazon S3.
  2. For Choose a metadata catalog, select Apache Hive metastore.
  3. Chose Next.

  1. For Lambda function, choose the function the CloudFormation template created with the key HiveMetastoreFunctionName.
  2. For Catalog name, enter a name for the Athena catalog (for example, demo_hive_metastore).
  3. Choose Connect.

You should now have a catalog named demo_hive_metastore with the catalog type Hive metastore.

  1. On the navigation bar, choose Query editor.
  2. Enter the following SQL statement:
    SELECT *
    FROM demo_hive_metastore.default.covid_confirmed_cases
    WHERE country = 'US'
            OR country = 'Italy'
            OR country = 'India'
            OR country = 'Brazil'
            OR country = 'Spain'
            OR country = 'United Kingdom'

This SQL statement selects all the columns in the covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern <catalog name>.<hive database name>.<hive table name in the database>, which for this post translates to demo_hive_metastore.default.covid_confirmed_cases.

  1. Choose Run query.

The following screenshot shows your query results.

Make sure you completely sign out of the console before moving on to the next steps.

Accessing data as the salesperson

Sign in to the console as sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.

 

Permissions policies

The data engineer has additional policies attached to their IAM group in addition to the managed AmazonAthenaFullAccess policy: the <stack-name>-DataBucketReadAccessPolicy-xxxxx and <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.

These policies are available on the IAM console, on the Permissions tab for the group <stack-name>-DataEngineerGroup-xxxxx.

The following sample JSON code is the <stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::<provisioned bucket name>",
                "arn:aws:s3:::<provisioned bucket name>/",
                "arn:aws:s3:::<provisioned bucket name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The following sample JSON code is the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lambda:GetFunction",
                "lambda:GetLayerVersion",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:<account id>:function:athena-hms-HiveMetastoreFunction"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
               "arn:aws:s3:::<provisioned bucket name>/hms_spill"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps. 

Signing up for QuickSight

You can skip this section if you have already signed up for QuickSight previously.

  1. Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
  2. On the QuickSight console, choose Sign up for QuickSight.
  3. Select either the Standard or Enterprise edition for QuickSight.
  4. Choose Continue.

  1. For QuickSight account name, enter your AWS account ID.
  2. For Notification email address, enter your email.
  3. Select Amazon S3.

  1. Select the provisioned S3 bucket to grant QuickSight permission to access.
  2. Choose Finish.

Your QuickSight account should now be set up.

Attaching the Hive metastore access policy

Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.

  1. On the IAM console, search for the service role aws-quicksight-service-role-v0.
  2. Choose the role.

  1. Search for the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy.
  2. Select the policy and attach it to the QuickSight service role.

Creating your data source and performing data conversions

Before we can create visualizations, we need to set up our data source.

  1. Download the SQL script covid-19.sql.
  2. On the QuickSight console, choose Datasets in the navigation pane.

  1. Choose New dataset.

  1. Choose Athena.

  1. In the pop-up window, for Data source name, enter demo_hive_metastore.
  2. Choose Validate.
  3. When the connection is validated, choose Create data source.

  1. In the next window, choose Use custom SQL.

  1. Enter the content of the covid-19.sql script in the query window.
  2. Choose Confirm query.

  1. Leave Import to SPICE for quicker analytics
  2. Choose Visualize.

Now we perform a few data type conversions before visualizing the data.

  1. Choose the Edit icon next to Data set on the menu bar.

  1. Choose the … icon.
  2. Choose Edit.

  1. Expand date and choose Change data type.
  2. Choose Date.

  1. Enter yyyy-MM-dd to convert the date format.
  2. Choose Update.

Now we create a coordinate using the latitude and longitude values.r

  1. Expand lat and choose Ad to coordinates.

  1. Leave Create new geospatial coordinates
  2. Chose Add.

  1. In the pop-up window, for Name your coordinates, enter coordinate.
  2. For Field to use for longitude, choose lon.
  3. Choose Create coordinates.

  1. Choose Save and visualize on the menu bar.

Creating visualizations in QuickSight

Now we can visualize our data. For this post, we create a map visualization.

  1. For Visual types, choose the map

  1. For Geospatial, drag coordinates.
  2. For Size, drag confirmed.
  3. For Color, drag country.

This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.

  1. In the navigation pane, choose Filter.
  2. Choose the + icon.
  3. For Filter type, choose Time range.
  4. Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
  5. Choose Apply.

This plots the confirmed cases on September 10, 2020, for these countries.

Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.

Using highly available primary nodes of the Amazon EMR cluster

The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.

To perform this failover, complete the following steps:

  1. On the AWS CloudFormation console, choose Stacks.
  2. Choose athena-hms.
  3. Choose update.
  4. Choose Use current update.
  5. Choose Next.
  6. For Hive Metastore Host Number, choose a host other than the current one you’re using.

  1. Choose Next.
  2. Acknowledge that AWS CloudFormation might create IAM resources.
  3. Choose Update stack.

In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.

Cleaning up

You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.

This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.

Summary

In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.


About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

 

 

Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.

 

 

Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Applying row-level and column-level security on Amazon QuickSight dashboards

Post Syndicated from Niyati Upadhyay original https://aws.amazon.com/blogs/big-data/applying-row-level-and-column-level-security-on-amazon-quicksight-dashboards/

Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources. On a single data dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more.

QuickSight users in your organization should have access to only certain data for compliance and security reasons. Without proper integrated features to enforce row-level and column-level security, you have to develop additional solutions such as views, data masking, or encryption, or try to integrate third-party solutions around your data to enforce security.

In this post, we walk through how to implement row-level and column-level security in QuickSight.

Overview of solution

You can use row-level and column-level security to share a dashboard with several users who can only view the data that they have permissions to. For example, if you have an internal team of sales agents who contact a list of potential customers, you need a single dashboard that is accessible to all the agents but displays only the list of prospects assigned to them.

Another example could be regarding financial data. You might want to restrict access to sensitive fields to a small set of users. You could further restrict access based on some other field, such as geography. Let’s assume we have three users: A, B and C. Users A and B are admin users and have access to financial data, but user C does not. Furthermore, users A and B are admins for different countries and have access to financial data for their respective countries only.

In this post, we set up row-level and column-level security permissions in our dataset for different sets of users and experiment with grant and deny access rules. We walk you through the following steps:

  1. Upload the dataset and apply the permissions to it.
  2. Create appropriate filters and reports for users.
  3. Log in as different users and see how the dashboard’s data has been tailored just to what they are allowed to see.

Prerequisites

For this walkthrough, you should have the following prerequisites:

In the Enterprise edition of QuickSight, you can restrict access to a dataset by configuring row-level and column-level security. Only the people you shared with can see any of the data.

You can also send dashboard snapshots as emails to groups of readers, with each reader receiving customer reports as per the security configurations set on the dashboard.

Granting row-level access

For our use case, a business analyst in the marketing team creates a dataset of monthly sales in QuickSight using the file month_b2bsalesdata.xlsx. The following screenshot shows the dataset details. We use the location field to restrict user access to the data.

To grant row-level access, complete the following steps:

  1. Upload the permission file and define appropriate user access.

In the permission file, we have the location field same as our primary month_b2bsalesdata dataset. It doesn’t matter what order the fields are in. However, all the fields are case-sensitive. They must exactly match the field names and values. You must have at least one field that identifies either users or groups. The field you use for users or groups can have any name you choose.

If you’re specifying groups, use only QuickSight groups or Microsoft AD groups.

The following screenshot shows the dataset rules (permissions) for row-level security. Here we have defined access for users to different states and cities.

  1. On the Datasets tab, choose the dataset you want to enable row-level security for.
  2. Choose Row-level security this dataset.

  1. For Selected data set rules, select the appropriate permission file (for this post, permissions.csv).
  2. For Permission Policy, you can grant permissions. For this post, select Grant access to data set.

By default, access is denied for all users, and you use the permissions file to allow access to the data. For example, if I don’t put a row for users in the permissions file, they don’t see any data in the report. If I put blanks (empty strings “”) for users in the permissions file, they see everything.

  1. Choose Apply data set.

Testing row-level restrictions

You can now test the restrictions by logging in as various users.

  1. Log in as the user AlejandroRosalez.

This user has access to data where the state is California or Texas, and the city is Los Angeles or Fort Worth. The following screenshot shows the data visible to AlejandroRosalez.

  1. Log in using SaanviSarkar.

The following screenshot shows the data visible to SaanviSarkar. They can see data only for any city, but only if the state is Texas.

  1. Log in using MarthaRivera.

The following screenshot shows the data visible to MarthaRivera. Martha can see the data for any city or state.

  1. Log in using the workshop user, which isn’t present in the permissions.csv file.

The following screenshot shows that no data is visible to the workshop user.

Granting column-level access

We can also grant access at the column level to specific users and groups.

  1. On the QuickSight console, on the Datasets page, choose the dataset you want to enable column-level security for.
  2. Choose Column-level security.

By default, all users and groups have access to the data. Column-level security allows you to manage access to specific columns in your dataset.

You can restrict column access to particular users and groups. In the following screenshot, I’m restricting access for the sales column only to PauloSantos and ZhangWei.

Testing column-level access

To test these additional restrictions, log in as the Zhangwei user.

The following screenshot shows the data visible to Zhangwei user. They can see all the columns.

If you log in as the workshop user, Sales is marked with the Restricted symbol in the Fields list pane. This user doesn’t have access to view or use the Sales field in any visual within the analysis.

Pivot tables and regular tables

If the column is in the Values field, you can see Not Authorised in pivot tables and regular tables.

In the following screenshots, you can see Sales is in the Values field.

If the column is in the Group by, Row, or Column field, you get an error message saying You can’t access this visual because this field is restricted : Sales.

In the following screenshots, you can see Sales is in the Column and Group by fields.

Cleaning up

To avoid incurring future charges, delete the QuickSight users and Enterprise account.

Conclusion

This post showed how QuickSight allows you to set up row-level and column-level security at the user and group level. QuickSight also supports emailing dashboards with row-level security enabled. You can send dashboard snapshots as emails to groups of readers, and each reader receives custom reports as per the security configurations set on the dataset. For more information, see Sending Reports by Email.

You can try this solution for your own use cases. If you have comments or feedback, please leave them in the comments.


About the Author

Niyati Upadhyay is a Solutions Architect at AWS. She joined AWS in 2019 and specializes in building and supporting Big Data solutions that help customers analyze and get value out of their data.

Event-driven refresh of SPICE datasets in Amazon QuickSight

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/

Businesses are increasingly harnessing data to improve their business outcomes. To enable this transformation to a data-driven business, customers are bringing together data from structured and unstructured sources into a data lake. Then they use business intelligence (BI) tools, such as Amazon QuickSight, to unlock insights from this data.

To provide fast access to datasets, QuickSight provides a fully managed calculation engine called SPICE—the Super-fast, Parallel, In-Memory Calculation Engine. At the time of writing, SPICE enables you to cache up to 250 million rows or 500 GB of data per dataset.

To extract value from the data quickly, you need access to new data as soon as it’s available. In this post, we describe how to achieve this by refreshing SPICE datasets as part of your extract, transform, and load (ETL) pipelines.

Solution architecture

In this post, you automate the refresh of SPICE datasets by implementing the following architecture.

This architecture consists of two parts: an example ETL job and a decoupled event-driven process to refresh SPICE.

For the ETL job, you use Amazon Simple Storage Service (Amazon S3) as your primary data store. Data lands in an S3 bucket, which we refer to as the raw zone. An Amazon S3 trigger configured on this bucket triggers an AWS Lambda function, which starts an AWS Glue ETL job. This job processes the raw data and outputs processed data into another S3 bucket, which we refer to as the processed zone.

This sample ETL job converts the data to Apache Parquet format and stores it in the processed S3 bucket. You can modify the ETL job to achieve other objectives, like more granular partitioning, compression, or enriching of the data. The Glue Data Catalog stores the metadata and QuickSight datasets are created using Amazon Athena data sources.

To trigger the SPICE dataset refresh, after the ETL job finishes, an Amazon EventBridge rule triggers a Lambda function that initiates the refresh.

In summary, this pipeline transforms your data and updates QuickSight SPICE datasets upon completion.

Deploying the automated data pipeline using AWS CloudFormation

Before deploying the AWS CloudFormation template, make sure you have signed up for QuickSight in one of the 11 supported Regions:

  • US East (Ohio)
  • US East (N. Virginia)
  • US West (Oregon)
  • Asia Pacific (Mumbai)
  • Asia Pacific (Seoul)
  • Asia Pacific (Singapore)
  • Asia Pacific (Sydney)
  • Asia Pacific (Tokyo)
  • EU (Frankfurt)
  • EU (Ireland)
  • EU (London)

This post works with both Standard and Enterprise editions of QuickSight. Enterprise Edition provides richer features and higher limits compared to Standard Edition.

  1. After you sign up for QuickSight, you can use CloudFormation templates to create all the necessary resources by choosing Launch stack:
  2. Enter a stack name; for example, SpiceRefreshBlog.
  3. Acknowledge the AWS Identity and Access Management (IAM) resource creation.
  4. Choose Create stack.

The CloudFormation template creates the following resources in your AWS account:

  • Three S3 buckets to store the following:
    • AWS Glue ETL job script
    • Raw data
    • Processed data
  • Three Lambda functions to do the following:
    • Create the ETL job
    • Initiate the ETL job upon upload of new data in the raw zone
    • Initiate the SPICE dataset refresh when the ETL job is complete
  • An AWS Glue database
  • Two AWS Glue tables to store the following:
    • Raw data
    • Processed data
  • An ETL job to convert the raw data from CSV into Apache Parquet format
  • Four IAM roles: One each for the Lambda functions and one for the ETL job
  • An EventBridge rule that triggers on an AWS Glue job state change event with a state of Succeeded and invokes a Lambda function that performs the SPICE dataset refresh

Importing the dataset

For this post, you use the taxi Trip Record Data dataset publicly available from the NYC Taxi & Limousine Commission Trip Record Data dataset. You upload monthly data in CSV format to the raw zone S3 bucket.

This data is available in Amazon S3 through Open Data on AWS, a service designed to let you spend more time on data analysis rather than data acquisition.

You start by copying the For Hire Vehicle (FHV) data for March 2020. Because the data is already available in Amazon S3 through Open Data, run the following command to copy the data into the raw zone. Make sure you replace <raw bucket name> with the name of the raw bucket created by the CloudFormation template:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-03.csv" s3://<raw bucket name>

After you copy the data into the raw zone, the Amazon S3 event trigger invokes the Lambda function that triggers the ETL job. You can see the job status on the AWS Glue console by choosing Jobs in the navigation pane. The process takes about 2 minutes.

When the job is complete, check that you can see the Parquet files in the processed zone S3 bucket.

Creating a QuickSight analysis of the data

To visualize the taxi data, we create a QuickSight analysis.

First, you need to give QuickSight the necessary permissions to access the processed zone S3 bucket. For instructions, see I Can’t Connect to Amazon S3.

Then complete the following steps to create an analysis of the taxi data:

  1. On the QuickSight console, choose Datasets.
  2. Choose New dataset.
  3. Choose Athena and provide a name for the data source (such as Athena).
  4. Choose Create data source.
  5. For Database, choose the name of the taxi AWS Glue database (starting with taxigluedatabase).
  6. For Tables, select processed_taxi_data as the table to visualize.
  7. Choose Select.
  8. Ensure Import to SPICE for quicker analytics is selected and choose Visualize.

After the data is imported into SPICE, you can create visuals to display the data. For example, the following screenshot shows a key performance indicator (KPI) of the number of taxi journeys aggregated at the month level and the number of journeys over time.

We use this dashboard to visualize the dataset again after we refresh SPICE with more data.

Automating the SPICE refresh

To refresh the SPICE dataset when the ETL job is complete, the CloudFormation template we deployed created an EventBridge rule that triggers a Lambda function each time an AWS Glue ETL job successfully completes. The following screenshot shows the code for the event pattern.

We need to configure the Lambda function with the ETL job name and the ID of the SPICE dataset we created in QuickSight.

  1. Locate the ETL job name on the AWS Glue console, named TaxiTransformationGlueJob-<unique id>.
  2. To find the SPICE dataset ID, run the following command using the AWS Command Line Interface (AWS CLI):
    aws quicksight list-data-sets --aws-account-id <your AWS account id> 

    The following screenshot shows the output with the dataset ID.

  3. On the Lambda console, open the Lambda function named SpiceRefreshBlog-QuicksightUpdateLambda-<unique id>.
  4. Update line 9 of the code to replace ReplaceWithGlueJobName with the AWS Glue job name and ReplaceWithYourDatasetID with the dataset ID.

Once a Glue job succeeds, this Lambda function is triggered. The EventBridge event that triggers the Lambda contains the name of the job. You can access this from the event as follows, as seen on line 25 of the function:

succeededJob = event[‘detail’][‘jobName’]

The Lambda function looks up the job name in the data_set_map dictionary. If the dictionary contains the job name, the dataset ID is accessed and the function calls the QuickSight Create Ingestion API to refresh the SPICE datasets.

You can extend the data_set_map dictionary to include additional job names and associated SPICE dataset IDs to be refreshed. If using this approach at scale, you might choose to move this configuration information to an Amazon DynamoDB table.

  1. Save the Lambda function by choosing Deploy.

Testing the automated refresh

Now that you have configured the Lambda function, we can test the ETL end-to-end process and make the next month’s data available for analysis.

To add the FHV data for April, run the following AWS CLI command:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-04.csv" s3://<raw bucket name>

As before, this upload to the raw zone triggers the Lambda function that starts the ETL job. You can to see the progress of the job on the AWS Glue console.

When the job is complete, navigate to QuickSight and open the taxi analysis (or, if you still have it open, refresh the window).

You can now see that both months’ data is available for analysis. This step might take 1–2 minutes to load.

To see the status of each SPICE refresh, navigate back to the dataset on the QuickSight console and choose View History.

The following screenshot shows the status of previous refreshes and the number of rows that have been ingested into SPICE.

Now that you have tested the end-to-end process, you can try copying more FHV data to the raw zone and see the data within your QuickSight analysis.

Cleaning up

To clean up the resources you created by following along with this post, complete the following steps:

  1. Delete the QuickSight analysis you created.
  2. Delete the QuickSight dataset that you created.
  3. Delete the QuickSight data source:
    1. Choose New dataset.
    2. Select the data source and choose Delete data source.
  4. On the Amazon S3 console, delete the contents of the raw and processed S3 buckets.
  5. On the AWS CloudFormation console, select the stack SpiceRefreshBlog and choose Delete.

Conclusion

Using an event-based architecture to automate the refresh of your SPICE datasets makes sure that your business analysts are always viewing the latest available data. This reduction in time to analysis can help your business unlock insights quicker without having to wait for a manual or scheduled process. Additionally, by only refreshing SPICE when new data is available, the underlying data storage resources are used efficiently, so you only pay for what you need!

Get started with QuickSight today!


About the Authors

Rob Craig is a Senior Solutions Architect with AWS. He supports customers in the UK with their cloud journey, providing them with architectural advice and guidance to help them achieve their business outcomes.

 

 

 

 

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Using administrative dashboards for a centralized view of Amazon QuickSight objects

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/using-administrative-dashboards-for-a-centralized-view-of-amazon-quicksight-objects/

“Security is job 0” is the primary maxim of all endeavors undertaken at AWS. Amazon QuickSight, the fast-growing, cloud-native business intelligence (BI) platform from AWS, allows security controls in a variety of means, including web browsers and API calls. These controls apply to various functions, such as user management, authorization, authentication, and data governance.

This post demonstrates how to build a workflow to enable a centralized visualization of QuickSight groups and user information, as well as QuickSight objects access permission auditing information. Combined with AWS CloudTrail logs, the solution enables your security team to detect any abnormal behavior in near-real time to ensure security compliance.

Benefits of a centralized dashboard

A group in QuickSight consists of a set of users. Using groups makes it easy to manage access and security. For example, you can configure three groups, called Marketing, HR, and BI Developer, and each has specific access privileges:

  • The users in the Marketing group can only view the dashboards with marketing data
  • The users in the HR group can only view the human resources data
  • The users in the BI Developer group can edit all objects, including data sources, datasets, and dashboards

After the users and groups are configured, BI administrators can check and edit the object access permission by choosing Share for dashboards, datasets, and all other objects. The following screenshot shows the Manage dashboard sharing page on the QuickSight console.

As of this writing, individual object permission information is available on the QuickSight console, and user information is provided on the user management view on the QuickSight console. Our solution integrates QuickSight APIs with other AWS services to create an administrative dashboard that provides a centralized view of essential security information. This dashboard covers not only user lists and individual object access permission information available on the current platform, but also additional security information like group lists, user-group mapping information, and overall objects access permissions. This dashboard allows you to acquire unique security insights with its collection of comprehensive security information.

This post provides a detailed workflow that covers the data pipeline, sample Python codes, the AWS CloudFormation template, and a sample administrative dashboard. With the guidance of this post, you can configure a centralized information center in your own environment.

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

  1. A new user creation event in the CloudTrail log triggers the Amazon CloudWatch Events rule CreateUser.
  2. The CreateUser rule triggers the AWS Lambda function User_Initiation. This function checks if the new user belongs to an existing group (for this post, we assume that their AWS Identity and Access Management (IAM) role equates to the group they should belong to in QuickSight). If such a group exists, the function adds the user into the group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup). The following is the process flow diagram of the Lambda function.
  3. If the CreateGroupMembership event occurs, it triggers the Lambda function Data_Prepare. This function calls QuickSight APIs to get QuickSight group, user, and object access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.
  1. If the Lambda function User_Initiation creates a new QuickSight group in Step 2, it triggers a CloudWatch rule CreateGroup and the Lambda function Group_Initiation. The Group_Initiation function updates the QuickSight objects permission for the new group, such as granting it permission to view a dashboard.
  2. The update object permission event in Step 4 triggers the Lambda function Data_Prepare, which updates the object access permissions information and saves the updated information to an S3 bucket.
  3. The DeleteUser and DeleteGroup events also trigger the Lambda function Data_Prepare.
  4. Based on the file in S3 that contains user-group mapping information and the QuickSight objects access permissions information, an Amazon Athena table is created.
  5. A QuickSight dataset fetches the data in the Athena table created in Step 7 through DirectQuery Another QuickSight dataset is created based on the CloudTrail logs data. Then, based on these two datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Access to the following AWS services:
    • QuickSight
    • Athena
    • Lambda
    • Amazon S3
  • Basic knowledge of Python
  • Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by downloading the following AWS Cloud Development Kit (AWS CDK) stack from the GitHub repo.

Pull the Administrative Dashboard folder and run the command cdk deploy QuickSightStack to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Implementing the solution

This solution assumes that the users log in to their QuickSight account with identity federation through SAML or OIDC. For instructions on setting up SAML SSO, see Single Sign-On Access to Amazon QuickSight Using SAML 2.0Federate Amazon QuickSight access with Okta and Enabling Amazon QuickSight federation with Azure AD. For OIDC SSO, see Use Amazon QuickSight Federated Single Sign-On with Amazon Cognito User Pools.

After you set up the IAM policy of the web identity or SAML federation role, you don’t need to invite users manually. A QuickSight user is provisioned automatically when opening QuickSight for the first time.

In this solution, one SAML federation role corresponds to a QuickSight group. There are four sample SAML roles: Marketing, HR, BI-Admin, and BI Developer.

The following code is the sample CreateUser CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZEAWJBC3FBJ7KDH2N:[email protected]",
        "arn": "arn:aws:sts::<aws_account_id>:assumed-role/ BI-Developer/[email protected]",
        "accountId": <aws_account_id>,
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZEAWJBC3FBJ7KDH2N",
                "arn": "arn:aws:iam:: <aws_account_id>:role/BI-Developer",
                "accountId": <aws_account_id>,
                "userName": " BI-Developer"}
        }
    },
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "CreateUser",
    "awsRegion": "us-east-1",
    "eventType": "AwsServiceEvent",
…    
}

This event triggers the CloudWatch events rule CreateUser. The following screenshot shows the details of this rule.

The CreateUser rule triggers the Lambda function User_Initiation. This function gets the QuickSight group name (Marketing, HR, BI-Admin, or BI Developer) and compares the group name with the existing group list. If such a group exists, it adds this new user into that group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup).

The Data_Prepare Lambda function is triggered by adding a new user into a group event (CreateGroupMembership). This function calls the QuickSight API describe_data_set_permissions, describe_dashboard_permissions, or describe_data_source_permissions to get the object access permissions. It also calls the APIs list_user_groups and list_users to get the list of users and the groups of each user. Finally, this function creates two files containing QuickSight group, user, or object access information, and saves these files into a S3 bucket.

If a new QuickSight group is created, it triggers the Lambda function Group_Initiation to update the QuickSight dashboard, dataset, or data source permission for this new group. For example, if the HR group is created, the Group_Initiation function lets the HR group view the Employee Information dashboard.

The UpdateDashboardPermissions, UpdateDatasetPermissions, and UpdateDatasourcePermissions events trigger the Lambda function Data_Prepare to update the object access permissions information stored in the S3 bucket.

To create two Athena tables (Groups and Objects), run an AWS Glue crawler.

The following screenshot is sample data of the Groups table.

The following screenshot is sample data of the Objects table.

You can create a DirectQuery dataset in QuickSight with the two new Athena tables joined. See the following screenshot.

The Objects table contains the information of objects (such as dashboards and datasets) belonging to each group or user. Furthermore, we can create a calculated field called Ownership based on Permissions information (the actions column in objects table) to provide the objects owner with viewer or user information (the object owner can delete this object, whereas the viewer or user can’t do the deletion action).

The following screenshot shows the relevant code.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

After that, run the following SQL query to build an Athena view with QuickSight events for the last 24 hours:

CREATE OR REPLACE VIEW qsctlog_last_24h AS 
SELECT "useridentity"."type", "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn", '/', 2) "group_name"
, COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) "user_name"
, "eventtime"
, "eventname"
, "awsregion"
, "resources"
, "eventtype"
, "recipientaccountid"
, "serviceeventdetails"
FROM default.cloudtrail_logs
WHERE (("eventsource" = 'quicksight.amazonaws.com') AND (CAST("split_part"("eventtime", 'T', 1) AS date) > "date_add"('hour', -24, "now"()))) 

Running queries in Athena

Now we have the datasets ready in Athena and can run SQL queries against them to answer some common administrative questions.

To create a QuickSight dataset to catch all orphan users that don’t belong to any group, as well as the events done by these users in the last 24 hours, run the following SQL query:

SELECT g.*
,log.group_name as role_name
,log.user_name as log_user_name
,log.type
,log.eventtime
,log.eventname
,log.awsregion
,log.eventtype
,log.recipientaccountid
,log.serviceeventdetails
FROM "default"."qsctlog_last_24h" as log 
full outer join 
"default"."groups" as g 
on log.awsregion=g.aws_region AND log.group_name=g.group_name AND log.user_name=g.user_name 
where g.group_name is null or g.group_name=''

To create a QuickSight dataset to list objects belonging to each group or user, run the following query:

SELECT group_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON group_name=principal_name
WHERE principal_type='group'
UNION
SELECT user_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON user_name=principal_name
WHERE principal_type='user'
ORDER BY  "Group/User Name" asc;

The following screenshot shows the sample data.

Building dashboards

In addition to running queries directly in Athena, we can build a dashboard using this same data in QuickSight. The following screenshot shows an example dashboard that you can make using our data.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Cleaning up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy QuickSightStack 

Then, on the Amazon S3 console, delete the S3 bucket administrative-dashboard<your_aws_account_id>.

Conclusion

This post discussed how BI administrators can use the QuickSight dashboard, Lambda functions, and other AWS services to create a centralized view of groups, users, and objects access permission information and abnormal access auditing. We also presented a serverless data pipeline to support the administrative dashboard. This dashboard can provide you with unique security insights with its collection of comprehensive security information.


About the Author

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Analyze and improve email campaigns with Amazon Simple Email Service and Amazon QuickSight

Post Syndicated from Apoorv Gakhar original https://aws.amazon.com/blogs/messaging-and-targeting/analyze-and-improve-email-campaigns-with-amazon-simple-email-service-and-amazon-quicksight/

Email is a popular channel for applications, used in both marketing campaigns and other outbound customer communications. The challenge with email is that it can become increasingly complex to manage for companies that must send large quantities of messages per month. This complexity is especially true when companies need to measure detailed email engagement metrics to track campaign success.

As a marketer, you want to monitor several metrics, including open rates, click-through rates, bounce rates, and delivery rates. If you do not track your email results, you could potentially be wasting your campaign resources. Monitoring and interpreting your sending results can help you deliver the best content possible to your subscribers’ inboxes, and it can also ensure that your IP reputation stays high. Mailbox providers prioritize inbox placement for senders that deliver relevant content. As a business professional, tracking your emails can also help you stay on top of hot leads and important clients. For example, if someone has opened your email multiple times in one day, it might be a good idea to send out another follow-up email to touch base.

Building a large-scale email solution is a complex and expensive challenge for any business. You would need to build infrastructure, assemble your network, and warm up your IP addresses. Alternatively, working with some third-party email solutions require contract negotiations and upfront costs.

Fortunately, Amazon Simple Email Service (SES) has a highly scalable and reliable backend infrastructure to reduce the preceding challenges. It has improved content filtering techniques, reputation management features, and a vast array of analytics and reporting functions. These features help email senders reach their audiences and make it easier to manage email channels across applications. Amazon SES also provides API operations to monitor your sending activities through simple API calls. You can publish these events to Amazon CloudWatch, Amazon Kinesis Data Firehose, or by using Amazon Simple Notification Service (SNS).

In this post, you learn how to build and automate a serverless architecture that analyzes email events. We explore how to track important metrics such as open and click rate of the emails.

Solution overview

 

The metrics that you can measure using Amazon SES are referred to as email sending events. You can use Amazon CloudWatch to retrieve Amazon SES event data. You can also use Amazon SNS to interpret Amazon SES event data. However, in this post, we are going to use Amazon Kinesis Data Firehose to monitor our user sending activity.

Enable Amazon SES configuration sets with open and click metrics and publish email sending events to Amazon Kinesis Data Firehose as JSON records. A Lambda function is used to parse the JSON records and publish the content in the Amazon S3 bucket.

Ingested data lands in an Amazon S3 bucket that we refer to as the raw zone. To make that data available, you have to catalog its schema in the AWS Glue data catalog. You create and run the AWS Glue crawler that crawls your data sources and construct your Data Catalog. The Data Catalog uses pre-built classifiers for many popular source formats and data types, including JSON, CSV, and Parquet.

When the crawler is finished creating the table definition and schema, you analyze the data using Amazon Athena. It is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL. Point to your data in Amazon S3, define the schema, and start querying using standard SQL, with most results delivered in seconds.

Now you can build visualizations, perform ad hoc analysis, and quickly get business insights from the Amazon SES event data using Amazon QuickSight. You can easily run SQL queries using Amazon Athena on data stored in Amazon S3, and build business dashboards within Amazon QuickSight.

 

Deploying the architecture:

Configuring Amazon Kinesis Data Firehose to write to Amazon S3:

  1. Navigate to the Amazon Kinesis in the AWS Management Console. Choose Kinesis Data Firehose and create a delivery stream.
  2. Enter delivery stream name as “SES_Firehose_Demo”.
  3. Under the source category, select “Direct Put or other sources”.
  4. On the next page, make sure to enable Data Transformation of source records with AWS Lambda. We use AWS Lambda to parse the notification contents that we only process the required information as per the use case.
  5. Click the “Create New” Lambda function.
  6. Click on “General Kinesis Data FirehoseProcessing” Lambda blueprint and this opens up the Lambda console. Enter following values in Lambda
    • Name: SES-Firehose-Json-Parser
    • Execution role: Create a new role with basic Lambda permissions.
  7. Click “Create Function”. Now replace the Lambda code with the following provided code and save the function.
    • 'use strict';
      console.log('Loading function');
      exports.handler = (event, context, callback) => {
         /* Process the list of records and transform them */
          const output = event.records.map((record) => {
              console.log(record.recordId);
              const payload =JSON.parse((Buffer.from(record.data, 'base64').toString()))
              console.log("payload : " + payload);
              
              if (payload.eventType == "Click") {
              const resultPayLoadClick = {
                      eventType : payload.eventType,
                      destinationEmailId : payload.mail.destination[0],
                      sourceIp : payload.click.ipAddress,
                  };
              console.log("resultPayLoad : " + resultPayLoadClick.eventType + resultPayLoadClick.destinationEmailId + resultPayLoadClick.sourceIp);
              
              //const parsed = resultPayLoad[0];
              //console.log("parsed : " + (Buffer.from(JSON.stringify(resultPayLoad))).toString('base64'));
              
              
              return{
                  recordId: record.recordId,
                  result: 'Ok',
                  data: (Buffer.from(JSON.stringify(resultPayLoadClick))).toString('base64'),
              };
              }
              else {
                  const resultPayLoadOpen = {
                      eventType : payload.eventType,
                      destinationEmailId : payload.mail.destination[0],
                      sourceIp : payload.open.ipAddress,
                  };
              console.log("resultPayLoad : " + resultPayLoadOpen.eventType + resultPayLoadOpen.destinationEmailId + resultPayLoadOpen.sourceIp);
              
              //const parsed = resultPayLoad[0];
              //console.log("parsed : " + (Buffer.from(JSON.stringify(resultPayLoad))).toString('base64'));
              
              
              return{
                  recordId: record.recordId,
                  result: 'Ok',
                  data: (Buffer.from(JSON.stringify(resultPayLoadOpen))).toString('base64'),
              };
              }
          });
          console.log("Output : " + output.data);
          console.log(`Processing completed.  Successful records ${output.length}.`);
          callback(null, { records: output });
      };

      Please note:

      For this blog, we are only filtering out three fields i.e. Eventname, destination_Email, and SourceIP. If you want to store other parameters you can modify your code accordingly. For the list of information that we receive in notifications, you may check out the following document.

      https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-retrieving-firehose-examples.html

  8. Now, navigate back to your Amazon Kinesis Data Firehose console and choose the newly created Lambda function.
  9. Keep the convert record format disabled and click “Next”.
  10. In the destination, choose Amazon S3 and select a target Amazon S3 bucket. Create a new bucket if you do not want to use the existing bucket.
  11. Enter the following values for Amazon S3 Prefix and Error Prefix. When event data is published.
    • Prefix:
      fhbase/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
    • Error Prefix:
      fherroroutputbase/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/
  12. You may utilize the above values in the Amazon S3 prefix and error prefix. If you use your own prefixes make sure to accordingly update the target values in AWS Glue which you will see in further process.
  13. Keep the Amazon S3 backup option disabled and click “Next”.
  14. On the next page, under the Permissions section, select create a new role. This opens up a new tab and then click “Allow” to create the role.
  15. Navigate back to the Amazon Kinesis Data Firehose console and click “Next”.
  16. Review the changes and click on “Create delivery stream”.

Configure Amazon SES to publish event data to Kinesis Data Firehose:

  1. Navigate to Amazon SES console and select “Email Addresses” from the left side.
  2. Click on “Verify a New Email Address” on the top. Enter your email address to which you send a test email.
  3. Go to your email inbox and click on the verify link. Navigate back to the Amazon SES console and you will see verified status on the email address provided.
  4. Open the Amazon SES console and select Configuration set from the left side.
  5. Create a new configuration set. Enter “SES_Firehose_Demo”  as the configuration set name and click “Create”.
  6. Choose Kinesis Data Firehose as the destination and provide the following details.
    • Name: OpenClick
    • Event Types: Open and Click
  7. In the IAM Role field, select ‘Let SES make a new role’. This allows SES to create a new role and add sufficient permissions for this use case in that role.
  8. Click “Save”.

Sending a Test email:

  1. Navigate to Amazon SES console, click on “Email Addresses” on the left side.
  2. Select your verified email address and click on “Send a Test email”.
  3. Make sure you select the raw email format. You may use the following format to send out a test email from the console. Make sure you send out this email to a recipient inbox to which you have the access.
    • X-SES-CONFIGURATION-SET: SES_Firehose_Demo
      X-SES-MESSAGE-TAGS: Email=NULL
      From: [email protected]
      To: [email protected]
      Subject: Test email
      Content-Type: multipart/alternative;
          		boundary="----=_boundary"
      
      ------=_boundary
      Content-Type: text/html; charset=UTF-8
      Content-Transfer-Encoding: 7bit
      This is a test email.
      
      <a href="https://aws.amazon.com/">Amazon Web Services</a>
      ------=_boundary
  4. Once the email is received in the recipient’s inbox, open the email and click the link present in the same. This generates a click and open event and send the response back to SES.

Creating Glue Crawler:

  1. Navigate to the AWS Glue console, select “crawler” from the left side, and then click on “Add crawler” on the top.
  2. Enter the crawler name as “SES_Firehose_Crawler” and click “Next”.
  3. Under Crawler source type, select “Data stores” and click “Next”.
  4. Select Amazon S3 as the data source and prove the required path. Include the path until the “fhbase” folder.
  5. Select “no” under Add another data source section.
  6. In the IAM role, select the option to ‘Create an IAM role’. Enter the name as “SES_Firehose-Crawler”. This provides the necessary permissions automatically to the newly created role.
  7. In the frequency section, select run on demand and click “Next”. You may choose this value as per your use case.
  8. Click on add Database and provide the name as “ses_firehose_glue_db”. Click on create and then click “Next”.
  9. Review your Glue crawler setting and click on “Finish”.
  10. Run the above-created crawler. This crawls the data from the specified Amazon S3 bucket and create a catalog and table definition.
  11. Now navigate to “tables” on the left, and verify a “fhbase” table is created after you run the crawler.

If you want to analyze the data stored until now, you can use Amazon Athena and test the queries. If not, you can move to the Amazon Quicksight directly.

Analyzing the data using Amazon Athena:

  1. Open Athena console and select the database, which is created using AWS Glue
  2. Click on “setup a query result location in Amazon S3” as shown in the following screenshot.
  3. Navigate to the Amazon S3 bucket created in earlier steps and create a folder called “AthenaQueryResult”. We store our Athena query result in this bucket.
  4. Now navigate back to Amazon Athena and select the Amazon S3 bucket with the folder location as shown in the following screenshot and click “Save”.
  5. Run the following query to test the sample output and accordingly modify your SQL query to get the desired output.
    • Select * from “ses_firehose_glue_db”.”fhbase”

Note: If you want to track the opened emails by unique Ip addresses then you can modify your SQL query accordingly. This is because every time an email gets opened, you will receive a notification even if the same email was previously opened.

 

Visualizing the data in Amazon QuickSight dashboards:

  1. Now, let’s analyze this data using Amazon Athena via Amazon Quicksight.
  2. Log into Amazon Quicksight and choose Manage data, New dataset. Choose Amazon Athena as a new data source.
  3. Enter the data source name as “SES-Demo” and click on “Create the data source”.
  4. Select your database from the drop-down as “ses_firehose_glue_db” and table “fhbase” that you have created in AWS Glue.
  5. And add a custom SQL based on your use case and click on “Confirm query”. Refer to the example below.
  6. You can perform ad hoc analysis and modify your query according to your business needs as shown in the following image. Click “Save & Visualize”.
  7. You can now visualize your event data on Amazon Quicksight dashboard. You can use various graphs to represent your data. For this demo, the default graph is used and two fields are selected to populate on the graph, as shown below.

 

Conclusion:

This architecture shows how to track your email sending activity at a granular level. You set up Amazon SES to publish event data to Amazon Kinesis Data Firehose based on fine-grained email characteristics that you define. You can also track several types of email sending events, including sends, deliveries, bounces, complaints, rejections, rendering failures, and delivery delays. This information can be useful for operational and analytical purposes.

To get started with Amazon SES, follow this quick start guide and you can learn more about monitoring sending activity here.

About the Authors

Chirag Oswal is a solutions architect and AR/VR specialist working with the public sector India. He works with AWS customers to help them adopt the cloud operating model on a large scale.

Apoorv Gakhar is a Cloud Support Engineer and an Amazon SES Expert. He is working with AWS to help the customers integrate their applications with various AWS Services.

 

Additional Resources:

Amazon SES Dedicated IP Pools

Amazon Personalize optimizer using Amazon Pinpoint events

Template Personalization using Amazon Pinpoint