Tag Archives: Amazon QuickSight

Building a Controlled Environment Agriculture Platform

2020-12-22 Ashu Joshi

Post Syndicated from Ashu Joshi original https://aws.amazon.com/blogs/architecture/building-a-controlled-environment-agriculture-platform/

This post was co-written by Michael Wirig, Software Engineering Manager at Grōv Technologies.

A substantial percentage of the world’s habitable land is used for livestock farming for dairy and meat production. The dairy industry has leveraged technology to gain insights that have led to drastic improvements and are continuing to accelerate. A gallon of milk in 2017 involved 30% less water, 21% less land, a 19% smaller carbon footprint, and 20% less manure than it did in 2007 (US Dairy, 2019). By focusing on smarter water usage and sustainable land usage, livestock farming can grow to provide sustainable and nutrient-dense food for consumers and livestock alike.

Grōv Technologies (Grōv) has pioneered the Olympus Tower Farm, a fully automated Controlled Environment Agriculture (CEA) system. Unique amongst vertical farming startups, Grōv is growing cattle feed to improve that sustainable use of land for livestock farming while increasing the economic margins for dairy and beef producers.

The challenges of CEA

The set of growing conditions for a CEA is called a “recipe,” which is a combination of ingredients like temperature, humidity, light, carbon dioxide levels, and water. The optimal recipe is dynamic and is sensitive to its ingredients. Crops must be monitored in near-real time, and CEAs should be able to self-correct in order to maintain the recipe. To build a system with these capabilities requires answers to the following questions:

What parameters are needed to measure for indoor cattle feed production?
What sensors enable the accuracy and price trade-offs at scale?
Where do you place the sensors to ensure a consistent crop?
How do you correlate the data from sensors to the nutrient value?

To progress from a passively monitored system to a self-correcting, autonomous one, the CEA platform also needs to address:

How to maintain optimum crop conditions
How the system can learn and adapt to new seed varieties
How to communicate key business drivers such as yield and dry matter percentage

Grōv partnered with AWS Professional Services (AWS ProServe) to build a digital CEA platform addressing the challenges posed above.

Olympus Tower - Grov Technologies

Tower automation and edge platform

The Olympus Tower is instrumented for measuring recipe ingredients by combining the mechanical, electrical, and domain expertise of the Grōv team with the IoT edge and sensor expertise of the AWS ProServe team. The teams identified a primary set of features such as height, weight, and evenness of the growth to be measured at multiple stages within the Tower. Sensors were also added to measure secondary features such as water level, water pH, temperature, humidity, and carbon dioxide.

The teams designed and developed a purpose-built modular and industrial sensor station. Each sensor station has sensors for direct measurement of the features identified. The sensor stations are extended to support indirect measurement of features using a combination of Computer Vision and Machine Learning (CV/ML).

The trays with the growing cattle feed circulate through the Olympus Tower. A growth cycle starts on a tray with seeding, circulates through the tower over the cycle, and returns to the starting position to be harvested. The sensor station at the seeding location on the Olympus Tower tags each new growth cycle in a tray with a unique “Grow ID.” As trays pass by, each sensor station in the Tower collects the feature data. The firmware, jointly developed for the sensor station, uses AWS IoT SDK to stream the sensor data along with the Grow ID and metadata that’s specific to the sensor station. This information is sent every five minutes to an on-site edge gateway powered by AWS IoT Greengrass. Dedicated AWS Lambda functions manage the lifecycle of the Grow IDs and the sensor data processing on the edge.

The Grōv team developed AWS Greengrass Lambda functions running at the edge to ingest critical metrics from the operation automation software running the Olympus Towers. This information provides the ability to not just monitor the operational efficiency, but to provide the hooks to control the feedback loop.

The two sources of data were augmented with site-level data by installing sensor stations at the building level or site level to capture environmental data such as weather and energy consumption of the Towers.

All three sources of data are streamed to AWS IoT Greengrass and are processed by AWS Lambda functions. The edge software also fuses the data and correlates all categories of data together. This enables two major actions for the Grōv team – operational capability in real-time at the edge and enhanced data streamed into the cloud.

Grov Technologies - Architecture

Cloud pipeline/platform: analytics and visualization

As the data is streamed to AWS IoT Core via AWS IoT Greengrass. AWS IoT rules are used to route ingested data to store in Amazon Simple Sotrage Service (Amazon S3) and Amazon DynamoDB. The data pipeline also includes Amazon Kinesis Data Streams for batching and additional processing on the incoming data.

A ReactJS-based dashboard application is powered using Amazon API Gateway and AWS Lambda functions to report relevant metrics such as daily yield and machine uptime.

A data pipeline is deployed to analyze data using Amazon QuickSight. AWS Glue is used to create a dataset from the data stored in Amazon S3. Amazon Athena is used to query the dataset to make it available to Amazon QuickSight. This provides the extended Grōv tech team of research scientists the ability to perform a series of what-if analyses on the data coming in from the Tower Systems beyond what is available in the react-based dashboard.

Data pipeline - Grov Technologies

Completing the data-driven loop

Now that the data has been collected from all sources and stored it in a data lake architecture, the Grōv CEA platform established a strong foundation for harnessing the insights and delivering the customer outcomes using machine learning.

The integrated and fused data from the edge (sourced from the Olympus Tower instrumentation, Olympus automation software data, and site-level data) is co-related to the lab analysis performed by Grōv Research Center (GRC). Harvest samples are routinely collected and sent to the lab, which performs wet chemistry and microbiological analysis. Trays sent as samples to the lab are associated with the results of the analysis with the sensor data by corresponding Grow IDs. This serves as a mechanism for labeling and correlating the recipe data with the parameters used by dairy and beef producers – dry matter percentage, micro and macronutrients, and the presence of myco-toxins.

Grōv has chosen Amazon SageMaker to build a machine learning pipeline on its comprehensive data set, which will enable fine tuning the growing protocols in near real-time. Historical data collection unlocks machine learning use cases for future detection of anomalous sensors readings and sensor health monitoring, as well.

Because the solution is flexible, the Grōv team plans to integrate data from animal studies on their health and feed efficiency into the CEA platform. Machine learning on the data from animal studies will enhance the tuning of recipe ingredients that impact the animals’ health. This will give the farmer an unprecedented view of the impact of feed nutrition on the end product and consumer.

Conclusion

Grōv Technologies and AWS ProServe have built a strong foundation for an extensible and scalable architecture for a CEA platform that will nourish animals for better health and yield, produce healthier foods and to enable continued research into dairy production, rumination and animal health to empower sustainable farming practices.

New charts, formatting, and layout options in Amazon QuickSight

2020-12-16 Sapna Maheshwari

Post Syndicated from Sapna Maheshwari original https://aws.amazon.com/blogs/big-data/new-charts-formatting-and-layout-options-in-amazon-quicksight/

Amazon QuickSight is a fast, cloud-powered business intelligence (BI) service that makes it easy to create and deliver insights to everyone in your organization. In this post, we explore how authors of QuickSight dashboards can use some of the new chart types, layout options, and dashboard formatting controls to deliver dashboards that intuitively deliver insights to all your users, whether within QuickSight or embedded in your websites or multi-tenant apps.

This blog post explores some of the visualization and dashboard customization features offered in Amazon QuickSight with the following datasets:

Ads dataset for stacked area and funnel charts
P&L dataset for waterfall charts
Student Performance dataset for histograms and box plots
Patient Info dataset for filled/choropleth maps

Picking the right chart for your use case

In this section, we explore some of the new charts QuickSight introduced in 2020 and how these help with various use cases.

Funnel charts

Funnel charts help visualize the progressive reduction of data as it passes from one phase to another. Data in each of these phases is represented as different portions of 100% (the whole). The most common use of the funnel chart is in visualizing conversion data. For example, you can represent sales lead generation showing different stages of sales conversion from first contact to lead generation.

To build a funnel chart with our Ads dataset, complete the following steps:

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the funnel chart icon.
For Group by, choose Stage.
For Value, choose Ad name.
To change default configuration, choose the gear icon.
In the Data labels section, for Metric label style, choose Value and percent of first.

The video below demonstrates these steps.

Stacked area charts

Stacked area charts are best used to visualize part-to-whole relationships, to show how each category contributes to the cumulative total. For this post, we create a stacked area chart with the Ads dataset.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the stacked area chart icon.
For X axis, choose Date (MONTH).
For Value, choose Cost (Sum).
For Color, choose Segment.
Choose the gear icon.
Under Legend, deselect Legend title.
Under Y-Axis, select Show Y axis label.
Under Data labels, select Show data labels.
Choose your desired position, font size, font color, and label pattern.

Histograms

Histograms help visualize the frequency distribution of a dataset and display numerical data by grouping data into bins of equal width. Each bin is plotted as a bar whose height corresponds to the number of data points within the bin.

For this post, we use the Student Performance dataset to create a histogram.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the histogram icon.
For Value, choose math score.

You can customize the histogram to show bins by bin count, bin width, or a custom start value. For this post, we sort by bin width.

Under Histogram, select Bin width.
For Bin width, enter 5.

Box plots

Box plot (also called box or whisker plot) is a standardized way of displaying distribution of data based on a five-number summary (minimum, first quartile (Q1), median, third quartile (Q3), and maximum). This is useful to determine if data is symmetrical, skewed, or tightly grouped. Box plots also show outliers.

For this post, we create a box plot on the Student Performance dataset.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the box plot icon.
For Group by, choose Gender.
For Value, choose writing score and reading score.
In the visual settings, under Box plot, select Show outliers and Show all data points.
Under Legend, deselect Show legend title.

Waterfall charts

Waterfall charts help you understand the cumulative effect of sequentially introduced positive or negative values. This is great to understand contributions to a whole, for example the main contributors to a monthly increase in revenue, or the breakdown of revenue vs costs.

We use the P&L dataset to create a waterfall chart.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual selection pane, choose the waterfall chart icon.
For Category, choose Line item.
For Value¸ choose Value (Sum).
Under Legend, deselect Show legend title.
For Position, select Bottom.
Under Title, deselect Show title.

Choropleth maps

Choropleth maps use differences in shading or coloring within geographical areas or regions to indicate the value of a variable in those areas

We use the Patient Info dataset to create a choropleth map.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the funnel chart icon.
For Location, choose State.
For Color, choose Revenue (Sum).
Choose the menu options icon (…) and choose Conditional formatting.
For Column, choose Revenue.
For Fill type, select Gradient.
For Max value, choose a color (for this post, we choose blue).

The video below demonstrates these steps.

You can also control the color and shading of the geographic areas using conditional formatting as shown below.

The video below demonstrates these steps.

Additionally, you can configure the alignment of column headers and content within. You can change the vertical alignment – top, middle, bottom and also the horizontal alignment – left, center and right.

Customization and formatting options

QuickSight also supports several formatting options that allow you to streamline visualizations and convey additional information in your dashboards.

Table/pivot table formatting

Pin or unpin and add custom text totals

You can now pin totals to the top or bottom of tables and pivot tables in QuickSight. This feature helps you view the totals even while scrolling through the tables.

Go to visual setting (gear icon on the visual menu).
Under Total, select Pin totals.
For Position, choose a position (for this post, we choose Bottom).

Additionally, you can edit the text you want to show on totals and subtotals.

For Total label, enter your custom text (for this post, we enter Grand Total).

Table alignment and wrapping

You can now align horizontal (left, right, center, auto) and vertical (top, middle, bottom) alignment on column headers and cell values in a table visual. Additionally, you can apply text wrapping on table and pivot table headers so that long headers are readable versus having to scroll over the header.

These options are available under Table options.

Hide +/- buttons on pivot tables

You can now show or hide +/- buttons on pivot tables. This allows you to improve presentation of pivot tables by removing these icons and keeping the pivot table simple. This option is available under Styling.

Visual customization options

In this section, we discuss additional customization options in QuickSight.

Custom sorting

If you want to sort your charts in a custom defined order different from the default alphabetic order, you can now do so on QuickSight. For example, you can sort geographical regions in the order of East, West, Central, and South, by ranking these regions 1– 4 and then sorting on this rank field. See video below to learn how to.

You can also therefore sort using any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

You can also therefore sort on any other critical metric fields that aren’t part of the visual. Choose your field well and choose Sort options to see available sort order options.

The following screenshot shows your sorted visualizations.

Adding descriptive text, images, and links

You can add images or logos to your dashboard using QuickSight’s narrative component available in Enterprise Edition.

On the analysis page, choose Visualize.
Choose Add, then choose Add visual.
In the Visual types pane, choose the insights icon.
Choose Customize insight.
Remove any existing text and add your custom text.

You can also add hyperlinks to text and images. Upload your image to a secure location where the image is accessible to QuickSight and your users.

The video below demonstrates these steps.

Customizing colors and fonts

QuickSight offers easy-to-build themes that allow customization of the color palette, background and foreground colors, spacing, fonts, and more. Themes can be created by authors and shared within an account, and are also accessible via APIs for programmatic management. Themes can also be defaulted for all users in the organization using APIs.

You can also prioritize the colors that you want to use in your dashboard by prioritizing them within your theme’s color palette.

You can apply predefined themes available out of the box or create your own themes that fit your corporate branding. The following screenshots show how a dashboard looks in both dark and light themes.

For more information, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.

Null value customization

You may occasionally have null values in your data and want to represent nulls with different annotations. For each of the dimensions and metrics in the dataset, you can provide custom text for null values. Go to More formatting options for any field.

The option is available under Null values.

Reference lines

You can draw reference lines based on a calculated field or a constant value. Choose the gear icon and navigate to reference lines section. In the following screenshot, the orange reference line is based off a calculation (average profit) and the black reference line is plotted on a constant value of 1,000.

You can also link reference lines to parameters via a calculated field, which allows you to create what-if scenarios within your dashboard.

Custom colors on heat and tree maps

Color gradient customization on heat and tree maps allows you to select colors for lower, intermediate, and upper limits so that the gradient is applied within these colors. You can configure this under Color in the visual settings.

Using logarithmic scale

If your metric numbers aren’t in the same order of magnitude, where some are extremely on the higher end and others on the lower end (for example, representing stock price for different entities or pandemic spread rates), you can represent them on a logarithmic scale so they’re normalized, yet relative. To use logarithmic scale, go to visual setting and under Y-Axis, for Scale, select Logarithmic.

The following screenshot shows your visualization after applying logarithmic scale.

Adjustable font size

You can now apply different font sizes on all visual content and visuals titles. In pivot tables and tables, you can see font sizes for table headers, cells, totals, and subtotals. In key performance indicators (KPIs), you can set font sizes for primary and comparison values, which allows you to keep dashboards dense and add more KPIs.

Actions

Finally, to all these charts, you can apply the following actions:

Filter actions– Select points on a chart to filter across the dashboard. QuickSight supports hierarchical filter actions that allow you to trigger one filter action from more than one chart. For more information, see Enhancing dashboard interactivity with Amazon QuickSight Actions.
URL actions – Trigger navigation from the dashboard to an external website and pass dynamic values within a URL.

Layout enhancements

QuickSight dashboards default to auto-fit mode, which makes them responsive based on screen size. However, in many situations, it’s preferable that the view you design is exactly what end-users see, whether on a laptop or a large monitor. QuickSight offers optimized layouts that allow you to pick a specific screen resolution to optimize for (such as the screen size most of your users use on a daily basis), and QuickSight automatically scales the dashboard view to render appropriately on larger or smaller screens. This doesn’t affect mobile devices—QuickSight automatically optimizes for mobile devices using a single-column layout. To adjust the scaling mode, choose Settings in the navigation pane while in dashboard authoring (analysis) mode.

If you build your dashboard for a 1024 px screen, for example, QuickSight scales that view to a larger or smaller screen to ensure that all users see the same content (mobile devices continue to fall back to a single-column, mobile-specific layout to ensure usability). Opting for the optimized mode also makes sure that your email reports look exactly like the dashboard that your viewers interact with.

On-sheet filter controls

You can now add filters to your dashboard directly without having to create parameters. Choose the field that you need to filter on choose Add filter for this field. Choose the newly added filter and choose Add to sheet.

If you need to pin it to the controls section, choose the filter and choose Pin to top.

The video below demonstrates these steps.

Thus, QuickSight allows you to choose from any of these control types to add to dashboards – single-select drop-downs, multi-select drop-downs, date and time picker, single-sided slider, single-line text box, time range picker, relative date selection, and numeric range slider. Learn more about on-sheet controls from the blog post here.

Other launches in 2020

While this blog covers all key charting and visualization launches in 2020, you can take a look at all new features enabled across other areas within QuickSight from this blog post here.

Conclusion

With these new QuickSight feature releases, you can now choose the chart type that is best suited to represent your data. You can provide richer dashboards for your readers by using the new formatting table options, dynamic titles, and reference lines. For more information about authoring dashboards in QuickSight, watch the virtual workshop Build Advanced Analytics and Dashboards with Amazon QuickSight and consider subscribing to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs.

About the Author

Sapna Maheshwari is a Specialist Solutions Architect for Amazon QuickSight. She is passionate about telling stories with data. In her previous roles at American Express and Early Warning services , she managed and led several projects in the data and analytics space.She enjoys helping customers unearth actionable insights from their data.

Field Notes: Ingest and Visualize Your Flat-file IoT Data with AWS IoT Services

2020-12-09 Paul Ramsey

Post Syndicated from Paul Ramsey original https://aws.amazon.com/blogs/architecture/field-notes-ingest-and-visualize-your-flat-file-iot-data-with-aws-iot-services/

Customers who maintain manufacturing facilities often find it challenging to ingest, centralize, and visualize IoT data that is emitted in flat-file format from their factory equipment. While modern IoT-enabled industrial devices can communicate over standard protocols like MQTT, there are still some legacy devices that generate useful data but are only capable of writing it locally to a flat file. This results in siloed data that is either analyzed in a vacuum without the broader context, or it is not available to business users to be analyzed at all.

AWS provides a suite of IoT and Edge services that can be used to solve this problem. In this blog, I walk you through one method of leveraging these services to ingest hard-to-reach data into the AWS cloud and extract business value from it.

Overview of solution

This solution provides a working example of an edge device running AWS IoT Greengrass with an AWS Lambda function that watches a Samba file share for new .csv files (presumably containing device or assembly line data). When it finds a new file, it will transform it to JSON format and write it to AWS IoT Core. The data is then sent to AWS IoT Analytics for processing and storage, and Amazon QuickSight is used to visualize and gain insights from the data.

Samba file share solution diagram

Since we don’t have an actual on-premises environment to use for this walkthrough, we’ll simulate pieces of it:

In place of the legacy factory equipment, an EC2 instance running Windows Server 2019 will generate data in .csv format and write it to the Samba file share.
- We’re using a Windows Server for this function to demonstrate that the solution is platform-agnostic. As long as the flat file is written to a file share, AWS IoT Greengrass can ingest it.
An EC2 instance running Amazon Linux will act as the edge device and will host AWS IoT Greengrass Core and the Samba share.
- In the real world, these could be two separate devices, and the device running AWS IoT Greengrass could be as small as a Raspberry Pi.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS Account
Access to provision and delete AWS resources
Basic knowledge of Windows and Linux server administration
If you’re unfamiliar with AWS IoT Greengrass concepts like Subscriptions and Cores, review the AWS IoT Greengrass documentation for a detailed description.

Walkthrough

First, we’ll show you the steps to launch the AWS IoT Greengrass resources using AWS CloudFormation. The AWS CloudFormation template is derived from the template provided in this blog post. Review the post for a detailed description of the template and its various options.

Create a key pair. This will be used to access the EC2 instances created by the CloudFormation template in the next step.
Launch a new AWS CloudFormation stack in the N. Virginia (us-east-1) Region using iot-cfn.yml, which represents the simulated environment described in the preceding bullet.
- Parameters:
  - Name the stack IoTGreengrass.
  - For EC2KeyPairName, select the EC2 key pair you just created from the drop-down menu.
  - For SecurityAccessCIDR, use your public IP with a /32 CIDR (i.e. 1.1.1.1/32).
  - You can also accept the default of 0.0.0.0/0 if you can have SSH and RDP open to all sources on the EC2 instances in this demo environment.
  - Accept the defaults for the remaining parameters.

View the Resources tab after stack creation completes. The stack creates the following resources:
- A VPC with two subnets, two route tables with routes, an internet gateway, and a security group.
- Two EC2 instances, one running Amazon Linux and the other running Windows Server 2019.
- An IAM role, policy, and instance profile for the Amazon Linux instance.
- A Lambda function called GGSampleFunction, which we’ll update with code to parse our flat-files with AWS IoT Greengrass in a later step.
- An AWS IoT Greengrass Group, Subscription, and Core.
- Other supporting objects and custom resource types.
View the Outputs tab and copy the IPs somewhere easy to retrieve. You’ll need them for multiple provisioning steps below.

3. Review the AWS IoT Greengrass resources created on your behalf by CloudFormation:

- Search for IoT Greengrass in the Services drop-down menu and select it.
- Click Manage your Groups.
- Click file_ingestion.
- Navigate through the Subscriptions, Cores, and other tabs to review the configurations.

Leveraging a device running AWS IoT Greengrass at the edge, we can now interact with flat-file data that was previously difficult to collect, centralize, aggregate, and analyze.

Set up the Samba file share

Now, we set up the Samba file share where we will write our flat-file data. In our demo environment, we’re creating the file share on the same server that runs the Greengrass software. In the real world, this file share could be hosted elsewhere as long as the device that runs Greengrass can access it via the network.

Follow the instructions in setup_file_share.md to set up the Samba file share on the AWS IoT Greengrass EC2 instance.
Keep your terminal window open. You’ll need it again for a later step.

Configure Lambda Function for AWS IoT Greengrass

AWS IoT Greengrass provides a Lambda runtime environment for user-defined code that you author in AWS Lambda. Lambda functions that are deployed to an AWS IoT Greengrass Core run in the Core’s local Lambda runtime. In this example, we update the Lambda function created by CloudFormation with code that watches for new files on the Samba share, parses them, and writes the data to an MQTT topic.

Update the Lambda function:

- Search for Lambda in the Services drop-down menu and select it.
- Select the file_ingestion_lambda function.
- From the Function code pane, click Actions then Upload a .zip file.
- Upload the provided zip file containing the Lambda code.
- Select Actions > Publish new version > Publish.

2. Update the Lambda Alias to point to the new version.

- Select the Version: X drop-down (“X” being the latest version number).
- Choose the Aliases tab and select gg_file_ingestion.
- Scroll down to Alias configuration and select Edit.
- Choose the newest version number and click Save.
- Do NOT use $LATEST as it is not supported by AWS IoT Greengrass.

3. Associate the Lambda function with AWS IoT Greengrass.

- Search for IoT Greengrass in the Services drop-down menu and select it.
- Select Groups and choose file_ingestion.
- Select Lambdas > Add Lambda.
- Click Use existing Lambda.
- Select file_ingestion_lambda > Next.
- Select Alias: gg_file_ingestion > Finish.
- You should now see your Lambda associated with the AWS IoT Greengrass group.
- Still on the Lambda function tab, click the ellipsis and choose Edit configuration.
- Change the following Lambda settings then click Update:
  - Set Containerization to No container (always).
  - Set Timeout to 25 seconds (or longer if you have large files to process).
  - Set Lambda lifecycle to Make this function long-lived and keep it running indefinitely.

Deploy AWS IoT Greengrass Group

Restart the AWS IoT Greengrass daemon:

- A daemon restart is required after changing containerization settings. Run the following commands on the Greengrass instance to restart the AWS IoT Greengrass daemon:

 cd /greengrass/ggc/core/

 sudo ./greengrassd stop

 sudo ./greengrassd start

2. Deploy the AWS IoT Greengrass Group to the Core device.

- Return to the file_ingestion AWS IoT Greengrass Group in the console.
- Select Actions > Deploy.
- Select Automatic detection.
- After a few minutes, you should see a Status of Successfully completed. If the deployment fails, check the logs, fix the issues, and deploy again.

Generate test data

You can now generate test data that is ingested by AWS IoT Greengrass, written to AWS IoT Core, and then sent to AWS IoT Analytics and visualized by Amazon QuickSight.

Follow the instructions in generate_test_data.md to generate the test data.
Verify that the data is being written to AWS IoT Core following these instructions (Use iot/data for the MQTT Subscription Topic instead of hello/world).

screenshot

Setup AWS IoT Analytics

Now that our data is in IoT Cloud, it only takes a few clicks to configure AWS IoT Analytics to process, store, and analyze our data.

Search for IoT Analytics in the Services drop-down menu and select it.
Set Resources prefix to file_ingestion and Topic to iot/data. Click Quick Create.
Populate the data set by selecting Data sets > file_ingestion_dataset >Actions > Run now. If you don’t get data on the first run, you may need to wait a couple of minutes and run it again.

Visualize the Data from AWS IoT Analytics in Amazon QuickSight

We can now use Amazon QuickSight to visualize the IoT data in our AWS IoT Analytics data set.

Search for QuickSight in the Services drop-down menu and select it.
If your account is not signed up for QuickSight yet, follow these instructions to sign up (use Standard Edition for this demo)
Build a new report:

- Click New analysis > New dataset.
- Select AWS IoT Analytics.
- Set Data source name to iot-file-ingestion and select file_ingestion_dataset. Click Create data source.
- Click Visualize. Wait a moment while your rows are imported into SPICE.
- You can now drag and drop data fields onto field wells. Review the QuickSight documentation for detailed instructions on creating visuals.
- Following is an example of a QuickSight dashboard you can build using the demo data we generated in this walkthrough.

Cleaning up

Be sure to clean up the objects you created to avoid ongoing charges to your account.

In Amazon QuickSight, Cancel your subscription.
In AWS IoT Analytics, delete the datastore, channel, pipeline, data set, role, and topic rule you created.
In CloudFormation, delete the IoTGreengrass stack.
In Amazon CloudWatch, delete the log files associated with this solution.

Conclusion

Gaining valuable insights from device data that was once out of reach is now possible thanks to AWS’s suite of IoT services. In this walkthrough, we collected and transformed flat-file data at the edge and sent it to IoT Cloud using AWS IoT Greengrass. We then used AWS IoT Analytics to process, store, and analyze that data, and we built an intelligent dashboard to visualize and gain insights from the data using Amazon QuickSight. You can use this data to discover operational anomalies, enable better compliance reporting, monitor product quality, and many other use cases.

For more information on AWS IoT services, check out the overviews, use cases, and case studies on our product page. If you’re new to IoT concepts, I’d highly encourage you to take our free Internet of Things Foundation Series training.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Transform data and create dashboards simply using AWS Glue DataBrew and Amazon QuickSight

2020-12-04 Prithiviraj Jothikumar

Post Syndicated from Prithiviraj Jothikumar original https://aws.amazon.com/blogs/big-data/transform-data-and-create-dashboards-simply-using-aws-glue-databrew-and-amazon-quicksight/

Before you can create visuals and dashboards that convey useful information, you need to transform and prepare the underlying data. The range and complexity of data transformation steps required depends on the visuals you would like in your dashboard. Often, the data transformation process is time-consuming and highly iterative, especially when you are working with large datasets.

In this post, we show you how to apply data transformations on COVID-related tweets (with sentiment scores) using AWS Glue DataBrew. AWS Glue DataBrew’s visual and interactive interface allows us to apply data transformations without any coding. Some examples of transformations we apply are: changing date formats, transformation of text strings and performing table pivots. We then use Amazon QuickSight to visualize the transformed data in a dashboard.

See the AWS architecture diagram below for an overview of the complete and serverless data pipeline.

These are seven main steps in the data pipeline:

Upload CSV file with tweets to S3
Trigger AWS Lambda to partition the dataset in case the number of rows exceeds a threshold, and store the output in S3. A threshold is placed to prevent a memory exception failure to compute VADER sentiment score.
Use Amazon SQS and AWS Lambda to compute VADER sentiment scores for tweets and store the VADER output as parquet files in S3
Crawl the VADER output and produce a Data Catalog using AWS Glue, so that the data can be queried easily using Amazon Athena
Use the Data Catalog connector in AWS Glue DataBrew to create a recipe and perform transformations
Setup a Recipe job to perform the transformation and send the output to an S3 location; repeat step 5 to produce a data catalog of the transformed data
Query and submit the data to Amazon QuickSight to create visuals for the dashboard

Each section below corresponds to one (or more) of the seven steps described above.

Steps 1 to 3: Setting and Deploying AWS Lambda pipeline

To test the solution we can use the AWS CloudFormation template found here. The AWS CloudFormation template automatically creates the following for you: the S3 Bucket to store the CSV and parquet files, the Lambda function to partition the files, the SQS queue and the VADER Lambda function.

Deploying the CloudFormation Template

Create an S3 bucket with the name of your choice and note the bucket name. Please refer to this guide for detailed steps on how to create an S3 bucket.
Upload all four individual zip files to the S3 bucket created in step 1. To do so, download the zip file from here, make sure to unzip the downloaded file and place each of the four individual zip files into the S3 bucket created in step 1.
Open the AWS Management Console in the AWS Region you want to deploy the solution to, and on the Services menu, choose CloudFormation.
Choose Create Stack, choose Upload a template to Amazon S3, and then choose the file databrew-cloudformation.yaml included in the solution that you downloaded earlier and if not, click on the link provided on this step to download.
Set Stack name to databrew-stack. Specify the Amazon S3 bucket that contains the compressed version of AWS Lambda function code and layers uploaded in step 2.
For Options, you can specify tags for your stack and an optional IAM role to be used by AWS CloudFormation to create resources. If the role isn’t specified, a new role is created. You can also perform additional configuration for rollback settings and notification options.
The review section shows a recap of the information. Be sure to select the three AWS CloudFormation acknowledgements to allow AWS CloudFormation to create resources with custom names on your behalf. Also, create a change set, because the AWS CloudFormation template includes the AWS::Serverless-2016-10-31
Click Execute.
The Outputs for the stack lists all the resources created.

Downloading the dataset

Download the dataset from here.

Testing the solution

Navigate to the S3 console.
Click on the bucket name (created by CloudFormation template and listed in the Outputs section of the stack).
Create folder named input in the S3 bucket.
Upload the downloaded dataset to the folder created in step 3.
The above will trigger a lambda function to chunk the files into smaller files and proceed to another lambda function triggered by an SQS event to perform the following preprocessing steps: removal of https links, VADER sentiment scores of the 4 categories (compound, positive, negative, and neutral), and saving the file as a parquet file in S3 the path called processedv4.

Step 4: Setup AWS Glue Data Catalog

An AWS Glue Data Catalog will allows us to easily import data into AWS Glue DataBrew. Follow these steps to create a Glue crawler that crawls the the raw data with VADER output in partitioned parquet files in S3 and determines the schema:

Choose a crawler name.
Use the default options for Crawler source type.
Provide the S3 location of the parquet files.
Choose/Create an IAM role that has read/write permissions to S3 and the AWSGlueServiceRole policy attached.
Set the frequency as Run on demand.
Choose a database name for the crawler’s output.
Leave all remaining options as default.

Once the crawler has been created, select and run the new crawler. Upon completion, the table schema is generated and visible within Tables under the Databases section in the AWS Glue console.

Step 5: Creating a DataBrew project

To get started with AWS Glue DataBrew, complete the following steps:

On the DataBrew console, choose Projects.
Choose Create project.
For Project Name, enter covid19-tweets.
For Attached recipe, choose Create new recipe.
For Recipe name, enter covid19-tweets-recipe.
For Select a dataset, select New dataset.
For Dataset name, enter covid19-tweets-data.
Import the AWS Glue table <YOUR-GLUE-TABLE-NAME> from the AWS Glue database <YOUR-GLUE-DATABASE-NAME>.
Create a new AWS Identity and Access Management (IAM) policy and IAM role by following the steps on the AWS Glue DataBrew console, which provides DataBrew the necessary permissions to access Amazon S3, Amazon Athena and AWS Glue.
Select Create Project.

Step 6: Data transformations, creating a AWS Glue DataBrew recipe and recipe job

Exploratory data analysis (EDA) is an essential component of the data transformation workflow. EDA allows us to gain an intuitive understanding of the dataset by summarizing its main characteristics such as the distribution of data across columns, the corresponding data types and summary statistics. For more information on EDA using AWS Glue DataBrew, see the Exploratory Data Analysis section of the post Data preprocessing for machine learning on Amazon EMR made easy with AWS Glue DataBrew.

The set of data transformation steps listed below are based on the EDA and the attributes we would like to visualize in the Amazon QuickSight dashboard.

Delete the following columns user_description, user_created, user_verified, is_retweet, user_followers, user_friends, user_favourites and use_name, because the aforementioned columns are not dependent for Amazon QuickSight dashboard.

Based on the schema view of the dataset, we can observe that the date column is represented as a string. AWS Glue DataBrew provides various transformations date-time formats to convert date types.

Select from the dropdown select Date functions, then DATEFORMAT.
1. Create a new column based on the values in the source column.
2. For source column, select date.
3. For date format, select mm/dd/yy*HH:MM.
Delete the date column and rename the date_DATEFORMAT column to date.

A visual representation of the positive, negative, neutral, and compound scores generated by the VADER AWS Lambda function is helpful to understand the sentiments of the tweets over time. Individual scores could have large variation between rows leading to large fluctuations on the line graph. We can use the rolling window transformation over observations in a column, to create a smoother, less fluctuating line graph. To apply the rolling window function, create a less granular version of the date column that is truncated of the hours and minute.

Select From the dropdown select Date functions, then DATEFORMAT.
1. Create a new column based on the values in the source column
2. For source column, select date
3. For date format, select mm/dd/yy to transform the observations at a date-level
4. For destination column, enter date_level.
Select From the dropdown select Window functions, then ROLLING AVERAGE.
1. For source column, select compound
2. Select 3 rows before and after. This is the size of the rolling window that slides over the column and impacts the smoothness of the average score from one row to another; larger window sizes produce more smoother rolling average scores and vice-versa.
3. Order the observations by the date column because it’s granularity is at the hour- and minute-level
4. Group the observations by the date_level column, which is only represented in date without the time
5. For the destination column name, enter rolling_compound.
Repeat step 5 above for each of the source columns pos, neu and neg to create rolling_pos, rolling_neu and rolling_neg.

To support downstream visuals in Amazon QuickSight, apply the unpivot transformation to convert the selected columns into row values.

Select Pivot transformation
1. For pivot type, select Unpivot: Columns to rows
2. In the unpivot columns dropdown, select the columns rolling_compound, rolling_pos, rolling_neu and rolling_neg
3. For column name, enter Rolling_category and for column value enter VADER_score

The hashtags column contains an array of hashtags. To visualize the hashtags as a word cloud in Amazon QuickSight, create a new row for each hashtag from the array by applying the unnest transformation.

Select unnest
1. For source column, select hashtags
2. For the destination column name, enter unnest_hashtags

Normalize the data in the unnest_hashtags column

Format the data in unnest_hashtags column to lowercase
Remove all special characters by selecting the clean transformation

The data preparation phase is now complete, and the set of 20 transformations that consist of date formatting, rolling window functions and normalized columns are combined into a recipe.

The transformations above were applied on a sample of the first 500 rows of the dataset. AWS Glue DataBrew recipe job provides the ability to scale the set of transformation steps from a sample of data to the entire dataset.

To create a recipe job, complete the following steps:

On the AWS Glue DataBrew console, choose Jobs.
Choose Create recipe job.
For Job name, enter a name.
Create a new folder in Amazon S3 for the recipe job output, select the file type as GLUEPARQUET and compression as Snappy.

If your dataset is updated on a regular basis, AWS Glue DataBrew provides an option to schedule jobs.

To query the newly transformed data from S3 into Amazon QuickSight, create another new crawler/table in AWS Glue similar to steps provided earlier (refer to the following section: Step 4: Setup an AWS Glue Data Catalog).

Use AWS QuickSight to visualize transformed data

Before proceeding, make sure your Amazon QuickSight account has IAM permissions to access Amazon Athena and S3. Add a new dataset by clicking on Datasets from the left panel and click on New dataset. Select Athena from the list of data sources and provide a name for the Data source name and on the next section. When prompted, select the database and table that contains the table post AWS Glue DataBrew transformation and click Use custom SQL followed by changing the New custom SQL name to TableOrderedData and paste the following SQL query before selecting Edit/Preview data:

SELECT * FROM "<db>"."<table_name>" t ORDER BY t.date;

Replace <db> with your database and <table_name> with your table name. Leave the quotes in place.

Click Apply to make some further modifications.

For the date column, change the data type from Stringto Date and provide the format the date as it is presented in the column (i.e. MM/dd/yy HH:mm). Similarly, change date_level column into Date
Click Save and visualize to approach the next step, which is to analyze the data. In the Analyses section, select Horizontal bar chart under Visual types followed by clicking vader_score and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate.
For this visual, select Actions and select Filter same-sheet visuals under Quick create.
Create another panel by selecting Add to create a line graph that allows rolling score. Select Line chart under Visual types and select the following in the presented order: vader_score, date, and rolling_category. Under Field wells, change vader_score (Sum) to vader_score (Average) as the Aggregate. Likewise, change date (Day) to date (Hour) as the Aggregate. Similar to the previous panel, create a quick action item.
Further panels can also be created, such as word cloud of hashtags or a pivot table consisting of the different VADER categories with score and date. The following image is based on selecting the rolling_pos bar on the first quadrant, which filters and cascades to the rest of the panels to that of the rolling_pos filter.
First quadrant depicts the overall average of each category of the entirety of the dataset. The second quadrant depicts the rolling average that is aggregated by the hour. The third quadrant is table representative that is depicted as the rolling average that is aggregated by the day. The word cloud panel is from the hashtags column.

Summary

In this post, we showed you how to interactively analyze and visualize a dataset using AWS Lambda, AWS Glue Databrew and Amazon QuickSight. We began with a COVID19 tweets dataset and computed their sentiment scores using the VADER algorithm. We then cleaned and prepared the tweets and their sentiment scores in AWS Glue DataBrew, and finally visualized key characteristics of the dataset in Amazon QuickSight.

We encourage you to extend this pipeline to your data analytics + visualization use case – there are many more pre-built transformations in AWS Glue DataBrew and pre-built visuals in Amazon QuickSight to explore.

Happy Building!

About the Authors

Prithiviraj Jothikumar, PhD, is a Data Scientist with AWS Professional Services, where he helps customers build solutions using machine learning. He enjoys watching movies and sports and spending time to meditate.

Kartik Kannapur is a Data Scientist with AWS Professional Services. He holds a Master’s degree in Applied Mathematics and Statistics from Stony Brook University and focuses on using machine learning to solve customer business problems.

Bala Krishnamoorthy is a Data Scientist with AWS Professional Services, where he helps customers solve problems and run machine learning workloads on AWS. He has worked with customers across diverse industries, including software, finance, and healthcare. In his free time, he enjoys spending time outdoors, running with his dog, beating his family and friends at board games and keeping up with the stock market.

Paritosh Walvekar is a Cloud Application Architect with AWS Professional Services, where he helps customer build cloud native applications. He has a Master’s degree in Computer Science from University at Buffalo. In his free time, he enjoys watching movies and is learning to play the piano.

Building an ad-to-order conversion engine with Amazon Kinesis, AWS Glue, and Amazon QuickSight

2020-12-04 Gandhi Raketla

Post Syndicated from Gandhi Raketla original https://aws.amazon.com/blogs/big-data/building-an-ad-to-order-conversion-engine-with-aws-glue-amazon-kinesis-data-streams-and-amazon-quicksight/

Businesses in ecommerce have the challenge of measuring their ad-to-order conversion ratio for ads or promotional campaigns displayed on a webpage. Tracking the number of users that clicked on a particular promotional ad and the number of users who actually added items to their cart or placed an order helps measure the ad’s effectiveness. Utilizing promotional ads that have higher conversion rates enables you to effectively utilize limited space on your ecommerce websites and applications.

This post demonstrates how to sessionize and aggregate clickstream and order data, compute the conversion ratio in real time, and generate data visualizations. We use Amazon Kinesis Data Streams to ingest and send data to Amazon Simple Storage Service (Amazon S3), and AWS Glue, Amazon Athena, and Amazon QuickSight to catalog, analyze, and visualize the data, respectively.

Solution overview

To measure ad-to-order conversion, you need two important pieces of data: user clicks and orders. Clickstream data is captured as users navigate through the site, each time users click on the webpage, and the metadata associated with those clicks. Depending on the user base and number of active users at any moment, clickstream data can be a large amount of data generated per second. Typically, every ecommerce system has a centralized order management system that captures orders created from different channels like a web portal or mobile app. To compute an ad-to-order conversion rate, you join clickstream data and order data over time: (total number of orders/total number of clicks) *100.

The following diagram illustrates the architecture of our solution.

The solution has six main categories.

Data generators – Clickstream and order data is generated with the help of an AWS Lambda function. The function is triggered by a scheduled Amazon CloudWatch Events event every minute and generates random clicks for ingestion into a Kinesis data stream. Similarly, another function triggered by a CloudWatch event generates random orders for ingestion into a second data stream. In a production environment, this data comes from clickstream generators and a centralized order management system.
Data ingestion – Kinesis data streams ingest clickstream and order data as they are generated.
Data sessionization – Data sessionization helps group related data. For clickstream data, we can group clicks on an ad by different users or time periods. For order data, we can group orders by different ads. We use Amazon Kinesis Data Analytics for SQL to analyze streaming data in real time with standard SQL. Sessionized clickstream and order data is ingested into another in-application stream.
Data processing and storage – The sessionization stream from Kinesis Data Analytics for SQL is ingested into an Amazon Kinesis Data Firehose delivery stream, which delivers the data to a pre-configured S3 bucket.
Data Catalog – You use AWS Glue to crawl the clickstream and orders data in their respective S3 buckets, as well as build metadata definitions and tables in Athena. AWS Glue crawlers run every hour to update table definitions, and Athena views are built to compute the ad-to-order conversion.
Data visualization – You use QuickSight to generate visualizations.

Prerequisites

Before getting started, you must provision your resources with AWS CloudFormation.

Choose Launch Stack.

Choose Next.
For Stack name, enter a name for the stack.
For Bucket Name for Clicks, enter the name of the S3 bucket that holds clickstream data (for this post, click-stream).
For Bucket Name for Orders, enter the name of the S3 bucket that holds order data (order-stream).
Enter any tags you wish to assign to the stack.
Choose Next.
Verify that the stack has been created successfully.

If you have never used QuickSight in this account before, sign up for QuickSight before moving on to the next step. Keep in mind that admin access to the Enterprise Edition QuickSight instance is needed to complete setup.

Generating and ingesting clickstream data

On the Lambda console, view your function ingest-clickstream for ingesting clickstream data. The clickstream data attributes include UserId, Device, Event, EventType, and Timestamp. The event contains promotional ad information on the webpage clicked by the user. This function generates random clickstreams and ingests it into the data stream ClickStream. The following screenshot shows your function details on the console.

A CloudWatch Events rule invokes this function every minute. The following screenshot shows sample data that was ingested into the data stream. The Event column represents the portion of the webpage the user clicked; every click on the webpage has a unique ID and type assigned (for example, P601 has the event type Promotion, C301 has the event type Checkout).

Generating and ingesting order data

On the AWS Lambda console, view your function ingest-order for ingesting order data. This function ingests random orders.

Each order has order lines, which contain the attributes ItemId, Promotion, UnitPrice, and Quantity (see the following screenshot). The promotion attribute indicates the ad the user clicked before adding the item to their shopping cart. This function generates random orders and ingests it into OrderStream. The Promotion attribute joins clickstream data and order data.

Sessionizing the data

To sessionize the data, complete the following steps:

On the Kinesis Data Analytics console, select <Stack Name>-ClickStreamApplication.
Choose Run.
Repeat the same step for <Stack Name>-OrderAnalysisApp.
When the status changes to Running, choose the application name.
Under Real time analytics, choose Go to SQL results.
Choose the Real-time analytics

The application groups clicks in 1-minute intervals. Let’s take the ad P701 as an example. If this ad is clicked by multiple users, this SQL function adds all the clicks by different users in the last minute. If five users clicked on P701 in the last minute, the function outputs a ClickCount of 5. A stagger window is used because it’s well-suited for analyzing groups of data that arrive at inconsistent times.

On the Kinesis Data Analytics console, choose OrderAnalysisApp.
Choose Go to SQL results.
This application groups orders by Promotion, as shown in the following screenshot.

Processing and storing the data

In the data processing and storage stage, aggregated clickstream and order data is delivered to a Kinesis Data Firehose delivery stream. Kinesis Data Firehose delivers clickstream aggregated records and orders to the click-stream and order-stream buckets, respectively. The data is partitioned by year, month, and day. The following screenshot shows the delivery streams on the console.

Analyzing the data

To analyze your data, complete the following steps:

Verify that the S3 bucket was created for clickstream and orders.

The data in the bucket is partitioned by year, month, date, and hour.

On the AWS Glue console, view the clickstream and orders crawlers.

These two crawlers crawl the click-stream and order-stream buckets every 15 minutes and create tables.

To run the crawlers on demand, choose Run crawler.

When the crawler is finished, the Tables added column displays 1.

In the navigation pane, choose Tables.
Verify that the crawlers created the tables.
On the Athena console, choose Saved queries.

You can see three queries have been created.

Select view_clicks_aggregate to load it in the query editor.
Select ad_to_order_conversion and choose Run Query.

If the Amazon S3 bucket name has -, the crawler replaces - with _ while creating the table.

Replace - with _ in the table name when creating the view.
Repeat the same process for view_orders_aggregate and view_conversion_ratio.

Make sure you run view_clicks_aggregate and view_orders_aggregate before running view_conversion_ratio.

Choose view_conversion_ratio and choose Preview.

Orders and clicks for each promotion and the corresponding conversion ratio are displayed.

Visualizing the data

To visualize your data, you first load it into QuickSight. You can then create visualizations. In this section, we also configure a scheduled data refresh.

Loading the data

To visualize your data, you must first load your data into QuickSight.

On the QuickSight console, from the Admin drop-down menu, choose Manage QuickSight.
In the navigation pane, choose Security & Permissions.
Choose Add or remove.
Select Amazon Athena.
Select Amazon S3 to edit QuickSight access to your S3 buckets.
Choose the Details link next to Amazon S3.
Choose Select S3 buckets.
Select the bucket names you provided for clicks and orders.
Choose Finish.
Choose Update.
Choose the QuickSight icon on the top left of the admin panel to proceed back to the home screen.
In the navigation pane, choose Datasets.
Choose New dataset.
Choose Athena.
For Data source name, enter Ad-To-Order-Conversion.
Choose Validate Connection.
After your connection is validated, choose Create data source.
For Database, choose ad-to-order-conversion.
For Tables, select view_conversion_ratio.
Choose Select.
Choose Visualize.

Creating visualizations

In this section, we create two visualizations of our data. We first make a horizontal bar chart.

From the Add menu, choose Add Calculated Field.
Enter Clicks_to_Orders.
Enter the formula sum(orders)/sum(clicks).
Choose Save.
Choose … next to Click to orders.
For Show as, choose Percent.
For Visual type, choose Horizontal bar chart.
Drag promotion to Y-axis.
Drag clicks_to_orders to Value.
Drag date to Group/Color.

The following screenshot shows our visualization.

We now make our second visualization, a vertical bar chart.

Choose the + icon next to Sheet1.
For Visual types, choose Vertical bar chart.
Drag promotions to Y-axis.
Drag clicks and orders to Value.

This graph displays clicks and orders for each promotion.

Choose Insights on the left panel to see a summary of your insights.

Refreshing the data

We can also set up a scheduled refresh for our data.

Choose Manage Data.
Choose view_conversion_ratio.
Choose Schedule refresh.
Choose Create.
For Repeats, choose Hourly.
Choose Create.

You see a confirmation message that you configured a refresh one time per hour.

Conclusion

In this post, we showed you how to use AWS analytics and storage services to address business challenges that require handling large volumes of data. Kinesis Data Streams and Kinesis Data Analytics let you ingest large volumes of data and sessionize the data. We also showed you how to analyze and visualize the clickstream and order data using AWS Glue, Athena, and QuickSight.

About the Authors

Gandhi Raketla is a Senior Solutions Architect for AWS. He works with AWS customers and partners on cloud adoption, architecting solutions that help customers foster agility and innovation.

Nick Sack is a DevOps Consultant for AWS Professional Services. He is passionate about working with customers and building automated solutions to help customers on their cloud journeys. When not working, Nick enjoys hiking, playing soccer, reading, and learning about technology.

Amazon QuickSight: 2020 in review

2020-12-01 Jose Kunnackal

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-2020-in-review/

As 2020 draws to a close, we’ve put together this post to walk you through all that’s changed in Amazon QuickSight in 2020. For your reading convenience, this post is broken up into the following sections:

Embedded Analytics at scale
Faster insights with Q & Machine Learning (ML)
Business Intelligence (BI) with QuickSight
- Build Rich, Interactive Dashboards
- Scale your data with SPICE
- Centralized governance, security, and administration
QuickSight learning resources

Embedded Analytics at scale

Embedded analytics has been a key focus area for both independent software vendors (ISVs) and enterprises in the past year. Thousands of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including the NFL, Blackboard, Comcast, Panasonic Avionics, EHE Health, and more, with the ability to scale from tens of users to tens of thousands without any server provisioning or management. Embedding analytics within existing end-user workflows has meant better engagement in apps and portals, and less fatigue for users as they do not need to browse multiple apps for insights.

Dashboard embedded within the website of a fictional company

In 2020, we launched namespaces in QuickSight adding native support for multi-tenancy, an important consideration for developers and ISVs looking for in-app analytics. Together with embedded QuickSight authoring (dashboard creation capabilities), namespaces allow developers and ISVs to use QuickSight to provide self-service data exploration and dashboard creation options to specific users of an application or portal. E.g., ISVs can empower power users to build their own dashboards and share with other end-users from the same organization, which provides feature differentiation for the ISV’s app and also new revenue opportunities for ISVs. We also updated the QuickSight JavaScript SDK to allow you to pass parameters into an embedded dashboard, switch tabs, change languages, and listen to events within an embedded dashboard.

As developers, ISVs, and enterprises roll out analytics with QuickSight for hundreds of thousands, we also saw the need for a pricing model that scales with growth. The QuickSight per-user model for dashboard readers offers a per-user max charge of $5 per reader per month, offering a low per-user price point for BI use cases. However, as dashboards are rolled out to hundreds of thousands of users, we find that average session consumption per user is usually low (less than 10 sessions a month), but aggregate session usage across users within the account is high.

To ensure scalable pricing for growth, we launched a new session capacity pricing model that allows you to commit to session usage in bulk, and benefit from increased per-session discounts with commitments. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Unlike traditional options for capacity pricing, which require a server with annual commitments, session capacity pricing allows you to get started easily with a $250 per month starter option. After starting with the monthly option, you can move to one of the annual session capacity tiers as your session usage increases—ensuring that costs scale with growing usage. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy or lean periods of the year (the first or last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, see Amazon QuickSight Pricing.

With this new model in place, we now also support embedding QuickSight dashboards in scenarios where provisioning and managing end-user reader identities might not be convenient (or even possible). Examples of these situations include embedded dashboards for sharing public statistics within a city or county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for tens of thousands of users per customer, department, or job location or site within an app.

To make it easier to get started with QuickSight embedded analytics, we also have a new embedded analytics developer portal. Learn more about how to embed, view code samples, see capabilities of the JavaScript SDK, and interact with embedded dashboards as well as the authoring experience without having to sign up or code!

Faster insights with Q and ML

While embedding dashboards provides a way to present users with key statistics and insights that are curated ahead of time, end-users often want to dig in further and ask detailed questions. While embedded authoring capabilities allow for such detailed exploration, it is often power users or analysts who leverage such features. For most regular end users, it is either the app developer (in case of embedded apps) or the BI team (in case of BI dashboards) that provides answers to such detailed questions. Both app devs and BI teams are often backlogged, and take time to respond. As a result, this workflow can be tedious and time consuming, and disruptive to both the decision-makers and the data teams. This prompted us to develop Q, the machine learning (ML)-powered natural language query capability in QuickSight. With Q, business users of QuickSight can ask questions in everyday business language and get answers in seconds. For more information on Amazon Quicksight Q, see the product page and our blog post New – Amazon Quicksight Q answers natural-language questions about business data. Q is now available in preview, sign up now.

Separately, we have also continued our efforts to integrate AWS advances in ML into QuickSight to provide easy-to-use ML capabilities for business analysts and end-users. The most popular feature of our ML-Insights suite has been natural language narratives in QuickSight, which use natural language generation to provide key insights that you can customize and add to dashboards. We added the ability to add images and links to these insights (and conditionally showing them as needed), allowing you to use these as a versatile component of your dashboards, which can present up-to-date insights in simple business language for easy consumption by end-users.

We also made Amazon SageMaker integration generally available this year. With this integration, you can perform ML-powered inferences on your datasets with just a few clicks. These include predicting the likelihood of customer churn, scoring leads to prioritize sales activity, assessing credit risk for loan applications, and more. With a native integration, data scientists and data engineering teams no longer have to do the heavy lifting of writing code for ETL jobs, consuming inference APIs, storing output in a queryable source, and finally importing to QuickSight. For more information, see Visualizing Amazon SageMaker machine learning predictions with Amazon QuickSight. For customers who want to understand anomalies in their data better without setting up or managing Sagemaker, we continue to enhance anomaly detection in QuickSight. Anomaly detection runs against your data in a database, data warehouse, data lake, or in the QuickSight SPICE in-memory store, and alerts you to anomalous data points. For more information, see Setting Up ML-Powered Anomaly Detection for Outlier Analysis.

Business Intelligence (BI) with QuickSight

With a serverless architecture, QuickSight allows customers to get started with rich interactive BI dashboards in no time. Customers such as Capital One, Best Western, Rio Tinto, and the NFL have taken the stage at re:Invent previously to talk about how a serverless model has allowed a launch to tens of thousands of users without the typical infrastructure planning and monitoring aspects. Additionally, the completely web-based dashboard authoring experience in QuickSight means analysts don’t need to download any clients to create and publish dashboards. Administrators also don’t spend weeks of time and effort in software updates, because QuickSight is fully managed, with updates twice a month. Overall, this translates to an easy setup and rollout of QuickSight as a business intelligence (BI) solution for thousands of users.

Build Rich, Interactive Dashboards

We introduced a significant number of enhancements around dashboards and the dashboard creation experience in 2020. We introduced six new chart types: filled (choropleth) map, histogram, funnel chart, stacked area chart, waterfall chart, and boxplot, bringing the total number of supported charts to 28. QuickSight now also supports reference lines, either based on a value from the chart itself or an external calculated field, which allow authors to convey additional information to the reader. QuickSight charts now also support custom sorting using a field not present in the visual, thereby performing custom ordering of fields based on business context. Through the narrative component available as part of ML-Insights, you can also add relevant text, images, and links into dashboard, allowing for rich visual layouts.

Tables and pivot tables are among the most popular visual types among business users, and we’ve introduced a number of enhancements to these, including text wrap for headers, text alignment for content and headers, customization of header and total text, representing null data, and more. You can also now export tables and pivot tables to Excel, in addition to the existing CSV option.

On the filtering front, we’ve added support for on-sheet filter controls. You can create these controls with a single click, and also scope them to work across multiple datasets used on the dashboard sheet. Available controls include drop-downs, date/time pickers, slider controls, date range controls, and relative data controls—allowing for a breadth of options for dashboards. You can set these controls to cascade so that one filters the other and only shows relevant values to the end-user.

We also followed up on filter actions launched last year to add the ability to cascade these interactive options for users. This means you can choose a state from a visual on a dashboard, which filters the entire dashboard by the state selected, and then drill down further based on the counties presented within the state coming from another visual. Lastly, we introduced dashboard persistence for readers, allowing them to pick up on dashboards where they had left them. This means that filter control and sheet selections are retained for users based on where they last left off.

Authors of dashboards continue to see new calculations available, with support added for:

Minimums and maximums for date fields
Modulo operations
First and last values of metrics and dimensional values
Discrete and continuous percentile

Themes for dashboards now also support a selection of fonts, allowing you to pick the font that aligns well with your design language and corporate standards.

Lastly, QuickSight dashboards now support an optimized scaling mode, which ensures that a dashboard is optimized for the most common screen size expected, yet scaled beautifully when on a projector or a larger screen, or emailed to end-users.

We also continue to invest in our mobile apps (iOS and Android), which allow you to access your dashboards on the go, and interact with filters, actions, drill downs, and more.

Scale your data with SPICE

One of the key benefits of choosing QuickSight for your BI needs is QuickSight’s SPICE in-memory data store, which provides fast, interactive access to data for both authors and readers, and automatically scales to meet high concurrency needs. Each dataset in SPICE can now be up to 250 million rows in size (or 500 GB), up from 100 million rows previously. SPICE data can be ingested from supported database or SaaS sources via an API or a scheduled, with email notifications available upon failures.

For relational data sources and Athena, QuickSight supports both SPICE and direct query options, giving you the choice depending on your use case and performance needs. Choosing SPICE with summarized sets of data allows you to reduce traffic on your backend data stores while also automatically scaling to peaks in usage (such as at the end of the month) without lifting a finger. At the same time, direct query is useful when data may change frequently, or you need access to historical data residing in a data warehouse for occasional analysis. Dashboards in QuickSight can support a combination of both SPICE and direct query datasets, with cross visual filters and actions allowing fluid interactions.

Authors and data admins in QuickSight can combine data across multiple data sources (e.g., flat file, Amazon Redshift, Snowflake, Amazon RDS) and bring the resultant data into SPICE for dashboarding and analysis. In 2020, we added support for joining datasets using custom SQL statements with tables from another data source, other custom SQL datasets, or flat files. For more information, see Joining Data.

Centralized governance, security, and administration

With many organizations moving operations almost completely remote, governance, security, and administration have been key areas of focus this year. Within QuickSight, all operations are logged to AWS CloudTrail, providing auditability, alerting, and traceability as users make changes and access data or assets in QuickSight. Unlike many BI options in the market, QuickSight doesn’t follow a disjoint client/server model, but instead offers centralized control over data, assets, and user capabilities. We launch two key features that further emphasize these: folders and user permissions.

Folders in QuickSight come in two types: personal folders that authors can use to organize content for themselves, or shared folders that admins can create and share with authors and readers (at the user or group level), with the ability to delegate permissions to authors. Shared folders allow strong permissions to be enforced across the organization and help control user access to data and sharing. User permissions customization allow admins to restrict user permissions to actions such as sharing assets, creating folders, or downloading data as CSV. The combination of CloudTrail logging, shared folders, and user-level permissions management allows admins to create strong governance around a QuickSight account.

A key element in a governed analytics setup is enforcing security controls and common definitions in datasets used across the organization. We introduced column-level security as a native option in QuickSight, easily configurable from the UI or APIs. This feature compliments the existing row-level security feature. Together, they give authors control over who can access what data within their visualizations. QuickSight now also allows data owners to create datasets that provide pre-built calculations, column-level metadata, and other customizations that make it easier for authors consuming the dataset to understand the data and build on it.

QuickSight is the only BI tool to provide native AWS Identity and Access Management (IAM) permissions-based control over Amazon Simple Storage Service (Amazon S3), Amazon Athena, and other AWS native data sources. QuickSight has AWS PrivateLink support for secure data connectivity to databases in your VPC, data warehouses such as Amazon Redshift or Snowflake, or big data options such as Presto. QuickSight now also supports AWS Lake Formation permissions for data accessed via Athena and has added native connectivity to Amazon Timestream and Amazon Elasticsearch Service (Amazon ES). QuickSight also supports Oracle databases, either Amazon Relational Database Service (Amazon RDS) or running on Amazon Elastic Compute Cloud (Amazon EC2) or on premises.

On the account administration and management front, we launched APIs (including analysis and theme APIs) that enable admins to create a centralized view of assets within the account, and offer customizations that allow ISVs to provide a more integrated experience for embedded analytics users, including a default organization-specific theme for authors.

Admins will also appreciated QuickSight’s support for a variety of authentication options, including Active Directory (using AWS AD Connector acting as a proxy to the on-premises AD) or federated IAM users via an identity provider such as Azure AD, Okta, Ping, or others. We refined the end-user experience for the latter by introducing options for the user to authenticate from the QuickSight login page (SP-initiated login), as opposed to navigating to the identity provider every time. This enables a business user-friendly experience. Customers wanting to turn off non-federated users within accounts can also do so by logging a support request.

Finally, we continue to expand to make QuickSight accessible to more users and use cases, with expansion to India (Mumbai) and US GovCloud (West) Regions, as well as support for five new European languages: Danish, Dutch, Finnish, Norwegian, and Swedish, making us available in 12 AWS Regions and 15 languages.

QuickSight learning resources

New to QuickSight? Use our 2-month free trial (four authors free for 2 months) to take a quick test drive. You will be asked to sign up for AWS if you don’t have an AWS account. If you already have an account from your company or organization, navigate to the AWS Management Console and choose QuickSight to get started. To learn more about embedding QuickSight, use our embedded developer portal and check out our API documentation.

If you’re an existing QuickSight user, or have used QuickSight in the past, check out our YouTube channel, or dive into one of the following workshops to learn more about your area of interest:

Build Advanced Analytics and Dashboards with Amazon QuickSight Part 1, Part 2 – If you’re new to QuickSight or looking to build advanced calculations in QuickSight, this workshop is for you. It provides step-by-step instructions to grow your dashboard building skills from basic to advanced level. You learn how to set up QuickSight in your AWS account, import and prepare data, and build advanced visuals to analyze data in a meaningful way.
Embed Amazon QuickSight Dashboards – In this virtual, instructor-led workshop, our Global QuickSight Solutions Architect walks you through setting up a reusable embedding framework. We also include a cloud formation template to quickly spin up this framework.
Administration on QuickSight – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. The ISV pipeline for data curation, data analysis, and dashboard publishing is addressed with distinct end-user personas. We also discuss how development, test, and production environments can be managed and operationalized by admins in a single or multiple QuickSight accounts.
Administration Level-Up – This virtual, instructor-led workshop is designed with a real-world ISV use case in mind. Learning objectives include automating dashboard deployment, customizing access to the QuickSight console, configuring for team collaboration, and implementing multi-tenancy and client user segregation.
Cost Intelligence Dashboard Setup – Existing AWS admins can check out this video of how to use QuickSight to quickly set up dashboards of their AWS Cost and Usage Report data and provide access to business users within the organization.

Looking ahead

To see the full list of 2020 launches, see What’s New in Amazon QuickSight or subscribe to the Amazon QuickSight YouTube channel for the latest training and feature walkthroughs. You can also check out the repository of QuickSight blogs in the AWS Big Data Blog.

We have a packed roadmap for 2021, and continue to focus on enabling you with insights from all your data that you can share with your users, while not having to worry about operations and servers. Thank you for your support.

We wish you all the very best in the New Year!

About the Authors

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

Sahitya Pandiri is a technical program manager with Amazon Web Services.

New – Amazon QuickSight Q Answers Natural-Language Questions About Business Data

2020-12-01 Harunobu Kameda

Post Syndicated from Harunobu Kameda original https://aws.amazon.com/blogs/aws/amazon-quicksight-q-to-answer-ad-hoc-business-questions/

We launched Amazon QuickSight as the first Business Intelligence (BI) service with Pay-per-Session pricing. Today, we are happy to announce the preview of Amazon QuickSight Q, a Natural Language Query (NLQ) feature powered by machine learning (ML). With Q, business users can now use QuickSight to ask questions about their data using everyday language and receive accurate answers in seconds.

For example, in response to questions such as, “What is my year-to-date year-over-year sales growth?” or “Which products grew the most year-over-year?” Q automatically parses the questions to understand the intent, retrieves the corresponding data and returns the answer in the form of a number, chart, or table in QuickSight. Q uses state-of-the art ML algorithms to understand the relationships across your data and build indexes to provide accurate answers. Also, since Q does not require BI teams to pre-build data models on specific datasets, you can ask questions across all your data.

The Need for Q
Traditionally, BI engineers and analysts create dashboards to make it easier for business users to view and monitor key metrics. When a new business question arises and no answers are found in the data displayed on an existing dashboard, the business user must submit a data request to the BI Team, which is often thinly staffed, and wait several weeks for the question to be answered and added to the dashboard.

A sales manager looking at a dashboard that outlines daily sales trends may want to know what their overall sales were for last week, in comparison to last month, the previous quarter, or the same time last year. They may want to understand how absolute sales compare to growth rates, or how growth rates are broken down by different geographies, product lines, or customer segments to identify new opportunities for growth. This may require a BI team to reconstruct the data, create new data models, and answer additional questions. This process can take from a few days to a few weeks. Such specific data requests increase the workload for BI teams that may be understaffed, increases the time spent waiting for answers, and frustrates business users and executives who need the data to make timely decisions.

How Q Works
To ask a question, you simply type your question into the QuickSight Q search bar. Once you start typing in your question, Q provides autocomplete suggestions with key phrases and business terms to speed up the process. It also automatically performs spell check, and acronym and synonym matching, so you don’t have to worry about typos or remember the exact business terms in the data. Q uses natural language understanding techniques to extract business terms (e.g., revenue, growth, allocation, etc.) and intent from your questions, retrieves the corresponding data from the source, and returns the answers in the form of numbers and graphs.

Q further learns from user interactions from within the organization to continually improve accuracy. For example, if Q doesn’t understand a phrase in a question, such as what “my product” refers to, Q prompts the user to choose from a drop-down menu of suggested options in the search bar. Q then remembers the phrase for next time, thus improving accuracy with use. If you ask a question about all your data, Q provides an answer using that data. Users are not limited to asking questions that are confined to a pre-defined dashboard and can ask any questions relevant to your business.

Let’s see a demo. We assume that there is a dashboard of sales for a company.

The business users of the dashboard can drill down and slice and dice the data simply by typing their questions on the Q search bar above.

Let’s use the Q search bar to ask a question, “Show me last year’s weekly sales in California.” Q generates numbers and a graph within seconds.

You can click “Looks good” or “Not quite right” on the answer. When clicking “Not quite right,” you can submit your feedback to your BI team to help improve Q. You can also investigate the answer further. Let’s add “versus New York” to the end of the question and hit enter. A new answer will pop up.

Next, let’s investigate further in California. Type in “What are the best selling categories in California.”

With Q, you can easily change the presentation. Let’s see another diagram for the same question.

Next, let’s take a look at the biggest industry, “Finance.” Type in “Show me the sales growth % week over week in the Finance sector” to Q, and specify “Line chart” to check weekly sales revenue growth.

The sales revenue shows growth, but it has peak and off-peak spikes. With these insights, you might now consider how to stabilize for a better profit structure.

Getting Started with Amazon QuickSight Q
A new “Q Topics” link will appear on the left navigation bar. Topics are a collection of one or more datasets and are meant to represent a subject area that users can ask questions about. For example, a marketing team may have Q Topics for “Ad Spending,” “Email Campaign,” “Website Analytics,” and others. Additionally, as an author, you can:

Add friendly names, synonyms, and descriptions to datasets and columns to improve Q’s answers.
Share the Topic to your users so they can ask questions about the Topic.
See questions your users are asking, how Q answered these questions, and improve upon the answer.

Select Topics, and set Topic name and its Description.

After clicking the Continue button, you can add datasets to a topic in two ways: You can add one or more datasets directly to your topic by selecting Add datasets, or you can import all the datasets in an existing dashboard into your topic by selecting Import dashboard.

The next step is to make your datasets natural-language friendly. Generally, names of datasets and columns are based on technical naming conventions and do not reflect how they are referred to by end users. Q relies heavily on names to match the right dataset and column with the terms used in questions. Therefore, such technical names must be converted to user-friendly names to ensure that they can be mapped correctly. Below are examples:

Dataset Name – D_CUST_DLY_ORD_DTL → Friendly Name: Customer Daily Order Details.
Column Name: pdt_cd Column → Friendly name: Product Code

Also, you can set up synonyms for each column so users can use the terms they are most comfortable with. For example, some users might input the term “client” or “segment” instead of “industry.” Q provides a feature to correct to the right name when typing the query, but BI operators can also set up synonyms for frequently used words. Click “Topics” in the left pane and choose the dashboard where you want to set synonyms.

Then, choose “datasets.”

Now, we can set a Friendly Name or synonyms as Aliases, such as “client” for “Customer,” or “Segment” for “Industry.”

After adding synonyms, a user can save the changes and start asking questions in the Q search bar.

Amazon QuickSight Q Preview Available Today
Q is available in preview for US East (N. Virginia), US West (Oregon), US East (Ohio) and Europe (Ireland). Getting started with Q is just a few clicks away from QuickSight. You can use Q with AWS data sources such as Amazon Redshift, Amazon RDS, Amazon Aurora, Amazon Athena, and Amazon S3, or third-party commercial sources such as SQL Server, Teradata, and Snowflake. Salesforce, ServiceNow, and Adobe automatically integrate with all data sources supported by QuickSight, including business applications such as Analytics or Excel.

Learn more about Q and get started with the preview today.

– Kame

New in Amazon QuickSight – session capacity pricing for large scale deployments, embedding without user provisioning, and developer portal for embedded analytics

2020-12-01 Jose Kunnackal

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/new-in-amazon-quicksight-embedding-without-user-provisioning-session-capacity-pricing-and-embedded-developer-portal/

Amazon QuickSight Enterprise edition now offers a new, session capacity-based pricing model starting at $250/month, with annual commitment options that provide scalable pricing for embedded analytics and BI rollouts to 100s of 1000s of users. QuickSight now also supports embedding dashboards in apps, websites, and wikis without the need to provision and manage users (readers) in QuickSight, which utilizes this new pricing model. Lastly, we also have a new developer portal for embedded analytics that allows you to learn more about the different embedded solutions available with QuickSight and experience it first-hand.

Session Capacity Pricing

Amazon QuickSight’s new session capacity-based pricing model provides scalable pricing for large scale deployments. Session capacity pricing allows Developers, Independent Software Vendors (ISVs) and Enterprises to benefit from lower per-session rates as they roll out embedded analytics and BI to 100s of 1000s of users. In such scenarios, average session consumption per user is usually low (<10 sessions/month) but aggregate session usage across users within the account is high. With session capacity pricing, sessions are simply charged in 30-minute blocks of usage starting with first access. Session capacity pricing is also required for embedding without user management, where per-user pricing is not meaningful. Unlike traditional options for capacity pricing which require a server with annual commitments, QuickSight’s session capacity pricing allows you to get started easily with a $250/month starter option. QuickSight’s session capacities do not slow down with increased user concurrency or higher analytical complexity of dashboards (both common in a server-based model), but instead automatically scale to ensure a consistent, fast, end-user experience. After starting with the monthly option, you can move to one of QuickSight’s annual session capacity tiers as your sessions usage increases – ensuring that costs scale with growing usage. Any usage beyond the committed levels (monthly or annual) is charged at the overage session rates indicated, with no manual intervention for scaling needed – no more scrambling to add servers as you have bursts in usage or just greater success with your application/website. Annual session capacities are billed for monthly usage of sessions, with consumption of all committed sessions expected by end of the period. Annual session capacities allow sessions to be consumed across the year, providing flexibility in ramping up on production traffic, balancing session needs across busy/lean periods of the year (e.g., first/last week of the month are busy, holidays may be slow or busy depending on the nature of the business). For more details on the new pricing options, visit the QuickSight pricing page.

Embedding without user provisioning

Before this launch, Amazon QuickSight provided embedding dashboards in apps and portals where each end-user could be identified and provisioned in QuickSight, and charged using QuickSight’s per-user pricing model. This works well for situations where each end user can be uniquely identified, and often has permissions associated with their data access levels. The NFL chose QuickSight to embed dashboards in their next-gen stats portal and share insights with clubs, broadcasters and editorial teams. 1000s of customers have since chosen to launch embedded dashboards in QuickSight for both their internal and external end-users, including Blackboard, Comcast, Panasonic Avionics, EHE Health. Customers can scale from 10s of users to 100s of 1000s without any server provisioning or management and also benefit from being able to utilize embedded QuickSight authoring capabilities to enable self-service dashboard creation.

With today’s launch, QuickSight will now also enable use cases with dashboards for 100s of 1000s of readers, where it is not possible to provision and manage users (or is highly inconvenient to do so).

Examples of these situations include embedded dashboards for sharing public statistics within a city/county, dashboards of company-wide statistics available to all users to be embedded on an internal corporate portal, or embedded dashboards with the same data intended for 10s or 1000s of users per customer, department or job location/site within an app.

Let’s take a look at a couple of examples and then how to embed these dashboards. First, a dashboard that shows a live stream of the three main stock indices (S&P 500, DOW Jones, and NASDAQ) that uses QuickSight’s newly launched connector to Amazon Timestream to provide a real-time view of the market index in US Central Time.

Second, a dashboard showing industries and count of firms in those industries using a choropleth map at the state level with the ability to drill down to county-level data.

Both dashboards are setup so that they can be accessed without any user restrictions, and we don’t have to setup users to roll this out. You can see these dashboards accessible for anyone here. To try this out, you can use the AWS Command Line Interface (AWS CLI); note that the AWS CLI used here is simply to illustrate this process, and for actual integration into a website/app you have to use the AWS SDK to obtain an embeddable URL for every new visit to the page.

4 steps to embed a dashboard

Configure an AWS Identity and Access Management (IAM) role for your application to use for embedding
```
arn:aws:iam::xxxxxxxxxxxx:role/YourApplicationRole
```

Attach the following policy to the role so the role can run the GetDashboardEmbedURL API for anonymous identity type:

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Effect":"Allow",
         "Action":[
            "quicksight:GetDashboardEmbedUrl",
            "quickSight:GetAnonymousUserEmbedUrl"
         ],
         "Resource":[
            "replace with ARN for dashboard1",
            "replace with ARN for dashboard2",
            "replace with ARN for dashboard3"
         ]
      }
   ]
}

Run the GetDashboardEmbedURL API with IdentityType set to ANONYMOUS. This returns a response with the EmbedURL:

INPUT
aws quicksight get-dashboard-embed-url \
    --region us-west-2 \
    --namespace default \
    --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
    --identity-type ANONYMOUS \
    --aws-account-id 123456789012
    
RESPONSE
{
    "Status": 200,
    "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
}

You can embed the returned URL in the IFRAME code of your application. Make sure your application is added to the allow list in QuickSight. This is a single-use URL and has to be generated dynamically by invoking get-dashboard-embed-url from the backend compute layer upon each load of the parent page.

Embed the dashboard in your application with the following HTML code:

<head>
    <title>Basic Embed</title>
    <!-- You can download the latest QuickSight embedding SDK version from https://www.npmjs.com/package/amazon-quicksight-embedding-sdk -->
    <!-- Or you can do "npm install amazon-quicksight-embedding-sdk", if you use npm for javascript dependencies -->
    <script src="./quicksight-embedding-js-sdk.min.js"></script>
    <script type="text/javascript">
        var dashboard;

        function embedDashboard() {
            var containerDiv = document.getElementById("embeddingContainer");
            var options = {
                // replace this dummy url with the one generated via embedding API
                url: "https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/dashboardId?isauthcode=true&identityprovider=quicksight&code=authcode",  
                container: containerDiv,
                scrolling: "no",
                height: "700px",
 
                footerPaddingEnabled: true
            };
            dashboard = QuickSightEmbedding.embedDashboard(options);
        }
    </script>
</head>

<body onload="embedDashboard()">
    <div id="embeddingContainer"></div>
</body>

</html>

If you use NPM to manage you front end dependencies, run npm install amazon-quicksight-embedding-sdk. And then you can add use the following code to embed the URL in your application:

import { embedDashboard } from 'amazon-quicksight-embedding-sdk';

var options = {
    url: "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    container: document.getElementById("embeddingContainer"),
    parameters: {
        country: "United States",
        states: [
            "California",
            "Washington"
        ]
    },
    scrolling: "no",
    height: "700px",
    width: "1000px",
    locale: "en-US",
    footerPaddingEnabled: true
};
const dashboardSession = embedDashboard(options);

If you want to embed multiple dashboards and switch between them, you can pass more dashboard IDs by using the additional-dashboard-ids option while generating the URL. This generates the URL with authorization for all specified dashboard IDs and launches the dashboard specified under the dashboard-id option. See the following code:

INPUT
aws quicksight get-dashboard-embed-url \
    --region us-west-2 \
    --namespace default \
    --dashboard-id c6e91d20-0909-4836-b87f-e4c115a88b65 \
    --identity-type ANONYMOUS \
    --aws-account-id 123456789012
    --additional-dashboard-ids dashboardid1 dashboardid2
    
RESPONSE
{
    "Status": 200,
    "EmbedUrl": "https://us-west-2.quicksight.aws.amazon.com/embed/bc973ae439ce45b49e011c9fc8c855ea/dashboards/c6e91d20-0909-4836-b87f-e4c115a88b65?code=AYABeJcLJ0WqjqtWBi0sdFZ2GP8AAAABAAdhd3Mta21zAEthcm46YXdzOmttczp1cy13ZXN0LTI6ODQ1MzU0MDA0MzQ0OmtleS85ZjYzYzZlOS0xMzI3LTQxOGYtODhmZi1kM2Y3ODExMzI5MmIAuAECAQB421ynKsVxdYWD7qmNX3Zzbra88wGZIZL-RXp78eF_lpIBMX2cuRvnCU-OpFLUps57PQAAAH4wfAYJKoZIhvcNAQcGoG8wbQIBADBoBgkqhkiG9w0BBwEwHgYJYIZIAWUDBAEuMBEEDOfOUSMMuEepqj8bzAIBEIA77tMfykw7WnJT__2sRSInn0gymHK1_vXmBAWAlyG2mwcNsD-HGI3xNNUoaSEUdvFQ6c0XuFQAgLz8ufICAAAAAAwAABAAAAAAAAAAAAAAAAAAHkJ674fL1usbIVG0oIYfCv____8AAAABAAAAAAAAAAAAAAABAAAAm3wfjZv5rICOROeYqIPsu6jFmWxU6fBEnSHTBkaw4ZPLnIGr3Cr1HU0D7DJM90dmCQ6t9kTVOy2XdgwNm606yqoEhSjwq4OWU-_rjGilwbKpes_5uKZR0IZNh2SMqgUPuu4Q1z884FhHQmX3yRI_RxWEyTnjR2sajl1m6OQCgvRJ3kEeh3cB0wWSsSdcUeZt-iNxYRbckKa3Eb6viPXHYRs-Q_skcSTsjfJ6GQ%3D%3D&identityprovider=quicksight&isauthcode=true",
    "RequestId": "1c82321c-6934-45b0-83f4-a8ce2f641067"
}

In the preceding code, we assign our primary dashboard’s ID to the --dashboard-id option, and the other dashboard IDs to the --additional-dashboard-ids option as a space-separated value. You can pass up to 20 dashboard IDs in this option.

The EmbedURL value in the response is the URL for the primary dashboard. You can embed this dashboard in your app, wiki, or website and use the JavaScript SDK to switch between dashboards. To switch to another dashboard without having to generate a fresh embed URL, invoke the navigateToDashboard function (available in our JavaScript library) with any of the dashboard IDs that were initially included in the get-dashboard-embed-url call. See the following example code:

var options = 
     {
      dashboardId: "dashboardid1", 
      parameters: 
       {
         country: 
          [
            "United States"
          ]
     }
    };
dashboard.navigateToDashboard(options);

For more information about the JavaScript SDK, see the GitHub repo.

Embedding without users is now generally available to all Amazon QuickSight Enterprise Edition users and requires session capacity pricing.

To opt-in for session capacity pricing and embed dashboards without user provisioning, or simply enable discounted QuickSight usage for large scale deployments, you can opt-into session capacity pricing starting with the $250/month option via the “Manage QuickSight” > “Your subscriptions” page accessible to QuickSight administrators.

Embedded Developer Portal

We have a new embedded developer portal at https://developer.quicksight.aws that allows you to quickly interact with three key embedded scenarios – 1) embedded dashboards accessible to anyone accessing a website or portal (no user provisioning required), 2) embedded dashboards accessible to only authenticated users, and 3) embedded dashboard authoring for power users of apps.

The interactive dashboards, code snippets and setup instructions allow you to learn more about the rich capabilities of QuickSight embedding, easily get started with embedding QuickSight.

Summary

With new embedding capability, QuickSight offers a modern, serverless approach to deploying dashboards and visualizations into websites, apps and corporate portals in hours, without any user provisioning or management needed to scale to 100s of 1000s of users. Unlike traditional server-based models, QuickSight’s session capacity model allows you to start at a low $250/month, month-to-month price point. As usage grows, the available annual commitment models offer scalable pricing by reducing the per-session cost – enabling both ISVs and Enterprises to roll out embedded QuickSight dashboards at large scale, whether for internal or external users. Finally, with QuickSight’s new developer portal, you have instant access to samples of interactive embedded QuickSight dashboards as well as steps to integrating various embedded capabilities into your own websites, apps and portals.

About the Authors

Jose Kunnackal John is Sr. Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Jose started his career with Motorola, writing software for telecom and first responder systems. Later he was Director of Engineering at Trilibis Mobile, where he built a SaaS mobile web platform using AWS services. Jose is excited by the potential of cloud technologies and looks forward to helping customers with their transition to the cloud.

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his as a career developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Arun Santhosh is a Specialized World Wide Solution Architect for Amazon QuickSight. Arun started his career at IBM as a developer and progressed on to be an Application Architect. Later, he worked as a Technical Architect at Cognizant. Business Intelligence has been his core focus in these prior roles as well

Accessing and visualizing external tables in an Apache Hive metastore with Amazon Athena and Amazon QuickSight

2020-11-23 James Sun

Post Syndicated from James Sun original https://aws.amazon.com/blogs/big-data/accessing-and-visualizing-external-tables-in-an-apache-hive-metastore-with-amazon-athena-and-amazon-quicksight/

Many organizations have an Apache Hive metastore that stores the schemas for their data lake. You can use Amazon Athena due to its serverless nature; Athena makes it easy for anyone with SQL skills to quickly analyze large-scale datasets. You may also want to reliably query the rich datasets in the lake, with their schemas hosted in an external Hive metastore. In response to customers’ requests, AWS announced Athena’s support for Hive metastore in early 2020. This extends the ability in Athena to query external data stores, with a combined benefit of better performance and lower cost.

In this post, we provide an AWS CloudFormation template that configures a remote Hive metastore based on Amazon Relational Database Service (Amazon RDS) and MySQL with Amazon EMR located in a private subnet to perform ETL tasks. We then demonstrate how you can use a Spark step to pull COVID-19 datasets from a public repository and transform the data into a performant Parquet columnar storage format. We also walk through the steps to query the data with Athena and visualize it with Amazon QuickSight. QuickSight is a fully managed data visualization service; it lets the you easily create and publish interactive dashboards by analyzing data from various data sources, including Athena.

Solution walkthrough

The following diagram shows the architecture for our solution.

As shown in the preceding architecture, we have an Availability Zone within a VPC in an AWS Region. The Availability Zone hosts subnets that are either public or private. A multi-master EMR cluster that has Hive software components installed is launched in a private subnet with egress internet traffic through a NAT gateway to download data from public sites for analysis. The multi-master feature also ensures that the primary nodes are highly available. The Hive metastore is backed by a remote RDS for MySQL instance located in the same private subnet.

We also have an Amazon Simple Storage Service (Amazon S3)-based data lake. A Spark step in Amazon EMR retrieves the data in CSV format, saves it in the provisioned S3 bucket, and transforms the data into Parquet format. The Spark step creates an external Hive table referencing the Parquet data and is ready for Athena to query.

With Athena, you can create a data source connector based on an AWS Lambda function to access the Hive metastore hosted on the EMR cluster by using the Apache Thrift interface.

The connector is called a catalog, which when invoked in a SQL statement with Athena, invokes the Lambda function. The function exits if the connector is not active for 15 minutes. For queries that run longer than 15 minutes, it’s recommended to let the query complete before retrieving the query results in an Amazon S3 location you can specify.

In Athena, you can compose a SQL statement against the Hive tables with predicates to further limit the size of the query result for faster visualization by QuickSight.

Deploying the resources with AWS CloudFormation

To demonstrate our solution, we provide a CloudFormation template that you can download to easily deploy the necessary AWS resources. The template creates the following resources to simulate the environment:

VPC and subnets – A VPC with one public and one private subnets. A NAT gateway is also created to allow outbound internet access from the EMR cluster to download public COVID-19 datasets from the Johns Hopkins GitHub repo.
EMR cluster – A multi-master EMR cluster with Hive, running on three primary nodes (m5.xlarge) and two core nodes (m5.xlarge), is launched to support the thrift connection required by the Athena Lambda connectors.
Amazon RDS for MySQL database – An RDS for MySQL primary instance is launched in the same subnet as the EMR cluster. The RDS instance serves as the Hive metastore backend data store.
S3 bucket – An S3 bucket stores files in Parquet format by Amazon EMR and is accessed later by Athena.
AWS IAM users – Two AWS Identity and Access Management (IAM) users belonging to different user groups. The first user, the data engineer, has permissions to access the Lambda-based Athena data source connector. The other user, the salesperson, does not have permissions to access the connector.

To get started, you need to have an AWS account. If you don’t have one, go to aws.amazon.com to sign up for one. Then complete the following steps:

Sign in to the AWS Management Console as an IAM power user, preferably an admin user.
Choose Launch Stack to launch the CloudFormation template:

This template has been tested in the US East (N. Virginia) Region.

Choose Next.

You’re prompted to enter a few launch parameters.

For Stack name, enter a name for the stack (for example, athena-hms).
For Hive Metastore Host Number, choose the Amazon EMR primary node to connect to (or use the default value).
Continue to choose Next and leave other parameters at their default.
On the review page, select the three check boxes to confirm that AWS CloudFormation might create resources.
Choose Create stack.

The stack takes 15–20 minutes to complete.

On the Outputs tab of the stack details, save the key-value pairs to use later.

When the EMR cluster is provisioned, it uses a bootstrap action to install the necessary Python libraries. It runs an Amazon EMR Spark step (a PySpark script) that downloads three COVID-19 datasets for confirmed, recovered, and death cases from the John Hopkins GitHub repo in CSV format and stores them in the csv subfolder of the S3 bucket created by the CloudFormation stack. Lastly, the final transformed data is converted to Parquet format and external Hive tables are created referencing the Parquet data located in the parquet subfolder of the S3 bucket.

The following are the source codes for the bootstrap and Spark step actions for your reference:

To validate the data, on the Amazon S3 console, choose the bucket name from the CloudFormation template outputs. You should see a covid_data folder in the bucket. The folder contains the two subfolders, csv and parquet, which store the raw CSV and transformed Parquet data, respectively.

Querying the data in Athena

The CloudFormation template creates two users belonging to two different AWS IAM groups. The de_user_us-east-1_athena-hms user is a data engineer with permissions to access the Lambda function to communicate with the Hive metastore using the Athena data source connector. They belong to the group athena-hms-DataEngineerGroup-xxxxxxx. The sales_user_us-east-1_athena-hms user is a salesperson with no access to the connector. They belong to the group athena-hms-SalesGroup-xxxxx.

To query the data, first retrieve your secret values in AWS Secrets Manager:

On the Secrets Manager console, choose Secrets.
Choose DataEngineerUserCreds.

Choose Retrieve secret value.

Save the username and password.

Repeat these steps for SalesUserCreds.

With the data in place, we can now use Athena to query it.

Accessing data as the data engineer

To query the data as the de_user_us-east-1_athena-hms user, complete the following steps:

Sign in to the Athena console as the de_user_us-east-1_athena-hms user with the credentials retrieved from Secrets Manager.

After logging in, we need to create a data source for Hive metastore.

On the navigation bar, choose Data sources.
Choose Connect data source.

For Choose where you data is located, select Query data in Amazon S3.
For Choose a metadata catalog, select Apache Hive metastore.
Chose Next.

For Lambda function, choose the function the CloudFormation template created with the key HiveMetastoreFunctionName.
For Catalog name, enter a name for the Athena catalog (for example, demo_hive_metastore).
Choose Connect.

You should now have a catalog named demo_hive_metastore with the catalog type Hive metastore.

On the navigation bar, choose Query editor.

Enter the following SQL statement:

SELECT *
FROM demo_hive_metastore.default.covid_confirmed_cases
WHERE country = 'US'
        OR country = 'Italy'
        OR country = 'India'
        OR country = 'Brazil'
        OR country = 'Spain'
        OR country = 'United Kingdom'

This SQL statement selects all the columns in the covid_confirmed_cases table with predicates to only include a few countries of interest. We use a table name with the pattern <catalog name>.<hive database name>.<hive table name in the database>, which for this post translates to demo_hive_metastore.default.covid_confirmed_cases.

Choose Run query.

The following screenshot shows your query results.

Make sure you completely sign out of the console before moving on to the next steps.

Accessing data as the salesperson

Sign in to the console as sales_user_us-east-1_athena-hms user. Because the salesperson user doesn’t have the appropriate IAM policies to access the Hive metastore connection, you can’t see the tables.

Permissions policies

The data engineer has additional policies attached to their IAM group in addition to the managed AmazonAthenaFullAccess policy: the <stack-name>-DataBucketReadAccessPolicy-xxxxx and <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policies created by the CloudFormation template. Therefore, the data engineer can view the tables, but the salesperson can’t.

These policies are available on the IAM console, on the Permissions tab for the group <stack-name>-DataEngineerGroup-xxxxx.

The following sample JSON code is the <stack-name>-DataBucketReadWriteAccessPolicy-xxxxx policy to allow access to the provisioned S3 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:Get*",
                "s3:List*",
                "s3:Put*"
            ],
            "Resource": [
                "arn:aws:s3:::<provisioned bucket name>",
                "arn:aws:s3:::<provisioned bucket name>/",
                "arn:aws:s3:::<provisioned bucket name>/*"
            ],
            "Effect": "Allow"
        }
    ]
}

The following sample JSON code is the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy to allow access to the Lambda Hive metastore function:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "lambda:GetFunction",
                "lambda:GetLayerVersion",
                "lambda:InvokeFunction"
            ],
            "Resource": [
                "arn:aws:lambda:us-east-1:<account id>:function:athena-hms-HiveMetastoreFunction"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:PutObject",
                "s3:ListMultipartUploadParts",
                "s3:AbortMultipartUpload"
            ],
            "Resource": [
               "arn:aws:s3:::<provisioned bucket name>/hms_spill"
            ],
            "Effect": "Allow"
        },
        {
            "Action": [
                "lambda:ListFunctions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

Next, we walk through using QuickSight to visualize the results. Make sure you completely sign out of the console as the salesperson user before proceeding to the next steps.

Signing up for QuickSight

You can skip this section if you have already signed up for QuickSight previously.

Sign in to the console as the IAM power user who deployed the CloudFormation template or any user with enough IAM privileges to set up QuickSight.
On the QuickSight console, choose Sign up for QuickSight.
Select either the Standard or Enterprise edition for QuickSight.
Choose Continue.

For QuickSight account name, enter your AWS account ID.
For Notification email address, enter your email.
Select Amazon S3.

Select the provisioned S3 bucket to grant QuickSight permission to access.
Choose Finish.

Your QuickSight account should now be set up.

Attaching the Hive metastore access policy

Before you can use QuickSight, you have to attach the Hive metastore access policy to the QuickSight service role.

On the IAM console, search for the service role aws-quicksight-service-role-v0.
Choose the role.

Search for the <stack-name>-HiveMetastoreDataSourceAccessPolicy-xxxxx policy.
Select the policy and attach it to the QuickSight service role.

Creating your data source and performing data conversions

Before we can create visualizations, we need to set up our data source.

Download the SQL script covid-19.sql.
On the QuickSight console, choose Datasets in the navigation pane.

Choose New dataset.

Choose Athena.

In the pop-up window, for Data source name, enter demo_hive_metastore.
Choose Validate.
When the connection is validated, choose Create data source.

In the next window, choose Use custom SQL.

Enter the content of the covid-19.sql script in the query window.
Choose Confirm query.

Leave Import to SPICE for quicker analytics
Choose Visualize.

Now we perform a few data type conversions before visualizing the data.

Choose the Edit icon next to Data set on the menu bar.

Choose the … icon.
Choose Edit.

Expand date and choose Change data type.
Choose Date.

Enter yyyy-MM-dd to convert the date format.
Choose Update.

Now we create a coordinate using the latitude and longitude values.r

Expand lat and choose Ad to coordinates.

Leave Create new geospatial coordinates
Chose Add.

In the pop-up window, for Name your coordinates, enter coordinate.
For Field to use for longitude, choose lon.
Choose Create coordinates.

Choose Save and visualize on the menu bar.

Creating visualizations in QuickSight

Now we can visualize our data. For this post, we create a map visualization.

For Visual types, choose the map

For Geospatial, drag coordinates.
For Size, drag confirmed.
For Color, drag country.

This world map shows the accumulated confirmed cases for selected countries over time; you need to use a filter to look at confirmed cases on a specific date.

In the navigation pane, choose Filter.
Choose the + icon.
For Filter type, choose Time range.
Choose start and end dates, such as 2020-09-10 00:00 and 2020-09-11 00:00, respectively.
Choose Apply.

This plots the confirmed cases on September 10, 2020, for these countries.

Similarly, you can choose other visual types, such as a line chart, and generate the mortality rate for selected countries over time.

Using highly available primary nodes of the Amazon EMR cluster

The EMR cluster has a multi-master configuration with three primary nodes running to meet high availability requirements. At any time, the Lambda function communicates with one of these three EMR primary nodes. In the rare event that this node goes down, you can quickly re-establish the Athena data source connector to the external Hive metastore by failing over to another active primary node.

To perform this failover, complete the following steps:

On the AWS CloudFormation console, choose Stacks.
Choose athena-hms.
Choose update.
Choose Use current update.
Choose Next.
For Hive Metastore Host Number, choose a host other than the current one you’re using.

Choose Next.
Acknowledge that AWS CloudFormation might create IAM resources.
Choose Update stack.

In less than a minute, you should be able to access the Hive metastore and continue to query on the Athena console.

Cleaning up

You may want to clean up the demo environment when you’re done. To do so, on the AWS CloudFormation console, select the template and choose Delete.

This action also deletes the S3 bucket and any data in it. If you want to retain the data for future use, you should make a copy of the bucket before you delete it.

Summary

In this post, we walked through a solution using Athena to query external Hive tables with public COVID-19 datasets hosted in an S3 bucket and visualizing the data with QuickSight. We provided a CloudFormation template to automate the deployment of necessary AWS services for the demo. We encourage you to use these managed and scalable services for your specific use cases in production.

About the Authors

James Sun is a Senior Solutions Architect with Amazon Web Services. James has over 15 years of experience in information technology. Prior to AWS, he held several senior technical positions at MapR, HP, NetApp, Yahoo, and EMC. He holds a PhD from Stanford University.

Chinmayi Narasimhadevara is a Solutions Architect with Amazon Web Services. Chinmayi has over 15 years of experience in information technology and has worked with organizations ranging from large enterprises to mid-sized startups. She helps AWS customers leverage the correct mix of AWS services to achieve success for their business goals.

Gagan Brahmi is a Solutions Architect focused on Big Data & Analytics at Amazon Web Services. Gagan has over 15 years of experience in information technology. He helps customers architect and build highly scalable, performant, and secure cloud-based solutions on AWS.

Applying row-level and column-level security on Amazon QuickSight dashboards

2020-11-23 Niyati Upadhyay

Post Syndicated from Niyati Upadhyay original https://aws.amazon.com/blogs/big-data/applying-row-level-and-column-level-security-on-amazon-quicksight-dashboards/

Amazon QuickSight is a cloud-scale business intelligence (BI) service that you can use to deliver easy-to-understand insights to the people you work with, wherever they are. QuickSight connects to your data in the cloud and combines data from many different sources. On a single data dashboard, QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more.

QuickSight users in your organization should have access to only certain data for compliance and security reasons. Without proper integrated features to enforce row-level and column-level security, you have to develop additional solutions such as views, data masking, or encryption, or try to integrate third-party solutions around your data to enforce security.

In this post, we walk through how to implement row-level and column-level security in QuickSight.

Overview of solution

You can use row-level and column-level security to share a dashboard with several users who can only view the data that they have permissions to. For example, if you have an internal team of sales agents who contact a list of potential customers, you need a single dashboard that is accessible to all the agents but displays only the list of prospects assigned to them.

Another example could be regarding financial data. You might want to restrict access to sensitive fields to a small set of users. You could further restrict access based on some other field, such as geography. Let’s assume we have three users: A, B and C. Users A and B are admin users and have access to financial data, but user C does not. Furthermore, users A and B are admins for different countries and have access to financial data for their respective countries only.

In this post, we set up row-level and column-level security permissions in our dataset for different sets of users and experiment with grant and deny access rules. We walk you through the following steps:

Upload the dataset and apply the permissions to it.
Create appropriate filters and reports for users.
Log in as different users and see how the dashboard’s data has been tailored just to what they are allowed to see.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
QuickSight Enterprise edition

In the Enterprise edition of QuickSight, you can restrict access to a dataset by configuring row-level and column-level security. Only the people you shared with can see any of the data.

You can also send dashboard snapshots as emails to groups of readers, with each reader receiving customer reports as per the security configurations set on the dashboard.

Granting row-level access

For our use case, a business analyst in the marketing team creates a dataset of monthly sales in QuickSight using the file month_b2bsalesdata.xlsx. The following screenshot shows the dataset details. We use the location field to restrict user access to the data.

To grant row-level access, complete the following steps:

Upload the permission file and define appropriate user access.

In the permission file, we have the location field same as our primary month_b2bsalesdata dataset. It doesn’t matter what order the fields are in. However, all the fields are case-sensitive. They must exactly match the field names and values. You must have at least one field that identifies either users or groups. The field you use for users or groups can have any name you choose.

If you’re specifying groups, use only QuickSight groups or Microsoft AD groups.

The following screenshot shows the dataset rules (permissions) for row-level security. Here we have defined access for users to different states and cities.

On the Datasets tab, choose the dataset you want to enable row-level security for.
Choose Row-level security this dataset.

For Selected data set rules, select the appropriate permission file (for this post, permissions.csv).
For Permission Policy, you can grant permissions. For this post, select Grant access to data set.

By default, access is denied for all users, and you use the permissions file to allow access to the data. For example, if I don’t put a row for users in the permissions file, they don’t see any data in the report. If I put blanks (empty strings “”) for users in the permissions file, they see everything.

Choose Apply data set.

Testing row-level restrictions

You can now test the restrictions by logging in as various users.

This user has access to data where the state is California or Texas, and the city is Los Angeles or Fort Worth. The following screenshot shows the data visible to AlejandroRosalez.

The following screenshot shows the data visible to SaanviSarkar. They can see data only for any city, but only if the state is Texas.

The following screenshot shows the data visible to MarthaRivera. Martha can see the data for any city or state.

The following screenshot shows that no data is visible to the workshop user.

Granting column-level access

We can also grant access at the column level to specific users and groups.

On the QuickSight console, on the Datasets page, choose the dataset you want to enable column-level security for.
Choose Column-level security.

By default, all users and groups have access to the data. Column-level security allows you to manage access to specific columns in your dataset.

You can restrict column access to particular users and groups. In the following screenshot, I’m restricting access for the sales column only to PauloSantos and ZhangWei.

Testing column-level access

To test these additional restrictions, log in as the Zhangwei user.

The following screenshot shows the data visible to Zhangwei user. They can see all the columns.

If you log in as the workshop user, Sales is marked with the Restricted symbol in the Fields list pane. This user doesn’t have access to view or use the Sales field in any visual within the analysis.

Pivot tables and regular tables

If the column is in the Values field, you can see Not Authorised in pivot tables and regular tables.

In the following screenshots, you can see Sales is in the Values field.

If the column is in the Group by, Row, or Column field, you get an error message saying You can’t access this visual because this field is restricted : Sales.

In the following screenshots, you can see Sales is in the Column and Group by fields.

Cleaning up

To avoid incurring future charges, delete the QuickSight users and Enterprise account.

Conclusion

This post showed how QuickSight allows you to set up row-level and column-level security at the user and group level. QuickSight also supports emailing dashboards with row-level security enabled. You can send dashboard snapshots as emails to groups of readers, and each reader receives custom reports as per the security configurations set on the dataset. For more information, see Sending Reports by Email.

You can try this solution for your own use cases. If you have comments or feedback, please leave them in the comments.

About the Author

Niyati Upadhyay is a Solutions Architect at AWS. She joined AWS in 2019 and specializes in building and supporting Big Data solutions that help customers analyze and get value out of their data.

Event-driven refresh of SPICE datasets in Amazon QuickSight

2020-10-12 Dylan Qu

Post Syndicated from Dylan Qu original https://aws.amazon.com/blogs/big-data/event-driven-refresh-of-spice-datasets-in-amazon-quicksight/

Businesses are increasingly harnessing data to improve their business outcomes. To enable this transformation to a data-driven business, customers are bringing together data from structured and unstructured sources into a data lake. Then they use business intelligence (BI) tools, such as Amazon QuickSight, to unlock insights from this data.

To provide fast access to datasets, QuickSight provides a fully managed calculation engine called SPICE—the Super-fast, Parallel, In-Memory Calculation Engine. At the time of writing, SPICE enables you to cache up to 250 million rows or 500 GB of data per dataset.

To extract value from the data quickly, you need access to new data as soon as it’s available. In this post, we describe how to achieve this by refreshing SPICE datasets as part of your extract, transform, and load (ETL) pipelines.

Solution architecture

In this post, you automate the refresh of SPICE datasets by implementing the following architecture.

This architecture consists of two parts: an example ETL job and a decoupled event-driven process to refresh SPICE.

For the ETL job, you use Amazon Simple Storage Service (Amazon S3) as your primary data store. Data lands in an S3 bucket, which we refer to as the raw zone. An Amazon S3 trigger configured on this bucket triggers an AWS Lambda function, which starts an AWS Glue ETL job. This job processes the raw data and outputs processed data into another S3 bucket, which we refer to as the processed zone.

This sample ETL job converts the data to Apache Parquet format and stores it in the processed S3 bucket. You can modify the ETL job to achieve other objectives, like more granular partitioning, compression, or enriching of the data. The Glue Data Catalog stores the metadata and QuickSight datasets are created using Amazon Athena data sources.

To trigger the SPICE dataset refresh, after the ETL job finishes, an Amazon EventBridge rule triggers a Lambda function that initiates the refresh.

In summary, this pipeline transforms your data and updates QuickSight SPICE datasets upon completion.

Deploying the automated data pipeline using AWS CloudFormation

Before deploying the AWS CloudFormation template, make sure you have signed up for QuickSight in one of the 11 supported Regions:

US East (Ohio)
US East (N. Virginia)
US West (Oregon)
Asia Pacific (Mumbai)
Asia Pacific (Seoul)
Asia Pacific (Singapore)
Asia Pacific (Sydney)
Asia Pacific (Tokyo)
EU (Frankfurt)
EU (Ireland)
EU (London)

This post works with both Standard and Enterprise editions of QuickSight. Enterprise Edition provides richer features and higher limits compared to Standard Edition.

After you sign up for QuickSight, you can use CloudFormation templates to create all the necessary resources by choosing Launch stack:
Enter a stack name; for example, SpiceRefreshBlog.
Acknowledge the AWS Identity and Access Management (IAM) resource creation.
Choose Create stack.

The CloudFormation template creates the following resources in your AWS account:

Three S3 buckets to store the following:
- AWS Glue ETL job script
- Raw data
- Processed data
Three Lambda functions to do the following:
- Create the ETL job
- Initiate the ETL job upon upload of new data in the raw zone
- Initiate the SPICE dataset refresh when the ETL job is complete
An AWS Glue database
Two AWS Glue tables to store the following:
- Raw data
- Processed data
An ETL job to convert the raw data from CSV into Apache Parquet format
Four IAM roles: One each for the Lambda functions and one for the ETL job
An EventBridge rule that triggers on an AWS Glue job state change event with a state of Succeeded and invokes a Lambda function that performs the SPICE dataset refresh

Importing the dataset

For this post, you use the taxi Trip Record Data dataset publicly available from the NYC Taxi & Limousine Commission Trip Record Data dataset. You upload monthly data in CSV format to the raw zone S3 bucket.

This data is available in Amazon S3 through Open Data on AWS, a service designed to let you spend more time on data analysis rather than data acquisition.

You start by copying the For Hire Vehicle (FHV) data for March 2020. Because the data is already available in Amazon S3 through Open Data, run the following command to copy the data into the raw zone. Make sure you replace <raw bucket name> with the name of the raw bucket created by the CloudFormation template:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-03.csv" s3://<raw bucket name>

After you copy the data into the raw zone, the Amazon S3 event trigger invokes the Lambda function that triggers the ETL job. You can see the job status on the AWS Glue console by choosing Jobs in the navigation pane. The process takes about 2 minutes.

When the job is complete, check that you can see the Parquet files in the processed zone S3 bucket.

Creating a QuickSight analysis of the data

To visualize the taxi data, we create a QuickSight analysis.

First, you need to give QuickSight the necessary permissions to access the processed zone S3 bucket. For instructions, see I Can’t Connect to Amazon S3.

Then complete the following steps to create an analysis of the taxi data:

On the QuickSight console, choose Datasets.
Choose New dataset.
Choose Athena and provide a name for the data source (such as Athena).
Choose Create data source.
For Database, choose the name of the taxi AWS Glue database (starting with taxigluedatabase).
For Tables, select processed_taxi_data as the table to visualize.
Choose Select.
Ensure Import to SPICE for quicker analytics is selected and choose Visualize.

After the data is imported into SPICE, you can create visuals to display the data. For example, the following screenshot shows a key performance indicator (KPI) of the number of taxi journeys aggregated at the month level and the number of journeys over time.

We use this dashboard to visualize the dataset again after we refresh SPICE with more data.

Automating the SPICE refresh

To refresh the SPICE dataset when the ETL job is complete, the CloudFormation template we deployed created an EventBridge rule that triggers a Lambda function each time an AWS Glue ETL job successfully completes. The following screenshot shows the code for the event pattern.

We need to configure the Lambda function with the ETL job name and the ID of the SPICE dataset we created in QuickSight.

Locate the ETL job name on the AWS Glue console, named TaxiTransformationGlueJob-<unique id>.
To find the SPICE dataset ID, run the following command using the AWS Command Line Interface (AWS CLI):
```
aws quicksight list-data-sets --aws-account-id <your AWS account id> 
```
The following screenshot shows the output with the dataset ID.
On the Lambda console, open the Lambda function named SpiceRefreshBlog-QuicksightUpdateLambda-<unique id>.
Update line 9 of the code to replace ReplaceWithGlueJobName with the AWS Glue job name and ReplaceWithYourDatasetID with the dataset ID.

Once a Glue job succeeds, this Lambda function is triggered. The EventBridge event that triggers the Lambda contains the name of the job. You can access this from the event as follows, as seen on line 25 of the function:

succeededJob = event[‘detail’][‘jobName’]

The Lambda function looks up the job name in the data_set_map dictionary. If the dictionary contains the job name, the dataset ID is accessed and the function calls the QuickSight Create Ingestion API to refresh the SPICE datasets.

You can extend the data_set_map dictionary to include additional job names and associated SPICE dataset IDs to be refreshed. If using this approach at scale, you might choose to move this configuration information to an Amazon DynamoDB table.

Save the Lambda function by choosing Deploy.

Testing the automated refresh

Now that you have configured the Lambda function, we can test the ETL end-to-end process and make the next month’s data available for analysis.

To add the FHV data for April, run the following AWS CLI command:

aws s3 cp "s3://nyc-tlc/trip data/fhv_tripdata_2020-04.csv" s3://<raw bucket name>

As before, this upload to the raw zone triggers the Lambda function that starts the ETL job. You can to see the progress of the job on the AWS Glue console.

When the job is complete, navigate to QuickSight and open the taxi analysis (or, if you still have it open, refresh the window).

You can now see that both months’ data is available for analysis. This step might take 1–2 minutes to load.

To see the status of each SPICE refresh, navigate back to the dataset on the QuickSight console and choose View History.

The following screenshot shows the status of previous refreshes and the number of rows that have been ingested into SPICE.

Now that you have tested the end-to-end process, you can try copying more FHV data to the raw zone and see the data within your QuickSight analysis.

Cleaning up

To clean up the resources you created by following along with this post, complete the following steps:

Delete the QuickSight analysis you created.
Delete the QuickSight dataset that you created.
Delete the QuickSight data source:
1. Choose New dataset.
2. Select the data source and choose Delete data source.
On the Amazon S3 console, delete the contents of the raw and processed S3 buckets.
On the AWS CloudFormation console, select the stack SpiceRefreshBlog and choose Delete.

Conclusion

Using an event-based architecture to automate the refresh of your SPICE datasets makes sure that your business analysts are always viewing the latest available data. This reduction in time to analysis can help your business unlock insights quicker without having to wait for a manual or scheduled process. Additionally, by only refreshing SPICE when new data is available, the underlying data storage resources are used efficiently, so you only pay for what you need!

Get started with QuickSight today!

About the Authors

Rob Craig is a Senior Solutions Architect with AWS. He supports customers in the UK with their cloud journey, providing them with architectural advice and guidance to help them achieve their business outcomes.

Dylan Qu is an AWS solutions architect responsible for providing architectural guidance across the full AWS stack with a focus on Data Analytics, AI/ML and DevOps.

Using administrative dashboards for a centralized view of Amazon QuickSight objects

2020-10-12 Ying Wang

Post Syndicated from Ying Wang original https://aws.amazon.com/blogs/big-data/using-administrative-dashboards-for-a-centralized-view-of-amazon-quicksight-objects/

“Security is job 0” is the primary maxim of all endeavors undertaken at AWS. Amazon QuickSight, the fast-growing, cloud-native business intelligence (BI) platform from AWS, allows security controls in a variety of means, including web browsers and API calls. These controls apply to various functions, such as user management, authorization, authentication, and data governance.

This post demonstrates how to build a workflow to enable a centralized visualization of QuickSight groups and user information, as well as QuickSight objects access permission auditing information. Combined with AWS CloudTrail logs, the solution enables your security team to detect any abnormal behavior in near-real time to ensure security compliance.

Benefits of a centralized dashboard

A group in QuickSight consists of a set of users. Using groups makes it easy to manage access and security. For example, you can configure three groups, called Marketing, HR, and BI Developer, and each has specific access privileges:

The users in the Marketing group can only view the dashboards with marketing data
The users in the HR group can only view the human resources data
The users in the BI Developer group can edit all objects, including data sources, datasets, and dashboards

After the users and groups are configured, BI administrators can check and edit the object access permission by choosing Share for dashboards, datasets, and all other objects. The following screenshot shows the Manage dashboard sharing page on the QuickSight console.

As of this writing, individual object permission information is available on the QuickSight console, and user information is provided on the user management view on the QuickSight console. Our solution integrates QuickSight APIs with other AWS services to create an administrative dashboard that provides a centralized view of essential security information. This dashboard covers not only user lists and individual object access permission information available on the current platform, but also additional security information like group lists, user-group mapping information, and overall objects access permissions. This dashboard allows you to acquire unique security insights with its collection of comprehensive security information.

This post provides a detailed workflow that covers the data pipeline, sample Python codes, the AWS CloudFormation template, and a sample administrative dashboard. With the guidance of this post, you can configure a centralized information center in your own environment.

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

A new user creation event in the CloudTrail log triggers the Amazon CloudWatch Events rule CreateUser.
The CreateUser rule triggers the AWS Lambda function User_Initiation. This function checks if the new user belongs to an existing group (for this post, we assume that their AWS Identity and Access Management (IAM) role equates to the group they should belong to in QuickSight). If such a group exists, the function adds the user into the group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup). The following is the process flow diagram of the Lambda function.
If the CreateGroupMembership event occurs, it triggers the Lambda function Data_Prepare. This function calls QuickSight APIs to get QuickSight group, user, and object access permissions information and saves the results to an Amazon Simple Storage Service (Amazon S3) bucket.

If the Lambda function User_Initiation creates a new QuickSight group in Step 2, it triggers a CloudWatch rule CreateGroup and the Lambda function Group_Initiation. The Group_Initiation function updates the QuickSight objects permission for the new group, such as granting it permission to view a dashboard.
The update object permission event in Step 4 triggers the Lambda function Data_Prepare, which updates the object access permissions information and saves the updated information to an S3 bucket.
The DeleteUser and DeleteGroup events also trigger the Lambda function Data_Prepare.
Based on the file in S3 that contains user-group mapping information and the QuickSight objects access permissions information, an Amazon Athena table is created.
A QuickSight dataset fetches the data in the Athena table created in Step 7 through DirectQuery Another QuickSight dataset is created based on the CloudTrail logs data. Then, based on these two datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

An AWS account
Access to the following AWS services:
- QuickSight
- Athena
- Lambda
- Amazon S3
Basic knowledge of Python
Security Assertion Markup Language 2.0 (SAML 2.0) or OpenID Connect (OIDC) single sign-on (SSO) configured for QuickSight access

Creating resources

Create your resources by downloading the following AWS Cloud Development Kit (AWS CDK) stack from the GitHub repo.

Pull the Administrative Dashboard folder and run the command cdk deploy QuickSightStack to deploy the resources. For more information, see AWS CDK Intro Workshop: Python Workshop.

Implementing the solution

This solution assumes that the users log in to their QuickSight account with identity federation through SAML or OIDC. For instructions on setting up SAML SSO, see Single Sign-On Access to Amazon QuickSight Using SAML 2.0, Federate Amazon QuickSight access with Okta and Enabling Amazon QuickSight federation with Azure AD. For OIDC SSO, see Use Amazon QuickSight Federated Single Sign-On with Amazon Cognito User Pools.

After you set up the IAM policy of the web identity or SAML federation role, you don’t need to invite users manually. A QuickSight user is provisioned automatically when opening QuickSight for the first time.

In this solution, one SAML federation role corresponds to a QuickSight group. There are four sample SAML roles: Marketing, HR, BI-Admin, and BI Developer.

The following code is the sample CreateUser CloudTrail event:

{
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "AROAZEAWJBC3FBJ7KDH2N:[email protected]",
        "arn": "arn:aws:sts::<aws_account_id>:assumed-role/ BI-Developer/[email protected]",
        "accountId": <aws_account_id>,
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "AROAZEAWJBC3FBJ7KDH2N",
                "arn": "arn:aws:iam:: <aws_account_id>:role/BI-Developer",
                "accountId": <aws_account_id>,
                "userName": " BI-Developer"}
        }
    },
    "eventSource": "quicksight.amazonaws.com",
    "eventName": "CreateUser",
    "awsRegion": "us-east-1",
    "eventType": "AwsServiceEvent",
…    
}

This event triggers the CloudWatch events rule CreateUser. The following screenshot shows the details of this rule.

The CreateUser rule triggers the Lambda function User_Initiation. This function gets the QuickSight group name (Marketing, HR, BI-Admin, or BI Developer) and compares the group name with the existing group list. If such a group exists, it adds this new user into that group (CreateGroupMembership). Otherwise, it creates a new group (CreateGroup).

The Data_Prepare Lambda function is triggered by adding a new user into a group event (CreateGroupMembership). This function calls the QuickSight API describe_data_set_permissions, describe_dashboard_permissions, or describe_data_source_permissions to get the object access permissions. It also calls the APIs list_user_groups and list_users to get the list of users and the groups of each user. Finally, this function creates two files containing QuickSight group, user, or object access information, and saves these files into a S3 bucket.

If a new QuickSight group is created, it triggers the Lambda function Group_Initiation to update the QuickSight dashboard, dataset, or data source permission for this new group. For example, if the HR group is created, the Group_Initiation function lets the HR group view the Employee Information dashboard.

The UpdateDashboardPermissions, UpdateDatasetPermissions, and UpdateDatasourcePermissions events trigger the Lambda function Data_Prepare to update the object access permissions information stored in the S3 bucket.

To create two Athena tables (Groups and Objects), run an AWS Glue crawler.

The following screenshot is sample data of the Groups table.

The following screenshot is sample data of the Objects table.

You can create a DirectQuery dataset in QuickSight with the two new Athena tables joined. See the following screenshot.

The Objects table contains the information of objects (such as dashboards and datasets) belonging to each group or user. Furthermore, we can create a calculated field called Ownership based on Permissions information (the actions column in objects table) to provide the objects owner with viewer or user information (the object owner can delete this object, whereas the viewer or user can’t do the deletion action).

The following screenshot shows the relevant code.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

After that, run the following SQL query to build an Athena view with QuickSight events for the last 24 hours:

CREATE OR REPLACE VIEW qsctlog_last_24h AS 
SELECT "useridentity"."type", "split_part"("useridentity"."sessioncontext"."sessionissuer"."arn", '/', 2) "group_name"
, COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) "user_name"
, "eventtime"
, "eventname"
, "awsregion"
, "resources"
, "eventtype"
, "recipientaccountid"
, "serviceeventdetails"
FROM default.cloudtrail_logs
WHERE (("eventsource" = 'quicksight.amazonaws.com') AND (CAST("split_part"("eventtime", 'T', 1) AS date) > "date_add"('hour', -24, "now"())))

Running queries in Athena

Now we have the datasets ready in Athena and can run SQL queries against them to answer some common administrative questions.

To create a QuickSight dataset to catch all orphan users that don’t belong to any group, as well as the events done by these users in the last 24 hours, run the following SQL query:

SELECT g.*
,log.group_name as role_name
,log.user_name as log_user_name
,log.type
,log.eventtime
,log.eventname
,log.awsregion
,log.eventtype
,log.recipientaccountid
,log.serviceeventdetails
FROM "default"."qsctlog_last_24h" as log 
full outer join 
"default"."groups" as g 
on log.awsregion=g.aws_region AND log.group_name=g.group_name AND log.user_name=g.user_name 
where g.group_name is null or g.group_name=''

To create a QuickSight dataset to list objects belonging to each group or user, run the following query:

SELECT group_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON group_name=principal_name
WHERE principal_type='group'
UNION
SELECT user_name AS "Group/User Name"
, object_name
, object_type
, if((actions LIKE '%Delete%'), 'Owner', 'Viewer/User') AS Ownership
FROM "default"."object" full outer
JOIN "default"."groups"
    ON user_name=principal_name
WHERE principal_type='user'
ORDER BY  "Group/User Name" asc;

The following screenshot shows the sample data.

Building dashboards

In addition to running queries directly in Athena, we can build a dashboard using this same data in QuickSight. The following screenshot shows an example dashboard that you can make using our data.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Cleaning up

To avoid incurring future charges, delete the resources you created by running the following command:

cdk destroy QuickSightStack

Then, on the Amazon S3 console, delete the S3 bucket administrative-dashboard<your_aws_account_id>.

Conclusion

This post discussed how BI administrators can use the QuickSight dashboard, Lambda functions, and other AWS services to create a centralized view of groups, users, and objects access permission information and abnormal access auditing. We also presented a serverless data pipeline to support the administrative dashboard. This dashboard can provide you with unique security insights with its collection of comprehensive security information.

About the Author

Ying Wang is a Data Visualization Engineer with the Data & Analytics Global Specialty Practice in AWS Professional Services.

Analyze and improve email campaigns with Amazon Simple Email Service and Amazon QuickSight

2020-10-12 Apoorv Gakhar

Post Syndicated from Apoorv Gakhar original https://aws.amazon.com/blogs/messaging-and-targeting/analyze-and-improve-email-campaigns-with-amazon-simple-email-service-and-amazon-quicksight/

Email is a popular channel for applications, used in both marketing campaigns and other outbound customer communications. The challenge with email is that it can become increasingly complex to manage for companies that must send large quantities of messages per month. This complexity is especially true when companies need to measure detailed email engagement metrics to track campaign success.

As a marketer, you want to monitor several metrics, including open rates, click-through rates, bounce rates, and delivery rates. If you do not track your email results, you could potentially be wasting your campaign resources. Monitoring and interpreting your sending results can help you deliver the best content possible to your subscribers’ inboxes, and it can also ensure that your IP reputation stays high. Mailbox providers prioritize inbox placement for senders that deliver relevant content. As a business professional, tracking your emails can also help you stay on top of hot leads and important clients. For example, if someone has opened your email multiple times in one day, it might be a good idea to send out another follow-up email to touch base.

Building a large-scale email solution is a complex and expensive challenge for any business. You would need to build infrastructure, assemble your network, and warm up your IP addresses. Alternatively, working with some third-party email solutions require contract negotiations and upfront costs.

Fortunately, Amazon Simple Email Service (SES) has a highly scalable and reliable backend infrastructure to reduce the preceding challenges. It has improved content filtering techniques, reputation management features, and a vast array of analytics and reporting functions. These features help email senders reach their audiences and make it easier to manage email channels across applications. Amazon SES also provides API operations to monitor your sending activities through simple API calls. You can publish these events to Amazon CloudWatch, Amazon Kinesis Data Firehose, or by using Amazon Simple Notification Service (SNS).

In this post, you learn how to build and automate a serverless architecture that analyzes email events. We explore how to track important metrics such as open and click rate of the emails.

Solution overview

The metrics that you can measure using Amazon SES are referred to as email sending events. You can use Amazon CloudWatch to retrieve Amazon SES event data. You can also use Amazon SNS to interpret Amazon SES event data. However, in this post, we are going to use Amazon Kinesis Data Firehose to monitor our user sending activity.

Enable Amazon SES configuration sets with open and click metrics and publish email sending events to Amazon Kinesis Data Firehose as JSON records. A Lambda function is used to parse the JSON records and publish the content in the Amazon S3 bucket.

Ingested data lands in an Amazon S3 bucket that we refer to as the raw zone. To make that data available, you have to catalog its schema in the AWS Glue data catalog. You create and run the AWS Glue crawler that crawls your data sources and construct your Data Catalog. The Data Catalog uses pre-built classifiers for many popular source formats and data types, including JSON, CSV, and Parquet.

When the crawler is finished creating the table definition and schema, you analyze the data using Amazon Athena. It is an interactive query service that makes it easy to analyze data in Amazon S3 using SQL. Point to your data in Amazon S3, define the schema, and start querying using standard SQL, with most results delivered in seconds.

Now you can build visualizations, perform ad hoc analysis, and quickly get business insights from the Amazon SES event data using Amazon QuickSight. You can easily run SQL queries using Amazon Athena on data stored in Amazon S3, and build business dashboards within Amazon QuickSight.

Deploying the architecture:

Configuring Amazon Kinesis Data Firehose to write to Amazon S3:

Navigate to the Amazon Kinesis in the AWS Management Console. Choose Kinesis Data Firehose and create a delivery stream.
Enter delivery stream name as “SES_Firehose_Demo”.
Under the source category, select “Direct Put or other sources”.
On the next page, make sure to enable Data Transformation of source records with AWS Lambda. We use AWS Lambda to parse the notification contents that we only process the required information as per the use case.
Click the “Create New” Lambda function.
Click on “General Kinesis Data FirehoseProcessing” Lambda blueprint and this opens up the Lambda console. Enter following values in Lambda
- Name: SES-Firehose-Json-Parser
- Execution role: Create a new role with basic Lambda permissions.

Click “Create Function”. Now replace the Lambda code with the following provided code and save the function.

'use strict';
console.log('Loading function');
exports.handler = (event, context, callback) => {
   /* Process the list of records and transform them */
    const output = event.records.map((record) => {
        console.log(record.recordId);
        const payload =JSON.parse((Buffer.from(record.data, 'base64').toString()))
        console.log("payload : " + payload);
        
        if (payload.eventType == "Click") {
        const resultPayLoadClick = {
                eventType : payload.eventType,
                destinationEmailId : payload.mail.destination[0],
                sourceIp : payload.click.ipAddress,
            };
        console.log("resultPayLoad : " + resultPayLoadClick.eventType + resultPayLoadClick.destinationEmailId + resultPayLoadClick.sourceIp);
        
        //const parsed = resultPayLoad[0];
        //console.log("parsed : " + (Buffer.from(JSON.stringify(resultPayLoad))).toString('base64'));
        
        
        return{
            recordId: record.recordId,
            result: 'Ok',
            data: (Buffer.from(JSON.stringify(resultPayLoadClick))).toString('base64'),
        };
        }
        else {
            const resultPayLoadOpen = {
                eventType : payload.eventType,
                destinationEmailId : payload.mail.destination[0],
                sourceIp : payload.open.ipAddress,
            };
        console.log("resultPayLoad : " + resultPayLoadOpen.eventType + resultPayLoadOpen.destinationEmailId + resultPayLoadOpen.sourceIp);
        
        //const parsed = resultPayLoad[0];
        //console.log("parsed : " + (Buffer.from(JSON.stringify(resultPayLoad))).toString('base64'));
        
        
        return{
            recordId: record.recordId,
            result: 'Ok',
            data: (Buffer.from(JSON.stringify(resultPayLoadOpen))).toString('base64'),
        };
        }
    });
    console.log("Output : " + output.data);
    console.log(`Processing completed.  Successful records ${output.length}.`);
    callback(null, { records: output });
};

Please note:

For this blog, we are only filtering out three fields i.e. Eventname, destination_Email, and SourceIP. If you want to store other parameters you can modify your code accordingly. For the list of information that we receive in notifications, you may check out the following document.

https://docs.aws.amazon.com/ses/latest/DeveloperGuide/event-publishing-retrieving-firehose-examples.html

Now, navigate back to your Amazon Kinesis Data Firehose console and choose the newly created Lambda function.
Keep the convert record format disabled and click “Next”.
In the destination, choose Amazon S3 and select a target Amazon S3 bucket. Create a new bucket if you do not want to use the existing bucket.
Enter the following values for Amazon S3 Prefix and Error Prefix. When event data is published.
- Prefix:
  fhbase/year=!{timestamp:yyyy}/month=!{timestamp:MM}/day=!{timestamp:dd}/hour=!{timestamp:HH}/
- Error Prefix:
  fherroroutputbase/!{firehose:random-string}/!{firehose:error-output-type}/!{timestamp:yyyy/MM/dd}/
You may utilize the above values in the Amazon S3 prefix and error prefix. If you use your own prefixes make sure to accordingly update the target values in AWS Glue which you will see in further process.
Keep the Amazon S3 backup option disabled and click “Next”.
On the next page, under the Permissions section, select create a new role. This opens up a new tab and then click “Allow” to create the role.
Navigate back to the Amazon Kinesis Data Firehose console and click “Next”.
Review the changes and click on “Create delivery stream”.

Configure Amazon SES to publish event data to Kinesis Data Firehose:

Navigate to Amazon SES console and select “Email Addresses” from the left side.
Click on “Verify a New Email Address” on the top. Enter your email address to which you send a test email.
Go to your email inbox and click on the verify link. Navigate back to the Amazon SES console and you will see verified status on the email address provided.
Open the Amazon SES console and select Configuration set from the left side.
Create a new configuration set. Enter “SES_Firehose_Demo” as the configuration set name and click “Create”.
Choose Kinesis Data Firehose as the destination and provide the following details.
- Name: OpenClick
- Event Types: Open and Click
In the IAM Role field, select ‘Let SES make a new role’. This allows SES to create a new role and add sufficient permissions for this use case in that role.
Click “Save”.

Sending a Test email:

Navigate to Amazon SES console, click on “Email Addresses” on the left side.
Select your verified email address and click on “Send a Test email”.

Make sure you select the raw email format. You may use the following format to send out a test email from the console. Make sure you send out this email to a recipient inbox to which you have the access.

X-SES-CONFIGURATION-SET: SES_Firehose_Demo
X-SES-MESSAGE-TAGS: Email=NULL
From: [email protected]
To: [email protected]
Subject: Test email
Content-Type: multipart/alternative;
    		boundary="----=_boundary"

------=_boundary
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: 7bit
This is a test email.

<a href="https://aws.amazon.com/">Amazon Web Services</a>
------=_boundary

Once the email is received in the recipient’s inbox, open the email and click the link present in the same. This generates a click and open event and send the response back to SES.

Creating Glue Crawler:

Navigate to the AWS Glue console, select “crawler” from the left side, and then click on “Add crawler” on the top.
Enter the crawler name as “SES_Firehose_Crawler” and click “Next”.
Under Crawler source type, select “Data stores” and click “Next”.
Select Amazon S3 as the data source and prove the required path. Include the path until the “fhbase” folder.
Select “no” under Add another data source section.
In the IAM role, select the option to ‘Create an IAM role’. Enter the name as “SES_Firehose-Crawler”. This provides the necessary permissions automatically to the newly created role.
In the frequency section, select run on demand and click “Next”. You may choose this value as per your use case.
Click on add Database and provide the name as “ses_firehose_glue_db”. Click on create and then click “Next”.
Review your Glue crawler setting and click on “Finish”.
Run the above-created crawler. This crawls the data from the specified Amazon S3 bucket and create a catalog and table definition.
Now navigate to “tables” on the left, and verify a “fhbase” table is created after you run the crawler.

If you want to analyze the data stored until now, you can use Amazon Athena and test the queries. If not, you can move to the Amazon Quicksight directly.

Analyzing the data using Amazon Athena:

Open Athena console and select the database, which is created using AWS Glue
Click on “setup a query result location in Amazon S3” as shown in the following screenshot.
Navigate to the Amazon S3 bucket created in earlier steps and create a folder called “AthenaQueryResult”. We store our Athena query result in this bucket.
Now navigate back to Amazon Athena and select the Amazon S3 bucket with the folder location as shown in the following screenshot and click “Save”.
Run the following query to test the sample output and accordingly modify your SQL query to get the desired output.
- Select * from “ses_firehose_glue_db”.”fhbase”

Note: If you want to track the opened emails by unique Ip addresses then you can modify your SQL query accordingly. This is because every time an email gets opened, you will receive a notification even if the same email was previously opened.

Visualizing the data in Amazon QuickSight dashboards:

Now, let’s analyze this data using Amazon Athena via Amazon Quicksight.
Log into Amazon Quicksight and choose Manage data, New dataset. Choose Amazon Athena as a new data source.
Enter the data source name as “SES-Demo” and click on “Create the data source”.
Select your database from the drop-down as “ses_firehose_glue_db” and table “fhbase” that you have created in AWS Glue.
And add a custom SQL based on your use case and click on “Confirm query”. Refer to the example below.
You can perform ad hoc analysis and modify your query according to your business needs as shown in the following image. Click “Save & Visualize”.
You can now visualize your event data on Amazon Quicksight dashboard. You can use various graphs to represent your data. For this demo, the default graph is used and two fields are selected to populate on the graph, as shown below.

Conclusion:

This architecture shows how to track your email sending activity at a granular level. You set up Amazon SES to publish event data to Amazon Kinesis Data Firehose based on fine-grained email characteristics that you define. You can also track several types of email sending events, including sends, deliveries, bounces, complaints, rejections, rendering failures, and delivery delays. This information can be useful for operational and analytical purposes.

To get started with Amazon SES, follow this quick start guide and you can learn more about monitoring sending activity here.

About the Authors

Chirag Oswal is a solutions architect and AR/VR specialist working with the public sector India. He works with AWS customers to help them adopt the cloud operating model on a large scale.

Apoorv Gakhar is a Cloud Support Engineer and an Amazon SES Expert. He is working with AWS to help the customers integrate their applications with various AWS Services.

Additional Resources:

Amazon SES Dedicated IP Pools

Amazon Personalize optimizer using Amazon Pinpoint events

Template Personalization using Amazon Pinpoint

Amazon QuickSight adds support for on-sheet filter controls

2020-10-07 Jose Kunnackal

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-adds-support-for-on-sheet-filter-controls/

Amazon QuickSight now supports easy and intuitive filter controls that you can place beside visuals on dashboards, allowing readers to quickly slice and dice data in the context of its visual representation. You can create these filter controls from existing or new filters with a single click, and configure them to support different operations, such as filtering specific dates, relative dates, or date rages; setting upper and lower thresholds for numeric values; adding drop-downs with single-select or multi-select options; and more.

In this post, we review how these filtering improvements, together with themes and dashboard layout options, let you create stunning, interactive dashboards that you can share with tens of thousands of users, whether in QuickSight or using embedded dashboards within apps, without any server provisioning or management needed, and paying for what you use. For this use case, we use the COVID-19 public dataset for Washington state. The following screenshot shows the dashboard with on-sheet filters added.

Using the new filter controls

Let us take a deeper look at the new filter controls, which are now placed across the sheet between visuals. Creating these controls is easy— simply add a new filter on the required dimension or metric, set the scope to either filter the entire dashboard or specific visuals within, and add it to the sheet. QuickSight now automatically maps filters across multiple datasets used within a sheet, so actions on the filter can apply to every visual on the sheet if so desired. For example, to add a new filter on the county field, create a filter on the field using the left hand navigation pane when authoring the dashboard.

Set the scope of the filter as desired, and choose Add to sheet to add a control to the sheet.

QuickSight then creates a moveable control that you can place anywhere on the sheet. QuickSight chooses the control type depending on the type of filter created. For example, when creating filters on dimensional fields, QuickSight adds a multi-select drop-down control by default. You can change the control type by choosing the Settings icon in the control’s visual menu.

In the Edit control section, you can make further updates to your control.

You can also place these controls in the control drawer on top of the sheet by choosing Pin to top.

On-sheet controls

The on-sheet controls currently supported include those previously supported with QuickSight parameter controls (single-select drop-downs, multi-select drop-downs, date and time picker, single-sided slider, single-line text box). QuickSight now supports new controls for date and time range selection, relative date selection, and numeric range selection. You can move existing parameter controls on the sheet and place them beside the new filter controls.

Let’s take a quick look at these new controls. You can use date and time range selection controls when you have a BETWEEN date range filter on your dashboard.

Relative date controls provide readers with powerful functionality to apply date filters at yearly, quarterly, monthly, weekly, daily, hourly, and minute levels. For example, you can choose to filter by current year, previous year, year to date, and last or next N years. These controls provide a great way to ensure your users always see the latest data whether viewing data in dashboards or email reports.

With a two-sided slider control, you can set lower and upper bounds on metric filters. To add a two-sided filter, you simply add the numeric BETWEEN filter to the sheet.

Additional dashboard features

In this section, we look at some other aspects of the dashboard.

Customizing your dashboard layout and theme

This dashboard uses a different set of colors and highlights than the regular palette in QuickSight. We achieve this by using a custom theme, which lets you pick a color palette for data values, background and foreground colors, fonts, and more. You can also choose to remove borders around the visual, show or remove spacing between visuals, and add margins around the sheet.

To create themes for your dashboards, go to the Themes from the left hand menu when authoring the dashboard. You can choose one of starter themes available in QuickSight and choose Save as to customize your own and use the theme editor to visualize changes before saving the theme.

On the Main page of the theme editor, you can customize your background color, foreground color, and font.

On the Data page, you can customize your data colors.

For more information about QuickSight themes, see Evolve your analytics with Amazon QuickSight’s new APIs and theming capabilities.

Images and rich text

You can use the QuickSight insight editor (available in Enterprise Edition) to add rich text and images to dashboards. To add an image, go to the insight editor, choose the Image icon, and provide the hosted location of the image. You can also hyperlink this image to any URL.

Reference lines

The bar chart within the dashboard also includes a reference line, which you can easily add to a QuickSight line or bar visual. You can configure a reference line from the visual menu.

Conditional formatting

Tables in the dashboard use conditional formatting, which you can add from the table menu.

Scaling

QuickSight dashboards default to auto-fit mode, so they are responsive based on screen size. Instead, you can choose to pick a specific screen resolution to optimize for, based on the devices your audience most commonly uses to view the dashboard. To adjust the scaling mode, choose Settings in the navigation pane while in dashboard authoring (analysis) mode.

For this post, our dashboard was built for a 1366px screen, and scales that view to a larger or smaller screen to ensure that all users see the same content (mobile devices continue to fall back to a single column, mobile-specific layout to ensure usability). Opting for the optimized mode also makes sure that your email reports look exactly like the dashboard that your viewers interact with.

Conclusion

With denser dashboards, custom themes, and new on-sheet filter controls in QuickSight, you can provide richer dashboards for your readers. Visit our user guide to learn more about on-sheet filter controls, themes, dashboard scaling options and more. For more information about authoring dashboards in QuickSight, watch Part 1 and Part 2 of our two-part interactive workshop.

About the authors

Jose Kunnackal John is a principal product manager for Amazon QuickSight.

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in product/program management for 6 years now, and has built multiple products in the retail, healthcare, and analytics space.

ICBiome uses Amazon QuickSight to empower hospitals in dealing with harmful pathogens

2020-10-02 Chirag Dhull

Post Syndicated from Chirag Dhull original https://aws.amazon.com/blogs/big-data/icbiome-uses-amazon-quicksight-to-empower-hospitals-in-dealing-with-harmful-pathogens/

In response to the COVID-19 pandemic, hospitals and healthcare organizations are increasingly employing genetic sequencing to screen, track, and contain harmful pathogens. ICBiome is a startup that has been working on this problem for several years, creating innovative data analytics products using AWS to help hospitals and researchers address both community-associated and hospital-acquired infections. Building on its early focus on methicillin-resistant Staphylococcus aureus (MRSA), the company is expanding its solution to cover COVID-19 and other types of infections.

ICBiome has now integrated Amazon QuickSight embedded analytics into its solution to provide hospital investigators and other users with easy access to powerful analytics and visualizations on any device.

Empowering hospitals and communities

Many hospitals lack access to sequencing capabilities, and hospitals that have such capabilities often lack complex data processing and advanced bioinformatics expertise. Open-source platforms are limited, complex, and scale poorly. They require hardware resources and management beyond the reach of most hospitals and lack visualization software to interactively explore results without specialized know-how.

BiomeMRSA, ICBiome’s cloud-based genomics service, overcomes these gaps by providing easy access to sequencing through its Clinical Laboratory Improvement Amendments (CLIA)-certified provider, with storage, computing, analytics, and visualization delivered from the AWS Cloud. With the addition of QuickSight embedded analytics, hospital investigators can easily monitor pathogens coming through their hospital environments.

Seeing the whole picture

The company’s innovative data processing methodology enables the comparison of full genomes, as opposed to open-source platforms that only compare portions of them. With a vast and growing database of MRSA mutations, the company enables rapid identification of existing and new strains to help hospitals combat infection.

BiomeMRSA monitors intensive care unit (ICU) settings and identifies emerging MRSA transmissions. By using BiomeMRSA to monitor their ICU settings, hospitals can proactively target local reservoirs and reduce overall incidence rates of MRSA. To support its R&D in pathogen genomics, the company has received more than $1 million in funding from NIAID and state grants. In 2019, ICBiome completed an evaluation of BiomeMRSA at a major hospital. It’s now evaluating and validating BiomeMRSA at two additional hospitals.

Improving patient care and treatment cost

This capability gives hospitals the potential to transform how they manage outbreaks. Intake and weekly screenings of patients lack the resolution to determine how the disease was acquired. This means hospitals can be financially and reputationally penalized for a disease that was acquired before admission.

BiomeMRSA allows continuous monitoring of ICUs by routinely sequencing all samples and returning those samples to hospitals within days. This allows for earlier detection, mitigation of outbreaks, and the reduction of colonization rates of ICU patients by identifying transmissions where the patient was colonized but not infected. Lowering colonization rates can reduce the risk of post-care infections and costly readmissions.

ICBiome is now expanding this work to include tracking pandemics such as COVID-19. The Virginia-based company is adding a COVID-19 application, BiomeCOVID, to address the ongoing crisis. Previously deployed for real-time operations, BiomeCOVID will use genetics to greatly increase the accuracy of contact tracing by separating isolates within different lineages and further identifying unique clusters within those lineages. The company is accelerating the development of BiomeCOVID to meet the critical healthcare need.

Visualizing genomics to combat disease

Since its first NIH grant award in 2017, ICBiome has built its solutions in the AWS Cloud. To deliver rich, in-depth analytics capabilities to its customers, the company chose QuickSight for embedded dashboards in its products. According to Dr. Srini Iyer, founder and CEO of ICBiome, QuickSight provides the agility and scalability the company needs to deliver in-depth solutions quickly. “ICBiome has developed an end-to-end cloud architecture for processing genetic sequence data from bacterial and viral pathogens,” says Dr. Iyer. “Over the last year, ICBiome looked at several data analytics tools that can serve as the visual interface for our product line. We did a formal analysis of both cloud and on-premises business intelligence solutions, and ultimately decided to choose QuickSight for the security, scalability, cost, and reliability.”

The following image shows the Analytics Dashboard for BiomeMRSA:

ICBiome’s customers are centered in large hospitals and public health agencies. Security and data privacy are critical to these decision-makers, particularly because bacterial and viral samples are taken from patients in the course of clinical care. QuickSight uses the same HIPAA-compliant security architecture used by the rest of the company’s data ingestion and processing systems built on AWS.

Cost-efficient and highly scalable

Scalability is critical because the company expects to handle exponentially growing amounts of data due to the growth in whole-genome sequencing to identify pathogen strains. “We opted for Amazon QuickSight as it streamlines operational scaling and integration not just within the application layer but across the entire AWS ecosystem,” says Dante Martinez, Chief Technology Officer of ICBiome.

As a startup providing affordable solutions for hospitals, cost is a critical concern for ICBiome. Many others in its space have opted to use open-source business intelligence (BI) solutions for this reason. However, the company’s analysis showed that this wasn’t the most cost-effective route. “Many open-source BI platforms require a high level of maintenance to ensure reliability both for routine operations and lifecycle management,” says Martinez. “Given the critical nature of our application, we did not want to compromise product integrity by choosing a difficult-to-maintain open-source platform or an entry-level BI tool. Amazon QuickSight is highly cost-effective, but it is also full-featured and mature, providing an experience on par with much more expensive competitors.”

The following diagram illustrates ICBiome’s cloud architecture.

Visualizing a healthier future

The company has big plans for the embedded analytics functionality enabled by QuickSight. “Once we complete the launches of BiomeCOVID and BiomeMRSA as commercial SaaS products, we intend to bring to market other surveillance products such as BiomeCRE and BiomeSTD to track other critical threats to public health,” says Dr. Iyer. “We will also be targeting other biomedical areas where we can develop new classes of products that provide critical data analytical capabilities that are currently not feasible.”

With the landscape of public health now transformed in the wake of the COVID-19 pandemic, ICBiome will continue to expand its SaaS portfolio with Amazon to provide hospitals innovative data analytics tools in infection prevention, epidemiology, and patient care.

About the Author

Chirag Dhull is a Principal Product Marketing Manager for Amazon QuickSight.

Field Notes: Gaining Insights into Labeling Jobs for Machine Learning

2020-09-30 Michael Graumann

Post Syndicated from Michael Graumann original https://aws.amazon.com/blogs/architecture/field-notes-gaining-insights-into-labeling-jobs-for-machine-learning/

In an era where more and more data is generated, it becomes critical for businesses to derive value from it. With the help of supervised learning, it is possible to generate models to automatically make predictions or decisions by leveraging historical data. For example, image recognition for self-driving cars, predicting anomalies on X-rays, fraud detection in finance and more. With supervised learning, these models learn from labeled data. The success of those models is highly dependent on readily available, high quality labeled data.

However, you might encounter cases where a high percentage of your pre-existing data is unlabeled. In these situations, providing correct labeling to previously unlabeled data points would directly translate to higher model accuracy.

Amazon SageMaker Ground Truth helps you with exactly that. It lets you build highly accurate training datasets for machine learning quickly. SageMaker Ground Truth provides your labelers with built-in workflows and interfaces for common labeling tasks. This process could take several hours or more depending on the size of your unlabeled dataset, and you might have a need to track the progress easily, preferably in the form of a dashboard.

In this blogpost we show how to gain deep insights into the progress of labeling and the performance of the workers by using Amazon Athena and Amazon QuickSight. We use Amazon Athena former to set up several views with specific insights into the labeling progress. Finally we will reference these views in Amazon QuickSight to visualize the data in a dashboard.

This approach also works for combining multiple AWS services in general. AWS provides many building blocks than you can mix-and-match to create a unique, integrated solution with cohesive insights. In this blog post we use data produced by one service (Ground Truth), prepare it with another (Athena) and visualize with a third (QuickSight). The following diagram shows this architecture.

Solution Architecture

ML Solution Architecture

Mapping a JSON structure to a table structure

Ground Truth creates several directories in your Amazon S3 output path. These directories contain the results of your labeling job and other artifacts of the job. The top-level directory for a labeling job has the same name as your labeling job, while the output directories are placed inside it. We will create all insights from what SageMaker Ground Truth calls worker responses.

All respective JSON files reside in the path s3://bucket/<job-name>/annotations/worker-response/.

To analyze the labeling data with Amazon Athena we need to understand the structure of the underlying JSON files. Let’s review the example below. For each item that was labeled, we see the label itself, followed by the submission time and a workerId pointing to an identity. This identity lives in Amazon Cognito, a fully managed service that provides the user directory for our labelers.

{
    "answers": [
        {
            "answerContent": {
                "crowd-classifier": {
                    "label": "Compute"
                }
            },
            "submissionTime": "2020-03-27T10:31:04.210Z",
            "workerId": "private.eu-west-1.1111111111111111",
            "workerMetadata": {
                "identityData": {
                    "identityProviderType": "Cognito",
                    "issuer": "https://cognito-idp.eu-west-1.amazonaws.com/eu-west-1_111111111",
                    "sub": "11111111-1111-1111-1111-111111111111"
                }
            }
        },
        ...
    ]
}

Although the data is stored in Amazon S3 object storage, we are able to use SQL to access the data by using Amazon Athena. Since we now understand the JSON structure from shown in the preceding code, we use Athena and define how to interpret the data that is relevant to us. We do so by first creating a database using the Athena Query Editor:

CREATE DATABASE analyze_labels_db;

Once inside the database, we add the table schema. The actual files remain on Amazon S3, but using the metadata catalog, Athena then knows where the data lies and how to interpret it. The AWS Glue Data Catalog is a central repository to store structural and operational metadata for all your data assets. For a given dataset, you can store its table definition, physical location, add business relevant attributes, in addition to track how this data has changed over time. Besides, Athena the AWS Glue Data Catalog also provides out-of-box integration with Amazon EMR and Amazon Redshift Spectrum. Once you add your table definitions to the Glue Data Catalog, they are available for ETL. They are also readily available for querying in Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum so that you can have a common view of your data between these services.

When going from JSON to SQL, we are crossing format boundaries. To further facilitate how to read the JSON formatted data we are using SerDe Properties to replace the hyphen in crowd-classifier with an underscore due to DDL constraints. Finally we point the location to our Amazon S3 bucket containing the single worker responses. Recognize in the following script that we translate the nested structure of the JSON file itself into a hierarchical, nested data structure in the schema definition. Also, we could leave out the workerMetadata as we don’t need it at this time. The data would still stay in the files on Amazon S3, so that we could later change and add the workerMetadata STRUCT into the table definition for our analysis.

CREATE EXTERNAL TABLE annotations_raw (
  answers array<
    struct<answercontent: 
      struct<crowd_classifier: 
        struct<label: string>
      >,
      submissionTime: string,
      workerId: string,
      workerMetadata: 
        struct<identityData: 
          struct<identityProviderType: string, issuer: string, sub: string>
        >
    >
  >
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
  "mapping.crowd_classifier"="crowd-classifier" 
) 
LOCATION 's3://<YOUR_BUCKET>/<JOB_NAME>/annotations/worker-response/'

Creating Views in Athena

Now, we have nested data in our annotations_raw table. For many use cases, especially for analytical uses, representing data in a tabular fashion—as rows—is more natural. This is also the standard way when using SQL and business intelligence tools. To unnest the hierarchical data into flattened rows, we create the following view which will serve as foundation for the other views we create. For an in-depth look into unnesting data with Amazon Athena, read this blog post.

Some of the information we’re interested in might not be part of the document, but is encoded in the path. We use a trick in Athena by using the $path variable from the Presto Hive Connector. This determines which Amazon S3 file contains data that is returned by a specific row in an Athena table. This way we can find out which data object an annotation belongs to. Since Athena is built on top of Presto, we are able to use Presto’s built-in regexp_extract function to find out the iteration as well as the data object id per labeling result. We also cast the submission time in date format to later determine the labeling progress per day.

CREATE OR REPLACE VIEW annotations_view AS
SELECT 
  regexp_extract("$path", 'iteration-[0-9]*') as iteration,
  regexp_extract("$path", '(iteration-[0-9]*\/([0-9]*))',2) as dataRecord,
  answer.answercontent.crowd_classifier.label,
  cast(from_iso8601_timestamp(answer.submissionTime) as timestamp) as submissionTime,
  cast(from_iso8601_timestamp(answer.submissionTime) as date) as submissionDay,
  answer.workerId,
  answer.workerMetadata.identityData.identityProviderType,
  answer.workerMetadata.identityData.issuer,
  answer.workerMetadata.identityData.sub,
  "$path" path
FROM 
  annotations_raw
CROSS JOIN UNNEST(answers) AS t(answer)

This view, annotations_view, will be the starting point for the other views we will be creating in further in this post.

Visualizing with QuickSight

In this section, we explore a way to visualize the views we build in Athena by pointing Amazon QuickSight to the respective view. Amazon QuickSight lets you create and publish interactive dashboards that include ML Insights. Dashboards can then be accessed from any device, and embedded into your applications, portals, and websites.

Thanks to the tight integration between Athena and QuickSight, we are able to map one dataset in QuickSight to one Athena view. In order to further optimize the performance of the dashboard, we can optionally import the datasets into the in-memory optimized calculation engine for Amazon QuickSight called SPICE. With the datasets in place we can now create an analysis in order to interact with the visuals we’re going to add. You can think of an analysis as a container for a set of related visuals. You can use multiple datasets in an analysis, although any given visual can only use one of those datasets. After you create an analysis and an initial visual, you can expand the analysis. You can do this for example by adding datasets and visuals.

Let’s start with our first insight.

Annotations per worker

We’d like to gain insights not only into the total number of labeled items but also on the level of contributions of each individual workers. This could give us an indication whether the labels were created by a diverse crowd of labelers or by a few productive ones. A largely disproportionate amount of contributions from a handful of workers who may have brought along their biases.

SageMaker Ground Truth calls labeled data objects annotations, which is the result of a single workers labeling task.

Luckily we encapsulated all the heavy lifting of format conversion in the annotations_view, so that it is now easy to create a view for the annotations per user:

CREATE OR REPLACE VIEW annotations_per_user AS
SELECT COUNT(sub) AS LabeledItems,
sub AS User
FROM annotations_view
GROUP BY sub
ORDER BY LabeledItems DESC

Next we visualize this view in QuickSight. We add a visual to our analysis, select the respective dataset for the view and use the AutoGraph feature, which chooses the most appropriate visual type. Since we already arranged our view in Athena by the number of labeled items in descending order, there is no need now to sort the data in QuickSight. In the following screenshot, worker c4ef78e4... contributed more labels compared to their peers.

Annotations per worker

This view gives you an indicator to check for a bias that the leading worker might have brought along.

Annotations per label

One thing we want to be aware of is potential imbalances between classes in our dataset. Especially simple machine learning models, which may learn to frequently predict a label that is massively over represented in the dataset. If we can identify an imbalance, we can apply mitigation actions such as upsampling data of underrepresented classes. With the following view we list the total number of annotations per label.

CREATE OR REPLACE VIEW annotations_per_label AS
SELECT Count(dataRecord) AS TotalLabels, label As Label 
FROM annotations_view
GROUP BY label
ORDER BY TotalLabels DESC, Label;

As before, we create a dataset in QuickSight pointing to the annotations_per_label view, open the analysis, add a new visual and leverage the AutoGraph functionality. The result is the following visual representation:

Annotations per worker 2

One can clearly see that the Analytics & AI/ML class is massively underrepresented. At this point, you might want to try getting more data or think about upsampling data for that class.

Annotations per day

Seeing the total number of annotations per label and per worker is good, but we are also interested in how the labeling progress changes over time. This way we might see spikes related to labeller activations. We can also or estimate how long it takes to reach a certain goal of annotations given the current pace. For this purpose we create the following view aggregating the total annotations per day.

CREATE OR REPLACE VIEW annotations_per_day AS
SELECT COUNT(datarecord) AS LabeledItems,
submissionDay
FROM annotations_view
GROUP BY submissionDay
ORDER BY submissionDay, LabeledItems DESC

This time the QuickSight AutoGraph provides us with the following line chart. You might have noticed that the axis labels do not match the column names in Athena. That is because we renamed them in QuickSight for better readability.

Total annotations per day

In the preceding chart we see that there is no consistent pace of labeling, which makes it hard to predict when a certain amount of labeled data will be reached. In this example, after starting strong the progress immediately went down. Knowing this, we might want to take action into motivating our workers to contribute more and validate the effectiveness of these actions with the help of this chart. The spikes indicate an effective short-term action.

Distribution of total annotations by user

We already have insights into annotations per worker, per label and per day. Let us now now see what insights we can get from aggregating some of this information.

The bigger your labeling workforce gets, the harder it can become to see the whole picture. For that reason we will now create a histogram consisting of five buckets. Each bucket represents an interval of total annotations (for example, 0-25 annotations) mapped to the number of users whose amount of total annotations lies in that interval. This allows us to get a sense of what kind of bias might be introduced by the majority of annotations being contributed by a small amount of workers.

To do that, we use the Presto function width_bucket which returns the number of labeled data objects according to the five buckets we defined with a size of 25 each. We define these buckets by creating an Array with 5 elements that specify the boundaries.

CREATE OR REPLACE VIEW users_per_bucket_annotations AS
SELECT 
bucket,numberOfUsers,
CASE
   WHEN bucket=5 THEN 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '+'
   ELSE 'B' || cast(bucket AS VARCHAR(10)) || ': ' || cast(((bucket-1) * 25) AS VARCHAR(10)) || '-' || cast((bucket * 25) AS VARCHAR(10))
END AS NumberOfAnnotations
FROM
(SELECT width_bucket(labeleditems,ARRAY[0,25,50,75,100]) AS bucket,
 count(user) AS numberOfUsers
FROM annotations_per_user
GROUP BY 1
ORDER BY bucket)

A SELECT * FROM users_per_bucket_annotations produces the following result:

A SELECT FROM users_per_bucket_annotations

Let’s now investigate the same data via QuickSight:

Annotations per User in buckets of Size 25

Now that we can look at the data visually it becomes clear that we have a bimodal distribution, with many labelers having done very little, and many labelers doing quite a lot. This may warrant interviewing some labelers to find out if there is something holding back users from progressing, or if we can keep engagement high over time.

Putting it all together in QuickSight

Since we created all previous visuals into one analysis, we can now utilize it as a central place to consume our insights in a user-friendly way. Moreover, we can share our insights with others as a read-only snapshot which QuickSight calls a dashboard. User who are dashboard viewers can view and filter the dashboard data as below:

Groundtruth dashboard

Furthermore, you can generate a report and let QuickSight send it either once or on a schedule (daily, weekly or monthly) to your peers. This way users do not have to sign in and they can get reminders to check the progress of the labeling job. Lastly, sending out those reports is an opportunity to stay in touch with the labelers and keep the engagement high.

Conclusion

In this blogpost, we have shown one example of combining multiple AWS services in order to build a solution tailored to your needs. We took the Amazon S3 output generated by SageMaker Ground Truth and showed how it can be further processed and analyzed with Athena. Finally, we created a central place to consume our insights in a user-friendly way with QuickSight. By putting it all together in a dashboard we were able to share our insights with our peers.

You can take the same pattern and apply it to other situations: take some of the many building blocks AWS provides and mix-and-match them to create a unique, integrated solution with cohesive insights just as we did with Ground Truth, Athena, and QuickSight.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Why Deployment Requirements are Important When Making Architectural Choices

2020-09-29 Yusuf Mayet

Post Syndicated from Yusuf Mayet original https://aws.amazon.com/blogs/architecture/why-deployment-requirements-are-important-when-making-architectural-choices/

Introduction

Too often, architects fall into the trap of thinking the architecture of an application is restricted to just the runtime part of the architecture. By doing this we focus on only a single customer (such as the application’s users and how they interact with the system) and we forget about other important customers like developers and DevOps teams. This means that requirements regarding deployment ease, deployment frequency, and observability are delegated to the back burner during design time and tacked on after the runtime architecture is built. This leads to increased costs and reduced ability to innovate.

In this post, I discuss the importance of key non-functional requirements, and how they can and should influence the target architecture at design time.

Architectural patterns

When building and designing new applications, we usually start by looking at the functional requirements, which will define the functionality and objective of the application. These are all the things that the users of the application expect, such as shopping online, searching for products, and ordering. We also consider aspects such as usability to ensure a great user experience (UX).

We then consider the non-functional requirements, the so-called “ilities,” which typically include requirements regarding scalability, availability, latency, etc. These are constraints around the functional requirements, like response times for placing orders or searching for products, which will define the expected latency of the system.

These requirements—both functional and non-functional together—dictate the architectural pattern we choose to build the application. These patterns include Multi-tier, event-driven architecture, microservices, and others, and each one has benefits and limitations. For example, a microservices architecture allows for a system where services can be deployed and scaled independently, but this also introduces complexity around service discovery.

Aligning the architecture to technical users’ requirements

Amazon is a customer-obsessed organization, so it’s important for us to first identify who the main customers are at each point so that we can meet their needs. The customers of the functional requirements are the application users, so we need to ensure the application meets their needs. For the most part, we will ensure that the desired product features are supported by the architecture.

But who are the users of the architecture? Not the applications’ users—they don’t care if it’s monolithic or microservices based, as long as they can shop and search for products. The main customers of the architecture are the technical teams: the developers, architects, and operations teams that build and support the application. We need to work backwards from the customers’ needs (in this case the technical team), and make sure that the architecture meets their requirements. We have therefore identified three non-functional requirements that are important to consider when designing an architecture that can equally meet the needs of the technical users:

Deployability: Flow and agility to consistently deploy new features
Observability: feedback about the state of the application
Disposability: throwing away resources and provision new ones quickly

Together these form part of the Developer Experience (DX), which is focused on providing developers with APIs, documentation, and other technologies to make it easy to understand and use. This will ensure that we design for Day 2 operations in mind.

Deployability: Flow

There are many reasons that organizations embark on digital transformation journeys, which usually involve moving to the cloud and adopting DevOps. According to Stephen Orban, GM of AWS Data Exchange, in his book Ahead in the Cloud, faster product development is often a key motivator, meaning the most important non-functional requirement is achieving flow, the speed at which you can consistently deploy new applications, respond to competitors, and test and roll out new features. As well, the architecture needs to be designed upfront to support deployability. If the architectural pattern is a monolithic application, this will hamper the developers’ ability to quickly roll out new features to production. So we need to choose and design the architecture to support easy and automated deployments. Results from years of research prove that leaders use DevOps to achieve high levels of throughput:

Graphic - Using DevOps to achieve high levels of throughput

Decisions on the pace and frequency of deployments will dictate whether to use rolling, blue/green, or canary deployment methodologies. This will then inform the architectural pattern chosen for the application.

Using AWS, in order to achieve flow of deployability, we will use services such as AWS CodePipeline, AWS CodeBuild, AWS CodeDeploy and AWS CodeStar.

Observability: feedback

Once you have achieved a rapid and repeatable flow of features into production, you need a constant feedback loop of logs and metrics in order to detect and avoid problems. Observability is a property of the architecture that will allow us to better understand the application across the delivery pipeline and into production. This requires that we design the architecture to ensure that health reports are generated to analyze and spot trends. This includes error rates and stats from each stage of the development process, how many commits were made, build duration, and frequency of deployments. This not only allows us to measure code characteristics such as test coverage, but also developer productivity.

On AWS, we can leverage Amazon CloudWatch to gather and search through logs and metrics, AWS X-Ray for tracing, and Amazon QuickSight as an analytics tool to measure CI/CD metrics.

Disposability: automation

In his book, Cloud Strategy: A Decision-based Approach to a Successful Cloud Journey, Gregor Hohpe, Enterprise Strategist at AWS, notes that cloud and automation add a new “-ility”: disposability, which is the ability to set up and dispose of new servers in an automated and pain-free manner. Having immutable, disposable infrastructure greatly enhances your ability to achieve high levels of deployability and flow, especially when used in a CI/CD pipeline, which can create new resources and kill off the old ones.

At AWS, we can achieve disposability with serverless using AWS Lambda, or with containers running on Amazon Elastic Container Service (ECS) or Amazon Elastic Kubernetes Service (EKS), or using AWS Auto Scaling with Amazon Elastic Compute Cloud (EC2).

Three different views of the architecture

Once we have designed an architecture that caters for deployability, observability, and disposability, it exposes three lenses across which we can view the architecture:

3 views of the architecture

Build lens: the focus of this part of the architecture is on achieving deployability, with the objective to give the developers an easy-to-use, automated platform that builds, tests, and pushes their code into the different environments, in a repeatable way. Developers can push code changes more reliably and frequently, and the operations team can see greater stability because environments have standard configurations and rollback procedures are automated
Runtime lens: the focus is on the users of the application and on maximizing their experience by making the application responsive and highly available.
Operate lens: the focus is on achieving observability for the DevOps teams, allowing them to have complete visibility into each part of the architecture.

Summary

When building and designing new applications, the functional requirements (such as UX) are usually the primary drivers for choosing and defining the architecture to support those requirements. In this post I have discussed how DX characteristics like deployability, observability, and disposability are not just operational concerns that get tacked on after the architecture is chosen. Rather, they should be as important as the functional requirements when choosing the architectural pattern. This ensures that the architecture can support the needs of both the developers and users, increasing quality and our ability to innovate.

Enabling Amazon QuickSight federation with Azure AD

2020-09-29 Adnan Hasan

Post Syndicated from Adnan Hasan original https://aws.amazon.com/blogs/big-data/enabling-amazon-quicksight-federation-with-azure-ad/

Customers today want to establish a single identity and access strategy across all of their own apps, such as on-premises apps, third-party cloud apps (SaaS), or apps in AWS. If your organization use Azure Active Directory (Azure AD) for cloud applications, you can enable single sign-on (SSO) for applications like Amazon QuickSight without needing to create another user account or remember passwords. You can also enable role-based access control to make sure users get appropriate role permissions in QuickSight based on their entitlement stored in Active Directory attributes or granted through Active Directory group membership. The setup also allows administrators to focus on managing a single source of truth for user identities in Azure AD while having the convenience of configuring access to other AWS accounts and apps centrally.

In this post, we walk through the steps required to configure federated SSO between QuickSight and Azure AD. We also demonstrate ways to assign a QuickSight role based on Azure AD group membership. Administrators can publish the QuickSight app in the Azure App portal to enable users to SSO to QuickSight using their Azure AD or Active Directory credentials.

The solution in this post uses an identity provider (IdP)-initiated SSO, which means your end-users must log in to Azure AD and choose the published QuickSight app in the Azure App Portal portal to sign in to QuickSight.

Registering a QuickSight application in Azure AD

Your first step is to create a QuickSight application in Azure AD.

Log in to your Azure portal using the administrator account in the Azure AD tenant where you want to register the QuickSight application.
Under Azure Services, open Azure Active Directory and under Manage, choose Enterprise Application.
Choose New Application.
Select Non-gallery application.
For Name, enter Amazon QuickSight.
Choose Add to register the application.

Creating users and groups in Azure AD

You can now create new users and groups or choose existing users and groups that can access QuickSight.

Under Manage, choose All applications and open Amazon QuickSight
Under Getting Started, choose Assign users and groups.
For this post, you create three groups, one for each QuickSight role:
1. QuickSight-Admin
2. QuickSight-Author
3. QuickSight-Reader

For instructions on creating groups in Azure AD, see Create a basic group and add members using Azure Active Directory.

Configuring SSO in Azure AD

You can now start configuring the SSO settings for the app.

Under Manage, choose Single sign-on.
For Select a single sign-on method, choose SAML.
To configure the sections, choose Edit.
In the Basic SAML Configuration section, for Identifier (Entity ID), enter URN:AMAZON:WEBSERVICES.

This is the entity ID passed during the SAML exchange. Azure requires that this value be unique for each application. For additional AWS applications, you can append a number to the string; for example, URN:AMAZON:WEBSERVICES2.

For Reply URL, enter https://signin.aws.amazon.com/saml.
Leave Sign on URL blank.
For Relay State, enter https://quicksight.aws.amazon.com.
Leave Logout Url blank.
Under SAML Signing Certificate, choose Download next to Federation Metadata XML.

You use this XML document later when setting up the SAML provider in AWS Identity and Access Management (IAM).

Leave this tab open in your browser while moving on to the next steps.

Creating Azure AD as your SAML IdP in AWS

You now configure Azure AD as your SAML IdP.

Open a new tab in your browser.
Log in to the IAM console in your AWS account with admin permissions.
On the IAM console, choose Identity providers.
Choose Create provider.
For Provider name, enter AzureActiveDirectory.
Choose Choose File to upload the metadata document you downloaded earlier.
Choose Next Step.
Verify the provider information and choose Create.
On the summary page, record the value for the provider ARN (arn:aws:iam::<AccountID>:saml-provider/AzureActiveDirectory).

You need this ARN to configure claims rules later in this post.

You can also complete this configuration using the AWS Command Line Interface (AWS CLI).

Configuring IAM policies

In this step, you create three IAM policies for different role permissions in QuickSight:

QuickSight-Federated-Admin
QuickSight-Federated-Author
QuickSight-Federated-Reader

Use the following steps to set up QuickSight-Federated-Admin policy. This policy grants admin privileges in QuickSight to the federated user:

On the IAM console, choose Policies.
Choose Create Policy.

Choose JSON and replace the existing text with the following code:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "quicksight:CreateAdmin",
            "Resource": "*"
        }
    ]
}

Choose Review policy
For Name enter QuickSight-Federated-Admin.
Choose Create policy.

Now repeat the steps to create QuickSight-Federated-Author and QuickSight-Federated-Reader policy using the following JSON codes for each policy:

QuickSight-Federated-Author

The following policy grants author privileges in QuickSight to the federated user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "quicksight:CreateUser",
            "Resource": "*"
        }
    ]
}

QuickSight-Federated-Reader

The following policy grants reader privileges in QuickSight to the federated user:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "quicksight:CreateReader",
            "Resource": "*"
        }
    ]
}

Configuring IAM roles

Next, create the roles that your Azure AD users assume when federating into QuickSight. The following steps set up the admin role:

On the IAM console, choose Roles.
Choose Create role.
For Select type of trusted entity, choose SAML 2.0 federation.
For SAML provider, choose the provider you created earlier (AzureActiveDirectory).
Select Allow programmatic and AWS Management Console access.
For Attribute, make sure SAML:aud is selected.
Value should show https://signin.aws.amazon.com/saml.
Choose Next: Permissions.
Choose the QuickSight-Federated-Admin IAM policy you created earlier.
Choose Next: Tags.
Choose Next: Review
For Role name, enter QuickSight-Admin-Role.
For Role description, enter a description.
Choose Create role.
On the IAM console, in the navigation pane, choose Roles.
Choose the QuickSight-Admin-Role role you created to open the role’s properties.
Record the role ARN to use later.
On the Trust Relationships tab, choose Edit Trust Relationship.
Under Trusted Entities, verify that the IdP you created is listed.
Under Conditions, verify that SAML:aud with a value of https://signin.aws.amazon.com/saml is present.
Repeat these steps to create your author and reader roles and attach the appropriate policies:
1. For QuickSight-Author-Role, use the policy QuickSight-Federated-Author.
2. For QuickSight-Reader-Role, use the policy QuickSight-Federated-Reader.

Configuring user attributes and claims in Azure AD

In this step, you return to the application in Azure portal and configure the user claims that Azure AD sends to AWS.

By default, several SAML attributes are populated for the new application, but you don’t need these attributes for federation into QuickSight. Under Additional Claims, select the unnecessary claims and choose Delete.

For this post, you create three claims:

Role
RoleSessionName
SAML_SUBJECT

Creating the Role claim

To create the Role claim, complete the following steps:

Under Manage, choose Single sign-on.
Choose Edit on User Attributes & Claims section
Choose Add new claim.
For Name, enter Role.
For Namespace, enter https://aws.amazon.com/SAML/Attributes.
Under Claim conditions, add a condition for the admin, author, and reader roles. Use the parameters in the following table:

User Type	Scoped Group	Source	Value
Any	QuickSight-Admin	Attribute	arn:aws:iam::253914981264:role/Quicksight-Admin-Role,arn:aws:iam::253914981264:saml-provider/AzureActiveDirectory
Any	QuickSight-Author	Attribute	arn:aws:iam::253914981264:role/Quicksight-Author-Role,arn:aws:iam::253914981264:saml-provider/AzureActiveDirectory
Any	QuickSight-Reader	Attribute	arn:aws:iam::253914981264:role/Quicksight-Reader-Role,arn:aws:iam::253914981264:saml-provider/AzureActiveDirectory

Creating the RoleSessionName claim

To create your RoleSessionName claim, complete the following steps:

Choose Add new claim.
For Name, enter RoleSessionName.
For Namespace, enter https://aws.amazon.com/SAML/Attributes.
For Source, choose Transformation.
For Transformation, enter ExtractMailPrefix().
For Parameter 1, enter user.userprincipalname.

We use the ExtractMailPrefix() function to extract the name from the userprincipalname attribute. For example, the function extracts the name joe from the user principal name value of [email protected]. IAM uses RoleSessionName to build the role session ID for the user signing into QuickSight. The role session ID is made up of the Role name and RoleSessionName, in Role/RoleSessionName format. Users are registered in QuickSight with the role session ID as the username.

Creating the SAML_SUBJECT claim

To create your final claim, SAML_SUBJECT, complete the following steps:

Choose Add new claim.
For Name, enter SAML_SUBJECT.
For Namespace, enter https://aws.amazon.com/SAML/Attributes.
For Source, choose Attribute.
For Source attribute, enter ““Azure AD - QuickSight SSO””.

Testing the application

You’re now ready to test the application.

In the Azure portal, on the Azure Active Directory page, choose All groups.
Update the group membership of the QuickSight-Admin group by adding the current user to it.
Under Enterprise Applications, choose Amazon QuickSight.
Under Manage, choose Single sign-on.
Choose Test this application to test the authentication flow.
Log in to QuickSight as an admin.

The following screenshot shows you the QuickSight dashboard for the admin user.

Remove the current user from QuickSight-Admin Azure AD group and add it to QuickSight-Author group.

When you test the application flow, you log in to QuickSight as an author.

Remove the current user from QuickSight-Author group and add it to QuickSight-Reader group.

When you test the application flow again, you log in as a reader.

By removing the user from the Azure AD group will not automatically remove the registered user in QuickSight. You have to remove the user manually in the QuickSight admin console. The user management inside QuickSight is documented in this article.

Deep-linking QuickSight dashboards

You can share QuickSight dashboards using the sign-on URL for the QuickSight application published in the Azure Apps portal. This allows users to federate directly into the QuickSight dashboard without having to land first on the QuickSight homepage.

To deep-link to a specific QuickSight dashboard with SSO, complete the following steps:

Under Enterprise Applications, choose Amazon QuickSight
Under Manage, choose Properties.
Locate the User access URL.
Append ?RelayState to the end of the URL containing the URL of your dashboard. For example, https://myapps.microsoft.com/signin/Amazon%20QuickSight/a06d28e5-4aa4-4888-bb99-91d6c2c4eae8?RelayState=https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/224103be-0470-4de4-829f-390e55b3ef96.

You can test it by creating a custom sign-in URL using the RelayState parameter pointing to an existing dashboard. Make sure the user signing in to the dashboard has been granted proper access.

Summary

This post provided step-by-step instructions to configure a federated SSO with Azure AD as the IdP. We also discussed how to map users and groups in Azure AD to IAM roles for secure access into QuickSight.

If you have any questions or feedback, please leave a comment.

About the Author

Adnan Hasan is a Global GTM Analytics Specialist at Amazon Web Services, helping customers transform their business using data, machine learning and advanced analytics.

Bringing the power of embedded analytics to your apps and services with Amazon QuickSight

2020-09-21 Dorothy Li

Post Syndicated from Dorothy Li original https://aws.amazon.com/blogs/big-data/bringing-the-power-of-embedded-analytics-to-your-apps-and-services-with-amazon-quicksight/

In the world we live in today, companies need to quickly react to change—and to anticipate it. Customers tell us that their reliance on data has never been greater than what it is today. To improve your decision-making, you have two types of data transformation needs: data agility, the speed at which data turns into insights, and data transparency, the need to present insights to decision makers. Going forward, we expect data transformation projects to become a centerpiece in every organization, big or small.

Furthermore, applications are migrating to the cloud faster than ever. Applications need to scale quickly to potentially millions of users, have global availability, manage petabytes of data, and respond in milliseconds. Such modern applications are built with a combination of these new architecture patterns, operational models, and software delivery processes, and allow businesses to innovate faster while reducing risk, time-to-market, and total cost of ownership.

An emerging area from these two trends is to combine the power of application modernization with data transformation. This emerging trend is often called embedded analytics, and is the focus of this post.

The case for embedded analytics

Applications generate a high volume of structured and unstructured data. This could be clickstream data, sales data, data from IoT devices, social data, and more. Customers who are building these applications (such as software-as-a-service (SaaS) apps or enterprise portals) often tell us that their end-users find it challenging to derive meaning from this data because traditional business intelligence (BI) approaches don’t always work.

Traditional BI tools live in disparate systems and require data engineering teams to provide connectivity and continous integration with the application, adding to complexity and delays in the overall process. Even after the connectivity is built, you must switch back and forth between your application and the BI tool, causing frustration and decreasing the overall pace of decision-making. Customers tell us that their development teams are constantly looking for new ways to delight their users, and embedding the BI capability directly into their applications is one of the most requested asks from their end-users.

Given the strategic importance of this capability, you can use this to differentiate and up-sell as a new service in their applications. Gartner research demonstrates that 63% of CEOs expect to adopt a product-as-a-service model in the next two years, making this a major market opportunity. For example, if you provide financial services software, you can empower users to perform detailed analysis of portfolio performance trends. An HR solution might enable managers to visualize and predict turnover rates. A supply chain management solution could embed the ability to slice and dice KPIs and better understand the efficiency of logistics routes.

Comparing common approaches to embedded analytics

The approach to building an embedded analytics capability needs to deliver on the requirements of modern applications. It must be scalable, handle large amounts of data without compromising agility, and seamlessly integrate with the application’s user experience. Choosing the right methodology becomes especially important in the face of these needs.

You can build your own embedded analytics solution, but although this gives you maximum control, it has a number of disadvantages. You have to hire specialized resources (such as data engineers for building data connectivity and UX developers for building dashboards) and maintain dedicated infrastructure to manage the data processing needs of the application. This can be expensive, resource-intensive, and complex to build.

Embedding traditional BI solutions that are available in the market has limitations as well, because they’re not purpose-built for embedding use cases. Most solutions are server-based, meaning that they’re challenging to scale and require additional infrastructure setup and ongoing maintenance. These solutions also have restrictive, pay-per-server pricing, which doesn’t fully meet the needs of end-users that are consuming applications or portals via a session-based usage model.

A new approach to embedded analytics

At AWS re:Invent 2019, we launched new capabilities in Amazon QuickSight that make it easy to embed analytics into your applications and portals, empowering your customers to gain deeper insights into your application’s data. Unlike building your own analytics solution, which can be time-consuming and hard to scale, QuickSight allows you to quickly embed interactive dashboards and visualizations into your applications without compromising on the ability to personalize the look and feel of these new features.

QuickSight has a serverless architecture that automatically scales your applications from a few to hundreds of thousands of users without the need to build, set up, and manage your own analytics infrastructure. These capabilities allow you to deliver embedded analytics at hyperscale. So, why does hyperscale matter? Traditional BI tools run on a fixed amount of hardware resources, therefore more users, more concurrency, or more complex queries impact performance across all users, which requires you to add more capacity (leading to higher costs).

The following diagram illustrates a traditional architecture, which requires additional servers (and higher upfront cost) to scale.

With QuickSight, you have access to the power and scale of the AWS Cloud. You get auto scaled, consistent performance no matter the concurrency or scale of the userbase, and a truly pay-per-use architecture, meaning you only pay when your users access the dashboards or reports. The following diagram illustrates how QuickSight scales seamlessly with its serverless architecture, powered by the AWS cloud.

Furthermore, QuickSight enables your users to perform machine learning based insights such as anomaly detection, forecasting, and natural language queries. It also has a rich set of APIs that allow you to programmatically manage your analytics workflows, such as moving dashboards across accounts, automating deployments, and managing access for users with single sign-on (SSO).

New features in QuickSight Embedded Analytics

We recently announced the launch of additional embedding capabilities that allow you to do even more with QuickSight embedded analytics. QuickSight now allows you to embed dashboard authoring within applications (such as SaaS applications and enterprise portals), allowing you to empower your end-users to create their own visualizations and reports.

These ad hoc data analysis and self-service data exploration capabilities mean you don’t have to repeatedly create custom dashboards based on requests from your end-users, and can provide end-users with even greater agility and transparency with their data. This capability helps create product differentiation and up-sell opportunities within customer applications.

With this launch, QuickSight also provides namespaces, a multi-tenant capability that allows you to easily maintain data isolation while supporting multiple workloads within the same QuickSight account. For example, if you’re an independent software vendor (ISV), you can now assign dedicated namespaces to different customers within the same QuickSight account. This allows you to securely manage multiple customer workloads as users (authors or readers) within one namespace, and they can only discover and share content with other users within the same namespace, without exposing any data to other parties.

Without namespaces, you could set up your own embedded dashboards for hundreds of thousands of users with QuickSight. For example, see the following dashboard for our fictional company, Oktank Analytica.

With namespaces in place, you can extend this to provide ad-hoc authoring capabilities using curated datasets specific to each customer, created and shared by the developer or ISV. See the following screenshot.

For more information about these new features, see Embed multi-tenant analytics in applications with Amazon QuickSight.

Customer success stories

Customers are already using embedded analytics in QuickSight to great success. In this section, we share the stories of a few customers.

Blackboard

Blackboard is a leading EdTech company, serving higher education, K-12, business, and government clients around the world.

“The recent wave in digital transformation in the global education community has made it clear that it’s time for a similar transformation in the education analytics tools that support that community,” says Rachel Scherer, Sr. Director of Data & Analytics at Blackboard. “We see a need to support learners, teachers, and leaders in education by helping to change their relationship with data and information—to reduce the distance between information and experience, between ‘informed’ and ‘acting.’

“A large part of this strategy involves embedding information directly where our users are collaborating, teaching, and learning—providing tools and insights that aid in assessment, draw attention to opportunities learners may be missing, and help strategic and academic leadership identify patterns and opportunities for intervention. We’re particularly interested in making the experience of being informed much more intuitive—favoring insight-informed workflows and/or embedded prose over traditional visualizations that require interpretation.

“By removing the step of interpretation, embedded visualizations make insights more useful and actionable. With QuickSight, we were able to deliver on our promise of embedding visualizations quickly, supporting the rapid iteration that we require, at the large scale needed to support our global user community.”

For more information about Blackboard’s QuickSight use case, see the AWS Online Tech Talk Embedding Analytics in your Applications with Amazon QuickSight at the 25:50 mark.

Comcast

Syndication Insights (SI) enables Comcast’s syndicated partners to access the same level of rich data insights that Comcast uses for platform and operational improvements.

“The SI platform enables partners to gain deeper business insights, such as early detection into anomalies for users, while ensuring a seamless experience through embedded, interactive reports,” says Ajay Gavagal, Sr. Manager of Software Development at Comcast. “From the start, scalability was a core requirement for us. We chose QuickSight as it is scalable, enabling SI to extend to multiple syndicated partners without having to provision or manage additional infrastructure. Furthermore, QuickSight provides interactive dashboards that can be easily embedded into an application. Lastly, QuickSight’s rich APIs abstract away a lot of functionality that would otherwise need to be custom built.”

For more information about how Comcast uses QuickSight, see the AWS Online Tech Talk Embedding Analytics in your Applications with Amazon QuickSight at the 38:05 mark.

Panasonic Avionics Corporation

Panasonic Avionics Corporation provides customized in-flight entertainment and communications systems to more than 300 airlines worldwide.

“Our cloud-based solutions collect large amounts of anonymized data that help us optimize the experience for both our airline partners and their passengers,” says Anand Desikan, Director of Cloud Operations at Panasonic Avionics Corporation. “We started using Amazon QuickSight to report on in-flight Wi-Fi performance, and with its rich APIs, pay-per-session pricing, and ability to scale, we quickly rolled out QuickSight dashboards to hundreds of users. The constant evolution of the platform has been impressive: ML-powered anomaly detection, Amazon SageMaker integration, embedding, theming, and cross-visual filtering. Our users consume insights via natural language narratives, which allows them to read all their information right off the dashboard with no complex interpretation needed.”

EHE Health

EHE Health is national preventive health and primary care Center of Excellence provider system.

“As a 106-year-old organization moving toward greater agility and marketplace nimbleness, we needed to drastically upgrade our ability to be transparent within our internal and external ecosystems,” says David Buza, Chief Technology Officer at EHE Health. “With QuickSight, we are not constrained by pre-built BI reports, and can easily customize and track the right operational metrics, such as product utilization, market penetration, and available inventory to gain a holistic view of our business. These inputs help us to understand current performance and future opportunity so that we can provide greater partnership to our clients, while delivering on our brand promise of creating healthier employee populations.

“QuickSight allowed our teams to seamlessly communicate with our clients—all viewing the same information, simultaneously. QuickSight’s embedding capabilities, along with its secure platform, intuitive design, and flexibility, allowed us to service all stakeholders—both internally and externally. This greater flexibility and customization allowed us to fit the client’s needs seamlessly.”

Conclusion

Where data agility and transparency are critical to business success, embedded analytics can open a universe of possibilities, and we are excited to see what our customers will do with these new capabilities.

Additional resources

For more resources, see the following:

Amazon QuickSight Embedded Analytics Demo (Video)
Amazon QuickSight Gallery (Embedded samples)
Embedding the Full Functionality of the Amazon QuickSight Console (Documentation)
Supporting Multitenancy with Isolated Namespaces (Documentation)
Customizing Access to the Amazon QuickSight Console (Documentation)
Embed Amazon QuickSight (Hands-on tutorial)
Tools to Build on AWS (AWS SDK)
Amazon QuickSight Embedding SDK (JavaScript SDK)

About the Author

Dorothy Li is the Vice President and General Manager for Amazon QuickSight.

How Aruba Networks built a cost analysis solution using AWS Glue, Amazon Redshift, and Amazon QuickSight

2020-09-01 Siddharth Thacker

Post Syndicated from Siddharth Thacker original https://aws.amazon.com/blogs/big-data/how-aruba-networks-built-a-cost-analysis-solution-using-aws-glue-amazon-redshift-and-amazon-quicksight/

This is a guest post co-written by Siddharth Thacker and Swatishree Sahu from Aruba Networks.

Aruba Networks is a Silicon Valley company based in Santa Clara that was founded in 2002 by Keerti Melkote and Pankaj Manglik. Aruba is the industry leader in wired, wireless, and network security solutions. Hewlett-Packard acquired Aruba in 2015, making it a wireless networking subsidiary with a wide range of next-generation network access solutions.

Aruba Networks provides cloud-based platform called Aruba Central for network management and AI Ops. Aruba cloud platform supports thousands of workloads to support customer facing production environment and also a separate development platform for Aruba engineering.

The motivation to build the solution presented in this post was to understand the unit economics of the AWS resources used by multiple product lines across different organization pillars. Aruba wanted a faster, effective, and reliable way to analyze cost and usage data and visualize that into a dashboard. This solution has helped Aruba in multiple ways, including:

Visibility into costs – Multiple Aruba teams can now analyze the cost of their application via data surfaced with this solution
Cost optimization – The solution helps teams identify new cost-optimization opportunities by making them aware of the higher-cost resources with low utilization so they can optimize accordingly
Cost management – The Cloud DevOps organization, the group who built this solution, can effectively plan at the application level and have a direct positive impact on gross margins
Cost savings – With daily cost data available, engineers can see the monetary impact of right-sizing compute and other AWS resources almost immediately
Big picture as well as granular – Users can visualize cost data from the top down and track cost at a business level and a specific resource level

Overview of the solution

This post describes how Aruba Networks automated the solution, from generating the AWS Cost & Usage Report (AWS CUR) to its final visualization on Amazon QuickSight. In this solution, they start by configuring the CUR on their primary payer account, which publishes the billing reports to an Amazon Simple Storage Service (Amazon S3) bucket. Then they use an AWS Glue crawler to define and catalog the CUR data. As the new CUR data is delivered daily, the data catalog is updated, and the data is loaded into an Amazon Redshift database using Amazon Redshift Spectrum and SQL. The reporting and visualization layer is built using QuickSight. Finally, the entire pipeline is automated by using AWS Data Pipeline.

The following diagram illustrates this architecture.

Aruba prefers the AWS CUR Report to AWS Cost Explorer because AWS Cost Explorer provides usage information at a high level, and not enough granularity for detailed operations, such as data transfer cost. AWS CUR provides the most detailed information available about your AWS costs and usage at an hourly granularity. This allows the Aruba team to drill down the costs by the hour or day, product or product resource, or custom tags, enabling them to achieve their goals.

Aruba implemented the solution with the following steps:

Set up the CUR delivery to a primary S3 bucket from the billing dashboard.
Use Amazon S3 replication to copy the primary payer S3 bucket to the analytics bucket. Having a separate analytics account helps prevent direct access to the primary account.
Create and schedule the crawler to crawl the CUR data. This is required to make the metadata available in the Data Catalog and update it quickly when new data arrives.
Create respective Amazon Redshift schema and tables.
Orchestrate an ETL flow to load data to Amazon Redshift using Data Pipeline.
Create and publish dashboards using QuickSight for executives and stakeholders.

Insights generated

The Aruba DevOps team built various reports that provide the cost classifications on AWS services, weekly cost by applications, cost by product, infrastructure, resource type, and much more using the detailed CUR data as shown by the following screenshot.

For example, using the following screenshot, Aruba can conveniently figure out that compute cost is the biggest contributor compared to other costs. To reduce the cost, they can consider using various cost-optimization methods like buying reserved instances, savings plans, or Spot Instances wherever applicable.

Similarly, the following screenshot highlights the cost doubled compared to the first week of April. This helps Aruba to identify anomalies quickly and make informed decisions.

Setting up the CUR delivery

For instructions on setting up a CUR, see Creating Cost and Usage Reports.

To reduce complexity in the workflow, Aruba chose to create resources in the same region with hourly granularity, mainly to see metrics more frequently.

To lower the storage costs for data files and maximize the effectiveness of querying data with serverless technologies like Amazon Athena, Amazon Redshift Spectrum, and Amazon S3 data lake, save the CUR in Parquet format. The following screenshot shows the configuration for delivery options.

The following table shows some example CUR data.

bill_payer_account_id	line_item_usage_account_id	line_item_usage_start_date	line_item_usage_end_date	line_item_product_code	line_item_usage_type	line_item_operation
123456789	111222333444	00:00.0	00:00.0	AmazonEC2	USW2-EBS:VolumeP-IOPS.piops	CreateVolume-P-IOPS
123456789	111222333444	00:00.0	00:00.0	AmazonEC2	USW2-APN1-AWS-In-Bytes	LoadBalancing-PublicIP-In
123456789	111222333444	00:00.0	00:00.0	AmazonEC2	USW2-DataProcessing-Bytes	LoadBalancing
123456789	111222333444	00:00.0	00:00.0	AmazonEC2	USW2-EBS:SnapshotUsage	CreateSnapshot
123456789	555666777888	00:00.0	00:00.0	AmazonEC2	USW2-EBS:SnapshotUsage	CreateSnapshot
123456789	555666777888	00:00.0	00:00.0	AmazonEC2	USW2-EBS:SnapshotUsage	CreateSnapshot
123456789	555666777888	00:00.0	00:00.0	AmazonEC2	USW2-DataTransfer-Regional-Bytes	InterZone-In
123456789	555666777888	00:00.0	00:00.0	AmazonS3	USW2-Requests-Tier2	ReadLocation
123456789	555666777888	00:00.0	00:00.0	AmazonEC2	USW2-DataTransfer-Regional-Bytes	InterZone-In

Replicating the CUR data to your analytics account

For security purposes, other teams aren’t allowed to access the primary (payer) account, and therefore can’t access CUR data generated from that account. Aruba replicated the data to their analytics account and build the cost analysis solution there. Other teams can access the cost data without getting access permission for the primary account. The data is replicated across accounts by adding an Amazon S3 replication rule in the bucket. For more information, see Adding a replication rule when the destination bucket is in a different AWS account.

Cataloging the data with a crawler and scheduling it to run daily

Because AWS delivers all daily reports in a report date range report-prefix/report-name/yyyymmdd-yyyymmdd folder, Aruba uses AWS Glue crawlers to crawl through the data and update the catalog.

AWS Glue is a fully managed ETL service that makes it easy to prepare and load the data for analytics. Once the AWS Glue is pointed to the data stored on AWS, it discovers the data and stores the associated metadata (such as table definition and schema) in the Data Catalog. After the data is cataloged, the data is immediately searchable, queryable, and available for ETL. For more information, see Populating the AWS Glue Data Catalog.

The following screenshot shows the crawler created on Amazon S3 location of the CUR data.

The following code is an example table definition populated by the crawler.:

CREATE EXTERNAL TABLE `cur_parquet`(
  `identity_line_item_id` string, 
  `identity_time_interval` string, 
  `bill_invoice_id` string, 
………
………
  `resource_tags_user_infra_role` string)

PARTITIONED BY ( 
  `year` string, 
  `month` string )

ROW FORMAT SERDE  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://curS3bucket/Parquet/'

Transforming and loading using Amazon Redshift

Next in the analytics service, Aruba chose Amazon Redshift over Athena. Aruba has a use case to integrate cost data together with other tables already present in Amazon Redshift and hence using the same service makes it easy to integrate with their existing data. To further filter and transform data at the same time, and simplify the multi-step ETL, Aruba chose Amazon Redshift Spectrum. It helps to efficiently query and load CUR data from Amazon S3. For more information, see Getting started with Amazon Redshift Spectrum.

Use the following query to create an external schema and map it to the AWS Glue database created earlier in the Data Catalog:

--Choose a schema name of your choice, cur_redshift_external_schema name is just an example--
 create external schema cur_redshift_spectrum_external_schema from data catalog database 
 'aruba_curr_db' iam_role 'arn:aws:iam::xxxxxxxxxxxxx:role/redshiftclusterrole' 
 create external database if not exists;

The table created in the Data Catalog appears under the Amazon Redshift Spectrum schema. The schema, table, and records created can be verified with the following SQL code:

SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE>; 

--Query the right partition, year=2020 and month=2 is used an example
SELECT Count(*) 
FROM   cur_redshift_spectrum_external_schema.<TABLE> 
WHERE  year=2020 
AND    month=2;

Next, transform and load the data into the Amazon Redshift table. Aruba started by creating an Amazon Redshift table to contain the data. The following SQL code can be used to create the production table with the desired columns:

CREATE TABLE redshift_schema.redshift_table 
  ( 
     usage_start_date TIMESTAMP, 
     usage_end_date   TIMESTAMP, 
     service_region   VARCHAR (256), 
     service_az       VARCHAR (256), 
     aws_resource_id  VARCHAR (256), 
     usage_amount     FLOAT (17), 
     charge_currency  VARCHAR (256), 
     aws_product_name VARCHAR (256), 
     instance_family  VARCHAR (256), 
     instance_type    VARCHAR (256), 
     unblended_cost   FLOAT (17), 
     usage_cost       FLOAT (17)
  );

CUR is dynamic in nature, which means that some columns may appear or disappear with each update. When creating the table, we take static columns only. For more information, see Line item details.

Next, insert and update to ingest the data from Amazon S3 to the Amazon Redshift table. Each CUR update is cumulative, which means that each version of the CUR includes all the line items and information from the previous version.

The reports generated throughout the month are estimated and subject to change during the rest of the month. AWS finalizes the report at the end of each month. Finalized reports have the calculations for the blended and unblended costs, and cover all the usage for the month. For this use case, Aruba updates the last 45 days of data to make sure the finalized cost is captured. The below sample query can be used to verify the updated data:

-- Create Table Statement
 INSERT INTO redshift_schema.redshift_table
            (usage_start_date, 
             usage_end_date, 
             service_region, 
             service_az, 
             aws_resource_id, 
             usage_amount, 
             charge_currency, 
             aws_product_name, 
             instance_family, 
             instance_type, 
             unblended_cost,
             Usage_Cost ) 
 SELECT line_item_usage_start_date, 
       line_item_usage_end_date, 
       line_item_operation, 
       line_item_availability_zone, 
       line_item_resource_id, 
       line_item_usage_amount, 
       line_item_currency_code, 
       product_product_name, 
       product_instance_family, 
       product_instance_type, 
       line_item_unblended_cost,
       case when line_item_type='Usage' then line_item_unblended_cost
            else 0
            end as usage_cost 
 FROM   cur_redshift_external_schema.cur_parquet_parquet
 WHERE  line_item_usage_start_date >= date_add('day', -45, getdate()) 
       AND line_item_usage_start_date < date_add('day', 1, getdate());

Using Data Pipeline to orchestrate the ETL workflow

To automate this ETL workflow, Aruba chose Data Pipeline. Data Pipeline helps to reliably process and move data between different AWS compute and storage services, as well as on-premises data sources. With Data Pipeline, Aruba can regularly access their data where it’s stored, transform and process it at scale, and efficiently transfer the results to AWS services such as Amazon S3, Amazon Relational Database Service (Amazon RDS), Amazon DynamoDB, and Amazon EMR. Although the detailed steps of setting up this pipeline are out of scope for this blog, there is a sample workflow definition JSON file, which can be imported after making the necessary changes.

Data Pipeline workflow

The following screenshot shows the multi-step ETL workflow using Data Pipeline. Data Pipeline is used to run the INSERT query daily, which inserts and updates the latest CUR data into our Amazon Redshift table from the external table.

In order to copy data to Amazon Redshift, RedshiftDataNode and RedshiftCopyActivity can be used, and then scheduled to run periodically.

Sharing metrics and creating visuals with QuickSight

To share the cost and usage with other teams, Aruba choose QuickSight using Amazon Redshift as the data source. QuickSight is a native AWS service that seamlessly integrates with other AWS services such as Amazon Redshift, Athena, Amazon S3, and many other data sources.

As a fully managed service, QuickSight lets Aruba to easily create and publish interactive dashboards that include ML Insights. In addition to building powerful visualizations, QuickSight provides data preparation tools that makes it easy to filter and transform the data into the exact needed dataset. As a cloud-native service, dashboards can be accessed from any device and embedded into applications and portals, allowing other teams to monitor their resource usage easily. For more information about creating a dataset, see Creating a Dataset from a Database. Quicksight Visuals can then be created from this dataset.

The following screenshot shows a visual comparison of device cost and count to help find the cost per device. This visual helped Aruba quickly identify the cost per device increase in April and take necessary actions.

Similarly, the following visualization helped Aruba identify an increase in data transfer cost and helped them decide to invest in rearchitecting their application.

The following visualization classifies the cost spend per resource.

Conclusion

In this post, we discussed how Aruba Networks was able to successfully achieve the following:

Generate CUR and use AWS Glue to define data, catalog the data, and update the metadata
Use Amazon Redshift Spectrum to transform and load the data to Amazon Redshift tables
Query, visualize, and share the data stored using QuickSight
Automate and orchestrate the entire solution using Data Pipeline

Aruba use this solution to automatically generate a daily cost report and share it with their stakeholders, including executives and cloud operations team.

About the Authors

Siddharth Thacker works in Business & Finance Strategy in Cloud Software division at Aruba Networks. Siddharth has Master’s in Finance with experience in industries like banking, investment management, cloud software and focuses on business analytics, margin improvement and strategic partnerships at Aruba. In his spare time, he likes exploring outdoors and participate in team sports.

Swatishree Sahu is a Technical Data Analyst at Aruba Networks. She has lived and worked in India for 7 years as an SME for SOA-based integration tools before coming to US to pursue her master’s in Business Analytics from UT Dallas. Breaking down and analyzing data is her passion. She is a Star Wars geek, and in her free time, she loves gardening, painting, and traveling.

Ritesh Chaman is a Technical Account Manager at Amazon Web Services. With 10 years of experience in the IT industry, Ritesh has a strong background in Data Analytics, Data Management, and Big Data systems. In his spare time, he loves cooking (spicy Indian food), watching sci-fi movies, and playing sports.

Kunal Ghosh is a Solutions Architect at AWS. His passion is to build efficient and effective solutions on the cloud, especially involving Analytics, AI, Data Science, and Machine Learning. Besides family time, he likes reading and watching movies, and is a foodie.