Tag Archives: Amazon QuickSight

Enhancing dashboard interactivity with Amazon QuickSight Actions

Post Syndicated from Sahitya Pandiri original https://aws.amazon.com/blogs/big-data/enhancing-dashboard-interactivity-with-amazon-quicksight-actions/

Amazon QuickSight now offers enhanced dashboard interactivity capabilities through QuickSight Actions. QuickSight Actions provide advanced filtering capabilities through single point-and-click actions on dashboards. With Actions, you can link visuals within a dashboard so that selecting a dimensional point on one visual provides you with granular insights on the selected point on other visuals within your dashboard. Therefore, you can start with summaries and dive deep into details of your business metrics, all within the same dashboard sheet. You can define what visuals within your dashboard are interactive and how these interact with each other. As of this writing, QuickSight Actions lets you define two primary actions of interactivity: filter actions and URL actions. URL actions within Amazon QuickSight are not new, but the point of entry to create URL actions is now consolidated with Actions.

You can apply QuickSight Actions to any supported chart that holds at least one dimension. This post provides examples of getting starting with Actions, configuring different Actions on a dashboard, and enabling different forms of interactivity for each action configured.

This post uses the following data sets:

B2B Sales
This data set holds order details for a fictitious company ABCDO for 2016 and 2017. The dashboard we will build will report on sales metrics by industry, segment, region as primary dimensions, and also provides granular details for each order purchased.

Product Availability
This data set holds available quantity for every product by ID.

Prerequisites

Before implementing Actions on your Amazon QuickSight dashboards, review how to create and publish dashboards.

Getting started with QuickSight Actions

This screenshot is a dashboard built from the above two data sets. It shows sales by category, industry and region on line 1; segment sales by quarter, industry sales by segment on line 2; total profit, sales, discount and quanity sold on line 3; order details pivot on line 4 and shipping details pivot on line 5.

Before getting started, note the following terminology:

  • Source visual – The visual on which an action is created. Choosing a point on the source visual triggers the action, and the dimensional values chosen are passed as filters to target visuals.
  • Target visual – The visual that is filtered by the dimensional values chosen on the source visual.
  • Activation – You can trigger an action either by  selecting it directly (left-click), or selecting from the menu options (right-click).
  • Action type – You can configure two types of actions on a visual: filter actions and URL actions. Filter actions pass select or all dimensions as filters across select or all visuals across the dashboard. URL actions allow you to navigate from the dashboard to an external site or different sheet within the same dashboard with the selected dimension passed along.

Setting up a click/select action

To set up a click/select action, complete the following steps:

  1. Select the source visual on the analysis and choose Actions.The following screenshot shows you this step.
  2. Within Actions, you can either create and configure your actions or use the 1-click action setup. The 1-click setup creates a default action on the visual. As a result, when a point on the visual is selected/clicked, all dimensions on a selected point are passed as filters across all other visuals on the dashboard. Under Quick create, choose Filter same-sheet visuals.

This creates a default action, which you can modify at any time.

After creating the 1-click action, you can select any point on the source visual. The following screenshot shows that the selected point is highlighted and all the underlying dimensions are used as filters on all other visuals on the analysis or dashboard.

In this screenshot, selecting Copper and Diamonds for the year 2017 passes these two as filters to all other visuals.

To verify what filter actions are applied on a particular visual, choose the filter icon on the top right. This shows all the filters applied through filter actions and also regular filters.

Setting up custom filter actions

To perform further analysis on the sales metrics for any segment, configure an action on the Industry sales by Segment visual. To configure a filter action to pass segment as the filter onto the KPI charts, complete the following steps:

  1. Choose the source visual and choose Actions.
  2. Choose Define a custom action.
  3. For Action name, enter a name for your action. You could also add dynamic placeholder values to the name to be more descriptive to your readers by choosing the plus sign next to the Action name.You have four configuration settings. While this post uses values specific for this use case, you can choose other values according to your specific needs.
  4. For Activation, select Menu option.
  5. For Action type, choose Filter action.
  6. For Filter scope, select Selected fields, segment.
  7. For Target visuals, select Select visuals and check needed target visuals Total quantity, Total discount offered, and Total profit by year.
  8. Choose Save.

To view menu options and choose a filter action, right-click any segment on the visual. This post selects Strategic. The following screenshot shows that all the Strategic points are highlighted and the KPIs are updated.

Adding filter actions on a pivot tables

Filter actions on pivot tables provide exceptional flexibility to pass anything from the innermost to the outermost dimensions. Additionally, when a filter action is triggered, the source pivot table is highlighted to show values that are passed as filters to the target visuals.

This post adds click/select filter actions on the Order details pivot table to analyze shipping details for each order. This post also adds two filter actions to the menu options: one to filter by the product name and the other by the customer name.

To create click/select filter actions on order_id, complete the following steps:

  1. Choose the Order details pivot table chart.
  2. Choose Actions.
  3. Choose Define custom action.
  4. For Action name, enter your action name.
  5. For Activation, select Select.
  6. For Filter scope, select Selected fields and check order_id.
  7. For Target visuals, select Select visuals, and check Shipping details.
  8. Choose Save.

The following screenshot shows that when choosing any point on the pivot table, the order pertaining to the point is highlighted and the Shipping details table shows shipping details for the selected order.

You can only create one Selection action on a source visual and any number of menu option actions.

You can now create additional actions in the menu. To create a new actions, go to Actions, choose the plus sign and create these two actions.

After creating the two actions, they are visible in the Actions menu. See the following screenshot.

After setting up these action filters, you can right-click on the visual and trigger these filters from the menu.

The screenshots below shows you the menu on right-clicking product name.

The screenshots below shows you the menu on right-clicking customer name.

Removing applied filter actions

There are three ways to remove a triggered filter action:

  • On source visuals with filter actions triggered via select or menu options, clear the filter action by choosing the selected point again or a different point.
  • On source visuals with empty spaces within the visual (such as bar chart family or pie chart), clicking within the empty space also clears the filter action selection. This deselection is not applicable to heat maps and tree maps. Deselect points on these charts by choosing an already selected point.
  • On target visuals with a highlighted filter icon, choosing the X on the filter actions displayed removes the filter for this visual and all other visuals the filter is applied to.

Using action filters when the source and target visuals use different data sets

If you use multiple data sets within an analysis, Amazon QuickSight automatically maps fields across these data sets by the field name. For example, product_id is a field in the B2B Sales data set and Product Availability data set. When setting a filter action on product_id from a source visual using the B2B Sales data set, the target visual (the adjacent KPI) showing product quantity availability shows results for the selected product_id from the source visual. This implicit mapping of fields is case sensitive.

You can add the new data set Product Availability to this analysis. For more information, see Adding a Data Set to an Analysis.

Add a KPI chart to show count of product_id as in the screenshot below.

To show total quantity available of select products, create an action filter on the menu options, as seen in the screenshot below:

To drill down into product quantity availability, choose (right-click) product_id on the order details pivot table. The following screenshot shows the product availability details for the selected product_id.

Creating custom URL actions

Custom URL actions aren’t new to Amazon QuickSight, but the point of entry to create URL actions is now consolidated with Actions. To create a custom URL action on the Order details pivot table to see Google search results on a customer, complete the following steps:

  1. Choose the Order details pivot table.
  2. Choose Actions.
  3. Choose Add a new action.
  4. For Action name, enter an action name.
  5. For Action type, choose URL action.
  6. Enter the URL https://www.google.com/search?q=<<customer_name>>.Customer_name is the value of the dimension field to select.
  7. Choose New browser tab.
  8. Choose Save.

When you right-click on the pivot table, menu options appear with your two filter actions created previously and the new custom URL action. See the following screenshot.

Conclusion

With Amazon QuickSight, you can create filter actions on any chart type with at least one dimensional value. You can also create URL actions on all chart types, including KPIs and gauge charts, which do not hold dimensional values. As of this writing, QuickSight Actions don’t support selecting multiple data points within the same source visual, cascading filter actions in which you choose one filter action on one visual and a subsequent filter on another visual, or adding customer mappings between different data sets used in the dashboard. You can expect to see these features in future releases.

QuickSight Actions is available on both Enterprise and Standard editions in all QuickSight-supported regions.

 


About the Author

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in product/program management for 6 years now, and has built multiple products in the retail, healthcare, and analytics space.

 

 

 

Amazon QuickSight announces the all-new QuickSight Mobile app

Post Syndicated from Tina Kelleher original https://aws.amazon.com/blogs/big-data/announcing-the-new-mobile-app-for-amazon-quicksight/

AWS is happy to announce the release of QuickSight Mobile for iOS and Android devices. This release is both a major update to the existing iOS app and the launch of a new Android application. The app enables you to securely get insights from your data from anywhere; favorite, browse, and interact with your dashboards; explore your data with drilldowns and filters; stay ahead of the curve via forecasting; get email alerts when unexpected changes happen in your data; and share those insights with colleagues.

To download the QuickSight Mobile app, visit the iOS App Store and Google Play.

The new QuickSight Mobile optimizes the dashboard consumption experience with newly added features and enhanced interactivity. In this blog, we will walk you through the new Mobile experience in detail.

Features and capabilities

Home page

After authenticating with the app, you land on the QuickSight home screen. From here, you have quick access to your favorite dashboards (synched from the browser application), and most recently viewed dashboards. The recently viewed list can display up to the last 100 dashboards you previously opened.

The following screenshots show your favorited and recently viewed dashboards.

Dashboards

You can browse through all your dashboards via the Dashboards tab. You can sort the dashboards by name or updated (published) date and change to a list or a grid view. You can also search for dashboards by name. Lastly, you can easily add any dashboard to your favorites by choosing the star icon next to the dashboard’s name.

The following screenshots show dashboards in grid view, display preferences and search tool.

Dashboard view

The dashboard view is optimized for your mobile device by showing the visuals in vertical stack view, with each visual expanded to full width. A multi-sheet dashboard shows the first sheet by default. To navigate across sheets, choose the sheet title and select the sheet name you want to view.

The following screenshots show an opened dashboard with visuals in vertical stack view and how you navigate across different sheets within the dashboard.

To apply filters to visuals, choose the funnel icon to view all available filter controls and options for your dashboard sheet. After you’ve made your selection, choose Apply, and choose Done. You can see the number of filters applied via the small blue tag next to the funnel icon. You can also expand the filter control view and reset a given filter to the default.

The following screenshots show dashboard filters, expanded filter pane and how you select filter values.

Some visuals have a scroll bar, which allows you to customize the range to zoom in on a visual. A long press on a data point on most charts, such as a line or a bar, brings up the context menu. You can focus on this data point, drill up or down if the visual has a drill hierarchy, and exclude the chosen data point.

The following screenshots show interactions with zoom bars and context menu.

To go back to the full view, choose the … icon in the upper right and choose Undo. To exit the dashboard view, choose the back arrow.

If it is an anomaly detection widget, you can click on “Explore anomalies” and get redirected to the detailed anomalies page. Here, you can see all of the anomalies detected for the latest period. To perform contribution analysis to uncover the hidden drivers, long press the anomaly data point and you will see the option show up in the bottom sheet. To find out more about anomaly detection and contribution analysis, see Amazon QuickSight Announces General Availability of ML Insights.

You can expect contribution analysis to extend support to all visuals beyond anomaly detection on the app in the near future.

The following screenshots show the anomaly details and contribution analysis flow.

Visual view

From the dashboard view, choose the expand icon in the upper right corner to expand the desired visual to full-screen. The app supports both portrait and landscape modes in the expanded visual, and you can use the same interactions such as zoom, drill, focus, and exclude.

The following screenshots show expanded visuals in both portrait and landscape modes.

Settings

The Settings tab allows you to change language, Region, and biometric authentication methods. You can also provide feedback by choosing Provide feedback, which initiates an email to our mobile team.

The following screenshot shows you the options on settings page.

Setup and authentication

The QuickSight Mobile app supports authentication by any provisioned user. You can authenticate by using either your Amazon QuickSight username and password (with MFA, if this option is enabled), or through your own identity provider (IdP). After authentication, you can re-authenticate using biometric authentication such as face ID or touch ID, if supported by your mobile device, or an application-specific PIN.

The following screenshots show the generic sign-in flow.

IdP-initiated login

For IdP-initiated logins, set up your account with identity federation. For more information, see Federate Amazon QuickSight access with Okta. You can federate using standard mobile federation processes that your IdP provides, either through an app or through the web. This post uses the Okta mobile app. When you are federated, you receive the prompt to either continue your session using the mobile web experience or in the QuickSight Mobile app.

Account administrators can set the duration of mobile sessions to be anywhere from 1–30 days through the administrative console. The default session length for new accounts is 30 days.

The following screenshots show IdP-initiated login flow.

Microsoft Active Directory

If your account uses Microsoft Active Directory, you can log in to the app using your current AD credentials. If your account requires MFA, you also need to provide your MFA token to authenticate. For more information, see Can I use AWS Directory Service for Microsoft Active Directory to authenticate users in Amazon QuickSight?

Email Invitations

If you receive an email invitation to Amazon QuickSight, follow the instructions in the email to set your username and password through the website (mobile or desktop). Use these credentials to authenticate with the mobile applications.

IAM Users

If your account is an IAM user, you need to provision yourself as an Amazon QuickSight user before authenticating in the mobile application. For more information on provisioning users, Provisioning Users for Amazon QuickSight.

Summary

This post provided an overview of the key features and functionalities as well as sign-in flows for the new QuickSight Mobile app. The app is available as a free download in Google Play and the iOS App Store. You can expect the app to continue to evolve its features and capabilities to give you the best experience for getting insights from your data while on the go!

If you have any questions or feedback, please leave a comment.

 


About the Authors

Susan Fang is a senior product manager for QuickSight with AWS.

 

 

 

 

Brian Stein is a software development manager at AWS.

 

 

 

 

Joining across data sources on Amazon QuickSight

Post Syndicated from Rakshith Dayananda original https://aws.amazon.com/blogs/big-data/joining-across-data-sources-on-amazon-quicksight/

Amazon QuickSight announced the launch of Cross Data Source Join, which allows you to connect to multiple data sources and join data across these sources in Amazon QuickSight directly to create data sets used to build dashboards. For example, you can join transactional data in Amazon Redshift that contains customer IDs with Salesforce tables that contain customer profile data to create an interactive dashboard with order and customer details. You can slice and dice transactional data by various customer dimensional data such as segment, geographic, or demographic without first pulling the data to a single source outside Amazon QuickSight.

With cross data source joins, you can join across all data sources supported by Amazon QuickSight, including file-to-file, file-to-database, and database-to-database joins using the built-in drag-and-drop UI, without heavily relying on the BI and data engineering teams to set up complex and time-consuming ETLs. Whether it is local CSV files, Amazon RDS databases, or JSON objects on an S3 bucket, you can now join any of these data sources together to create data sets.

Finally, you can set up a scheduled refresh up to the hour and confirm that the joined data set is always up to date with the latest information.

Getting started with Cross Data Source Join

The screenshot below shows all data sources you can connect to on QuickSight.

Amazon QuickSight allows you to connect to different data sources. It is common for businesses to have data spread across multiple data sources, depending on your data requirements. For example, you might have your web server logs stored in Amazon S3, customer details on Amazon Redshift tables, and order details on RDS. You may need to build reports from data combined from two or more of these different data sources.

You can accomplish this to some extent by building data pipelines to consolidate from multiple data sources into one single data source. However, creating these data pipelines results in data duplications across various AWS services, and adds additional cost in terms of effort and time to move data to a single data source. You would then build Amazon QuickSight data sets from this single data source. With cross data source join available directly on Amazon QuickSight, you can eliminate this problem.

There is no size restriction on your largest source, as long as the post-join table can fit into your SPICE capacity per data set. The rest of the tables together need to be within 1 GB in size. For example, if you have 20 numeric columns in your smaller table, you can fit about 5 million rows until you exceed the 1 GB memory limit.

This post demonstrates how to create data sets from two CSV files and how to join an RDS table with an S3 file. The post uses an example table with orders-related data in one data source, and returns-related data in another data source. The final goal is to create one single data set that contains both orders and returns data.

Prerequisites

Before getting started, download these CSV files in to your local machines:

Also, learn how to create, edit, delete QuickSight data sources from Working with Data Sources in Amazon QuickSight.

Joining multiple CSV files

To join two CSV files, complete the following steps:

  1. Use the orders CSV file downloaded from the S3 bucket above and upload to QuickSight.
  2. After selecting the ‘orders’ sheet, go to edit/preview data page where your data set details appears.
  1. From the top menu, choose Add data.A pop-up window appears with the option to either switch data sources or upload a new CSV to join with the current data set. The window also shown has existing source.
  1. Choose Upload a file and upload the ‘returns’ CSV file.After uploading the file, you can see sheet names for both the CSVs. The following screenshot shows orders and returns.
  1. Choose the two circles between the files.This step allows you to edit the join configuration between the two CSVs. In the Join configuration section, you can select the join type (inner, left, right, or full) and also select the column on which to apply the join. This post uses an inner join.
  1. For Join type, choose Inner.This post is joining the order ID classes of the two files.
  1. For Join classes, select Order ID in both drop-downs.
  1. Choose Apply.

You now have a data set that contains both orders and returns data from two different CSVs. You can save your data set and continue to create your analysis.

Joining an RDS table with S3

In this example, you have orders data in RDS and returns data as a JSON file in an S3 bucket. To join this data, complete the following steps:

  1. Create a data set on QuickSight from the RDS table. Review Create a Database Data Set and an Analysis to learn to connect to RDS to create data sets.
  2. Next, go to Manage data set, select the RDS data set. This post uses the orders data set.
  1. Choose Edit data set.A page with your data set details appears.
  1. From the top menu, choose Add data.A pop-up window appears with the option to either switch data sources or upload a new CSV to join with the current data set. The window also shown has existing source.
  2. Choose Switch data source.A list appears of the different data sets and their sources.
  1. Choose a data set. This post chooses Returns.You can now see both data sets linked together.
  2. Choose the two spheres between the data sets.
  3. Under Join configuration, choose your desired Join type. This post chooses Inner.
  1. For Join clauses, from the drop-downs of each data set, select the column on which to apply the join.This post chooses order_id for orders and Order ID for Returns.
  1. Choose Apply.

Your new data set contains both orders and returns data from two different CSVs. You can save your data set and continue to create your analysis.

Conclusion

This post showed you how to join data from two files, join tables from RDS and S3. You can join data from any two data sources (except IoT) on Amazon QuickSight with this method. Cross data source join capability is now available in both Enterprise and Standard editions in all Amazon QuickSight Regions.

 


About the Author

Rakshith Dayananda is an IT app dev engineer at AWS.

 

 

 

Analyze Google Analytics data using Upsolver, Amazon Athena, and Amazon QuickSight

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyze-google-analytics-data-using-upsolver-amazon-athena-and-amazon-quicksight/

In this post, we present a solution for analyzing Google Analytics data using Amazon Athena. We’re including a reference architecture built on moving hit-level data from Google Analytics to Amazon S3, performing joins and enrichments, and visualizing the data using Amazon Athena and Amazon QuickSight. Upsolver is used for data lake automation and orchestration, enabling customers to get started quickly.

Google Analytics is a popular solution for organizations who want to understand the performance of their web properties and applications. Google Analytics data is collected and aggregated to help users extract insights quickly.  This works great for simple analytics. It’s less than ideal, however, when you need to enrich Google Analytics data with other datasets to produce a comprehensive view of the customer journey.

Why analyze Google Analytics data on AWS?

Google Analytics has become the de-facto standard web analytics tool. It is offered for free at lower data volumes and provides tracking, analytics, and reporting.  It enables non-technical users to understand website performance by answering questions such as: where are users coming from? Which pages have the highest conversion rates? Where are users experiencing friction and abandoning their shopping cart?

While these questions are answered within the Google Analytics UI, there are however some limitation, such as:

  • Data sampling: Google Analytics standard edition displays sampled data when running ad hoc queries on time periods that contain more than 500,000 sessions. Large websites can easily exceed this number on a weekly or even daily basis. This can create reliability issues between different reports, as each query can be fed by a different sample of the data.
  • Difficulty integrating with existing AWS stack: Many customers have built or are in the process of building their data and analytics platform on AWS. Customers want to use the AWS analytics and machine learning capabilities with their Google Analytics data to enable new and innovative use cases.
  • Joining with external data sources: Seeing the full picture of a business’ online activity might require combining web traffic data with other sources. Google Analytics does not offer a simple way to either move raw data in or out of the system. Custom dimensions in Google Analytics can be used, but they are limited to 20 for the standard edition and are difficult to use.
  • Multi-dimensional analysis: Google Analytics custom reports and APIs are limited to seven dimensions per query. This limits the depth of analysis and requires various workarounds for more granular slicing and dicing.
  • Lack of alternatives: Google Analytics 360, which allows users to export raw data to Google BigQuery, carries a hefty annual fee. This can be prohibitive for organizations. And even with this upgrade, the native integration is only with BigQuery, which means users still can’t use their existing AWS stack.

Building or buying a new web analytics solution (including cookie-based tracking) is also cost-prohibitive, and can interrupt existing workflows that rely on Google Analytics data.

Customers are looking for a solution to enable their analysts and business users to incorporate Google Analytics data into their existing workflows using familiar AWS tools.

Moving Google Analytics data to AWS: Defining the requirements

To provide an analytics solution with the same or better level of reporting as Google Analytics, we designed our solution around the following tenets:

  1. Analytics with a low technical barrier to entry: Google Analytics is built for business users, and our solution is designed to provide a similar experience. This means that beyond ingesting the data, we want to automate the data engineering work that goes into making the data ready for analysis.  This includes data retention, partitioning, and compression. All of this work must be done under the hood and remain invisible to the user querying the data.
  2. Hit-level data: Google Analytics tracks clickstream activity based on Hits – the lowest level of interaction between a user and a webpage. These hits are then grouped into Sessions – hits within a given time period, and Users – groups of sessions (more details here). The standard Google Analytics API is limited to session and user-based queries, and does not offer any simple way of extracting hit-level data. Our solution, however, does provide access to this granular data.
  3. Unsampled data: By extracting the data from Google Analytics and storing it on Amazon S3, we are able to bypass the 500K sessions limitation. We also have access to unsampled data for any query at any scale.
  4. Data privacy: If sensitive data is stored in Google Analytics, relying on third-party ETL tools can create risks around data privacy, especially in the era of GDPR. Therefore, our solution encrypts data in transit and relies exclusively on processing within the customer’s VPC.

Solution overview

The solution is built on extracting hit-level data and storing it in a data lake architecture on Amazon S3. We then use Amazon Athena and Amazon QuickSight for analytics and reporting. Upsolver, an AWS premier solution provider, is used to automate ingestion, ETL and data management on S3. Upsolver also orchestrate the entire solution with a simple-to-use graphical user interface.  The following diagram shows the high level architecture of our solutions.

Reference architecture showing the flow of data across Google Anaytics, Amazon Athena and Amazon QuickSight

Using Upsolver’s GA connector we extract unsampled, hit-level data from Google Analytics. This data is then automatically ingested according to accepted data lake best practices and stored in an optimized form on Amazon S3. The following best practices are applied to the data:

  • Store data in Apache Parquet columnar file format to improve read performance and reduce the amount of data scanned per query.
  • Partition data by event (hit) time rather than by API query time.
  • Perform periodic compaction by which small files are merged into larger ones improving performance and optimizing compression.

Once data is stored on S3, we use Upsolver’s GUI to create structured fact tables from the Google Analytics data. Users can query them using Amazon Athena and Amazon Redshift. Upsolver provides simple to use templates to help users quickly create tables from their Google Analytics data.  Finally, we use Amazon QuickSight to create interactive dashboards to visualize the data.

The result is a complete view of our Google Analytics data. This view provides the level of self-service analytics that users have grown accustomed to, at any scale, and without the limitations outlined earlier.

Building the solution: Step by step guide

In this section, we walk through the steps to set up the environment, configure Upsolver’s Google Analytics plugin, extract the data, and begin exploring.

Step 1: Installation and permissions

  1. Sign up for Upsolver (can also be done via the AWS Marketplace).
  2. Allow Upsolver access to read data from Google Analytics and add new custom dimensions. Custom dimensions enable Upsolver to read non-sampled hit-level data directly from Google Analytics instead of creating parallel tracking mechanisms that aren’t as trust-worthy.
  3. To populate the custom dimensions that were added to Google Analytics, allow Upsolver to run a small JavaScript code on your website. If you’re using GA360, this is not required.

Step 2: Review and clean the raw data

For supported data sources, Upsolver automatically discovers the schema and collects key statistics for every field in the table. Doing so gives users a glimpse into their data.

In the following screenshot, you can see schema-on-read information on the left side, stats per field and value distribution on the right side.

Screen shot of the Upsolver UI showing schema-on-read information on the left side, stats per field and value distribution on the right side

Step 3: Publishing to Amazon Athena

Upsolver comes with four templates for creating tables in your AWS based data lake according to the Google Analytics entity being analyzed:

  • Pageviews – used to analyze user flow and behavior on specific sections of the web property using metrics such as time on page and exit rate.
  • Events – user-defined interactions such as scroll depth and link clicks.
  • Sessions – monitor a specific journey in the web property (all pageviews and events).
  • Users – understand a user’s interaction with the web property or app over time.

All tables are partitioned by event time, which helps improve query performance.

Upsolver users can choose to run the templates as-is, modify them first or create new tables unique to their needs.

The following screenshot shows the schema produced by the Pageviews template:

Screen shot of the Upsolver UI showing the schema produced by the Pageviews template:

The following screenshot shows the Pageviews and Events tables as well as the Amazon Athena views for Sessions and Users generated by the Upsolver templates.

Screenshot showing the Pageviews and Events tables as well as the Athena views for Sessions and Users generated from the Upsolver templates.

The following are a couple example queries you may want to run to extract specific insights:

-- Popular page titles 
SELECT page_title, 
       Count(*) AS num_hits 
FROM   ga_hits_pageviews 
GROUP  BY page_title 
ORDER  BY 2 DESC 
-- User aggregations from hit data 
SELECT user_id, 
       Count(*)                   AS num_hits, 
       Count(DISTINCT session_id) AS num_of_sessions, 
       Sum(session_duration)      AS total_sessions_time 
FROM   ga_hits_pageviews 
GROUP  BY user_id 

Step 4: Visualization in Amazon QuickSight

Now that the data has been ingested, cleansed, and written to S3 in a structured manner, we are ready visualize it with Amazon QuickSight. Start by creating a dashboard to mimic the one provided by Google Analytics.  But we don’t need to stop there.  We can use QuickSight ML Insights to extract deeper insights from our data.  We can also embed Amazon QuickSight visualizations into existing web portals and applications making insights available to everyone.

Screenshot of QuickSight visual ization showing several sections, one with a graph, several others with various statistics

Screen shot of QuickSight showing a global map with usage concentrations marked by bubbles, alongside a pie graph.

Sreenshot of QuickSight showing a bar graph, alongside a table with various data values.

Conclusion

With minimal setup, we were able to extract raw hit-level Google Analytics data, prepare, and stored it in a data lake on Amazon S3.  Using Upsolver, combined with Amazon Athena and Amazon QuickSight, we built a feature-complete solution for analyzing web traffic collected by Google Analytics on AWS.

Key technical benefits:

  • Schema on-read means data consumers don’t need to model the data into a table structure, and can instantly understand what their top dimensions are. For example, 85% of my users navigate my website using Google Chrome browser.
  • Graphical user interface that enables self-service consumption of Google Analytics data.
  • Fast implementation using pre-defined templates that map raw data from Google Analytics to tables in the data lake.
  • Ability to replay historical Google Analytics data stored on Amazon S3.
  • Ability to partition the data on Amazon S3 by hit time reducing complexity of handling late arriving events.
  • Optimize data on Amazon S3 automatically for improved query performance.
  • Automatically manage tables and partitions in AWS Glue Data Catalog.
  • Fully integrated with a suite of AWS native services – Amazon S3, Amazon Athena, Amazon Redshift and Amazon QuickSight.

Now that we have feature parity, we can begin to explore integrating other data sources such as CRM, sales, and customer profile to build a true 360-degree view of the customer.  Furthermore, you can now begin using AWS Machine Learning services to optimize traffic to your websites, forecast demand and personalize the user experience.

We’d love to hear what you think. Please feel free to leave a comment with any feedback or questions you may have.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors


Roy Hasson is the global business development lead of analytics and data lakes at AWS.
He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.

 

 

 

Eran Levy is the director of marketing at Upsolver.

 

 

 

 

Visualizing Sensor Data in Amazon QuickSight

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/visualizing-sensor-data-in-amazon-quicksight/

This post is courtesy of Moheeb Zara, Developer Advocate, AWS Serverless

The Internet of Things (IoT) is a term used wherever physical devices are networked in some meaningful connected way. Often, this takes the form of sensor data collection and analysis. As the number of devices and size of data scales, it can become costly and difficult to keep up with demand.

Using AWS Serverless Application Model (AWS SAM), you can reduce the cost and time to market of an IoT solution. This guide demonstrates how to collect and visualize data from a low-cost, Wi-Fi connected IoT device using a variety of AWS services. Much of this can be accomplished within the AWS Free Usage Tier, which is necessary for the following instructions.

Services used

The following services are used in this example:

What’s covered in this post?

This post covers:

  • Connecting an Arduino MKR 1010 Wi-Fi device to AWS IoT Core.
  • Forwarding messages from an AWS IoT Core topic stream to a Lambda function.
  • Using a Kinesis Data Firehose delivery stream to store data in S3.
  • Analyzing and visualizing data stored in S3 using Amazon QuickSight.

Connect the device to AWS IoT Core using MQTT

The Arduino MKR 1010 is a low-cost, Wi-Fi enabled, IoT device, shown in the following image.

An Arduino MKR 1010 Wi-Fi microcontroller

Its analog and digital input and output pins can be used to read sensors or to write to actuators. Arduino provides a detailed guide on how to securely connect this device to AWS IoT Core. The following steps build upon it to push arbitrary sensor data to a topic stream and ultimately visualize that data using Amazon QuickSight.

  1. Start by following this comprehensive guide to using an Arduino MKR 1010 with AWS IoT Core. Upon completion, your device is connected to AWS IoT Core using MQTT (Message Queuing Telemetry Transport), a protocol for publishing and subscribing to messages using topics.
  2. In the Arduino IDE, choose File, Sketch, Include Library, and Manage Libraries.
  3. In the window that opens, search for ArduinoJson and select the library by Benoit Blanchon. Choose install.

4. Add #include <ArduinoJson.h> to the top of your sketch from the Arduino guide.

5. Modify the publishMessage() function with this code. It publishes a JSON message with two keys: time (ms) and the current value read from the first analog pin.

void publishMessage() {  
  Serial.println("Publishing message");

  // send message, the Print interface can be used to set the message contents
  mqttClient.beginMessage("arduino/outgoing");
  
  // create json message to send
  StaticJsonDocument<200> doc;
  doc["time"] = millis();
  doc["sensor_a0"] = analogRead(0);
  serializeJson(doc, mqttClient); // print to client
  
  mqttClient.endMessage();
}

6. Save and upload the sketch to your board.

Create a Kinesis Firehose delivery stream

Amazon Kinesis Data Firehose is a service that reliably loads streaming data into data stores, data lakes, and analytics tools. Amazon QuickSight requires a data store to create visualizations of the sensor data. This simple Kinesis Data Firehose delivery stream continuously uploads data to an S3 storage bucket. The next sections cover how to add records to this stream using a Lambda function.

  1. In the Kinesis Data Firehose console, create a new delivery stream, called SensorDataStream.
  2. Leave the default source as a Direct PUT or other sources and choose Next.
  3. On the next screen, leave all the default values and choose Next.
  4. Select Amazon S3 as the destination and create a new bucket with a unique name. This is where records are continuously uploaded so that they can be used by Amazon QuickSight.
  5. On the next screen, choose Create New IAM Role, Allow. This gives the Firehose delivery stream permission to upload to S3.
  6. Review and then choose Create Delivery Stream.

It can take some time to fully create the stream. In the meantime, continue on to the next section.

Invoking Lambda using AWS IoT Core rules

Using AWS IoT Core rules, you can forward messages from devices to a Lambda function, which can perform actions such as uploading to an Amazon DynamoDB table or an S3 bucket, or running data against various Amazon Machine Learning services. In this case, the function transforms and adds a message to the Kinesis Data Firehose delivery stream, which then adds that data to S3.

AWS IoT Core rules use the MQTT topic stream to trigger interactions with other AWS services. An AWS IoT Core rule is created by using an SQL statement, a topic filter, and a rule action. The Arduino example publishes messages every five seconds on the topic arduino/outgoing. The following instructions show how to consume those messages with a Lambda function.

Create a Lambda function

Before creating an AWS IoT Core rule, you need a Lambda function to consume forwarded messages.

  1. In the AWS Lambda console, choose Create function.
  2. Name the function ArduinoConsumeMessage.
  3. For Runtime, choose Author From Scratch, Node.js10.x. For Execution role, choose Create a new role with basic Lambda permissions. Choose Create.
  4. On the Execution role card, choose View the ArduinoConsumeMessage-role-xxxx on the IAM console.
  5. Choose Attach Policies. Then, search for and select AmazonKinesisFirehoseFullAccess.
  6. Choose Attach Policy. This applies the necessary permissions to add records to the Firehose delivery stream.
  7. In the Lambda console, in the Designer card, select the function name.
  8. Paste the following in the code editor, replacing SensorDataStream with the name of your own Firehose delivery stream. Choose Save.
const AWS = require('aws-sdk')

const firehose = new AWS.Firehose()
const StreamName = "SensorDataStream"

exports.handler = async (event) => {
    
    console.log('Received IoT event:', JSON.stringify(event, null, 2))
    
    let payload = {
        time: new Date(event.time),
        sensor_value: event.sensor_a0
    }
    
    let params = {
            DeliveryStreamName: StreamName,
            Record: { 
                Data: JSON.stringify(payload)
            }
        }
        
    return await firehose.putRecord(params).promise()

}

Create an AWS IoT Core rule

To create an AWS IoT Core rule, follow these steps.

  1. In the AWS IoT console, choose Act.
  2. Choose Create.
  3. For Rule query statement, copy and paste SELECT * FROM 'arduino/outgoing’. This subscribes to the outgoing message topic used in the Arduino example.
  4. Choose Add action, Send a message to a Lambda function, Configure action.
  5. Select the function created in the last set of instructions.
  6. Choose Create rule.

At this stage, any message published to the arduino/outgoing topic forwards to the ArduinoConsumeMessage Lambda function, which transforms and puts the payload on the Kinesis Data Firehose stream and also logs the message to Amazon CloudWatch. If you’ve connected an Arduino device to AWS IoT Core, it publishes to that topic every five seconds.

The following steps show how to test functionality using the AWS IoT console.

  1. In the AWS IoT console, choose Test.
  2. For Publish, enter the topic arduino/outgoing .
  3. Enter the following test payload:
    {
      “time”: 1567023375013,  
      “sensor_a0”: 456
    }
  4. Choose Publish to topic.
  5. Navigate back to your Lambda function.
  6. Choose Monitoring, View logs in CloudWatch.
  7. Select a log item to view the message contents, as shown in the following screenshot.

Visualizing data with Amazon QuickSight

To visualize data with Amazon QuickSight, follow these steps.

  1. In the Amazon QuickSight console, sign up.
  2. Choose Manage Data, New Data Set. Select S3 as the data source.
  3. A manifest file is necessary for Amazon QuickSight to be able to fetch data from your S3 bucket. Copy the following into a file named manifest.json. Replace YOUR-BUCKET-NAME with the name of the bucket created for the Firehose delivery stream.
    {
       "fileLocations":[
          {
             "URIPrefixes":[
                "s3://YOUR-BUCKET-NAME/"
             ]
          }
       ],
       "globalUploadSettings":{
          "format":"JSON"
       }
    }
  4. Upload the manifest.json file.
  5. Choose Connect, then Visualize. You may have to give Amazon QuickSight explicit permissions to your S3 bucket.
  6. Finally, design the Amazon QuickSight visualizations in the drag and drop editor. Drag the two available fields into the center card to generate a Sum of Sensor_value by Time visual.

Conclusion

This post demonstrated visualizing data from a securely connected remote IoT device. This was achieved by connecting an Arduino to AWS IoT Core using MQTT, forwarding messages from the topic stream to Lambda using IoT Core rules, putting records on an Amazon Kinesis Data Firehose delivery stream, and using Amazon QuickSight to visualize the data stored within an S3 bucket.

With these building blocks, it is possible to implement highly scalable and customizable IoT data collection, analysis, and visualization. With the use of other AWS services, you can build a full end-to-end platform for an IoT product that can reliably handle volume. To further explore how hardware and AWS Serverless can work together, visit the Amazon Web Services page on Hackster.

Federate Amazon QuickSight access with Okta

Post Syndicated from Loc Trinh original https://aws.amazon.com/blogs/big-data/federate-amazon-quicksight-access-with-okta/

Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization. As a fully managed service, Amazon QuickSight lets you easily create and publish interactive dashboards that can then be accessed from any device and embedded into your applications, portals, and websites.

Amazon QuickSight supports identity federation through Security Assertion Markup Language 2.0 (SAML 2.0) in both Standard and Enterprise editions. With federation, you can manage users using your enterprise identity provider (IdP) and pass them to Amazon QuickSight at log-in. Such IdPs include Microsoft Active Directory Federation Services, Ping One Federation Server, and Okta.

This post provides step-by-step guidance for how to use Okta to federate access to Amazon QuickSight.

Create an Okta application

Sign in to your Okta admin dashboard. You can create a free Okta Developer Edition account.

  1. From the Okta admin dashboard ribbon, choose Applications.
  2. If you are viewing the Developer Console, switch to Classic UI, as shown in the following screenshot.
  3. Choose Add Application.
  4. Search for Amazon Web Services and choose Add.
  5. Rename Application label to Amazon QuickSight and choose Next.
  6. For Sign-On Options, choose SAML 2.0.
  7. For Default Relay State, type https://quicksight.aws.amazon.com.
  8. Right-click on Identity Provider metadata and choose Save Link As…
  9. Save the XML file to disk and choose Done. You need to use this file in the next steps.

Create a SAML provider in AWS

Open a new window and sign in to the AWS Management Console.

  1. Open the IAM console.
  2. In the navigation pane, choose Identity Providers, Create Provider.
  3. For Provider Type, choose SAML and provide a Provider Name (for example, Okta).
  4. For Metadata Document, upload the XML file from the previous steps.
  5. Choose Next Step, Create.
  6. Locate the IdP that you just created and make note of the Provider ARN

Create a role for federated users

This section describes the steps for creating an IAM SAML 2.0 federation role. While Okta is used for a single sign-on, there are two ways to provision users in Amazon QuickSight:

  • Grant the federation role permission to create new Amazon QuickSight users when a user visits for the first time.
  • Pre-provision Amazon QuickSight users using the API and add users to the appropriate groups. This is preferred for adding users to groups within Amazon QuickSight, because you can provision the user and add them to the groups at the same time.

The following steps demonstrate how to create a federation role with permission to create new Amazon QuickSight users. If you would rather pre-provision Amazon QuickSight users, instructions for using the API are at the end of this post.

  1. Open the IAM console.
  2. In the navigation pane, choose Roles, Create Role, Select type of trusted entity as SAML 2.0 federation.
  3. For SAML provider, select the IdP that you created in the previous steps (Okta).
  4. Select Allow programmatic and AWS Management Console access.
  5. Choose Next: Permissions, Create policy.
  6. In the Create policy window, navigate to the JSON tab and type the following:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "sts:AssumeRoleWithSAML",
                "Resource": "<YOUR SAML IDENTITY PROVIDER ARN>",
                "Condition": {
                    "StringEquals": {
                        "saml:aud": "https://signin.aws.amazon.com/saml"
                    }
                }
            },
            {
                "Action": [
                    "quicksight:CreateReader"
                ],
                "Effect": "Allow",
                "Resource": [
                    "arn:aws:quicksight::<YOUR ACCOUNT ID>:user/${aws:userid}"
                ]
            }
        ]
    }

The IAM policy above grants the federation role permission to self-provision an Amazon QuickSight reader with the quicksight:CreateReader action. Best practice is to grant users in your organization reader access, and then upgrade users from within the application. Instructions for upgrading users are at the end of this post.

If you would rather pre-provision Amazon QuickSight users using the API, do not include any actions in the permission policy.

  1. Choose Review Policy.
  2. For Name, enter a value (for example, QuicksightOktaFederatedPolicy) and choose Create policy.
  3. On the Create role page, choose Refresh and select your new policy.
  4. Choose Next: Tags and Next: Review.
  5. Provide a Role name (for example, QuicksightOktaFederatedRole) and Role description.
  6. Choose Create role.

Create an AWS access key for Okta

To create an access key for Okta, follow these steps.

  1. Open the IAM console.
  2. In the navigation pane, choose Users, Add user.
  3. For User name, enter a value (for example, OktaSSOUser).
  4. For Access type, choose Programmatic access.
  5. Choose Next: Permissions, Attach existing policies directly, and Create policy.
  6. On the Create policy page, navigate to the JSON tab and type the following:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "iam:ListRoles",
                    "iam:ListAccountAliases"
                ],
                "Resource": "*"
            }
        ]
    }

  7. Choose Review Policy.
  8. For a Name, enter a value (for example, OktaListRolesPolicy) and choose Create policy.
  9. On the Create user page, choose Refresh and select your new policy.
  10. Choose Next: Tags, Next: Review, and Create user.
  11. To save your access key and secret key, choose Download .csv.

Configure the Okta application

Return to the window with your Okta admin dashboard.

  1. For Identity Provider ARN (Required only for SAML SSO), provide the ARN (for example, arn:aws:iam::<YOUR ACCOUNT ID>:saml-provider/Okta) of the IdP that you created in previous steps.
  2. Choose Done.
  3. From the Applications dashboard, choose Provisioning.
  4. Choose Configure API Integration.
  5. Select Enable API Integration.
  6. For Access Key and Secret Key, provide the access key and secret key that you downloaded in previous steps.
  7. Choose Test API CredentialsSave.
  8. From the SETTINGS pane, navigate to To App.
  9. Choose Edit.
  10. Enable Create Usersand choose Save.
  11. Choose Assignments, Assign and then select the users or groups to which to grant federated access.
  12. Select the Roles and SAML User Roles to grant to the users, as shown in the following screenshot.
  13. Choose Save and Go Back, Done.

Launch Amazon QuickSight

Log in to your Okta Applications dashboard with a user (if you are using an admin account, switch to user mode) that has been granted federated access. You should see a new application with your label (for example, Amazon QuickSight). Choose on the application icon to launch Amazon QuickSight.

You can now manage your users and groups using Okta as your IdP and federate access to Amazon QuickSight.

Pre-provisioning Amazon QuickSight users

The outlined steps demonstrate how to grant users permission to self-provision Amazon QuickSight users when they visit Amazon QuickSight for the first time. If you would rather pre-provision Amazon QuickSight users, you can use the API to create users and groups and then add users to those groups.

  1. To create an Amazon QuickSight user, run the following AWS CLI Link the Amazon QuickSight user to your federated Okta username by providing the user-name parameter with the format <role name>\<email> (for example, QuicksightOktaFederatedRole\[email protected])
    aws quicksight register-user \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --email=<[email protected]> \
        --user-name=<ROLE NAME>\<[email protected]> \
        --identity-type=QUICKSIGHT \
        --user-role=READER

  2. Optionally, create an Amazon QuickSight group.
    aws quicksight create-group \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --group-name="<YOUR GROUP NAME>" \
        --description="<YOUR GROUP DESCRIPTION>"

  3. Add users to groups.
    aws quicksight create-group-membership \
        --aws-account-id=<YOUR ACCOUNT ID> \
        --namespace=default \
        --group-name="<YOUR GROUP NAME>" \
        --member-name="<YOUR MEMBER USER NAME>"

By using the Amazon QuickSight API, you can manage users, groups, and group membership. After they’re created, groups automatically become available for use when modifying permissions to data sets, analyses, or dashboards by typing in the group name instead of a specific user. For other supported group and user management functions, see List of Actions by Function.

Managing users

You can upgrade users between reader and author or admin in the Manage users tab of the Manage QuickSight screen.

  1. In the Amazon QuickSight console, choose your user name in the upper-right corner and choose Manage QuickSight.
  2. In the navigation pane, choose Manage users.
  3. Locate the user to upgrade, and select the role to grant from the Role

Deep-linking dashboards

AmazonQuickSight dashboards can be shared using the Okta application’s single sign-on URL so that users can be federated directly to specific dashboards.

To deep link to a specific Amazon QuickSight dashboard with single sign-on, first locate the Okta application’s single sign-on URL. This can be found by opening the metadata XML file that you downloaded in the Create an Okta application steps above. The URL is the value of the Location attribute in the md:SingleSignOnService element and ends with /sso/saml.

After you have the Okta application’s single sign-on URL, append ?RelayState= to the end of the URL followed by the URL to your Amazon QuickSight dashboard. For example, your deep link URL might look as follows:

https://my-test-org.okta.com/app/amazon_aws/abcdefg12345XYZ678/sso/saml?RelayState=https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/11111111-abcd-1234-efghi-111111111111

By deep-linking dashboards, you can provide users with a way to use single sign-on and directly access specific dashboards.

Summary

This post provided a step-by-step guide for configuring Okta as your IdP, and using IAM roles to enable single sign-on to Amazon QuickSight. It also showed how users and groups can be managed using the Amazon QuickSight API.

Although this post demonstrated the integration of IAM and Okta, you can replicate this solution using your choice of SAML 2.0 IdPs. For other supported federation options, see Enabling Single Sign-On Access to Amazon QuickSight Using SAML 2.0.

If you have any questions or feedback, please leave a comment.

 


About the Authors

Loc Trinh is a solutions architect at Amazon Web Services.

 

 

 

 

Naresh Gautam is a senior solutions architect at Amazon Web Services.

 

 

 

Create advanced insights using Level Aware Aggregations in Amazon QuickSight

Post Syndicated from Arun Baskar original https://aws.amazon.com/blogs/big-data/create-advanced-insights-using-level-aware-aggregations-in-amazon-quicksight/

Amazon QuickSight recently launched Level Aware Aggregations (LAA), which enables you to perform calculations on your data to derive advanced and meaningful insights. In this blog post, we go through examples of applying these calculations to a sample sales dataset so that you can start using these for your own needs.

What are Level Aware Aggregations?

Level aware aggregations are aggregation calculations that can be computed at a desired level in the overall query evaluation order of QuickSight. Please check this link for details on QuickSight’s Order of Evaluation. Up until now, the only types of aggregations possible in QuickSight were Display-level and Table calculation aggregation types.

  • Display-level aggregations are aggregations that are defined by the dimensions and metrics present in the field wells of a QuickSight visual.
  • Table calculations are computed by windowing/rolling-up over the display-level aggregated values of the visual. Hence, by definition, these are calculated after the Display-level aggregations are computed.

With Level Aware Aggregations, QuickSight now allows you to aggregate values before the Display-level aggregation. For more information, please visit the Level Aware Aggregations documentation.

Customer use cases

Distribution of customers by lifetime orders

Customer question: How many customers have made one order, two orders, three orders, and so forth?

In this case, we first want to aggregate the total number of orders made by each customer, then use the output of that as a visual dimension. This isn’t feasible to compute without LAA.

Solution using LAA

1.) Compute the number of orders per customer.

Calculated field name : NumberOrdersPerCustomer

Calculated field expression : countOver({order_id}, [{Customer Id}], PRE_AGG)

This computes the number of orders per customer, before the display-level aggregation of the visual.

2.) Create the visual.

Create the visual with the above field NumberOrdersPerCustomer in the “X-Axis well of the Field Wells. Add “Count Distinct” of “Customer Id” in the “Value” section of the Field Wells to create a histogram on number of orders made by customers.

As we can see, there are around 5000 unique customers with one order, around 3500 customers with two orders, and so on.

Filter out customers based on lifetime spends

Customer Question: How do I filter out customers with life-time spend less than $100,000? My visual’s dimension (group by) and metric definitions are independent of total spend per customer.

If the group dimensions of the aforementioned aggregation(spend) is exactly the same as the group dimensions in the field well, the customer can achieve this using aggregated filters feature. But that’s not always the case. As mentioned in the customer question, the visual’s definition can be different from the filter’s aggregation.

Solution using LAA

1.) Compute sum of sales per customer.

Calculated field name :  salesPerCustomer

Calculated field expression : sumOver(sales,[{customer_id}],PRE_AGG)

PRE_AGG indicates that the computation must occur before display-level aggregation.

2.) Create the visuals.

The visual on the left shows sum of sales per segment and the visual on the right shows the total number of customers. Note that there are no filters applied at this point.

3.) Create the filter on salesPerCustomer. 

Create a filter on top of the above field salesPerCustomer to select items greater than $100,000.

4.) Apply the filter.

The above image shows applying the filter on “salesPerCustomer” greater than $100,000.

With the filter applied, we have excluded the customers whose total spend is less than $100,000, regardless of what we choose to display in the visuals.

Fixed percent of total sales even with filters applied

Customer Question: How much is the contribution of each industry to the entire company’s profit (percent of total)? I don’t want the total to recompute when filters are applied.

The existing table calculation function percentOfTotal isn’t able to solve this problem, since filters on categories are applied before computing the total. Using percentOfTotal would recalculate the total every time filters are applied. We need a solution that doesn’t consider the filtering when computing the total.

Solution using LAA

1.) Compute total sales before filters through a calculated field.

Calculated field name : totalSalesBeforeFilters

Calculated field expression : sumOver(sales,[],PRE_FILTER)

PRE_FILTER indicates that this computation must be done prior to applying filters.

The partition dimension list (second argument) is empty since we want to compute the overall total.

2.) Compute the fixed percent of total sales.

Calculated field name : fixedPercentOfTotal

Calculated field expression : sum(sales) / min(totalSalesBeforeFilters)

Note: totalSalesBeforeFilters is the same for every row of the unaggregated data. Since we want to use it post-aggregation, we are using the aggregation min on top of it. If all values are the same, max or avg aggregations can be used as well as it serves the same purpose.

3.) Create the visual.

Add “industry” field to “Rows” well. Add “sales (SUM)” and “fixedPercentOfTotal“ to the ”values“ section. Now, the percent of total metric would remain fixed even if we filter out the data based on any underlying dimension or measure.

The visual shows sales per industry along with percent of total, computed using the table calculation percentOfTotal and using Level Aware Aggregation as described above. Both the percent of total values are currently the same since there aren’t any filters applied.

The visual shows the same metrics but with industries filtered only to 5 of them. As we can see “Percent of total sales” got re-adjusted to represent only the filtered data, whereas “Fixed Percent of total sales” remains the same even after filtering. Both the metrics are valuable customer use cases now feasible through QuickSight.

Compare sales in a category to industry average

Customer question: How do I compare sales in a category to the industry average? I want the industry average to include all categories even after filtering.

Since we want the industry average to stay fixed even with filtering, we need PRE_FILTER aggregation to achieve this.

Solution using LAA

1.) Compute the industry average.

Calculated field name : IndustryAverage

Calculated field expression : avgOver(sumOver(sales,[{category}],PRE_FILTER),[],PRE_FILTER)

We first compute the sum of sales per category and then average it across all categories. It’s important to note here that we first computed a finer level aggregation and fed that into a coarser level aggregation.

2.) Compute the difference from IndustryAverage.

Calculated field name : FixedDifferenceFromIndustryAverage

Calculated field expression : sum(sales) – min(IndustryAverage)

As mentioned in one of the examples above, we use min aggregation to retain the data while going.

3.) Create the visual.

Create the visual by adding “Category” in “X axis” field well and SUM(Sales), IndustryAverage and FixedDifferenceFromIndustryAverage as the values in a bar chart.

Visual shows total sales per category, the average across all industries and each category’s difference from average.

This visual shows the same metrics, but with categories filtered to include only 6 of them. As we can see, the industry average remained the same before and after filtering, keeping the difference the same whether you choose to show all categories, some of them, or just one.

Categorize customers based on lifetime spend

Customer question: How do I classify customers based on cumulative sales contribution? I then want to use that classification as my visual’s grouping.

The objective here is create custom sized bins to classify the customer. Even though we could do this classification post display-level aggregation, we wouldn’t be able to use it as a dimension/group by in the visual.

Solution using LAA

1.) Compute sales per customer before display-level aggregation.

Calculated field name : salesPerCustomer

Calculated field expression : sumOver({sales amount},[{customer id}],PRE_AGG)

2.) Categorize Customers.

Calculated field name : Customer Category

Calculated field expression : ifelse(salesPerCustomer < 1000, “VERY_LOW”, salesPerCustomer < 10000, “LOW”, salesPerCustomer < 100000, “MEDIUM”, “HIGH”)

3.) Create the visual.

Create the visual by adding “Customer Category” to the “Y-axis” field well, “Count Distinct” of “customer id” to the value field well.

Above image shows the number of unique customers per Custom Category.

Filtering can be done on top of these categories as well to build other relevant visuals, since the categories are tagged before aggregation.

Above image shows the number of unique customers per custom category split by gender.

Availability

Level aware aggregations are available in both Standard and Enterprise editions, in all supported AWS Regions. For more information, see the Amazon QuickSight documentation.

 


About the Author

Arun Baskar is a software development engineer for QuickSight at Amazon Web Services.

 

 

 

 

Creating custom Pinpoint dashboards using Amazon QuickSight, part 2

Post Syndicated from Brent Meyer original https://aws.amazon.com/blogs/messaging-and-targeting/creating-custom-pinpoint-dashboards-using-amazon-quicksight-part-2/

Note: This post was written by Manan Nayar and Aprajita Arora, Software Development Engineers on the AWS Digital User Engagement team.


In our previous post, we discussed the process of visualizing specific, pre-aggregated Amazon Pinpoint metrics—such as delivery rate or open rate—using the Amazon Pinpoint Metrics APIs. In that example, we showed how to create a Lambda function that retrieves your metrics, and then make those metrics available for creating visualizations in Amazon QuickSight.

This post discusses shows a different approach to exporting data from Amazon Pinpoint and using it to build visualizations. Rather than retrieve specific metrics, you can use the event stream feature in Amazon Pinpoint to export raw event data. You can use this data in Amazon QuickSight to create in-depth analyses of your data, as opposed to visualizing pre-calculated metrics. As an added benefit, when you use this solution, you don’t have to modify any code, and the underlying data is updated every few minutes.

Step 1: Configure the event stream in Amazon Pinpoint

The Amazon Pinpoint event stream includes information about campaign events (such as campaign.send) and application events (such as session.start). It also includes response information related to all of the emails and SMS messages that you send from your Amazon Pinpoint project, regardless of whether they were sent from campaigns or on a transactional basis. When you enable event streams, Amazon Pinpoint automatically sends this data to your S3 bucket (via Amazon Kinesis Data Firehose) every few minutes.

To set up the event stream

  1. Sign in to the Amazon Pinpoint console at http://console.aws.amazon.com/pinpoint. In the list of projects, choose the project that you want to enable event streaming for.
  2. Under Settings, choose Event stream.
  3. Choose Edit, and then configure the event stream to use Amazon Kinesis Data Firehose. If you don’t already have a Kinesis Data Firehose stream, follow the link to create one in the Kinesis console. Configure the stream to send data to an S3 bucket. For more information about creating streams, see Creating an Amazon Kinesis Data Firehose Delivery Stream.
  4. Under IAM role, choose Automatically create a role. Choose Save.

Step 2: Add a data set in Amazon QuickSight

Now that you’ve started streaming your Amazon Pinpoint data to S3, you can set Amazon QuickSight to look for data in the S3 bucket. You connect QuickSight to sources of data by creating data sets.

To create a data set

    1. In a text editor, create a new file. Paste the following code:
      {
          "fileLocations": [
              {
                  "URIPrefixes": [ 
                      "s3://<bucketName>/"          
                  ]
              }
          ],
          "globalUploadSettings": {
              "format": "JSON"
          }
      }

      In the preceding code, replace <bucketName> with the name of the S3 bucket that you’re using to store the event stream data. Save the file as manifest.json.

    2. Sign in to the QuickSight console at https://quicksight.aws.amazon.com.
    3. Create a new S3 data set. When prompted, choose the manifest file that you created in step 1. For more information about creating S3 data sets, see Creating a Data Set Using Amazon S3 Files in the Amazon QuickSight User Guide.
    4. Create a new analysis. From here, you can start creating visualizations of your data. To learn more, see Creating an Analysis in the Amazon QuickSight User Guide.

Step 3: Set the refresh rate for the data set

You can configure your data sets in Amazon QuickSight to automatically refresh on a certain schedule. In this section, you configure the data set to refresh every day, one minute before midnight.

To set the refresh schedule

  1. Go to the QuickSight start page at https://quicksight.aws.amazon.com/sn/start. Choose Manage data.
  2. Choose the data set that you created in the previous section.
  3. Choose Schedule refresh. Follow the prompts to set up a daily refresh schedule.

Step 4: Create your visualizations

From this point, you can start creating visualizations of your data. To learn more about creating visualizations, see Creating an Analysis in the Amazon QuickSight User Guide.

Analyzing AWS WAF logs with Amazon ES, Amazon Athena, and Amazon QuickSight

Post Syndicated from Aaron Franco original https://aws.amazon.com/blogs/big-data/analyzing-aws-waf-logs-with-amazon-es-amazon-athena-and-amazon-quicksight/

AWS WAF now includes the ability to log all web requests inspected by the service. AWS WAF can store these logs in an Amazon S3 bucket in the same Region, but most customers deploy AWS WAF across multiple Regions—wherever they also deploy applications. When analyzing web application security, organizations need the ability to gain a holistic view across all their deployed AWS WAF Regions.

This post presents a simple approach to aggregating AWS WAF logs into a central data lake repository, which lets teams better analyze and understand their organization’s security posture. I walk through the steps to aggregate regional AWS WAF logs into a dedicated S3 bucket. I follow that up by demonstrating how you can use Amazon ES to visualize the log data. I also present an option to offload and process historical data using AWS Glue ETL. With the data collected in one place, I finally show you how you can use Amazon Athena and Amazon QuickSight to query historical data and extract business insights.

Architecture overview

The case I highlight in this post is the forensic use of the AWS WAF access logs to identify distributed denial of service (DDoS) attacks by a client IP address. This solution provides your security teams with a view of all incoming requests hitting every AWS WAF in your infrastructure.

I investigate what the IP access patterns look like over time and assess which IP addresses access the site multiple times in a short period of time. This pattern suggests that the IP address could be an attacker. With this solution, you can identify DDoS attackers for a single application, and detect DDoS patterns across your entire global IT infrastructure.

Walkthrough

This solution requires separate tasks for architecture setup, which allows you to begin receiving log files in a centralized repository, and analytics, which processes your log data into useful results.

Prerequisites

To follow along, you must have the following resources:

  • Two AWS accounts. Following AWS multi-account best practices, create two accounts:
    • A logging account
    • A resource account that hosts the web applications using AWS WAFFor more information about multi-account setup, see AWS Landing Zone. Using multiple accounts isolates your logs from your resource environments. This helps maintain the integrity of your log files and provides a central access point for auditing all application, network, and security logs.
  • The ability to launch new resources into your account. The resources might not be eligible for Free Tier usage and so might incur costs.
  • An application running with an Application Load Balancer, preferably in multiple Regions. If you do not already have one, you can launch any AWS web application reference architecture to test and implement this solution.

For this walkthrough, you can launch an Amazon ECS example from the ecs-refarch-cloudformation GitHub repo. This is a “one click to deploy” example that automatically sets up a web application with an Application Load Balancer. Launch this in two different Regions to simulate a global infrastructure. You ultimately set up a centralized bucket that both Regions log into, which your forensic analysis tools then draw from. Choose Launch Stack to launch the sample application in your Region of choice.

Setup

Architecture setup allows you to begin receiving log files in a centralized repository.

Step 1: Provide permissions

Begin this process by providing appropriate permissions for one account to access resources in another. Your resource account needs cross-account permission to access the bucket in the logging account.

  1. Create your central logging S3 bucket in the logging account and attach the following bucket policy to it under the Permissions Make a note of the bucket’s ARN. You need this information for future steps.
  2. Change RESOURCE-ACCOUNT-ID and CENTRAL-LOGGING-BUCKET-ARNto the correct values based on the actual values in your accounts:
     // JSON Document
     {
       "Version": "2012-10-17",
       "Statement": [
          {
             "Sid": "Cross Account AWS WAF Account 1",
             "Effect": "Allow",
             "Principal": {
                "AWS": "arn:aws:iam::RESOURCE-ACCOUNT-ID:root"
             },
             "Action": [
                "s3:GetObject",
                "s3:PutObject"
             ],
             "Resource": [
                "CENTRAL-LOGGING-BUCKET-ARN/*"
             ]
          }
       ]
    }

Step 2: Manage Lambda permissions

Next, the Lambda function that you create in your resource account needs permissions to access the S3 bucket in your central logging account so it can write files to that location. You already provided basic cross-account access in the previous step, but Lambda still needs the granular permissions at the resources level. Remember to grant these permissions in both Regions where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. To create an IAM role for the Lambda function, in the Lambda console, choose Policies, Create Policy.
  3. Choose JSON, and enter the following policy document. Replace YOUR-SOURCE-BUCKETand YOUR-DESTINATION-BUCKET with the relative ARNs of the buckets that you are using for this walkthrough.
    // JSON document
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "ListSourceAndDestinationBuckets",
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket",
                    "s3:ListBucketVersions"
                ],
                "Resource": [
                    "YOUR-SOURCE-BUCKET",
                    "YOUR-DESTINATION-BUCKET"
                ]
            },
            {
                "Sid": "SourceBucketGetObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:GetObject",
                    "s3:GetObjectVersion"
                ],
                "Resource": "YOUR-SOURCE-BUCKET/*"
            },
            {
                "Sid": "DestinationBucketPutObjectAccess",
                "Effect": "Allow",
                "Action": [
                    "s3:PutObject"
                ],
                "Resource": "YOUR-DESTINATION-BUCKET/*"
            }
        ]
     }

  4. Choose Review policy, enter your policy name, and save it.
  5. With the policy created, create a new role for your Lambda function and attach the custom policy to that role. To do this, navigate back to the IAM dashboard.
  6. Select Create roleand choose Lambda as the service that uses the role. Select the custom policy that you created earlier in this step and choose Next. You can add tags if required and then name and create this new role.
  7. You must also add S3 as a trusted entity in the Trust Relationship section of the role. Choose Edit trust relationship and add amazonaws.com to the policy, as shown in the following example.

Lambda and S3 now appear as trusted entities under the Trust relationships tab, as shown in the following screenshot.

Step 3: Create a Lambda function and copy log files

Create a Lambda function in the same Region as your resource account’s S3 bucket. This function reads log files from the resource account bucket and then copies that content to the logging account’s bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Log in to your resource account.
  2. Navigate to Lambda in your console and choose Create Function.
  3. Choose the Author from scratch function and name it. Choose the IAM role you created in the previous step and attach it to the Lambda function.
  4. Choose Create function.
  5. This Lambda function receives a document from S3 that contains nested JSON string data. To handle this data, you must extract the JSON from this string to retrieve key names of both the document and the bucket. Your function then uses this information to copy the data to your central logging account bucket in the next step. To create this function, Copy and paste this code into the Lambda function that you created. Replace the bucket names with the names of the buckets that you created earlier. After you decide on a partitioning strategy, modify this script later.
    // Load the AWS SDK
    const aws = require('aws-sdk');
    
    // Construct the AWS S3 Object 
    const s3 = new aws.S3();
    
    //Main function
    exports.handler = (event, context, callback) => {
        console.log("Got WAF Item Event")
        var _srcBucket = event.Records[0].s3.bucket.name;
        let _key = event.Records[0].s3.object.key;
        let _keySplit = _key.split("/")
        let _objName = _keySplit[ (_keySplit.length - 1) ];
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/YOUR-DESTINATION-BUCKET/"+_objName;
        let _sourcePath = _srcBucket + "/" + _key;
        console.log(_destPath)
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

Step 4: Set S3 to Lambda event triggers

This step sets up event triggers in your resource account’s S3 buckets. These triggers send the file name and location logged by AWS WAF logs to the Lambda function. The triggers also notify the Lambda function that it must move the newly arrived file into your central logging bucket. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

  1. Go to the S3 dashboard and choose your S3 bucket, then choose the Properties Under Advanced settings, choose Events.
  2. Give your event a name and select PUT from the Events check boxes.
  3. Choose Lambda from the Send To option and select your Lambda function as the destination for the event.

Step 5: Add AWS WAF to the Application Load Balancer

Add an AWS WAF to the Application Load Balancer so that you can start logging events. You can optionally delete the original log file after Lambda copies it. This reduces costs, but your business and security needs might err on the side of retaining that data.

Create a separate prefix for each Region in your central logging account bucket waf-central-logs so that AWS Glue can properly partition them. For best practices of partitioning with AWS Glue, see Working with partitioned data in AWS Glue. AWS Glue ingests your data and stores it in a columnar format optimized for querying in Amazon Athena. This helps you visualize the data and investigate the potential attacks.

Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF. The procedure assumes that you already have an AWS WAF enabled that you can use for this exercise. To move forward with the next step, you need AWS WAF enabled and connected to Amazon Kinesis Data Firehose for log delivery.

Setting up and configuring AWS WAF

If you don’t already have a web ACL in place, set up and configure AWS WAF at this point. This solution handles logging data from multiple AWS WAF logs in multiple Regions from more than one account.

To do this efficiently, you should consider your partitioning strategy for the data. You can grant your security teams a comprehensive view of the network. Create each partition based on the Kinesis Data Firehose delivery stream for the specific AWS WAF associated with the Application Load Balancer. This partitioning strategy also allows the security team to view the logs by Region and by account. As a result, your S3 bucket name and prefix look similar to the following example:

s3://central-waf-logs/<account_id>/<region_name>/<kinesis_firehose_name>/...filename...

Step 6: Copying logs with Lambda code

This step updates the Lambda function to start copying log files. Keep your partitioning strategy in mind as you update the Lambda function. Repeat this step for every Region where you launched the application that you intend to monitor with AWS WAF.

To accommodate the partitioning, modify your Lambda code to match the examples in the GitHub repo.

Replace <kinesis_firehose_name> in the example code with the name of the Kinesis Data Firehose delivery stream attached to the AWS WAF. Replace <central logging bucket name> with the S3 bucket name from your central logging account.

Kinesis Data Firehose should now begin writing files to your central S3 logging bucket with the correct partitioning. To generate logs, access your web application.

Analytics

Now that Kinesis Data Firehose can write collected files into your logging account’s S3 bucket, create an Elasticsearch cluster in your logging account in the same Region as the central logging bucket. You also must create a Lambda function to handle S3 events as the central logging bucket receives new log files. This creates a connection between your central log files and your search engine. Amazon ES gives you the ability to query your logs quickly to look for potential security threats. The Lambda function loads the data into your Amazon ES cluster. Amazon ES also includes a tool named Kibana, which helps with managing data and creating visualizations.

Step 7: Create an Elasticsearch cluster

  1. In your central Logging Account, navigate to the Elasticsearch Service in the AWS Console.
  2. Select Create Cluster, enter a domain name for your cluster, and choose version 3 from the Elasticsearch version dropdown. Choose Next.In this example, don’t implement any security policies for your cluster and only use one instance. For any real-world production tasks, keep your Elasticsearch Cluster inside your VPC.
  3. For network configuration, choose Public access and choose Next.
  4. For the access policy, and this tutorial, only allow access to the domain from a specified Account ID or ARN address. In this case, use your Account ID to gain access.
  5. Choose Next and on the final screen and confirm. You generally want to create strict access policies for your domain and not allow public access. This example only uses these settings to quickly demonstrate the capabilities of AWS services. I would never recommend this in a production environment.

AWS takes a few minutes to finish and activate your Amazon ES. Once it goes live, you can see two endpoints. The Endpoint URL is the URL you use to send data to the cluster.

Step 8: Create a Lambda function to copy log files

Add an event trigger to your central logs bucket. This trigger tells your Lambda function to write the data from the log file to Amazon ES. Before you create the S3 trigger, create a Lambda function in your logging account to handle the events.

For this Lambda function, we use code from the aws-samples GitHub repository that streams data from an S3 file line by line into Amazon ES. This example uses code taken from amazon-elasticsearch-lambda-samples. Name your new Lambda function myS3toES.

  1. Copy and paste the following code into a text file named js:
    exports.handler = (event, context, callback) => {
        // get the source bucket name
        var _srcBucket = event.Records[0].s3.bucket.name;
            // get the object key of the file that landed on S3
        let _key = event.Records[0].s3.object.key;
        
        // split the key by "/"
        let _keySplit = _key.split("/")
            // get the object name
        let _objName = _keySplit[ (_keySplit.length - 1) ];
            // reset the destination path
        let _destPath = _keySplit[0]+"/"+_keySplit[1]+"/<kinesis_firehose_name>/"+_objName;
            // setup the source path
        let _sourcePath = _srcBucket + "/" + _key;
            // build the params for the copyObject request to S3
        let params = { Bucket: destBucket, ACL: "bucket-owner-full-control", CopySource: _sourcePath, Key: _destPath };
            // execute the copyObject request
        s3.copyObject(params, function(err, data) {
            if (err) {
                console.log(err, err.stack);
            } else {
                console.log("SUCCESS!");
            }
        });
        callback(null, 'All done!');
    };

  2. Copy and paste this code into a text file and name it json:
    //JSON Document
    {
      "name": "s3toesfunction",
      "version": "1.0.0",
      "description": "",
      "main": "index.js",
      "scripts": {},
      "author": "",
      "dependencies": {
        "byline": "^5.0.0",
        "clf-parser": "0.0.2",
        "path": "^0.12.7",    "stream": "0.0.2"
      }
    }

  3. Execute the following command in the folder containing these files:> npm install
  4. After the installation completes, create a .zip file that includes the js file and the node_modules folder.
  5. Log in to your logging account.
  6. Upload your .zip file to the Lambda function. For Code entry type, choose Upload a .zip file.
  7. This Lambda function needs an appropriate service role with a trust relationship to S3. Choose Edit trust relationships and add amazonaws.com and lambda.amazonaws.com as trusted entities.
  8. Set up your IAM role with the following permissions: S3 Read Only permissions and Lambda Basic Execution. To grant the role the appropriate access, assign it to the Lambda function from the Lambda Execution Role section in the console.
  9. Set Environment variables for your Lambda function so it knows where to send the data. Add an endpoint and use the endpoint URL you created in Step 7. Add an index and enter your index name. Add a value for region and detail the Region where you deployed your application.

Step 9: Create an S3 trigger

After creating the Lambda function, create the event triggers on your S3 bucket to execute that function. This completes your log delivery pipeline to Amazon ES. This is a common pipeline architecture for streaming data from S3 into Amazon S3.

  1. Log in to your central logging account.
  2. Navigate to the S3 console, select your bucket, then open the Properties pane and scroll down to Events.
  3. Choose Add notification and name your new event s3toLambdaToEs.
  4. Under Events, select the check box for PUT. Leave Prefix and Suffix
  5. Under Send to, select Lambda Function, and enter the name of the Lambda function that you created in the previous step—in this example, myS3toES.
  6. Choose Save.

With this complete, Lambda should start sending data to your Elasticsearch index whenever you access your web application.

Step 10: Configure Amazon ES

Your pipeline now automatically adds data to your Elasticsearch cluster. Next, use Kibana to visualize the AWS WAF logs in the central logging account’s S3 bucket. This is the final step in assembling your forensic investigation architecture.

Kibana provides tools to create visualizations and dashboards that help your security teams view log data. Using the log data, you can filter by IP address to see how many times an IP address has hit your firewall each month. This helps you track usage anomalies and isolate potentially malicious IP addresses. You can use this information to add web ACL rules to your firewall that adds extra protection against those IP addresses.

Kibana produces visualizations like the following screenshot.

In addition to the Number of IPs over Time visualization, you can also correlate the IP address to its country of origin. Correlation provides even more precise filtering for potential web ACL rules to protect against attackers. The visualization for that data looks like the following image.

Elasticsearch setup

To set up and visualize your AWS WAF data, follow this How to analyze AWS WAF logs using Amazon Elasticsearch Service post. With this solution, you can investigate your global dataset instead of isolated Regions.

An alternative to Amazon ES

Amazon ES is an excellent tool for forensic work because it provides high-performance search capability for large datasets. However, Amazon ES requires cluster management and complex capacity planning for future growth. To get top-notch performance from Amazon ES, you must adequately scale it. With the more straightforward data of these investigations, you could instead work with more traditional SQL queries.

Forensic data grows quickly, so using a relational database means you might quickly outgrow your capacity. Instead, take advantage of AWS serverless technologies like AWS Glue, Athena, and Amazon QuickSight. These technologies enable forensic analysis without the operational overhead you would experience with Elasticsearch or a relational database. To learn more about this option, consult posts like How to extract, transform, and load data from analytic processing using AWS Glue and Work with partitioned data in AWS Glue.

Athena query

With your forensic tools now in place, you can use Athena to query your data and analyze the results. This lets you refine the data for your Kibana visualizations, or directly load it into Amazon QuickSight for additional visualization. Use the Athena console to experiment until you have the best query for your visual needs. Having the database in your AWS Glue Catalog means you can make ad hoc queries in Athena to inspect your data.

In the Athena console, create a new Query tab and enter the following query:

# SQL Query
SELECT date_format(from_unixtime("timestamp"/1000), '%Y-%m-%d %h:%i:%s') as event_date, client_ip, country, account_id, waf_name, region FROM "paritionedlogdata"."waf_logs_transformed" where year='2018' and month='12';

Replace <your-database-name> and <your-table-name> with the appropriate values for your environment. This query converts the numerical timestamp to an actual date format using the SQL according to Presto 0.176 documentation. It should return the following results.

You can see which IP addresses hit your environment the most over any period of time. In a production environment, you would run an ETL job to re-partition this data and transform it into a columnar format optimized for queries. If you would like more information about doing that, see the Best Practices When Using Athena with AWS Glue post.

Amazon QuickSight visualization

Now that you can query your data in Athena, you can visualize the results using Amazon QuickSight. First, grant Amazon QuickSight access to the S3 bucket where your Athena query results live.

  1. In the Amazon QuickSight console, log in.
  2. Choose Admin/username, Manage QuickSight.
  3. Choose Account settings, Security & permissions.
  4. Under QuickSight access to AWS services, choose Add or remove.
  5. Choose Amazon S3, then choose Select S3 buckets.
  6. Choose the output bucket for your central AWS WAF logs. Also, choose your Athena query results bucket. The query results bucket begins with aws-athena-query-results-*.

Amazon QuickSight can now access the data sources. To set up your visualizations, follow these steps:

  1. In the QuickSight console, choose Manage data, New data set.
  2. For Source, choose Athena.
  3. Give your new dataset a name and choose Validate connection.
  4. After you validate the connection, choose Create data source.
  5. Select Use custom SQL and give your SQL query a name.
  6. Input the same query that you used earlier in Athena, and choose Confirm query.
  7. Choose Import to SPICE for quicker analytics, Visualize.

Allow Amazon QuickSight several minutes. It alerts you after completing the import.

Now that you have imported your data into your analysis, you can apply a visualization:

  1. In Amazon QuickSight, New analysis.
  2. Select the last dataset that you created earlier and choose Create analysis.
  3. At the bottom left of the screen, choose Line Chart.
  4. Drag and drop event_date to the X-Axis
  5. Drag and drop client_ip to the ValueThis should create a visualization similar to the following image.
  6. Choose the right arrow at the top left of the visualization and choose Hide “other” categories.This should modify your visualization to look like the following image.

You can also map the countries from which the requests originate, allowing you to track global access anomalies. You can do this in QuickSight by selecting the “Points on map” visualization type and choosing the country as the data point to visualize.

You can also add a count of IP addresses to see if you have any unusual access patterns originating from specific IP addresses.

Conclusion

Although Amazon ES and Amazon QuickSight offer similar final results, there are trade-offs to the technical approaches that I highlighted. If your use case requires the analysis of data in real time, then Amazon ES is more suitable for your needs. If you prefer a serverless approach that doesn’t require capacity planning or cluster management, then the solution with AWS Glue, Athena, and Amazon QuickSight is more suitable.

In this post, I described an easy way to build operational dashboards that track key metrics over time. Doing this with AWS Glue, Athena, and Amazon QuickSight relieves the heavy lifting of managing servers and infrastructure. To monitor metrics in real time instead, the Amazon ES solution provides a way to do this with little operational overhead. The key here is the adaptability of the solution: putting different services together can provide different solutions to your problems to fit your exact needs.

For more information and use cases, see the following resources:

Hopefully, you have found this post informative and the proposed solutions intriguing. As always, AWS welcomes all feedback or comment.

 


About the Authors

Aaron Franco is a solutions architect at Amazon Web Services .

 

 

 

 

 

 

 

 

Detect fraudulent calls using Amazon QuickSight ML insights

Post Syndicated from Guy Ben Baruch original https://aws.amazon.com/blogs/big-data/detect-fraudulent-calls-using-amazon-quicksight-ml-insights/

The financial impact of fraud in any industry is massive. According to the Financial Times article Fraud Costs Telecoms Industry $17bn a Year (paid subscription required), fraud costs the telecommunications industry $17 billion in lost revenues every year.

Fraudsters constantly look for new technologies and devise new techniques. This changes fraud patterns and makes detection difficult. Companies commonly combat this with a rules-based fraud detection system. However, once the fraudsters realize their current techniques or tools are being identified, they quickly find a way around it. Also, rules-based detection systems tend to struggle and slow down with a lot of data. This makes it difficult to detect fraud and act quickly, resulting in loss of revenue.

Overview

There are several AWS services that implement anomaly detection and could be used to combat fraud, but lets focus on the following three:

When trying to detect fraud, there are two high-level challenges:

  • Scale – The amount of data to be analyzed. For example, each call generates a call detail record (CDR) event. These CDRs include many pieces of information such as originating and terminating phone numbers, and duration of call. Multiply these CDR events times the number of telephone calls placed each day and you can get an idea of the scale that operators must manage.
  • Machine learning knowledge and skill – The right set of skills to help solve business problems with machine learning. Developing these skills or hiring qualified data scientists with adequate domain knowledge is not simple.

Introducing Amazon QuickSight ML Insights

Amazon QuickSight is a fast, cloud-powered BI service that makes it easy for everyone in an organization to get business insights from their data through rich, interactive dashboards. With pay-per-session pricing and a dashboard that can be embedded into your applications, BI is now even more cost-effective and accessible to everyone.

However, as the volume of data that customers generate grows daily, it’s becoming more challenging to harness their data for business insights. This is where machine learning comes in. Amazon is a pioneer in using machine learning to automate and scale various aspects of business analytics in the supply chain, marketing, retail, and finance.

ML Insights integrates proven Amazon technologies into Amazon QuickSight to provide customers with ML-powered insights beyond visualizations.

  • ML-powered anomaly detection to help customers uncover hidden insights by continuously analyzing across billions of data points.
  • ML-powered forecasting and what-if analysis to predict key business metrics with point-and-click simplicity.
  • Auto-narratives to help customers tell the story of their dashboard in a plain-language narrative.

In this post, I demonstrate how a Telecom provider with little to no ML expertise can use Amazon QuickSight ML capabilities to detect fraudulent calls.

Prerequisites

To implement this solution, you need the following resources:

  • Amazon S3 to stage a ‘ribbon’ call detail record sample in a CSV format.
  • AWS Glue running an ETL job in PySpark.
  • AWS Glue crawlers to discover the schema of the tables and update the AWS Glue Data Catalog.
  • Amazon Athena to query the Amazon QuickSight dataset.
  • Amazon QuickSight to build visualizations and perform anomaly detection using ML Insights.

Diagram of fraudulent call-detecting architecture, using a PySpark script to prepare the data and transform it into Parquet and an AWS Glue crawler to build the AWS Glue Data Catalog.

The dataset

For this post, I use a synthetic dataset, thanks to Ribbon Communications. The data was generated by call test generators, and is not customer or sensitive data.

Inspecting the data

The example below is a typical CDR. The STOP CDR shown below is generated after a call has been terminated.


As you can see, there are a lot of values here. Most of them are not relevant for fraud identification or prevention.

Revenue shared fraud

Revenue shared fraud is one of the most common fraud schemes threatening the telecom industry today. It involves using fraudulent or stolen numbers to repeatedly call a premium rate B-number, who then shares the cash generated with the fraudster.

Say that you’d like to detect national and international revenue share fraud using Amazon QuickSight ML. Consider the typical traits of a revenue share fraud phone call. The pattern for revenue share fraud is multiple A-numbers calling the same B-number or a range of B-numbers with the same prefix. The call duration is usually higher than average and could be up to two hours, which is the maximum length of time international switches allow. Generally, the calls originate from one cell or a group of cells.

One SIM may make short test calls to a variety of B-numbers as a precursor to the fraud itself, which most often happens when the risk of detection is lowest, for example, Friday night, weekends, or holidays. Conference calling may be used to make several concurrent calls from one A-number.

Often, SIMs used for this type of fraud are sold or activated in bulk from the same distributor or group of distributors. SIMs could be topped up using fraudulent online or IVR payments, such as using stolen credit card numbers. Both PAYG credit and bundles may be used.Based on the above use case, the following pieces of information are most relevant to detecting fraud.

  • Call duration
  • Calling number (A number)
  • Called number (B number)
  • Start time of the call
  • Accounting ID

You can use this reference to help identify those fields in a CDR.

Figure 2: Decoded CDR data, highlighting the relevant fields.

I identified the columns that I need out of 235 columns in the CDR.

Inspecting the raw sample data, I quickly see that it’s missing a header.

To make life easier, I converted the raw CSV data, added the column names, and converted to Parquet.

Discovering the data

In the AWS Glue console, set up a crawler and name it CDR_CRAWLER.

Point the crawler to s3://telco-dest-bucket/blog where the Parquet CDR data resides.

Next, create a new IAM role to be used by the AWS Glue crawler.

For Frequency, leave the default definition of Run on Demand.

Next, choose Add database and define the name of the database. This database contains the table discovered by the AWS Glue crawler.

Choose next and review the crawler settings. When you’re satisfied, choose Finish.

Next, choose Crawlers, select the crawler that you just created (CDR_CRAWLER), and choose Run crawler.

The AWS Glue crawler starts crawling the database. This can take one minute or more to complete.

When it’s complete, under Data catalog, choose Databases.  You should be able to see the new database created by the AWS Glue crawler. In this case, the name of the database is blog.

To view the tables created under this database, select the relevant database and choose Tables. The crawler’s table also points to the location of the Parquet format CDRs.

To see the table’s schema, select the table created by the crawler.

Data preparation

You have defined the relevant dimensions to use in the ML model to detect fraud. Now, you can use a PySpark script that I built earlier using an Amazon SageMaker notebook and an AWS Glue endpoint. The script covers the following tasks:

  • Reduce the dataset and focus only on the relevant columns.
  • Create a timestamp column, which you need for creating an analysis using Amazon QuickSight.
  • Transform files from CSV to Parquet for improved performance.

You can run the PySpark script on the raw CSV format of the CDRs that you are using. Here is the location of the raw CSV format:

s3:/telco-source-bucket/machine-learning-for-all/v1.0.0/data/cdr-stop/cdr_stop.csv

Here is the PySpark script that I created.

import sys    
from awsglue.transforms import *    
from awsglue.utils import getResolvedOptions    
from pyspark.context import SparkContext    
from awsglue.context import GlueContext    
from awsglue.job import Job    
import pyspark.sql.functions as fn    
from awsglue.dynamicframe import DynamicFrame    
    
    
sc = SparkContext.getOrCreate()    
glueContext = GlueContext(sc)    
spark = glueContext.spark_session    
    
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = "demo_ml", table_name = "cdr_stop_csv", transformation_ctx = "datasource0")    
#apply mapping from source table to destination , we pick only the relevant columns     
applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = [("col2", "string", "Accounting_ID", "string"), ("col13", "long", "Call_service_duration", "long"), ("col5", "string", "Start_Time_(MM/DD/YYYY)", "string"), ("col6", "string", "Start_Time_(HH/MM/SSs)", "string"), ("col19", "long", "Calling number", "string"), ("col20", "long", "Called number", "string")], transformation_ctx = "applymapping1")    
    
resolvechoice2 = ResolveChoice.apply(frame = applymapping1, choice = "make_struct", transformation_ctx = "resolvechoice2")    
resolvechoice2.printSchema()    
    
resolvechoice3 = ResolveChoice.apply(frame = resolvechoice2, choice = "MATCH_CATALOG", database = "demo_ml", table_name = "cdr_stop_csv", transformation_ctx = "resolvechoice3")    
resolvechoice3.printSchema()    
    
customDF = resolvechoice3.toDF()    
#create timestamp column    
customDF = customDF.withColumn('timestamp', fn.concat(fn.col("Start_Time_(MM/DD/YYYY)"),fn.lit(" "),fn.col("Start_Time_(HH/MM/SSs)")))    
    
#create timestamp2 column which is a substring of timestamp column    
customDF = customDF.withColumn('timestamp2',fn.substring(fn.col("timestamp"),1,19))    
#create Date column     
customDF =customDF.withColumn("Date",fn.unix_timestamp(fn.col("timestamp2"),"MM/dd/yyyy HH:mm:ss").cast("timestamp"))    
    
#remove temporary fields     
customDF = customDF.drop('timestamp','timestamp2')    
    
customDynamicFrame = DynamicFrame.fromDF(customDF, glueContext, "customDF_df")    
#transform to parquet format and land in S3 path    
datasink4 = glueContext.write_dynamic_frame.from_options(frame = customDynamicFrame, connection_type = "s3", connection_options = {"path": "s3://telco-dest-bucket/blog/"}, format = "parquet", transformation_ctx = "datasink4")    

The dataset has been cataloged in AWS Glue Data Catalog and is queryable using Athena.

Amazon QuickSight and anomaly detection

Next, build out anomaly detection using Amazon QuickSight. To get started, follow these steps.

  1. In the Amazon QuickSight console, choose new analysis.
  2. click on create new data set
  3. select Athena
  4. enter a data source name
  5. click on create data source
  6. select from the drop down list the relevant database and table that were created by the AWS Glue crawlers and click on select
  7. select directly query your data and click visualize

Visualizing the data using Amazon QuickSight

  1. Under visual types, choose Line chart.
  2. Drag call_service_duration to the Value field well.
  3. Drag timestamp_new to the X axis field well.

Amazon QuickSight generates a dashboard, as in the following screenshot.

The x-axis is the timestamp. By default, it’s based on the aggregates of one day. This can be changed by choosing a different value.

Because I currently define the timestamp to look on one-day aggregations, the call duration is a sum of all call durations from all call records within a day. I can begin the search by looking for days where the total call duration is high.

Anomaly detection

Now look at how to start using the ML insights anomaly detection feature.

  1. On the top of the Insights panel, choose Add anomaly to sheet. This creates an insights visual for anomaly detection.
  2. On the top of the screen, choose Field Wells and add at least one field to the Categories, as in the following example. I added the calling/called number, as those become relevant for fraud use cases; for example, one A-number calling multiple B-numbers or multiple A-numbers calling B-numbers.
    The categories represent the dimensional values by which Amazon QuickSight splits the metric. For example, you can analyze anomalies on sales across all product categories and product SKUs—assuming there are 10 product categories, each with 10 product SKUs. Amazon QuickSight splits the metric by the 100 unique combinations and runs anomaly detection on each of the split metric.
  3. To configure the anomaly detection job, choose Get Started.
  4. On the anomaly detection configuration screen, set up the following options:
  • Analyze all combinations of these categories—By default, if you have selected three categories, Amazon QuickSight runs anomaly detection on the following combinations hierarchically: A, AB, ABC. If you select this option, QuickSight analyzes all combinations: A, AB, ABC, BC, AC. If your data is not hierarchical, check this option.
  • Schedule—Set this option to run anomaly detection on your data hourly, daily, weekly, or monthly, depending on your data and needs. For Start schedule on and Timezone, enter values and choose OK.Important: The schedule does not take effect until you publish the analysis as a dashboard. Within the analysis, you have the option to run the anomaly detection manually (without the schedule).Contribution analysis on anomaly – You can select up to four additional dimensions for Amazon QuickSight to analyze the top contributors when an anomaly is detected. For example, Amazon QuickSight can show you the top customers that contributed to a spike in sale. In my current example, I added one additional dimension: the accounting ID. If you think about a telecom fraud case, you can also consider fields like charging time or cell ID as additional dimensions.
  1. After setting the configuration, choose Run Now to execute the job manually, which includes the “Detecting anomalies… This may take a while…” message. Depending on the size of your dataset, this may take a few minutes or up to an hour.
  2. When the anomaly detection job is complete, any anomalies are called out in the insights visual. By default, only the top anomalies for the latest time period in the data are shown in the insights visuals.

    Anomaly detection reveals several B numbers being called from multiple A numbers with a high call service duration on August 29, 2018. That looks interesting!
  3. To explore all anomalies for this insight, select the menu on the top-right corner of the visual and choose Explore Anomalies.
  4. On the Anomalies detailed page, you can see all the anomalies for the latest period.
    In the view, you can see that two anomalies were detected, showing two time series.The title of the visuals represents the metric that is run on the unique combination of the categorical fields. In this case:
  • [All] | 9645000024
  • 3512000024 | [ALL]So the system detected anomalies for multiple A-numbers calling 9645000024, and 351200024 calling multiple B numbers. In both cases, it observed a high call duration. The labeled data point on the chart represents the most recent anomaly that is detected for that time series.
  1. To expose a date picker, choose show anomalies by date at the top-right corner. This chart shows the number of anomalies that were detected for each day (or hour, depending on your anomaly detection configuration). You can select a particular day to see the anomalies detected for that day.For example, selecting August 10, 2018 on the top chart shows the anomalies for that day:

    Important:
    The first 32 points in the dataset are used for training and are not scored by the anomaly detection algorithm. You may not see any anomalies on the first 32 data points.You can expand the filter controls on the top of the screen. With the filter controls, you can change the anomaly threshold to show high, medium, or low significance anomalies. You can choose to show only anomalies that are higher than expected or lower than expected. You can also filter by the categorical values that are present in your dataset to look at anomalies only for those categories.
  2. Look at the contributors columns. When you configured the anomaly detection, you defined the accounting ID as another dimension. If this were real call traffic instead of practice data, you would be able to single out specific accounting IDs that contribute to the anomaly.
  3. When you’re done, choose Back to analysis.

Summary

In this post, I explored a common fraud pattern called shared revenue fraud. I looked at how to extract the relevant data for training the anomaly detection model in Amazon QuickSight. I then used this data to detect anomalies based on call duration, calling party, and called party, looking at additional contributors like Accounting ID. The entire process used serverless technologies and little to no machine learning experience.

For more information about options and strategies, see Amazon QuickSight Announces General Availability of ML Insights.

If you have questions or suggestions, please comment below.

 


About the Author

Guy Ben Baruch is a solutions architect with Amazon Web Services.

 

 

 

Introducing Amazon QuickSight fine-grained access control over Amazon S3 and Amazon Athena

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/introducing-amazon-quicksight-fine-grained-access-control-over-amazon-s3-and-amazon-athena/

Today, AWS is excited to announce the availability of fine-grained access control for AWS Identity and Access Management (IAM)-permissioned resources in Amazon QuickSight. Fine-grained access control allows Amazon QuickSight account administrators to control authors’ default access to connected AWS resources. Fine-grained access control enables administrators to use IAM policies to scope down access permissions, limiting specific authors’ access to specific items within the AWS resources. Administrators can now apply this new level of access control to Amazon S3, Amazon Athena, and Amazon RDS/Redshift database discovery.

Fine-Grained Access Control Setup

Here’s how fine-grained access control works in Amazon QuickSight:

Imagine an AWS customer, Acme. Acme’s account contains three S3 buckets, called b1, b2, and b3. Amazon QuickSight is configured to read data via Athena. Assume that Acme’s administrators have configured Amazon QuickSight service role permissions with access to all three buckets. The new default access setting introduced today enables administrators to limit access to Acme’s data for all users by default. Administrators grant access to specific buckets (b1, b2, b3) to individual users or groups within Acme via IAM policies.

In the following example, the policies assigned to Acme users A and B, and Group X grant them access to buckets 1, 2, and 3. Group Y is not assigned, as shown in the following diagram.

When User A attempts to read data via Athena from bucket 1, AWS evaluates the IAM policy associated with the user. Since the policy assigned to User A grants access to bucket 1, the query succeeds. User A can access data from bucket 1. Similarly, User B and users in group X can access data in buckets 2 and 3, respectively.

However, when a user from group Y tries to access bucket 2, QuickSight doesn’t allow any access to data. Remember, group Y has no user-level assignments. Users from group Y are denied access to bucket 2 because Amazon QuickSight requires explicit permissions to access data, as shown in the following diagram.

In an S3 data lake, Amazon QuickSight enables administrators to restrict each author’s data access using IAM policies. (Other AWS mechanisms, outside of Amazon QuickSight, provide for the modification of the policies themselves.) This fine-grained access control relies on the underlying IAM policy evaluation infrastructure. Currently, there is a ten-policy limit on policies that may be linked together at runtime to evaluate the user’s permissions.

Amazon QuickSight also offers an updated UI for AWS permissions management. Administrators can now access an account’s default resource setting, as shown in the following screenshot:

You can set Default resource access to Allow or Deny, based on the administration model. Choose the appropriate option and press Update. (Or, if you decide not to proceed, press Cancel.)

As before, you can specify AWS resource permissions for the Amazon QuickSight account through the new UI. Select resources from the checkboxes on the right side of the screen, as shown in the following screenshot:

And, as shown in the next screenshot, S3 access can be granted to buckets within the AWS account or other AWS accounts:

To control access for specific users or groups, use the newly introduced fine-grained access control feature, illustrated in this screenshot:

The IAM policy assignments button leads you to a page displaying all assignments in the account and lets you create a new assignment, as shown in the following screenshot.

Creating a new policy assignment involves only two steps:

  1. Pick from an IAM policy from those on the AWS account list.
  2. Assign to specific users or groups.

If you haven’t yet configured groups in Amazon QuickSight, you can do so using AWS APIs for accounts using SSO or Amazon QuickSight-native credentials. Groups are also natively available in an AD-connected account.

Fine-grained access control in Amazon QuickSight, when combined with data in S3 and Athena, allows you to set up a secure environment for data exploration across the organization. This feature is available in Amazon QuickSight Enterprise Edition in all supported AWS Regions starting today.

 


About the Author

Jose Kunnackal is a principal product manager for Amazon QuickSight.

 

 

 

Amazon QuickSight updates: Multiple sheets in dashboards, axis label orientation options, and more

Post Syndicated from Sahitya Pandiri original https://aws.amazon.com/blogs/big-data/amazon-quicksight-updates-multiple-sheets-in-dashboards-axis-label-orientation-options-and-more/

Today, we are pleased to announce a set of updates to Amazon QuickSight:

  • Richer dashboards with multiple sheets in your regular and embedded dashboards
  • Multiple axis label orientation options for better readability of dashboards
  • More calculations such as standard deviation, variance and conditional string functions on SPICE
  • Enhanced URL actions for supporting a broader set of interaction scenarios
  • One-click duplication of visuals for faster authoring

Multiple sheets in Dashboards

First, let’s look at the update to dashboards that allows multiple sheets, accessed via a new tab control. This allows better organization of information within the dashboard, where visuals related to specific subject areas or topics can be organized in separate sheets and distinctly identified through the tab name. As a reader, you can then navigate to a single dashboard to get a comprehensive view of all the insights related to a topic.

Each sheet within a dashboard has a distinct URL, which allows you to bookmark the specific sheet, or share a link to the specific sheet with others. Additionally, dashboards can also be setup to filter content at a dashboard level, or at a per-sheet level, with the filters being passed either via on-screen controls or using the URL.

Filtering through on-screen controls

Navigating to another dashboard and passing filters via URL

If you are authoring a dashboard, you will find that every dashboard now starts off as a single sheet. You can use the “add sheet” icon in the tabs section to add a new sheet to your dashboard. Note that the tab control only shows up on a dashboard if there is more than one sheet. When authoring, you will also be able to rename a sheet, change the order of sheets or delete a sheet. The first sheet will always be the one presented to your readers by default, so this would be where you want to provide a summary of the dashboard with KPIs, ML-Insights, or high level trends. Email reports will send the first sheet of the dashboard to all your viewers; we are working on additional rendering options for email reports.

Previously, on the Big Data Blog, we covered ways to add parameters and on-screen controls to dashboards. Parameter values are global across a dashboard, which allow you to maintain user inputs and filter content according to user selections across the dashboard. On-screen controls are specific to each sheet, since the context of these controls is often confined to the sheet itself. If you want to discreetly filter data based on user or group-level attributes, you may also use row-level security to do so.

Multi-sheet dashboards can be shared through the regular website, or can be embedded in your custom application. To embed a dashboard in your application, visit Embed interactive dashboards in your application with Amazon QuickSight.

Axis label orientation

When authoring a dashboard, you will see that Amazon QuickSight now automatically selects a horizontal or angled layout to provide better readability. You can also manually navigate to a visual’s options menu and choose “Format visual”, and then choose the axis label orientation that is most suited to your dashboard.

Duplicate visual option

When authoring a dashboard, there are often situations where you want to quickly clone a visual and make changes. With the new “Duplicate visual” option, Amazon QuickSight allows you to achieve this via a one-click action from the visual’s options menu. This new option creates a copy of the visual, along with its filters and other settings such as visual size and format, within the same sheet or a different sheet within the dashboard, thus removing the overhead of creating visuals from scratch each time.

URL action open options

URL actions allow you to configure actions outside of QuickSight that can be triggered by a user viewing a QuickSight dashboard. With today’s release we will allow you to choose how these URLs are opened, with options of overriding the viewer’s QuickSight browser tab, loading in a new tab, or a new browser window altogether.

In addition to these features that improve author and reader experience, QuickSight now supports additional statistical capabilities through new functions in calculated fields. These statistical functions include standard deviation and variance calculations and conditional functions on string fields. With standard deviation and variance, you can now determine dispersion of a metric relative to the mean. You can do this across your entire data set or partition by a select dimension. For more information, see Functions by Category in the Amazon QuickSight User Guide.

With conditional string functions on SPICE, you can now use coalesce, ifelse, isNotNull, isNull, nullif while building reports on SPICE datasets. For more information, see Calculated Field Function and Operator Reference for Amazon QuickSight in the Amazon QuickSight user guide.

All of these new features are now available in both Standard and Enterprise Editions in all supported Amazon QuickSight regions – US East (N. Virginia and Ohio), US West (Oregon), EU (Ireland), and Asia Pacific (Singapore, Sydney and Tokyo).


About the Author

Sahitya Pandiri is a technical program manager with Amazon Web Services. Sahitya has been in the product/program management for 5 years now, and has built multiple products in the retail, healthcare and analytics spaces. She enjoys problem solving, and leveraging technology to simplify processes.

 

 

 

Enabling serverless security analytics using AWS WAF full logs, Amazon Athena, and Amazon QuickSight

Post Syndicated from Umesh Ramesh original https://aws.amazon.com/blogs/security/enabling-serverless-security-analytics-using-aws-waf-full-logs/

Traditionally, analyzing data logs required you to extract, transform, and load your data before using a number of data warehouse and business intelligence tools to derive business intelligence from that data—on top of maintaining the servers that ran behind these tools.

This blog post will show you how to analyze AWS Web Application Firewall (AWS WAF) logs and quickly build multiple dashboards, without booting up any servers. With the new AWS WAF full logs feature, you can now log all traffic inspected by AWS WAF into Amazon Simple Storage Service (Amazon S3) buckets by configuring Amazon Kinesis Data Firehose. In this walkthrough, you’ll create an Amazon Kinesis Data Firehose delivery stream to which AWS WAF full logs can be sent, and you’ll enable AWS WAF logging for a specific web ACL. Then you’ll set up an AWS Glue crawler job and an Amazon Athena table. Finally, you’ll set up Amazon QuickSight dashboards to help you visualize your web application security. You can use these same steps to build additional visualizations to draw insights from AWS WAF rules and the web traffic traversing the AWS WAF layer. Security and operations teams can monitor these dashboards directly, without needing to depend on other teams to analyze the logs.

The following architecture diagram highlights the AWS services used in the solution:

Figure 1: Architecture diagram

Figure 1: Architecture diagram

AWS WAF is a web application firewall that lets you monitor HTTP and HTTPS requests that are forwarded to an Amazon API Gateway API, to Amazon CloudFront or to an Application Load Balancer. AWS WAF also lets you control access to your content. Based on conditions that you specify—such as the IP addresses from which requests originate, or the values of query strings—API Gateway, CloudFront, or the Application Load Balancer responds to requests either with the requested content or with an HTTP 403 status code (Forbidden). You can also configure CloudFront to return a custom error page when a request is blocked.

Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon Elasticsearch Service, and Splunk. With Kinesis Data Firehose, you don’t need to write applications or manage resources. You configure your data producers to send data to Kinesis Data Firehose, and it automatically delivers the data to the destination that you specified. You can also configure Kinesis Data Firehose to transform your data before delivering it.

AWS Glue can be used to run serverless queries against your Amazon S3 data lake. AWS Glue can catalog your S3 data, making it available for querying with Amazon Athena and Amazon Redshift Spectrum. With crawlers, your metadata stays in sync with the underlying data (more details about crawlers later in this post). Amazon Athena and Amazon Redshift Spectrum can directly query your Amazon S3 data lake by using the AWS Glue Data Catalog. With AWS Glue, you access and analyze data through one unified interface without loading it into multiple data silos.

Amazon Athena is an interactive query service that makes it easy to analyze data directly in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Amazon QuickSight is a business analytics service you can use to build visualizations, perform one-off analysis, and get business insights from your data. It can automatically discover AWS data sources and also works with your data sources. Amazon QuickSight enables organizations to scale to hundreds of thousands of users and delivers responsive performance by using a robust in-memory engine called SPICE.

SPICE stands for Super-fast, Parallel, In-memory Calculation Engine. SPICE supports rich calculations to help you derive insights from your analysis without worrying about provisioning or managing infrastructure. Data in SPICE is persisted until it is explicitly deleted by the user. SPICE also automatically replicates data for high availability and enables Amazon QuickSight to scale to hundreds of thousands of users who can all simultaneously perform fast interactive analysis across a wide variety of AWS data sources.

Step one: Set up a new Amazon Kinesis Data Firehose delivery stream

  1. In the AWS Management Console, open the Amazon Kinesis Data Firehose service and choose the button to create a new stream.
    1. In the Delivery stream name field, enter a name for your new stream that starts with aws-waf-logs- as shown in the screenshot below. AWS WAF filters all streams starting with the keyword aws-waf-logs when it displays the delivery streams. Note the name of your stream since you’ll need it again later in the walkthrough.
    2. For Source, choose Direct PUT, since AWS WAF logs will be the source in this walkthrough.

      Figure 2: Select the delivery stream name and source

      Figure 2: Select the delivery stream name and source

  2. Next, you have the option to enable AWS Lambda if you need to transform your data before transferring it to your destination. (You can learn more about data transformation in the Amazon Kinesis Data Firehose documentation.) In this walkthrough, there are no transformations that need to be performed, so for Record transformation, choose Disabled.
    Figure 3: Select "Disabled" for record transformations

    Figure 3: Select “Disabled” for record transformations

    1. You’ll have the option to convert the JSON object to Apache Parquet or Apache ORC format for better query performance. In this example, you’ll be reading the AWS WAF logs in JSON format, so for Record format conversion, choose Disabled.

      Figure 4: Choose "Disabled" to not convert the JSON object

      Figure 4: Choose “Disabled” to not convert the JSON object

  3. On the Select destination screen, for Destination, choose Amazon S3.
    Figure 5: Choose the destination

    Figure 5: Choose the destination

    1. For the S3 destination, you can either enter the name of an existing S3 bucket or create a new S3 bucket. Note the name of the S3 bucket since you’ll need the bucket name in a later step in this walkthrough.
    2. For Source record S3 backup, choose Disabled, because the destination in this walkthrough is an S3 bucket.

      Figure 6: Enter the S3 bucket name, and select "Disabled" for the Source record S3 backup

      Figure 6: Enter the S3 bucket name, and select “Disabled” for the source record S3 backup

  4. On the next screen, leave the default conditions for Buffer size, Buffer interval, S3 compression and S3 encryption as they are. However, we recommend that you set Error logging to Enabled initially, for troubleshooting purposes.
    1. For IAM role, select Create new or choose. This opens up a new window that will prompt you to create firehose_delivery_role, as shown in the following screenshot. Choose Allow in this window to accept the role creation. This grants the Kinesis Data Firehose service access to the S3 bucket.

      Figure 7: Select "Create new or choose" for IAM Role

      Figure 7: Select “Allow” to create the IAM role “firehose_delivery_role”

  5. On the last step of configuration, review all the options you’ve chosen, and then select Create delivery stream. This will cause the delivery stream to display as “Creating” under Status. In a couple of minutes, the status will change to “Active,” as shown in the below screenshot.

    Figure 8: Review the options you selected

    Figure 8: Review the options you selected

Step two: Enable AWS WAF logging for a specific Web ACL

  1. From the AWS Management Console, open the AWS WAF service and choose Web ACLs. Open your Web ACL resource, which can either be deployed on a CloudFront distribution or on an Application Load Balancer.
    1. Choose the Web ACL for which you want to enable logging. (In the below screenshot, we’ve selected a Web ACL in the US East Region.)
    2. On the Logging tab, choose Enable Logging.

      Figure 9: Choose "Enable Logging"

      Figure 9: Choose “Enable Logging”

  2. The next page displays all the delivery streams that start with aws-waf-logs. Choose the Amazon Kinesis Data Firehose delivery stream that you created for AWS WAF logs at the start of this walkthrough. (In the screenshot below, our example stream name is “aws-waf-logs-us-east-1)
    1. You can also choose to redact certain fields that you wish to exclude from being captured in the logs. In this walkthrough, you don’t need to choose any fields to redact.
    2. Select Create.

      Figure 10: Choose your delivery stream, and select "Create"

      Figure 10: Choose your delivery stream, and select “Create”

After a couple of minutes, you’ll be able to inspect the S3 bucket that you defined in the Kinesis Data Firehose delivery stream. The log files are created in directories by year, month, day, and hour.

Step three: Set up an AWS Glue crawler job and Amazon Athena table

The purpose of a crawler within your Data Catalog is to traverse your data stores (such as S3) and extract the metadata fields of the files. The output of the crawler consists of one or more metadata tables that are defined in your Data Catalog. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. AWS Glue provides built-in classifiers to infer schemas from common files with formats that include JSON, CSV, and Apache Avro.

  1. In the AWS Management Console, open the AWS Glue service and choose Crawler to setup a crawler job.
  2. Choose Add crawler to launch a wizard to setup the crawler job. For Crawler name, enter a relevant name. Then select Next.

    Figure 11: Enter "Crawler name," and select "Next"

    Figure 11: Enter “Crawler name,” and select “Next”

  3. For Choose a data store, select S3 and include the path of the S3 bucket that stores your AWS WAF logs, which you made note of in step 1.3. Then choose Next.

    Figure 12: Choose a data store

    Figure 12: Choose a data store

  4. When you’re given the option to add another data store, choose No.
  5. Then, choose Create an IAM role and enter a name. The role grants access to the S3 bucket for the AWS Glue service to access the log files.

    Figure 13: Choose "Create an IAM role," and enter a name

    Figure 13: Choose “Create an IAM role,” and enter a name

  6. Next, set the frequency to Run on demand. You can also schedule the crawler to run periodically to make sure any changes in the file structure are reflected in your data catalog.

    Figure 14: Set the "Frequency" to "Run on demand"

    Figure 14: Set the “Frequency” to “Run on demand”

  7. For output, choose the database in which the Athena table is to be created and add a prefix to identify your table name easily. Select Next.

    Figure 15: Choose the database, and enter a prefix

    Figure 15: Choose the database, and enter a prefix

  8. Review all the options you’ve selected for the crawler job and complete the wizard by selecting the Finish button.
  9. Now that the crawler job parameters are set up, on the left panel of the console, choose Crawlers to select your job and then choose Run crawler. The job creates an Amazon Athena table. The duration depends on the size of the log files.

    Figure 16: Choose "Run crawler" to create an Amazon Athena table

    Figure 16: Choose “Run crawler” to create an Amazon Athena table

  10. To see the Amazon Athena table created by the AWS Glue crawler job, from the AWS Management Console, open the Amazon Athena service. You can filter by your table name prefix.
      1. To view the data, choose Preview table. This displays the table data with certain fields showing data in JSON object structure.
    Figure 17: Choose "Preview table" to view the data

    Figure 17: Choose “Preview table” to view the data

Step four: Create visualizations using Amazon QuickSight

  1. From the AWS Management Console, open Amazon QuickSight.
  2. In the Amazon QuickSight window, in the top left, choose New Analysis. Choose New Data set, and for the data source choose Athena. Enter an appropriate name for the data source name and choose Create data source.

    Figure 18: Enter the "Data source name," and choose "Create data source"

    Figure 18: Enter the “Data source name,” and choose “Create data source”

  3. Next, choose Use custom SQL to extract all the fields in the JSON object using the following SQL query:
    
        ```
        with d as (select
        waf.timestamp,
            waf.formatversion,
            waf.webaclid,
            waf.terminatingruleid,
            waf.terminatingruletype,
            waf.action,
            waf.httpsourcename,
            waf.httpsourceid,
            waf.HTTPREQUEST.clientip as clientip,
            waf.HTTPREQUEST.country as country,
            waf.HTTPREQUEST.httpMethod as httpMethod,
            map_agg(f.name,f.value) as kv
        from sampledb.jsonwaflogs_useast1 waf,
        UNNEST(waf.httprequest.headers) as t(f)
        group by 1,2,3,4,5,6,7,8,9,10,11)
        select d.timestamp,
            d.formatversion,
            d.webaclid,
            d.terminatingruleid,
            d.terminatingruletype,
            d.action,
            d.httpsourcename,
            d.httpsourceid,
            d.clientip,
            d.country,
            d.httpMethod,
            d.kv['Host'] as host,
            d.kv['User-Agent'] as UA,
            d.kv['Accept'] as Acc,
            d.kv['Accept-Language'] as AccL,
            d.kv['Accept-Encoding'] as AccE,
            d.kv['Upgrade-Insecure-Requests'] as UIR,
            d.kv['Cookie'] as Cookie,
            d.kv['X-IMForwards'] as XIMF,
            d.kv['Referer'] as Referer
        from d;
        ```        
        

  4. To extract individual fields, copy the previous SQL query and paste it in the New custom SQL box, then choose Edit/Preview data.
    Figure 19: Paste the SQL query in "New custom SQL query"

    Figure 19: Paste the SQL query in “New custom SQL query”

    1. In the Edit/Preview data view, for Data source, choose SPICE, then choose Finish.

      Figure 20: Choose "Spice" and then "Finish"

      Figure 20: Choose “Spice” and then “Finish”

  5. Back in the Amazon Quicksight console, under the Fields section, select the drop-down menu and change the data type to Date.

    Figure 21: In the Amazon Quicksight console, change the data type to "Date"

    Figure 21: In the Amazon Quicksight console, change the data type to “Date”

  6. After you see the Date column appear, enter an appropriate name for the visualizations at the top of the page, then choose Save.

    Figure 22: Enter the name for the visualizations, and choose "Save"

    Figure 22: Enter the name for the visualizations, and choose “Save”

  7. You can now create various visualization dashboards with multiple visual types by using the drag-and-drop feature. You can drag and drop combinations of fields such as Action, Client IP, Country, Httpmethod, and User Agents. You can also add filters on Date to view dashboards for a specific timeline. Here are some sample screenshots:
    Figure 23: Visualization dashboard samples

    Figure 23a: Visualization dashboard samples

    Figure 23: Visualization dashboard samples

    Figure 23b: Visualization dashboard samples

    Figure 23: Visualization dashboard samples

    Figure 23c: Visualization dashboard samples

    Figure 23: Visualization dashboard samples

    Figure 23d: Visualization dashboard samples

Conclusion

You can enable AWS WAF logs to Amazon S3 buckets and analyze the logs while they are being streamed by configuring Amazon Kinesis Data Firehose. You can further enhance this solution by automating the streaming of data and using AWS Lambda for any data transformations based on your specific requirements. Using Amazon Athena and Amazon QuickSight makes it easy to analyze logs and build visualizations and dashboards for executive leadership teams. Using these solutions, you can go serverless and let AWS do the heavy lifting for you.

Author photo

Umesh Kumar Ramesh

Umesh is a Cloud Infrastructure Architect with Amazon Web Services. He delivers proof-of-concept projects, topical workshops, and lead implementation projects to various AWS customers. He holds a Bachelor’s degree in Computer Science & Engineering from National Institute of Technology, Jamshedpur (India). Outside of work, Umesh enjoys watching documentaries, biking, and practicing meditation.

Author photo

Muralidhar Ramarao

Muralidhar is a Data Engineer with the Amazon Payment Products Machine Learning Team. He has a Bachelor’s degree in Industrial and Production Engineering from the National Institute of Engineering, Mysore, India. Outside of work, he loves to hike. You will find him with his camera or snapping pictures with his phone, and always looking for his next travel destination.

Amazon QuickSight Announces General Availability of ML Insights

Post Syndicated from Luis Wang original https://aws.amazon.com/blogs/big-data/amazon-quicksight-announces-general-availability-of-ml-insights/

At re:Invent 2018, we announced the preview of ML Insights, a set of out-of-the-box machine learning and natural language features that provide Amazon QuickSight users with business insights beyond visualization. Today, we are announcing the general availability of ML Insights.

As the volume of data that customers generate continues to grow every day, it’s becoming more challenging to harness that data for business insights. This is where machine learning comes into play. Amazon is a pioneer in using machine learning to automate and scale various aspects of business analytics.

With new ML Insights features, Amazon QuickSight can help you discover hidden data trends, identify key business drivers, forecast future results, and summarize your data in easy-to-read, natural language narratives, saving hours of manual analysis and investigation. You can build comprehensive BI solutions that integrate out-of-the-box machine learning with the analytical richness of Amazon QuickSight and distribute interactive dashboards to everyone in your organization. ML Insights makes machine learning easy, allowing anyone regardless of their technical and ML skillset to easily get insights from their data in minutes rather than weeks. ML Insights features include:

  • ML-powered anomaly detection to uncover hidden insights by continuously analyzing billions of data points.
  • ML-powered forecasting to predict growth and business trends with point-and-click simplicity.
  • Auto-narratives to tell customers the story of their dashboard using plain-language narratives.

Check out this video to get a quick overview of ML Insights:

To get you started with ML Insights, this blog post will walk you through new ML-powered capabilities.

Customer use cases

During the past three months of ML Insights preview availably, customers from a broad range of industries, including telecommunication, entertainment, marketing, retail, energy, financial services, and healthcare, have used ML Insights to harness their growing volume of data on AWS and on-premises for business insights. Here are some of the cool things customers are doing with ML Insights:

Expedia Group is the world’s travel platform, and its purpose is to bring the world within reach.

“At Expedia Group two of our key strategic imperatives are to be customer centric and locally relevant on a global basis. This is why tools such as Amazon QuickSight are so helpful in making it easier to measure, report, and act on our business metrics to help our customers find the best matches for their travel searches. Amazon QuickSight’s out-of-the-box machine learning insights help us to continuously monitor our business for anomalies, alert stakeholders when outliers occur, and help our business project future trends, which in turn allows teams to focus on other priorities instead of building out these capabilities from scratch.”

Amit Marwah, Director of Technology, Flights Data & Analytics, Expedia Group

Ricoh Company, Ltd., is a global corporation that provides imaging equipment for offices, production print solutions, document management systems, IT services, and more, for approximately 200 countries and regions throughout the world.

“Machine learning is becoming more important than ever to meet our growing data volume and BI needs. Amazon QuickSight’s ML Insights functionality makes powerful machine learning easy to use in just a few clicks. It allows us to continuously monitor for unexpected usage behaviors in our fleet of smart devices worldwide, forecast usage trends and deliver comprehensive dashboards that incorporate these machine generated insights as auto-narratives to our line of business users. With ML Insights, we can quickly pinpoint and take actions on anomalies down to the specific devices and features, to improve the experience and add value to our customers.”

Naoki Umehara, Group Leader, Ricoh Company, Ltd.

Tata Consultancy Services Limited is an Indian multinational information technology (IT) service and consulting company headquartered in Mumbai, Maharashtra, with international presence in 46 countries.

“Amazon QuickSight allows us to quickly and easily integrate our Amazon Connect contact center metrics with our client ticketing tools in order to deliver interactive and automatic dashboards that our customers love. We revolutionized our staffing, training, and outage reporting with Amazon QuickSight ML Insights, predicting where the call flow is moving and react accordingly in order to prevent call spikes and provide a better service to our customers.”

Marco David Martinez, Cloud Manager, Tata Consultancy Services

Siemens is a global powerhouse focusing on electrification, automation, and digitalization. The company is also a leading supplier of power generation and transmission systems.

“Amazon QuickSight’s out-of-the-box ML Insights and usage-based pricing make it easy and cost effective to deliver robust machine-learning-based anomaly detection for our customers to analyze performance of their manufacturing process, detect faults in the production line, monitor downtime duration across hundreds of machineries, and understand the root cause of the failures — without heavy investment machine learning and custom development. This allows line supervisors and production managers to receive automated alerts on unexpected events and take actions to optimize the manufacturing process and improve performance.”

Massimilliano Ponticelli, Product Manager, Siemens

Daiso Industries Co., Ltd. is a global retailer and franchise of 100-yen shops founded in Japan.

“With Amazon QuickSight, we were able to build out our BI environment that handles the data of 5,000 stores x 70,000 products in 2 months. Precise sales forecast and inventory optimization are the most important challenges in our business. Amazon QuickSight’s ML Insights allow us to easily and quickly identify unexpected trend changes across our products and improve sales forecast and inventory optimization.”

Kenjiro Marumoto, Section Chief, Daiso Industries Co., Ltd.

Getting started

ML Insights is only available on the Enterprise Edition of Amazon QuickSight. If you are using the Standard Edition, you can easily upgrade with 1-click on the Manage QuickSight page.

To get started with ML Insights, you’ll need to connect a data source to Amazon QuickSight. Data sets can be accessed by direct query to the SQL-compatible database source or by using SPICE.

For this walkthrough, your data must have the following properties:

  • At least one date field.
  • At least one metric, such as sales, orders, shipped units, or sign ups.
  • At least one category dimension, such as product, channel, segment, or industry.
  • More than 40 historical data points per metric.

For optimal results, make sure that your data set has enough historical data points. The built-in ML algorithm requires at least 40 historical data points to learn and train the model, and it will use up to the most recent 1,000 data points. For example, if you’re analyzing daily sales by geographic region, make sure that you have at least 40 days of data. Three months to twelve months of data is preferred, depending on the seasonality of your business

You can use your own data set, or you can download the following sample data set. We’ll use this dataset for the walkthrough:

https://s3.amazonaws.com/quicksight-ml-insights/ML-Insights-Sample-Dataset-V1.csv

Once you have created a data set in Amazon QuickSight, create a new analysis from the data set. For more information about creating data sets and analyses in Amazon QuickSight, see the Amazon QuickSight User Guide.

Suggested insights

ML Insights automatically interprets your data and provides contextual insights called suggested insights. Different visuals may result in different types of insights. For example, if you have a time-series visual, you may get insights such as period-over-period changes, anomalies, and forecasts.

Let’s walk through an example.

1. Create a line chart with a metric and a date, such as revenue over time, aggregated daily.

2. Choose Insights in the top left-hand corner of the visual.

You should then see a list of suggested insights on the left pane. Suggested insights provide you with a quick summary of the data in plain language. As you add visuals to your analyses, you will see additional suggested insights on the left pane, grouped by the visual name.

You can choose a suggested insight such as day over day change to highlight the data point or segment on the visual. Choose it again to deselect it.

ML-powered anomaly detection

With ML Insights, you can run ML-powered anomaly detection on up to a million metrics simultaneously to discover hidden trends and outliers that are often buried in aggregates. To learn more about pricing for anomaly detection, go to Amazon QuickSight pricing and choose ML Insights.

Let’s get started with anomaly detection.

1. Choose Add on the application bar, and then choose Add anomaly to sheet. This creates an insights visual for anomaly detection.

2. Expand the field wells on the top of the page and add at least one category field.

The categories represent the dimensional values by which Amazon QuickSight will split the metric. For example, let’s say you are analyzing anomalies in revenue across all product categories and product SKUs. Assuming there are 10 product categories, each with 10 product SKUs, Amazon QuickSight will split the metric by the 100 unique combinations and run anomaly detection on each of the split metrics.

3. Choose Get Started on the insights visual to configure the anomaly detection job.

4. On the anomaly detection configuration pane, configure the following options:

  • Analyze all combinations of these categories – If you select three categories, Amazon QuickSight will run anomaly detection on the following combinations, hierarchically: A, AB, ABC. If you select this option, Amazon QuickSight will analyze all combinations, including: A, AB, ABC, BC, AC. If your data is not hierarchical, you should select this option.
  • Number of anomalies to show – This setting allows you to control the number of top anomalies you want to display on the insights cards as narratives.
  • Schedule – Set the schedule to run anomaly detection on your data hourly, daily, weekly or monthly, depending on your data. Choose the start time and the time zone of the start time.
  • Contribution analysis – You can select up to four additional dimensions for Amazon QuickSight to analyze for top contributors when an anomaly is detected. For example, Amazon QuickSight can show you the top customers that contributed to a spike in sales in the USA for Home Improvement products. If you have additional dimensions in your data (dimensions not used in the anomaly detection), you can add them here for contribution analysis. For this example, choose the Geo for contribution analysis.

5. Choose OK. Amazon QuickSight will not implement the schedule until you publish the analysis as a dashboard. Within an analysis, you will have the option to run anomaly detection manually without the schedule.

6. After the configuration is set, choose Run Now to run detection manually. You will see a “Analyzing for anomalies… This may take a while…” message. Depending on the size of your data set, analysis may take anywhere from a few minutes to an hour.

Once anomaly detection is complete, you will see the top anomalies for the latest period in your data listed in the insights visual. Amazon QuickSight also computes and displays the expected value so you can better understand the significance of the anomaly.

7. To see all anomalies for this data, choose the selector in the upper right of the visual and choose Explore Anomalies.

On the detailed anomalies page, you can see all of the anomalies detected for the latest period. The title of the visual represents the metric that is applied to the unique combination of the categorical fields. The highlighted data point on the chart—on the far right of the chart—represents the most recent anomaly detected for that time series.

On the left pane, you will see the top contributors to the anomaly based on the dimensions you have predefined. When you hover over the top contributors, Amazon QuickSight displays an explanation of the significance of the contribution.

8. To see anomalies by date, choose Show Anomalies by Date from the top of the visual to expose a date picker. The cart will display the number of anomalies detected for each unit of your anomaly detection configuration. You can choose a particular date to see the anomalies for that date. For example, if you choose Nov. 1st, 2018, from the graph, then the bar chart highlights the anomalies for that date.

Important: Amazon QuickSight uses the first 40 data points in a data set for training; these data points will not be scored by the anomaly detection algorithm. You may not see any anomalies on the first 40 data points.

9. Use the filter controls at the top of the pane to change the anomaly threshold to show anomalies with high, medium or low significance or to show only anomalies that are higher than expected or lower than expected. You can also filter by the categorical values that are present in your data set to look at anomalies only for those categories.

10. To go back to your analysis, choose Back to analysis at the top of the page.

ML-powered forecasting

Using the built-in ML algorithm, you can now forecast business metrics with point-and-click simplicity without having to write code or build a complex spreadsheet.

1. On your time series chart, choose the selector in the upper right corner of the visual, and then choose Add forecast. Amazon QuickSight will analyze the historical data using ML and present a forecast for the next 14 periods.

2. On the Forecast properties pane at the left, you can customize forecast settings. For example, you can change the number of periods to forecast into the future or add “forecast” periods into the past to compare historical actuals against ML-based expectations.

You can adjust the width of the prediction band by changing the prediction internal and manually setting the seasonality (number of periods). Choose Apply to save your changes.

3. Select a forecasted data point on the chart and choose What-if analysis. With What-if analysis, you can set target value for a particular date or date range, and Amazon QuickSight will adjust the forecast gracefully to meet the target.

4. Choose Apply to see the new forecast adjusted for the target along with the original forecast. You can hover over the data points to see details.

With ML-powered forecasting, Amazon QuickSight allows you to forecast complex, real-world scenarios such as data with multiple seasonality. Outliers will be excluded automatically, and missing values will be imputed.

To export the forecasting data in CSV format, choose the selector in the upper right corner of the visual, and then choose Export to CSV.

Auto-narratives

Auto narratives allow you to create natural language summaries of your visuals. You can embed these summaries into your dashboard to highlight key insights that are important for your readers, allowing them to access the data without having to sift through the entire dashboard. When you define a template, narratives update automatically as the data in your data set refreshes, just like a visual. The following steps show you how to get started with auto narratives.

1. In your time series chart, choose Insights again to show the suggested insights.

2. An easy way to add a narrative insight to your analysis is to choose the plus sign (+) next to a suggested insight. For the purpose of this walkthrough, choose the Day Over Day Change insight.

You’ll see an insight visual on your analysis with the predefined template. Notice that the field wells have the date, metric, and category filled in. These settings are populated from the visual that you used to create the insight visual. You can customize the fields as needed.

3. To edit the narrative, choose the insight visual menu and select Customize Narrative. You’ll see the Configure narrative pane where you can edit your insights template.. You can format the content with different sizes and colors using the formatting toolbar. You can also insert expressions and conditional statements like IF and FOR statements.

In the left pane, you can add computations to your narrative. Computations are predefined calculations such as period-over-period, period-to-date, growth-rate, max, min, and top movers, that you can reference in your template to describe your data. Currently, Amazon QuickSight supports 13 different types of computations. In this example, PeriodOverPeriod is added by default since we selected the Day Over Day Change insight from the suggested insights pane.

4. To add a new computation, choose Add computation in the bottom left corner of the pane. You’ll be prompted to select from a list of computations. For the purpose of this walkthrough, select the Growth rate computation type, and select Next.

5. You can configure certain aspects of the computation. In the case of growth rate, you can change the number of periods over which that you want to compute growth. After you make your selections, choose OK.

6. Now expand Computations on the left pane. You should see both PeriodOverPeriod and GrowthRate options.

Please note that computation names must be unique. When you create a computation, assign a unique name. You can reference multiple computations of the same type in your template. For example, if you have two metrics, such as $sales and units sold, you can create a GrowthRate computation for each of the metrics, with a different name for each computation. The specific computations can then be referenced by name in the template.

Also be aware that anomaly computation is not compatible with all other computation types. For example, if you have a PeriodOverPeriod or GrowthRate computation, you will not be able to add an Anomaly computation to the same insight visual.

7. To add growth rate to your narrative, enter the phrase Compounded Growth Rate for the last on the narrative template. From the Computations pane, choose GrowthRate, and then choose timePeriods to insert the expression GrowthRate.timePeriods into your narrative. This expression references the number of periods set in the configuration.

8. Complete the sentence by entering days is. Then add another expression from the Computations pane by choosing GrowthRate, then compounded GrowthRate, and then formattedValue. The selection formattedValue returns a phrase formatted according to the format applied to the metric on the field. To see a raw value in integer or decimal format, choose value instead of formattedValue.

Now, let’s try using a conditional statement.

1. To insert an IF statement, place the cursor at the end of your narrative template. From the Insert Code menu, choose Inline IF.

2. You’ll be prompted to enter some code. On the left pane choose GrowthRate, then choose compoundedGrowthRate, and then choose value. To insert the value, enter > 3, and choose Save.

3. For the conditional content, enter Great! Select the text and use the format menu to format menu to change the color to green.

4. Repeat the previous steps, entering <3 for the growth rate value. For conditional content, enter Bad! And format the text as red.

5. Choose Apply. You should see the results similar to the following.

The template provides you with a sophisticated tool to customize your narrative. Within the template, you can also reference parameters in your analysis or dashboard and leverage a set of built-in functions to perform more calculations.

The best way to get started with auto narratives and to learn the syntax is to use the existing templates built from suggested insights.  But you can also create insight visuals from scratch by choosing Add and then choosing Add Insight.

Try it yourself! Try creating a narrative that enumerates the top selling products for the last three months.

Conclusion

As you can see from the walkthrough, ML Insights helps you perform large scale anomaly detection and create business forecast in a few simple clicks. You can build rich and user-friendly auto narratives within your dashboards in minutes, without any custom development or ML skillset necessary.

 


About the Author

Luis Wang is a principal product manager for Amazon QuickSight. He’s been with AWS for over 6 years, working on various services including Amazon EC2 and then launching Amazon QuickSight. Luis is now focused on the application of machine learning and AI to business intelligence and analytics at QuickSight. He enjoys running, watching sitcoms and spending time with his family.

How to visualize Amazon GuardDuty findings: serverless edition

Post Syndicated from Ben Romano original https://aws.amazon.com/blogs/security/how-to-visualize-amazon-guardduty-findings-serverless-edition/

Note: This blog provides an alternate solution to Visualizing Amazon GuardDuty Findings, in which the authors describe how to build an Amazon Elasticsearch Service-powered Kibana dashboard to ingest and visualize Amazon GuardDuty findings.

Amazon GuardDuty is a managed threat detection service powered by machine learning that can monitor your AWS environment with just a few clicks. GuardDuty can identify threats such as unusual API calls or potentially unauthorized users attempting to access your servers. Many customers also like to visualize their findings in order to generate additional meaningful insights. For example, you might track resources affected by security threats to see how they evolve over time.

In this post, we provide a solution to ingest, process, and visualize your GuardDuty finding logs in a completely serverless fashion. Serverless applications automatically run and scale in response to events you define, rather than requiring you to provision, scale, and manage servers. Our solution covers how to build a pipeline that ingests findings into Amazon Simple Storage Service (Amazon S3), transforms their nested JSON structure into tabular form using Amazon Athena and AWS Glue, and creates visualizations using Amazon QuickSight. We aim to provide both an easy-to-implement and cost-effective solution for consuming and analyzing your GuardDuty findings, and to more generally showcase a repeatable example for processing and visualizing many types of complex JSON logs.

Many customers already maintain centralized logging solutions using Amazon Elasticsearch Service (Amazon ES). If you want to incorporate GuardDuty findings with an existing solution, we recommend referencing this blog post to get started. If you don’t have an existing solution or previous experience with Amazon ES, if you prefer to use serverless technologies, or if you’re familiar with more traditional business intelligence tools, read on!

Before you get started

To follow along with this post, you’ll need to enable GuardDuty in order to start generating findings. See Setting Up Amazon GuardDuty for details if you haven’t already done so. Once enabled, GuardDuty will automatically generate findings as events occur. If you have public-facing compute resources in the same region in which you’ve enabled GuardDuty, you may soon find that they are being scanned quite often. All the more reason to continue reading!

You’ll also need Amazon QuickSight enabled in your account for the visualization sections of this post. You can find instructions in Setting Up Amazon QuickSight.

Architecture from end to end

 

Figure 1:  Complete architecture from findings to visualization

Figure 1: Complete architecture from findings to visualization

Figure 1 highlights the solution architecture, from finding generation all the way through final visualization. The steps are as follows:

  1. Deliver GuardDuty findings to Amazon CloudWatch Events
  2. Push GuardDuty Events to S3 using Amazon Kinesis Data Firehose
  3. Use AWS Lambda to reorganize S3 folder structure
  4. Catalog your GuardDuty findings using AWS Glue
  5. Configure Views with Amazon Athena
  6. Build a GuardDuty findings dashboard in Amazon QuickSight

Below, we’ve included an AWS CloudFormation template to launch a complete ingest pipeline (Steps 1-4) so that we can focus this post on the steps dedicated to building the actual visualizations (Steps 5-6). We cover steps 1-4 briefly in the next section to provide context, and we provide links to the pertinent pages in the documentation for those of you interested in building your own pipeline.
 
Select this image to open a link that starts building the CloudFormation stack

Ingest (Steps 1-4): Get Amazon GuardDuty findings into Amazon S3 and AWS Glue Data Catalog

 

Figure 2: In this section, we'll cover the services highlighted in blue

Figure 2: In this section, we’ll cover the services highlighted in blue

Step 1: Deliver GuardDuty findings to Amazon CloudWatch Events

GuardDuty has integration with and can deliver findings to Amazon CloudWatch Events. To perform this manually, follow the instructions in Creating a CloudWatch Events Rule and Target for GuardDuty.

Step 2: Push GuardDuty events to Amazon S3 using Kinesis Data Firehose

Amazon CloudWatch Events can write to an Kinesis Data Firehose delivery stream to store your GuardDuty events in S3, where you can use AWS Lambda, AWS Glue, and Amazon Athena to build the queries you’ll need to visualize the data. You can create your own delivery stream by following the instructions in Creating a Kinesis Data Firehose Delivery Stream and then adding it as a target for CloudWatch Events.

Step 3: Use AWS Lambda to reorganize Amazon S3 folder structure

Kinesis Data Firehose will automatically create a datetime-based file hierarchy to organize the findings as they come in. Due to the variability of the GuardDuty finding types, we recommend reorganizing the file hierarchy with a folder for each finding type, with separate datetime subfolders for each. This will make it easier to target findings that you want to focus on in your visualization. The provided AWS CloudFormation template utilizes an AWS Lambda function to rewrite the files in a new hierarchy as new files are written to S3. You can use the code provided in it along with Using AWS Lambda with S3 to trigger your own function that reorganizes the data. Once the Lambda function has run, the S3 bucket structure should look similar to the structure we show in figure 3.
 

Figure 3: Sample S3 bucket structure

Figure 3: Sample S3 bucket structure

Step 4: Catalog the GuardDuty findings using AWS Glue

With the reorganized findings stored in S3, use an AWS Glue crawler to scan and catalog each finding type. The CloudFormation template we provided schedules the crawler to run once a day. You can also run it on demand as needed. To build your own crawler, refer to Cataloging Tables with a Crawler. Assuming GuardDuty has generated findings in your account, you can navigate to the GuardDuty findings database in the AWS Glue Data Catalog. It should look something like figure 4:
 

Figure 4: List of finding type tables in the AWS Glue Catalog

Figure 4: List of finding type tables in the AWS Glue Catalog

Note: Because AWS Glue crawlers will attempt to combine similar data into one table, you might need to generate sample findings to ensure enough variability for each finding type to have its own table. If you only intend to build your dashboard from a small subset of finding types, you can opt to just edit the crawler to have multiple data sources and specify the folder path for each desired finding type.

Explore the table structure

Before moving on to the next step, take some time to explore the schema structure of the tables. Selecting one of the tables will bring you to a page that looks like what’s shown in figure 5.
 

Figure 5: Schema information for a single finding table

Figure 5: Schema information for a single finding table

You should see that most of the columns contain basic information about each finding, but there’s a column named detail that is of type struct. Select it to expand, as shown in figure 6.
 

Figure 6: The "detail" column expanded

Figure 6: The “detail” column expanded

Ah, this is where the interesting information is tucked away! The tables for each finding may differ slightly, but in all cases the detail column will hold the bulk of the information you’ll want to visualize. See GuardDuty Active Finding Types for information on what you should expect to find in the logs for each finding type. In the next step, we’ll focus on unpacking detail to prepare it for visualization!

Process (Step 5): Unpack nested JSON and configure views with Amazon Athena

 

Figure 7: In this section, we'll cover the services highlighted in blue

Figure 7: In this section, we’ll cover the services highlighted in blue

Note: This step picks up where the CloudFormation template finishes

Explore the table structure (again) in the Amazon Athena console

Begin by navigating to Athena from the AWS Management Console. Once there, you should see a drop-down menu with a list of databases. These are the same databases that are available in the AWS Glue Data Catalog. Choose the database with your GuardDuty findings and expand a table.
 

Figure 8: Expanded table in the Athena console

Figure 8: Expanded table in the Athena console

This should look very familiar to the table information you explored in step 4, including the detail struct!

You’ll need a method to unpack the struct in order to effectively visualize the data. There are many methods and tools to approach this problem. One that we recommend (and will show) is to use SQL queries within Athena to construct tabular views. This approach will allow you to push the bulk of the processing work to Athena. It will also allow you to simplify building visualizations when using Amazon QuickSight by providing a more conventional tabular format.

Extract details for use in visualization using SQL

The following examples contain SQL statements that will provide everything necessary to extract the necessary fields from the detail struct of the Recon:EC2/PortProbeUnprotectedPort finding to build the Amazon QuickSight dashboard we showcase in the next section. The examples also cover most of the operations you’ll need to work with the elements found in GuardDuty findings (such as deeply nested data with lists), and they serve as a good starting point for constructing your own custom queries. In general, you’ll want to traverse the nested layers (i.e. root.detail.service.count) and create new records for each item in an embedded list that you want to target using the UNNEST function. See this blog for even more examples of constructing queries on complex JSON data using Amazon Athena.

Simply copy the SQL statements that you want into the Athena query field to build the port_probe_geo and affected_instances views.

Note: If your account has yet to generate Recon:EC2/PortProbeUnprotectedPort findings, you can generate sample findings to follow along.


CREATE OR REPLACE VIEW "port_probe_geo" AS

WITH getportdetails AS (
    SELECT id, portdetails
    FROM by_finding_type
    CROSS JOIN UNNEST(detail.service.action.portProbeAction.portProbeDetails) WITH ORDINALITY AS p (portdetails, portdetailsindex)
)

SELECT 
    root.id AS id,
    root.region AS region,
    root.time AS time,
    root.detail.type AS type,
    root.detail.service.count AS count, 
    portdetails.localportdetails.port AS localport, 
    portdetails.localportdetails.portname AS localportname, 
    portdetails.remoteipdetails.geolocation.lon AS longitude, 
    portdetails.remoteipdetails.geolocation.lat AS latitude, 
    portdetails.remoteipdetails.country.countryname AS country, 
    portdetails.remoteipdetails.city.cityname AS city 

FROM 
    by_finding_type  as root, getPortDetails
    
WHERE 
    root.id = getportdetails.id

CREATE OR REPLACE VIEW "affected_instances" AS

SELECT 
    max(root.detail.service.count) AS count,
    date_parse(root.time,'%Y-%m-%dT%H:%i:%sZ') as time,
    root.detail.resource.instancedetails.instanceid

FROM 
    recon_ec2_portprobeunprotectedport  AS root

GROUP BY  
    root.detail.resource.instancedetails.instanceid, 
    time

Visualize (Step 6): Build a GuardDuty findings dashboard in Amazon QuickSight

 

Figure 9: In this section we will cover the services highlighted in blue

Figure 9: In this section we will cover the services highlighted in blue

Now that you’ve created tabular views using Athena, you can jump into Amazon QuickSight from the AWS Management Console and begin visualizing! If you haven’t already done so, enable Amazon QuickSight in your account by following the instructions for Setting Up Amazon QuickSight.

For this example, we’ll leverage the geo_port_probe view to build a geographic visualization and see the locations from which nefarious actors are launching port probes.

Creating an analysis

In the upper left-hand corner of the Amazon QuickSight console select New analysis and then New data set.
 

Figure 10: Create a new analysis

Figure 10: Create a new analysis

To utilize the views you built in the previous step, select Athena as the data source. Give your data source a name (in our example, we use “port probe geo”), and select the database that contains the views you created in the previous section. Then select Visualize.
 

Figure 11: Available data sources in Amazon QuickSight. Be sure to choose Athena!

Figure 11: Available data sources in Amazon QuickSight. Be sure to choose Athena!

 

Figure 12: Select the "port prob geo view" you created in step 5

Figure 12: Select the “port prob geo view” you created in step 5

Viz time!

From the Visual types menu in the bottom left corner, select the globe icon to create a map. Then select the latitude and longitude geospatial coordinates. Choose count (with a max aggregation) for size. Finally, select localportname to break the data down by color.
 

Figure 13: A visual containing a map of port probe scans in Amazon QuickSight

Figure 13: A visual containing a map of port probe scans in Amazon QuickSight

Voila! A detailed map of your environment’s attackers!

Build out a dashboard

Once you like how everything looks, you can move on to adding more visuals to create a full monitoring dashboard.

To add another visual to the analysis, select Add and then Add visual.
 

Figure 14: Add another visual using the 'Add' option from the Amazon QuickSight menu bar

Figure 14: Add another visual using the ‘Add’ option from the Amazon QuickSight menu bar

If the new visual will use the same dataset, then you can immediately start selecting fields to build it. If you want to create a visual from a different data set (our example dashboard below adds the affected_instances view), follow the Creating Data Sets guide to add a new data set. Then return to the current analysis and associate the data set with the analysis by selecting the pencil icon shown below and selecting Add data set.
 

Figure 15: Adding a new data set to your Amazon QuickSight analysis

Figure 15: Adding a new data set to your Amazon QuickSight analysis

Repeat this process until you’ve built out everything you need in your monitoring dashboard. Once it’s completed, you can publish the dashboard by selecting Share and then Publish dashboard.
 

Figure 16: Publish your dashboard using the "Share" option of the Amazon QuickSight menu

Figure 16: Publish your dashboard using the “Share” option of the Amazon QuickSight menu

Here’s an example of a dashboard we created using the port_probe_geo and affected_instances views:
 

Figure 17: An example dashboard created using the "port_probe_geo" and "affected_instances" views

Figure 17: An example dashboard created using the “port_probe_geo” and “affected_instances” views

What does something like this cost?

To get an idea of the scale of the cost, we’ve provided a small pricing example (accurate as of the writing of this blog) that assumes 10,000 GuardDuty findings per month with an average payload size of 5KB.

ServicePricing StructureAmount ConsumedTotal Cost
Amazon CloudWatch Events$1 per million events/td>

10000 events $0.01
Amazon Kinesis Data Firehose$0.029 per GB ingested0.05GB ingested $0.00145
Amazon S3$0.029 per GB stored per month0.1GB stored $0.00230
AWS LambdaFirst million invocations free~200 invocations $0
Amazon Athena$5 per TB Scanned0.003TB scanned (Assume 2 full data scans per day to refresh views) $0.015
AWS Glue$0.44 per DPU hour (2 DPU minimum and 10 minute minimum) = $0.15 per crawler run30 crawler runs $4.50
Total Processing Cost$4.53

Oh, the joys of a consumption-based model: Less than five dollars per month for all of that processing!

From here, all that remains are your visualization costs using Amazon QuickSight. This pricing is highly dependent upon your number of users and their respective usage patterns. See the Amazon QuickSight pricing page for more specific details.

Summary

In this post, we demonstrated how you can ingest your GuardDuty findings into S3, process them with AWS Glue and Amazon Athena, and visualize with Amazon QuickSight. All serverless! Each portion of what we showed can be used in tandem or on its own for this or many other data sets. Go launch the template and get started monitoring your AWS environment!

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Ben Romano

Ben is a Solutions Architect in AWS supporting customers in their journey to the cloud with a focus on big data solutions. Ben loves to delight customers by diving deep on AWS technologies and helping them achieve their business and technology objectives.

Author

Jimmy Boyle

Jimmy is a Solutions Architect in AWS with a background in software development. He enjoys working with all things serverless because he doesn’t have to maintain infrastructure. Jimmy enjoys delighting customers to drive their business forward and design solutions that will scale as their business grows.

Visualize over 200 years of global climate data using Amazon Athena and Amazon QuickSight

Post Syndicated from Joe Flasher original https://aws.amazon.com/blogs/big-data/visualize-over-200-years-of-global-climate-data-using-amazon-athena-and-amazon-quicksight/

Climate Change continues to have a profound effect on our quality of life. As a result, the investigation into sustainability is growing. Researchers in both the public and private sector are planning for the future by studying recorded climate history and using climate forecast models.

To help explain these concepts, this post introduces the Global Historical Climatology Network Daily (GHCN-D). This registry is used by the global climate change research community.

This post also provides a step-by-step demonstration of how Amazon Web Services (AWS) services improve access to this data for climate change research. Data scientists and engineers previously had to access hundreds of nodes on high-performance computers to query this data. Now they can get the same data by using a few steps on AWS.

Background

Global climate analysis is essential for researchers to assess the implications of climate change on the Earth’s natural capital and ecosystem resources. This activity requires high-quality climate datasets, which can be challenging to work with because of their scale and complexity. To have confidence in their findings, researchers must be confident about the provenance of the climate datasets that they work with. For example, researchers may be trying to answer questions like: has the climate of a particular food producing area changed in a way that impacts food security? They must be able to easily query authoritative and curated datasets.

The National Centers for Environmental Information (NCEI) in the U.S. maintains a dataset of climate data that is based on observations from weather stations around the globe. It’s the Global Historical Climatology Network Daily (GHCN-D) — a central repository for daily weather summaries from ground-based stations. It is comprised of millions of quality-assured observations that are updated daily.

The most common parameters recorded are daily temperatures, rainfall, and snowfall. These are useful parameters for assessing risks for drought, flooding, and extreme weather.

The challenge

The NCEI makes the GHCN_D data available in CSV format through an FTP server, organized by year. Organizing the data by year means that a complete copy of the archive requires over 255 files (the first year in the archive is 1763). Traditionally, if a researcher wants to work on this dataset they must download it and work on it locally. For a researcher to be sure of using the latest data for their analysis, they must repeat this download every day.

For researchers, deriving insight from this data can be a challenge. They must be able to fully engage with the data, because that requires technical skill, computing resources, and subject matter expertise.

A new efficient approach

Through AWS’s collaboration with the NOAA Big Data Project, a daily snapshot of the GHCN_D dataset is now available on AWS. The data is publically accessible through an Amazon S3 bucket. For more information, see the Registry of Open Data on AWS.

Having the data available in this way offers several advantages:

  • The data is globally available to a community of users. Users no longer must download data to work on it. Everyone can work with the same, authoritative copy.
  • Time to insight is reduced. By taking advantage of AWS services, researchers can immediately start to perform analysis.
  • The cost of research is reduced. Researchers can switch off resources as soon as their analysis is finished.

This blog post illustrates a workflow using Amazon S3, Amazon Athena, AWS Glue, and Amazon QuickSight that demonstrates how quickly one can derive insights from this dataset.

The workflow presented in this post follows these general steps:

  • Extract data files from the NOAA bucket and make the data available as tables.
  • Use SQL to query the data contained in the tables.
  • Show how to speed up analysis by creating tables from queries and storing those tables in a private Amazon S3 bucket.
  • Visualize the data to gain insight.

Overview of the GHCN_D dataset

The GHCN-D is a quality-assured dataset that contains daily weather summaries from weather stations across global land areas. It has the following properties:

  • Data is integrated from approximately 30 sources that provide weather observations from various national and international networks.
  • A comprehensive dataset for the US and good coverage for many parts of the world.
  • There are many types of daily weather observations in this dataset, but the majority are maximum temperature, minimum temperature, precipitation, snow fall, and snow depth. These observations include:
    • Over 35,000 temperature stations.
    • Over 100,000 precipitation stations.
    • Over 50,000 snowfall or snow depth stations
  • The source of each datum, the term used for a single record, is contained in the dataset. Each datum has a quality control flag associated with it.
  • The dataset is updated daily. The historic sources are reprocessed weekly.

You can see in the graphic below how the data volume has grown in recent decades.

Figure 1: 1763 to 2018. For 1763 there are less than a thousand observations. For 2017 there are over 34 million observations.

Organization of the data on Amazon S3

As previously mentioned, the GHCN-D dataset is accessible through an Amazon S3 bucket. The details of the dataset are on the Registry of Open Data on AWS (RODA). The landing page for the dataset on RODA contains a link to a comprehensive readme file for the dataset. This readme contains all of the lookup tables and variable definitions.

This section shows the pertinent information required to start working with the dataset.

The data is in a text, or comma-separated values (CSV), format and is contained in the Amazon S3 bucket called noaa-ghcn-pds.

The noaa-ghcn-pds bucket contains virtual folders, and is structured like this:

  • noaa-ghcn-pds. This is the root of the bucket with two subdirectories and a number of useful files. For the purposes of this exercise, we use only the ghcnd-stations.txt file. This file contains information about the observation stations that produced the data for the GHCN_D dataset. You must download the ghcnd-stations.txt file.
  • noaa-ghcn-pds/csv/. This virtual folder contains all of the observations from 1763 to the present organized in .csv files, one file for every year. For this exercise, we’ll collate this data into a single table.

Also for the purpose of this exercise, the data from ‘ghcnd-stations.txt’ and the data contained in noaa-ghcn-pds/csv/ are extracted and added to two separate tables. These tables are the basis of the analysis.

The tables are labeled as:

  • tblallyears. This table contains all the records stored in the yearly .csv files from 1763 to present.
  • tblghcnd_stations. This table contains information for over 106,430 weather stations.

Point of interest: the .csv file from the year 1763 contains the data for one weather station. That station was located in the center of Milan, Italy.

The tools

To implement the general workflow in this exercise, we’re using the following tools:

  • Amazon Simple Storage Service (Amazon S3) to stage the data for analysis. The GHCN_D dataset is stored in a bucket on Amazon S3. We also use a private bucket to store new tables created from queries.
  • Amazon Athena to query data stored on Amazon S3 using standard SQL.
  • AWS Glue to extract and load data into Athena from the Amazon S3 buckets in which it is stored. AWS Glue is a fully managed extract, transform, and load (ETL) service.
  • AWS Glue Data Catalog to catalog the data that we query with Athena.
  • Amazon QuickSight to build visualizations, perform ad hoc analyses, and get insights from the dataset. Queries and tables from Athena can be read directly from Amazon QuickSight. Amazon QuickSight can also run queries against tables in Athena.

To implement the processes outlined in this post, you need an AWS Account. For more information about creating an AWS account, see Getting Started with AWS. You also must create a private Amazon S3 bucket located in the N. Virginia AWS Region. For more information, see Create a Bucket.

When you create the bucket, it must contain the following empty directories:

  1. [your_bucket_name]/stations_raw/
  2. [your_bucket_name]/ghcnblog/
  3. [your_bucket_name]/ghcnblog/stations/
  4. [your_bucket_name]/ghcnblog/allyears/
  5. [your_bucket_name]/ghcnblog/1836usa/

The following is an overview of how the various AWS services interact in this workflow.

Note

The AWS services are in the same AWS Region. One of the Amazon S3 buckets is the existing one that stores the GHCN_D data. The other Amazon S3 bucket is the bucket that you use for storing tables.

Figure 2: How the AWS services work together to compose this workflow.

The workflow

Now that we have the tools and the data, we are ready to:

  1. Extract the yearly .csv files and add them to a table in Amazon Athena.
  2. Extract the stations text file and add it to a separate table in Amazon Athena.

Extract the yearly .csv files and add it to a table in Amazon Athena

The complete set of daily weather observations is organized by year in one folder of the Amazon S3 bucket in .csv format. The path to the data is s3://noaa-ghcn-pds/csv/.

Each file is named by year beginning with 1763.csv and progressing one year at a time up to the present.

From the AWS console, click on AWS Athena. This takes you to the main dashboard for Athena. From here, click on AWS Glue Data Catalog. This brings you to AWS Glue.

In AWS Glue, choose the Tables section on the left side. Then, in the Add table drop-down menu, choose Add table manually. A series of forms displays for you to add the following information:

  • Set up your table’s properties:
    • Give the new table a name, for example, tblallyears
    • Create a database and name it ghcnblog.

The database then appears in the Athena dashboard.

  • Add a data store:
    • Choose the Specified path in another account option, and enter the following path in the text box: s3://noaa-ghcn-pds/csv/
  • Choose a data format:
    • Select CSV, then select Comma as the delimiter.
  • Define a schema:
    • Add the following columns as string variables:
      • id
      • year_date
      • element
      • data_value
      • m_flag
      • q_flag
      • s_flag
      • obs_time

For a full description of the variables and data structures, see the readme file.

  • Choose OK, then Finish.

Now return to the Athena dashboard, and choose the database that you created. The table will appear in the list of tables on the left. You can now preview the data by choosing the ‘Preview table’ option to the right of the table.

Use CTAS to speed up queries

As a final step, create a table using the SQL statement called CREATE TABLE AS SELECT (CTAS). Store the table in a private Amazon S3 bucket.

This step dramatically speeds up queries. The reason is because in this process we extract the data once and store the extracted data in a columnar format (Parquet) in the private Amazon S3 bucket.

To illustrate the improvement in speed, here are two examples:

  • A query that counts all of the distinct IDs, meaning unique weather stations, takes around 55 seconds and scans around 88 GB of data.
  • The same query on the converted data takes around 13 seconds and scans about 5 GB of data.

To create the table for this final step:

  1. Open the Athena console.
  2. In the dashboard, select New query, then enter the query as shown in the following example. Make sure to enter the information that’s applicable to your particular situation, such as your bucket name.
    Figure 3: Query to create a table data converting the data into Parquet and storing it in your S3 bucket.
  3. Make sure that the data format is Parquet.
  4. Name your table tblallyears_qa.
  5. Add this path to this folder in the private Amazon S3 bucket: [your_bucket_name]/ghcnblog/allyearsqa/. Replace your_bucket_name with your specific bucket name.

The new table appears in your database, listed on the left side of the Athena dashboard. This is the table that we work with going forward.

Extract the stations text file and add it to a separate table in Amazon Athena

The stations text file contains information about the weather stations, such as location, nationality, and ID. This data is kept in a separate file from the yearly observations. We need to import this data to look at the geographical spread of weather observations. While dealing with this file is a bit more complicated, the steps to importing this data into Athena are similar to what we have already done.

To import this data into Athena:

  1. Download the ghcnd-stations text file.
  2. Open the file in a spreadsheet program and use the fixed width-delimited data import function. The fixed widths of the columns are described in the readme file in the section called FORMAT OF “ghcnd-stations.txt” file.
  3. After you successfully import the data, save the spreadsheet as a .csv text file.
  4. Copy the new .csv file to [your_bucket_name]/stations_raw/. Replace your_bucket_name with your specific bucket name.
  5. Using this new .csv file, follow the Add table process steps in AWS Glue, as described earlier in this post.
    • Use the following field names:
      • id
      • latitude
      • longitude
      • elevation
      • state
      • name
      • gsn_flag
      • hcn_flag
      • wmo_id
    • Name the table tblghcnd_stations.
  6. After the table is created, follow the CREATE TABLE AS SELECT (CTAS) steps for this table as described earlier in this post.
    • Name the new table tblghcnd_stations_qa.
    • Store the new table in [your_bucket_name]/ghcnblog/stations/. Replace your_bucket_name with your specific bucket name.

The two most important datasets of GHCN_D are now in Athena.

In the next section, we run queries against these tables and visualize the results using Amazon QuickSight.

Exploratory data analysis and visualization

With our two tables created, we are now ready to query and visualize to gain insight.

Exploratory data analysis

In the Athena query window, run the following queries to get an idea of the size of the dataset.

Query #1: the total number of observations since 1763:

Figure 4: Query for total number of observations. This was run in autumn 2018. The dataset is continuingly growing over time.

Query #2: the number of stations since 1763:

Figure 5: Query for total number of stations that have made observations since 1763. Deactivated stations are included.

Average weather parameters for the Earth

The following figure shows a query that calculates the average maximum temperature (Celsius), average minimum temperature (Celsius), and average rainfall (mm) for the Earth since 1763.

In the query, we must convert the data_value from a String variable to a Real variable. We also must divide by 10, because the temperature and precipitation measurements are in tenths of their respective units. For more information about these details and the element codes (TMIB, TMAX and PRCP), see the readme file.

Figure 6. Querying for global averages to get to Earth’s pulse.

It would be convenient if we could just run simple queries, such as this one, on this dataset and accept that the results are correct.

The previous query is assuming an even and equal spread of weather stations around the world since 1763. In fact, the number and spread of weather stations varied over time.

Visualizing the growth in numbers of weather stations over time

The following query visualizes the number of weather stations for each year since 1763, by using Amazon QuickSight.

Note: You must be signed up for Amazon QuickSight to complete these steps. During the sign-up process, you are prompted to manage your Amazon QuickSight data source permissions. At this time, use step 3 in the following procedure to grant access to the Amazon S3 buckets and to Athena.

The steps are as follows:

  1. Open the Amazon QuickSight console.
  2. On the far right of the dashboard, choose Manage QuickSight.
  3. Choose Account Setting, then Manage QuickSight permissions. Give Amazon QuickSight permission to access Athena, and to read the Amazon S3 bucket that contains the new tables.
  4. Return to Amazon QuickSight by choosing the logo on the top left side of the screen.
  5. From the Amazon QuickSight dashboard, choose New analysis, then New data set.
  6. From the Create a Data Set tiles, choose Athena. Name the data source, for example ghcnblog, then choose Create data source.
  7. Choose the option to add a custom SQL, then add the SQL, as shown in the following example:

Figure 7: Location to add a custom SQL query.

  1. Choose Confirm query.
  2. Choose Directly query your data.
  3. Choose Visualize.
  4. To make the graph, choose the line chart graph. Add year to the X-axis and number_of_stations to the Value field wells. The options appear to the left of the visualization dashboard.

Figure 8. The number of global weather stations used by GHCN_D over time.

The resulting graph shows that the number and spread of stations around the world has varied over time.

A look at the development of observation in the US

1836 is the year of the first US observation station in the data set. To get an insight into the development of observations the US, we extracted a subset of US data from the main data source (tblallyears_qa). This dataset features annual data every 30th year from 1836 to 2016.

This query generates a large dataset. To improve performance, save the query as a table stored in an Amazon S3 bucket using the previously described procedure.

The query to do this in one step is shown in the following figure.

Figure 9: The SQL to create a table from a query and store it in Parquet format in a user-specified Amazon S3 bucket.

The table appears in the Amazon Athena dashboard as tbl1836every30thyear and it forms the basis for our analysis.

In the Amazon QuickSight console, use the follow SQL to generate a new dataset.

Figure 10: The SQL to create a dataset for viewing USA data in Amazon QuickSight.

  1. Choose Confirm query.
  2. Choose Directly query your data.
  3. Choose Visualize.

This brings you back to the visualization dashboard. From this dashboard, chose the Points on a map visualization, and set up the fields as follows:

  • Geospatial: state
  • Size: number_of_stations, aggregate by count.
  • Color: year

The results should be the following map of the US showing the growth of weather stations used by GHCN_D from 1836 to 2016 at 30-year increments. In 1836, there was one station. By 2016, there were thousands.

Figure 11: The growth of the number of observations stations in the US.

Interestingly, some states had more stations in 1956 than they did in 1986. This is also illustrated in the following figure. The data for the figure was derived from the previous dataset.

Figure 12: This heat map illustrates the number of stations per state over time. This is a 30th year snapshot.

A look at more data

We have now a data lake of GHN_D data. By using the tools that we have assembled, we are free to experiment with the data. It is now possible to construct queries and visualization on this huge dataset to gain insights.

The following figure shows the heat map that we created. It shows the average minimum temperature in US states over time. As before, we are looking at 30-year snapshots; that is to say, every 30th year we take a yearly average.

Figure 13: This heat map illustrates the minimum temperate for each state over time. A yearly average every 30th year starting at 1836.

Summary

Our headlines are full of Climate Change and Sustainability stories, and research and analysis has become more crucial than ever.

We showed researchers, analysts, and scientists how AWS services have reduced the level of technical skills required to fully use the GHCN_D dataset.

This GHCN-D is available on AWS. The details can be found on the Registry of Open Data on AWS. This data is available to researchers studying climate change and weather impacts.

This blog post demonstrated a typical workflow that a researcher could use to engage with and analyze this important data by using Amazon Athena, AWS Glue, and Amazon S3, and how they can visualize insights by using Amazon QuickSight.

By making this data available, Amazon has removed the heavy lifting that was traditionally required to work with the GHCN_D, thus expanding the opportunity for new research and new discoveries.

 


About the Authors

Joe Flasher is the Open Geospatial Data Lead at Amazon Web Services, helping organizations most effectively make data available for analysis in the cloud. Joe has been working with geospatial data and open source projects for the past decade, both as a contributor and maintainer. He has been a member of the Landsat Advisory Group, and has worked on projects, ranging from building GIS software to making the space shuttle fly. Joe’s background is in astrophysics, but he kindly requests you don’t ask him any questions about constellations.

 

 

Conor Delaney, PhD. is an environmental data scientist.

Our data lake story: How Woot.com built a serverless data lake on AWS

Post Syndicated from Karthik Kumar Odapally original https://aws.amazon.com/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

In this post, we talk about designing a cloud-native data warehouse as a replacement for our legacy data warehouse built on a relational database.

At the beginning of the design process, the simplest solution appeared to be a straightforward lift-and-shift migration from one relational database to another. However, we decided to step back and focus first on what we really needed out of a data warehouse. We started looking at how we could decouple our legacy Oracle database into smaller microservices, using the right tool for the right job. Our process wasn’t just about using the AWS tools. More, it was about having a mind shift to use cloud-native technologies to get us to our final state.

This migration required developing new extract, transform, load (ETL) pipelines to get new data flowing in while also migrating existing data. Because of this migration, we were able to deprecate multiple servers and move to a fully serverless data warehouse orchestrated by AWS Glue.

In this blog post, we are going to show you:

  • Why we chose a serverless data lake for our data warehouse.
  • An architectural diagram of Woot’s systems.
  • An overview of the migration project.
  • Our migration results.

Architectural and design concerns

Here are some of the design points that we considered:

  • Customer experience. We always start with what our customer needs, and then work backwards from there. Our data warehouse is used across the business by people with varying level of technical expertise. We focused on the ability for different types of users to gain insights into their operations and to provide better feedback mechanisms to improve the overall customer experience.
  • Minimal infrastructure maintenance. The “Woot data warehouse team” is really just one person—Chaya! Because of this, it’s important for us to focus on AWS services that enable us to use cloud-native technologies. These remove the undifferentiated heavy lifting of managing infrastructure as demand changes and technologies evolve.
  • Responsiveness to data source changes. Our data warehouse gets data from a range of internal services. In our existing data warehouse, any updates to those services required manual updates to ETL jobs and tables. The response times for these data sources are critical to our key stakeholders. This requires us to take a data-driven approach to selecting a high-performance architecture.
  • Separation from production systems. Access to our production systems is tightly coupled. To allow multiple users, we needed to decouple it from our production systems and minimize the complexities of navigating resources in multiple VPCs.

Based on these requirements, we decided to change the data warehouse both operationally and architecturally. From an operational standpoint, we designed a new shared responsibility model for data ingestion. Architecturally, we chose a serverless model over a traditional relational database. These two decisions ended up driving every design and implementation decision that we made in our migration.

As we moved to a shared responsibility model, several important points came up. First, our new way of data ingestion was a major cultural shift for Woot’s technical organization. In the past, data ingestion had been exclusively the responsibility of the data warehouse team and required customized pipelines to pull data from services. We decided to shift to “push, not pull”: Services should send data to the data warehouse.

This is where shared responsibility came in. For the first time, our development teams had ownership over their services’ data in the data warehouse. However, we didn’t want our developers to have to become mini data engineers. Instead, we had to give them an easy way to push data that fit with the existing skill set of a developer. The data also needed to be accessible by the range of technologies used by our website.

These considerations led us to select the following AWS services for our serverless data warehouse:

The following diagram shows at a high level how we use these services.

Tradeoffs

These components together met all of our requirements and enabled our shared responsibility model. However, we made few tradeoffs compared to a lift-and-shift migration to another relational database:

  • The biggest tradeoff was upfront effort vs. ongoing maintenance. We effectively had to start from scratch with all of our data pipelines and introduce a new technology into all of our website services, which required a concerted effort across multiple teams. Minimal ongoing maintenance was a core requirement. We were willing to make this tradeoff to take advantage of the managed infrastructure of the serverless components that we use.
  • Another tradeoff was balancing usability for nontechnical users vs. taking advantage of big data technologies. Making customer experience a core requirement helped us navigate the decision-making when considering these tradeoffs. Ultimately, only switching to another relational database would mean that our customers would have the same experience, not a better one.

Building data pipelines with Kinesis Data Firehose and Lambda

Because our site already runs on AWS, using an AWS SDK to send data to Kinesis Data Firehose was an easy sell to developers. Things like the following were considerations:

  • Direct PUT ingestion for Kinesis Data Firehose is natural for developers to implement, works in all languages used across our services, and delivers data to Amazon S3.
  • Using S3 for data storage means that we automatically get high availability, scalability, and durability. And because S3 is a global resource, it enables us to manage the data warehouse in a separate AWS account and avoid the complexity of navigating multiple VPCs.

We also consume data stored in Amazon DynamoDB tables. Kinesis Data Firehose again provided the core of the solution, this time combined with DynamoDB Streams and Lambda. For each DynamoDB table, we enabled DynamoDB Streams and then used the stream to trigger a Lambda function.

The Lambda function cleans the DynamoDB stream output and writes the cleaned JSON to Kinesis Data Firehose using boto3. After doing this, it converges with the other process and outputs the data to S3. For more information, see How to Stream Data from Amazon DynamoDB to Amazon Aurora using AWS Lambda and Amazon Kinesis Firehose on the AWS Database Blog.

Lambda gave us more fine-grained control and enabled us to move files between accounts:

  • We enabled S3 event notifications on the S3 bucket and created an Amazon SNS topic to receive notifications whenever Kinesis Data Firehose put an object in the bucket.
  • The SNS topic triggered a Lambda function, which took the Kinesis output and moved it to the data warehouse account in our chosen partition structure.

S3 event notifications can trigger Lambda functions, but we chose SNS as an intermediary because the S3 bucket and Lambda function were in separate accounts.

Migrating existing data with AWS DMS and AWS Glue

We needed to migrate data from our existing RDS database to S3, which we accomplished with AWS DMS. DMS natively supports S3 as a target, as described in the DMS documentation.

Setting this up was relatively straightforward. We exported data directly from our production VPC to the separate data warehouse account by tweaking the connection attributes in DMS. The string that we used was this:

"cannedAclForObjects=BUCKET_OWNER_FULL_CONTROL;compressionType=GZIP;addColumnName=true;”

This code gives ownership to the bucket owner (the destination data warehouse account), compresses the files to save on storage costs, and includes all column names. After the data was in S3, we used an AWS Glue crawler to infer the schemas of all exported tables and then compared against the source data.

With AWS Glue, some of the challenges we overcame were these:

  • Unstructured text data, such as forum and blog posts. DMS exports these to CSV. This approach conflicted with the commas present in the text data. We opted to use AWS Glue to export data from RDS to S3 in Parquet format, which is unaffected by commas because it encodes columns directly.
  • Cross-account exports. We resolved this by including the code

"glueContext._jsc.hadoopConfiguration().set("fs.s3.canned.acl", "BucketOwnerFullControl”)”

at the top of each AWS Glue job to grant bucket owner access to all S3 files produced by AWS Glue.

Overall, AWS DMS was quicker to set up and great for exporting large amounts of data with rule-based transformations. AWS Glue required more upfront effort to set up jobs, but provided better results for cases where we needed more control over the output.

If you’re looking to convert existing raw data (CSV or JSON) into Parquet, you can set up an AWS Glue job to do that. The process is described in the AWS Big Data Blog post Build a data lake foundation with AWS Glue and Amazon S3.

Bringing it all together with AWS Glue, Amazon Athena, and Amazon QuickSight

After data landed in S3, it was time for the real fun to start: actually working with the data! Can you tell I’m a data engineer? For me, a big part of the fun was exploring AWS Glue:

  • AWS Glue handles our ETL job scheduling.
  • AWS Glue crawlers manage the metadata in the AWS Glue Data Catalog.

Crawlers are the “secret sauce” that enables us to be responsive to schema changes. Throughout the pipeline, we chose to make each step as schema-agnostic as possible, which allows any schema changes to flow through until they reach AWS Glue.

However, raw data is not ideal for most of our business users, because it often has duplicates or incorrect data types. Most importantly, the data out of Firehose is in JSON format, but we quickly observed significant query performance gains from using Parquet format. Here, we used one of the performance tips in the Big Data Blog post Top 10 performance tuning tips for Amazon Athena.

With our shared responsibility model, the data warehouse and BI teams are responsible for the final processing of data into curated datasets ready for reporting. Using Lambda and AWS Glue enables these teams to work in Python and SQL (the core languages for Amazon data engineering and BI roles). It also enables them to deploy code with minimal infrastructure setup or maintenance.

Our ETL process is as follows:

  • Scheduled triggers.
  • Series of conditional triggers that control the flow of subsequent jobs that depend on previous jobs.
  • A similar pattern across many jobs of reading in the raw data, deduplicating the data, and then writing to Parquet. We centralized this logic by creating a Python library of functions and uploading it to S3. We then included that library in the AWS Glue job as an additional Python library. For more information on how to do this, see Using Python Libraries with AWS Glue in the AWS Glue documentation.

We also migrated complex jobs used to create reporting tables with business metrics:

  • The AWS Glue use of PySpark simplified the migration of these queries, because you can embed SparkSQL queries directly in the job.
  • Converting to SparkSQL took some trial and error, but ultimately required less work than translating SQL queries into Spark methods. However, for people on our BI team who had previously worked with Pandas or Spark, working with Spark dataframes was a natural transition. As someone who used SQL for several years before learning Python, I appreciate that PySpark lets me quickly switch back and forth between SQL and an object-oriented framework.

Another hidden benefit of using AWS Glue jobs is that the AWS Glue version of Python (like Lambda) already has boto3 installed. Thus, ETL jobs can directly use AWS API operations without additional configuration.

For example, some of our longer-running jobs created read inconsistency if a user happened to query that table while AWS Glue was writing data to S3. We modified the AWS Glue jobs to write to a temporary directory with Spark and then used boto3 to move the files into place. Doing this reduced read inconsistency by up to 90 percent. It was great to have this functionality readily available, which may not have been the case if we managed our own Spark cluster.

Comparing previous state and current state

After we had all the datasets in place, it was time for our customers to come on board and start querying. This is where we really leveled up the customer experience.

Previously, users had to download a SQL client, request a user name and password, set it up, and learn SQL to get data out. Now, users just sign in to the AWS Management Console through automatically provisioned IAM roles and run queries in their browser with Athena. Or if they want to skip SQL altogether, they can use our Amazon QuickSight account with accounts managed through our pre-existing Active Directory server.

Integration with Active Directory was a big win for us. We wanted to enable users to get up and running without having to wait for an account to be created or managing separate credentials. We already use Active Directory across the company for access to multiple resources. Upgrading to Amazon QuickSight Enterprise Edition enabled us to manage access with our existing AD groups and credentials.

Migration results

Our legacy data warehouse was developed over the course of five years. We recreated it as a serverless data lake using AWS Glue in about three months.

In the end, it took more upfront effort than simply migrating to another relational database. We also dealt with more uncertainty because we used many products that were relatively new to us (especially AWS Glue).

However, in the months since the migration was completed, we’ve gotten great feedback from data warehouse users about the new tools. Our users have been amazed by these things:

  • How fast Athena is.
  • How intuitive and beautiful Amazon QuickSight is. They love that no setup is required—it’s easy enough that even our CEO has started using it!
  • That Athena plus the AWS Glue Data Catalog have given us the performance gains of a true big data platform, but for end users it retains the look and feel of a relational database.

Summary

From an operational perspective, the investment has already started to pay off. Literally: Our operating costs have fallen by almost 90 percent.

Personally, I was thrilled that recently I was able to take a three-week vacation and didn’t get paged once, thanks to the serverless infrastructure. And for our BI engineers in addition to myself, the S3-centric architecture is enabling us to experiment with new technologies by integrating seamlessly with other services, such as Amazon EMR, Amazon SageMaker, Amazon Redshift Spectrum, and Lambda. It’s been exciting to see how these services have grown in the time since we’ve adopted them (for example, the recent AWS Glue launch of Amazon CloudWatch metrics and Athena’s launch of views).

We are thrilled that we’ve invested in technologies that continue to grow as we do. We are incredibly proud of our team for accomplishing this ambitious migration. We hope our experience can inspire other engineers to dive in to building a data lake of their own.

For additional information, see these similar AWS Big Data blog posts:


About the authors

Chaya Carey is a data engineer at Woot.com. At Woot, she’s responsible for managing the data warehouse and other scalable data solutions. Outside of work, she’s passionate about Seattle’s bar and restaurant scene, books, and video games.

 

 

 

Karthik Odapally is a senior solutions architect at AWS. His passion is to build cost-effective and highly scalable solutions on the cloud. In his spare time, he bakes cookies and cupcakes for family and friends here in the PNW. He loves vintage racing cars.

 

 

 

 

Analyze and visualize nested JSON data with Amazon Athena and Amazon QuickSight

Post Syndicated from Mariano Kamp original https://aws.amazon.com/blogs/big-data/analyze-and-visualize-nested-json-data-with-amazon-athena-and-amazon-quicksight/

Although structured data remains the backbone for many data platforms, increasingly unstructured or semistructured data is used to enrich existing information or to create new insights. Amazon Athena enables you to analyze a wide variety of data. This includes tabular data in comma-separated value (CSV) or Apache Parquet files, data extracted from log files using regular expressions, and JSON-formatted data. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

In this blog post, I show you how to use JSON-formatted data and translate a nested data structure into a tabular view. For data engineers, using this type of data is becoming increasingly important. For example, you can use API-powered data feeds from operational systems to create data products. Such data can also help to add more finely grained facets to your understanding of customers and interactions. Understanding the fuller picture helps you better understand your customers and tailor experiences or predict outcomes.

To illustrate, I use an end-to-end example. It processes financial data retrieved from an API operation that is formatted as JSON. We analyze the data in Amazon Athena and visualize the results in Amazon QuickSight. Along the way, we compare and contrast alternative options.

The result looks similar to what you see below.

Analyzing JSON-formatted data

For our end-to-end example, we use financial data as provided by IEX. The financials API call pulls income statement, balance sheet, and cash flow data from four reported years of a stock.

Following, you can see example output. On the top level is an attribute called symbol, which identifies the stock described here: Apple. On the same level is an attribute called financials. This is a data container. The actual information is one level below, including such attributes as reportDatecashflow, and researchAndDevelopment.

The data container is an array. In the example following, financial data for only one year is shown. However, the {...} indicates that there might be more. In our case, data for four years is returned when making the actual API call.

{
  "symbol": "AAPL",
  "financials": [
    {
      "reportDate": "2017-03-31",
      "grossProfit": 20591000000,
      "costOfRevenue": 32305000000,
      "operatingRevenue": 52896000000,
      "totalRevenue": 52896000000,
      "operatingIncome": 14097000000,
      "netIncome": 11029000000,
      "researchAndDevelopment": 2776000000,
      "operatingExpense": 6494000000,
      "currentAssets": 101990000000,
      "totalAssets": 334532000000,
      "totalLiabilities": 200450000000,
      "currentCash": 15157000000,
      "currentDebt": 13991000000,
      "totalCash": 67101000000,
      "totalDebt": 98522000000,
      "shareholderEquity": 134082000000,
      "cashChange": -1214000000,
      "cashFlow": 12523000000,
      "operatingGainsLosses": null
    } // , { ... }
  ]
}

Data is provided for free by IEX (see the IEX Terms of Use).

It has become commonplace to use external data from API operations as feeds into Amazon S3. Although this is usually done in an automated fashion, in our case we manually acquire the API call’s results.

To download the data, you can use a script, described following.

Alternatively, you can click the following three links: 123. You can then save the resulting JSON files to your local disk, then upload the JSON to an Amazon S3 bucket. In my case, the location of the data is s3://athena-json/financials, but you should create your own bucket. The result looks similar to this:

You can also use a Unix-like shell on your local computer or on an Amazon EC2 instance to populate a S3 location with the API data:

$ curl -s "https://api.iextrading.com/1.0/stock/aapl/financials?period=annual" > aapl.json 
$ curl -s "https://api.iextrading.com/1.0/stock/nvda/financials?period=annual" > nvda.json 
$ curl -s "https://api.iextrading.com/1.0/stock/fb/financials?period=annual" > fb.json 

$ ls -ltrh *.json
-rw-r--r--  1 mkamp  ANT\Domain Users   2.2K Nov 21 16:57 aapl.json
-rw-r--r--  1 mkamp  ANT\Domain Users   2.1K Nov 21 16:57 nvda.json
-rw-r--r--  1 mkamp  ANT\Domain Users   2.1K Nov 21 16:57 fb.json 

$ aws s3 sync . s3://athena-json/financials/ --exclude "*" --include "*.json"
upload: ./aapl.json to s3://athena-json/financials/aapl.json   
upload: ./nvda.json to s3://athena-json/financials/nvda.json   
upload: ./fb.json to s3://athena-json/financials/fb.json       

$ aws s3 ls s3://athena-json/financials/
2018-11-21 16:58:30       2245 aapl.json
2018-11-21 16:58:30       2162 fb.json
2018-11-21 16:58:30       2150 nvda.json

Mapping JSON structures to table structures

Now we have the data in S3. Let’s make it accessible to Athena. This is a simple two-step process:

  1. Create metadata. Doing so is analogous to traditional databases, where we use DDL to describe a table structure. This step maps the structure of the JSON formatted data to columns.
  2. Specify where to find the JSON files.

We can use all information of the JSON file at this time, or we can concentrate on mapping the information that we need today. The new data structure in Athena overlays the files in S3 only virtually. Therefore, even though we just map a subset of the contained information at this time, all information is retained in the files and can be used later on as needed. This is a powerful concept and enables an iterative approach to data modeling.

You can use the following SQL statement to create the table. The table is then named financials_raw—see (1) following. We use that name to access the data from this point on. We map the symbol and the list of financials as an array and some figures. We define that the underlying files are to be interpreted as JSON in (2), and that the data lives following s3://athena-json/financials/ in (3).

CREATE EXTERNAL TABLE financials_raw ( -- (1)
    symbol string,
    financials array<
        struct<reportdate: string,
             grossprofit: bigint,
             totalrevenue: bigint,
             totalcash: bigint,
             totaldebt: bigint,
             researchanddevelopment: bigint>>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' -- (2)
LOCATION 's3://athena-json/financials/' -- (3)

You can run this statement using the Athena console as depicted following:

After you run the SQL statement on the left, the just-created table financials_raw is listed under the heading Tables. Now let’s have a look what’s in this table. Choose the three vertical dots to the right of the table name and choose Preview table. Athena creates a SELECT statement to show 10 rows of the table:

Looking at the output, you can see that Athena was able to understand the underlying data in the JSON files. Specifically, we can see two columns:

  • symbol, which contains flat data, the symbol of the stock
  • financials, which now contains an array of financials reports

If you look closely and observe the reportdate attribute, you find that the row contains more than one financial report.

Even though the data is nested—in our case financials is an array—you can access the elements directly from your column projections:

SELECT
  symbol, 
  financials[1].reportdate one_report_date, -- indexes start with 1
  financials[1].totalrevenue one_total_revenue,
  financials[2].reportdate another_report_date,
  financials[2].totalrevenue another_total_revenue
FROM
  financials_raw
ORDER BY
  1 -- the 1 indicates to order by the first column

As you can see preceding, all data is accessible. From this point on, it is structured, nested data, but not JSON anymore.

It’s still not tabular, though. We come back to this in a minute. First let’s have a look at a different way that would also have brought us to this point.

Alternative approach: Deferring the JSON extraction to query time

There are many different ways to use JSON formatted data in Athena. In the previous section, we use a simple, explicit, and rigid approach. In contrast, we now see a rather generic, dynamic approach.

In this case, we defer the final decisions about the data structures from table design to query design. To do that, we leave the data untouched in its JSON form as long as possible. As a consequence, the CREATE TABLE statement is much simpler than in the previous section:

CREATE EXTERNAL TABLE financials_raw_json (
  -- Using a mixed approach, extracting the symbol for 
  -- convenience directly from the JSON data
  symbol string,
  -- But otherwise storing the RAW JSON Data as string
  financials string 
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://athena-json/financials/' 
Executing
SELECT * FROM financials_raw_json

This shows that the data is accessible:

Even though the data is now accessible, it is only treated as a single string or varchar. This type is generic and doesn’t reflect the rich structure and the attributes of the underlying data.

But before diving into the richness of the data, I want to acknowledge that it’s hard to see from the query results which data type a column is. When using your queries, the focus is on the actual data, so seeing the data types all the time can be distracting. However in this case, when creating your queries and data structures, it is useful to use typeof. For example, use the following SQL statement:

SELECT typeof(financials) FROM financials_raw_json

Using this SQL statement, you can verify for yourself that the column is treated as a varchar.

To now introduce the data structure during query design, Athena provides specific functionality covered in the documentation to work with JSON formatted data.

The following table shows how to extract the data, starting at the root of the record in the first example. The table then shows additional examples on how to navigate further down the document tree. The first column shows the expression that can be used in a SQL statement like SELECT <expr> FROM financials_raw_json, where <expr> is to be replaced by the expression in the first column. The remaining columns explain the results.

Expression ResultTypeDescription 
json_extract(financials, '$')[{.., "reportdate":"2017-12-31",..},{..}, {..}, {.., "reportdate":"2014-12-31", ..}]jsonSelecting the root of the document (financials).
json_extract(financials, '$[0]'){.., "reportdate":"2017-12-31", "totalcash":"41711000000", ..}jsonSelecting the first element of the financials array. The indexing starts at 0, as opposed to 1, which is customary in SQL.
json_extract(financials, '$[0].reportdate')"2017-12-31"jsonSelecting the totalcash attribute of the first element of the financials array.
json_extract_scalar(financials, '$[0].reportdate')2017-12-31varcharAs preceding, but now the type became a varchar because we are now using json_extract_scalar.
json_size(financials, '$')4bigintThe size of the financials array; 4 represents the four years contained in each JSON.

To implement our example, we now have more than enough skills and we can leave it at that.
However, there are more functions to go back and forth between JSON and Athena. You can find more information in the Apache Presto documentation. Athena is our managed service based on Apache Presto. Thus, when looking for information it is also helpful to consult Presto documentation.

Let’s put the JSON functions introduced preceding to use:

SELECT 
  symbol,
  -- indexes start with 0, as is customary with JavaScript/JSON
  json_extract_scalar(financials, '$[0].reportdate') one_report_date,  
  json_extract_scalar(financials, '$[0].totalrevenue') one_total_revenue,
  json_extract_scalar(financials, '$[1].reportdate') another_report_date,
  json_extract_scalar(financials, '$[1].totalrevenue') another_total_revenue
FROM
  financials_raw_json
ORDER BY 
  1

As with the first approach, we still have to deal with the nested data inside the rows. By doing so, we can get rid of the explicit indexing of the financial reports as used preceding.

Comparing approaches

If you go back and compare our latest SQL query with our earlier SQL query, you can see that they produce the same output. On the surface, they even look alike because they project the same attributes. But a closer look reveals that the first statement uses a structure that has already been created during CREATE TABLE. In contrast, the second approach interprets the JSON document for each column projection as part of the query.

Interpreting the data structures during the query design enables you to change the structures across different SQL queries or even within the same SQL query. Different column projections in the same query can interpret the same data, even the same column, differently. This can be extremely powerful, if such a dynamic and differentiated interpretation of the data is valuable. On the other hand, it takes more discipline to make sure that during maintenance different interpretations are not introduced by accident.

In both approaches, the underlying data is not touched. Athena only overlays the physical data, which makes changing the structure of your interpretation fast. Which approach better suits you depends on the intended use.

To determine this, you can ask the following questions. As a rule of thumb, are your intended users data engineers or data scientists? Do they want to experiment and change their mind frequently? Maybe they even want to have different use case–specific interpretations of the same data, Then they would fare better with the latter approach of leaving the JSON data untouched until query design. They would also then likely be willing to invest in learning the JSON extensions to gain access to this dynamic approach.

If on the other hand your users have established data sources with stable structures, the former approach fits better. It enables your users to query the data with SQL only, with no need for information about the underlying JSON data structures.

Use the following side-by-side comparison to choose the appropriate approach for your case at hand.

Data structure interpretation happens at table creation timeData structure interpretation happens at query creation time
The interpretation of data structures is scoped to the whole table. All subsequent queries use the same structures.The data interpretation is scoped to an individual query. Each query can potentially interpret the data differently
The interpretation of data structures evolves centrally.The interpretation of data structures can be changed on a per-query basis so that different queries can evolve with different speeds and into different directions.
It is easy to provide a single version of the truth, because there is just a single interpretation of the underlying data structures.A single version of the truth is hard to maintain and needs coordination across the different queries using the same data. Rapidly evolving data interpretations can easily go hand-in-hand with an evolving understanding of use cases.
Applicable to well-understood data structures that are slowly and consciously evolving. A single interpretation of the underlying data structures is valued more than change velocity.Applicable to experimental, rapidly evolving interpretations of data structures and use cases. Change velocity is more important than a single, stable interpretation of data structures.
Production data pipelines benefit from this approach.Exploratory data analysis benefit from this approach.

Both approaches can serve well at different times in the development lifecycle, and each approach can be migrated to the other.

In any case, this is not a black and white decision. In our example, we keep the tables financials_raw and financials_raw_json, both accessing the same underlying data. The data structures are just metadata, so keeping both around doesn’t store the actual data redundantly.

For example, financials_raw might be used by data engineers as the source of productive pipelines where the attributes and their meaning are well-understood and stable across use cases. At the same time, data scientists might use financials_raw_json for exploratory data analysis where they refine their interpretation of the data rapidly and on a per-query basis.

Working with nested data

At this point, we can access data that is JSON formatted through Athena. However, the underlying structure is still hierarchical, and the data is still nested. For many use cases, especially for analytical uses, expressing data in a tabular fashion—as rows—is more natural. This is also the standard way when using SQL and business intelligence tools. To unnest the hierarchical data into flattened rows, we need to reconcile these two approaches.

To simplify, we can set the financial reports example aside for the moment. Instead, let’s experiment with a narrower example. Reconciling different ways of thinking can sometimes be hard to follow. The narrow example and hands-on experimentation should make this easier. Copy the code we discuss into the Athena console to play along.

The following code is self-contained and uses synthetic data. This lends itself particular well to experimentation:

SELECT 
  parent, children
FROM (
  VALUES
    ('Parent 1', ARRAY['Child 1.1', 'Child 1.2']),
    ('Parent 2', ARRAY['Child 2.1', 'Child 2.2', 'Child 2.3'])
) AS t(parent, children)

Looking at the data, this is similar to our situation with the financial reports. There we had multiple financial reports for one stock symbol, multiple children for each parent. To flatten the data, we first unnest the individual children for each parent. Then we cross-join each child with its parent, which creates an individual row for each child that contains the child and its parent.

In the following SQL statement, UNNEST takes the children column from the original table as a parameter. It creates a new dataset with the new column child, which is later cross-joined. The enclosing SELECT statement can then reference the new child column directly.

SELECT 
  parent, child
FROM (
  VALUES
    ('Parent 1', ARRAY['Child 1.1', 'Child 1.2']),
    ('Parent 2', ARRAY['Child 2.1', 'Child 2.2', 'Child 2.3'])
) AS t(parent, children)
CROSS JOIN UNNEST(children) AS t(child)

If you played along with the simplified example, it should be easy now to see how this method can be applied to our financial reports:

SELECT 
    symbol,
    report
FROM 
    financials_raw
CROSS JOIN UNNEST(financials) AS t(report)

Bam! Now that was easy, wasn’t it?

Using this as a basis, let’s select the data that we want to provide to our business users and turn the query into a view. The underlying data has still not been touched, is still formatted as JSON, and is still expressed using nested hierarchies. The new view makes all of this transparent and provides a tabular view.

Let’s create the view:

CREATE OR REPLACE VIEW financial_reports_view AS
SELECT 
  symbol,
  CAST(report.reportdate AS DATE) reportdate,
  report.totalrevenue,
  report.researchanddevelopment
FROM 
  financials_raw
CROSS JOIN UNNEST(financials) AS t(report)
ORDER BY 1 ASC, 2 DESC
and then check our work:
SELECT
  *
FROM
  financial_reports_view

This is a good basis and acts as an interface for our business users.

The previous steps were based on the initial approach of mapping the JSON structures directly to columns. Let’s also explore the alternative path that we discussed before. How does this look like when we keep the data JSON formatted for longer, as we did in our alternative approach?

For variety, this approach also shows json_parse, which is used here to parse the whole JSON document and converts the list of financial reports and their contained key-value pairs into an ARRAY(MAP(VARCHAR, VARCHAR)). This array in turn is then used in the unnesting and its children eventually in the column projections. With element_at elements in the JSON, you can access the value by name. You can also see the use of WITH to define subqueries, helping to structure the SQL statement.

If you run the following query, it returns the same result as the approach preceding. You can also turn this query into a view.

WITH financial_reports_parsed AS (
  SELECT 
    symbol,   
    CAST(json_parse(financials) AS ARRAY(MAP(VARCHAR, VARCHAR))) financial_reports
  FROM         
    financials_raw_json)
SELECT 
  symbol,
  CAST(element_at(report, 'reportdate') AS DATE) reportdate,  
  element_at(report, 'totalrevenue') totalrevenue,
  element_at(report, 'researchanddevelopment') researchanddevelopment
FROM
  financial_reports_parsed
CROSS JOIN UNNEST(financial_reports) AS t(report)
ORDER BY 1 ASC, 2 DESC

Visualizing the data

Let’s get back to our example. We created the financial_reports_view that acts as our interface to other business intelligence tools. In this blog post, we use it to provide data for visualization using Amazon QuickSight. Amazon QuickSight can directly access data through Athena. Its pay-per-session pricing enables you to put analytical insights into the hands of everyone in your organization.

Let’s set this up together. We first need to select our view to create a new data source in Athena and then we use this data source to populate the visualization.

We are creating the visual that is displayed at the top of this post. If you want just the data and you’re not interested in condensing data to a visual story, you can skip ahead to the post conclusion section.

Creating an Athena data source in Amazon QuickSight

Before we can use the data in Amazon QuickSight, we need to first grant access to the underlying S3 bucket. If you haven’t done so already for other analyses, see our documentation on how to do so.

On the Amazon QuickSight home page, choose Manage data from the upper-right corner, then choose New data set and pick Athena as data source. In the following dialog box, give the data source a descriptive name and choose Create data source.

Choose the default database and our view financial_reports_view, then choose Select to confirm. If you used multiple schemas in Athena, you could pick them here as your database.

In the next dialog box, you can choose if you want to import the data into SPICE for quicker analytics or to directly query the data.

SPICE is the super-fast, parallel, in-memory calculation engine in Amazon QuickSight. For our example, you can go either way. Using SPICE results in the data being loaded from Athena only once, until it is either manually refreshed or automatically refreshed (using a schedule). Using direct query means that all queries are run on Athena.

Our view now is a data source for Amazon QuickSight and we can turn to visualizing the data.

Creating a visual in Amazon QuickSight

You can see the data fields on the left. Notice that reportdate is shown with a calendar symbol and researchanddevelopment as a number. Amazon QuickSight picks up the data types that we defined in Athena.

The canvas on the right is still empty. Before we populate it with data, let’s select Line Chart from the available visual types.

To populate the graph, drag and drop the fields from the field list on the left onto their respective destinations. In our case, we put the reportdate onto the X axis well. We put our metric researchanddevelopment towards the value well, so that it’s displayed on the y-axis. We put the symbol onto the Color well, helping us to tell the different stocks apart.

An initial version of our visualization is now shown on the canvas. Drag the handle at the lower-right corner to adjust the size to your liking. Also, pick Format visual from the drop-down menu in the upper right corner. Doing this opens a dialog with more options to enhance the visualization.

Expand the Data labels section and choose Show data labels. Your changes are immediately reflected in the visualization.

You can also interact with the data directly. Given that Amazon QuickSight picked up on the reportdate being a DATE, it provides a date slider at the bottom of the visual. You can use this slider to adjust the time frame shown.

You can add further customizations. These can include changing the title of the visual or the axis, adjusting the size of the visual, and adding additional visualizations. Other possible customizations are adding data filters and capturing the combination of visuals into a dashboard. You might even turn the dashboard into a scheduled report that gets sent out once a day by email.

Conclusion

We have seen how to use JSON formatted data that is stored in S3. We contrasted two approaches to map the JSON formatted data to data structures in Athena:

  • Mapping the JSON structures at table creation time to columns.
  • Leaving the JSON structures untouched and instead mapping the contents as a whole to a string, so that the JSON contents remains intact. The JSON contents can later be interpreted and the structures at query creation time mapped to columns.

The approaches are not mutually exclusive, but can be used in parallel for the same underlying data.

Furthermore, JSON data can be hierarchical, which must be unnested and cross-joined to provide the data in a flattened, tabular fashion.

For our example, we provided the data in a tabular fashion and created a view that encapsulates the transformations, hiding the complexity from its users. We used the view as an interface to Amazon QuickSight. Amazon QuickSight directly accesses the Athena view and visualizes the data.

More on using JSON

JSON features blend nicely into the existing SQL oriented functions in Athena, but are not ANSI SQL compatible. Also, the JSON file is expected to carry each record in a separate line (see the JSON lines website).

In the documentation for the JSON SerDe Libraries, you can find how to use the property ignore.malformed.json to indicate if malformed JSON records should be turned into nulls or an error. Further information about the two possible JSON SerDe implementations is linked in the documentation. If necessary, you can dig deeper and find out how to take explicit control of how column names are parsed, for example to avoid clashing with reserved keywords.

How to efficiently store data

During our excursions, we never touched the actual data. We only defined different ways to interpret the data. This approach works well for us here, because we are only dealing with a small amount of data. If you want to use these concepts at scale, consider how to apply partitioning of data and possibly how to consolidate data into larger files.

Depending on the data, also consider whether storing it in a columnar fashion, using for example Apache Parquet might be beneficial. You can find additional practical suggestions in our AWS Big Data Blog post Top 10 Performance Tuning Tips for Amazon Athena.

All these options don’t replace what you learned in this article, but benefit from your being able to compare and contrast JSON formatted data and nested data. They can be used in a complementary fashion.

Further, this AWS Big Data Blog post walks you through a real-world scenario showing how to store and query data efficiently.


About the Author

Mariano Kamp is a principal solutions architect with Amazon Web Services. He works with financial services customers in Germany and has more than 25 years of industry experience covering a wide range of technologies. His area of depth is Analytics.

In his spare time, Mariano enjoys hiking with his wife.

 

 

 

 

Embed interactive dashboards in your application with Amazon QuickSight

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/embed-interactive-dashboards-in-your-application-with-amazon-quicksight/

Starting today, you can embed Amazon QuickSight dashboards in your applications. This means you can now quickly and efficiently enhance your applications with rich interactive data visualizations and analytics capabilities without any custom development. You won’t require specialized expertise on your team to develop, maintain, and evolving the analytics components for your applications, nor manage and scale your analytics servers and infrastructure with your applications’ popularity. This saves you time and money and lets you focus on your core application functionality!

Embedded Amazon QuickSight dashboards allow you to utilize Amazon QuickSight’s serverless architecture and easily scale your insights with your growing user base, while ensuring you only pay for usage with Amazon QuickSight’s unique pay-per-session pricing model. Viewers of embedded dashboards are authenticated via Federated Single Sign-On (SAML, OpenID Connect or AWS IAM Federation). This means that every user receives a secure, personalized dashboard while requiring no user-facing QuickSight-specific authentication. Dashboard embedding is available in Amazon QuickSight Enterprise Edition in all supported regions.

To facilitate an easy embedding experience, AWS is also launching the Amazon QuickSight JavaScript SDK, and the first set of Amazon QuickSight API operations. The Amazon QuickSight JavaScript SDK lets you efficiently integrate Amazon QuickSight dashboards in your application pages, set defaults, connect controls and handle errors. The new user and group API operations allow you to programmatically manage users and groups. This helps you rollout Amazon QuickSight to your users faster, and makes it easier to manage Amazon QuickSight resources using groups. User API operations are available for both Standard and Enterprise Editions, while group API operations are available in Enterprise Edition.

Let’s now take a quick look at how you can embed a dashboard in your application. The following four steps describe how to embed a dashboard.

Step 1: In Amazon QuickSight, create your dashboards and whitelist your domains

In this step, you create the dashboards in Amazon QuickSight and then share it with the readers who you would like to access this in the embedded form. Each unique viewer of the embedded dashboard must be a member of the Amazon QuickSight account and can be either provisioned in the account ahead of time or at the point of access using the user API operations. To provide each viewer with access to the dashboard, share the dashboard with a group. Then add users to the group once they are provisioned in the Amazon QuickSight account.

When you publish a dashboard, there is a new option which allows you to enable or disable the advanced filtering and download to CSV options on the dashboard, to provide a simplified dashboard experience if desired.

To avoid unauthorized embedding of dashboards from your account, an administrator in your account must explicitly enable domains where your Amazon QuickSight dashboards are embedded. This is done by using the Manage QuickSight option that is available to all administrators. All domains in use (dev/staging/production) must be whitelisted and must use https.

Step 2: In your AWS account, set up permissions for embedded viewers

To provision users of your application as members of your Amazon QuickSight account, create an AWS Identity and Access Management (IAM) role that a user can assume. This role should grant the visitor permissions to:

  • Become a reader in the Amazon QuickSight account (quicksight:RegisterUser).
  • Retrieve the specific embedded dashboards (quicksight:GetDashboardEmbedUrl).

The following is a sample policy with these permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "quicksight:RegisterUser",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Action": "quicksight:GetDashboardEmbedUrl",
            "Resource": "arn:aws:quicksight:us-west-2:293424211206:dashboard/26a7dcad-7831-4266-9f75-ffd9265aa335",
            "Effect": "Allow"
        }
    ]
}

Your application’s IAM identity must have a trust policy associated with it to allow access the role you just created. This means that as a user accesses your application, your application can assume the QuickSightEmbed role on the user’s behalf, and provision the user in the Amazon QuickSight account. The following example shows QuickSightEmbed as the role that has the previous policy associated with it. More details are available regarding trust policies for OpenID Connect or SAML authentication.

{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "arn:aws:iam::293424211206:role/QuickSightEmbed"
    }
}

Step 3: On your application server, authenticate the user and get the embedded dashboard URL

When a user lands in your application, you can assume the IAM role created in Step 2 (for example, QuickSightEmbed) on the user’s behalf, and provision the user in the Amazon QuickSight account. To ensure that each user of the application is uniquely provisioned in Amazon QuickSight and has permissions (row level security) or defaults (dynamic defaults for parameters) enforced, you must pass an identifier, such as a user name or email address, as the role session name.

The following shows an AWS CLI command that achieves this:

aws sts assume-role --role-arn "arn:aws:iam::293424211206:role/QuickSightEmbed" --role-session-name [email protected]

Running this command sets the role session ID of the user visiting your website to “QuickSightEmbed/[email protected]” This is also the user name in Amazon QuickSight.

You can use this pattern to provision your users in Amazon QuickSight ahead of time, or provision as they come in the first time. With Amazon QuickSight’s Pay-per-Session mode, you can provision thousands of users. You get charged only when they access Amazon QuickSight. The following is an AWS Command Line Interface (AWS CLI) command that can be used to provision a user.

aws quicksight register-user --aws-account-id 293424211206 --namespace default --identity-type IAM --iam-arn "arn:aws:iam::293424211206:role/QuickSightEmbed" --user-role READER --session-name [email protected] --email [email protected]

On the first access, you can also add this user to a group, with which the dashboard has been shared.

aws quicksight create-group-membership --aws-account-id=293424211206 --namespace=default --group-name=financeusers --member-name=”QuickSightEmbed/[email protected]

You now have a user of your application who is also a Amazon QuickSight user and has access to the dashboard. For the last part of Step 3, call the get-dashboard-embed-url to get a signed URL for the dashboard that can be embedded.

aws quicksight get-dashboard-embed-url --aws-account-id 293424211206 --dashboard-id 26a7dcad-7831-4266-9f75-ffd9265aa335 --identity-type IAM

Step 4: On your application page, use the Amazon QuickSight JavaScript SDK to embed the dashboard

You can download the Amazon QuickSight JavaScript SDK here. Use it to place the signed URL obtained in Step 3 in your application page. The SDK lets you place the dashboard on an HTML page, pass parameters into the dashboard and also handle error states with messages that are customized to your application.

function embedDashboard() {
                var containerDiv = document.getElementById("dashboardContainer");
                var options = {
                    url: "<signed URL from Step 3>",
                    container: containerDiv,
                    parameters: {
                        country: 'United States'
                    },
                    scrolling: "no",
                    height: "700px",
                    width: "1000px"
                };
                dashboard = QuickSightEmbedding.embedDashboard(options);
                dashboard.on('error', onDashboardLoad);
                dashboard.on('load', onError);
            }

The CLI commands referenced above are part of the AWS SDK. You can achieve the same result with all AWS SDK supported languages. For more information, see Tools for Amazon Web Services

To summarize, these four steps let you to create your dashboards in Amazon QuickSight, embed them in your application, and enable your users with rich, interactive analytics!

 


Additional Reading

If you found this post useful, be sure to check out Create Amazon QuickSight dashboards that have impact with parameters, on-screen controls, and URL actions, and Advanced analytics with table calculations in Amazon QuickSight.

 


About the Author

Jose Kunnackal is a principal product manager for Amazon QuickSight.

Amazon QuickSight announces ML Insights in preview

Post Syndicated from Luis Wang original https://aws.amazon.com/blogs/big-data/amazon-quicksight-announces-ml-insights-in-preview/

Amazon QuickSight is a fast, cloud-powered BI service that makes it easy for everyone in an organization to get business insights from their data through rich, interactive dashboards. With pay-per-session pricing and embedded dashboard, we made BI even more cost-effective and accessible to everyone.

However, as the volume of data that customers generate continues to grow every day, it’s becoming more challenging to harness their data for business insights. This is where machine learning comes into play. Amazon is a pioneer in using machine learning to automate and scale various aspects of business analytics in the supply chain, marketing, retail, and finance.

With AWS, you can use these capabilities through our extensive ML and AI services. Today, we are announcing three new features that integrate proven Amazon technologies into Amazon QuickSight to provide customers with ML-powered insights beyond visualizations:

  1. ML-powered anomaly detection to help customers uncover hidden insights by continuously analyzing across billions of data points.
  2. ML-powered forecasting and what-if analysis to predict key business metrics with point-and-click simplicity.
  3. Auto-narratives to help customers tell the story of their dashboard in a plain-language narrative.

ML-powered anomaly detection

Today, there’s no shortage of tools to visualize and report on business performance and metrics, but the real insights are often buried deep in aggregates. These insights generally don’t manifest in the existing reports and dashboard but are discovered only several days or weeks later. To add to the problemgrowing volumes of data in the cloud make manual analysis such as slice and dice and pivoting too time-consuming and not scalable. As such, customers often resolve to look only at a small subset of their data rather than the whole picture.

Amazon QuickSight uses proven Amazon technology to continuously run ML-powered anomaly detection on millions of metrics and billions of data points. This anomaly detection enables you to get deep insights that are often buried in the aggregates, not visible in plain sight, and not scalable with manual analysis. When anomalies are detected, Amazon QuickSight automatically sends you timely email alerts detailing the top contributors to each anomaly. With ML-powered anomaly detection, there’s no need for manual analysis, custom development, or ML domain expertise.

For example, say that you operate a business that sells products across various geographical regions. You can now use the ML-powered anomaly detection in Amazon QuickSight to continuously monitor your sales and orders at scale across all products, geographies, and customers. Amazon QuickSight’s ML algorithm continuously learns from your historical data patterns at the most granular level. It can alert you when it detects higher- or lower-than-expected sales for a particular product, in a specific city, or even for an individual customer.

With pay-per-session, we introduced industry-first usage-based pricing model for BI. Continuing down that path, we plan for ML-powered anomaly detection to have a usage-based pricing model based on the number of metrics processed in a month. To learn more about planned pricing and examples, see here.

Note: Customers will not be charged during preview. Pricing will apply when the feature becomes generally available.

ML-powered forecasting

We see customers wanting to perform common forecasting tasks that are too cumbersome and often out of reach for most business users. For example, suppose that you are a business manager and you want to forecast sales to see if you are going to meet your goal by the end of the year. Or suppose that you expect a large deal to come through in two weeks that is going to affect your overall forecast. Today, most business forecasting involves tedious, error-prone, and rigid processes using Excel that don’t scale. This is an area where Amazon QuickSight is using ML to simplify.

With ML-powered forecasting and what-if analyses in Amazon QuickSight, nontechnical users can now easily forecast their key business metrics with just a few clicks. No ML-expertise or Excel data modeling is required. The built-in ML algorithm in Amazon QuickSight is designed to handle complex real-world scenarios. Amazon QuickSight uses ML to provide more reliable forecasts than traditional means.

For example, you can forecast your business revenue with multiple levels of seasonality (for example, sales with both weekly and quarterly trends). Amazon QuickSight automatically excludes anomalies in the data (for example, a spike in sales due to price drop or promotion) from influencing the forecast. Additionally, you don’t have to clean and reprep the data with missing values because Amazon QuickSight automatically handles it under the hood. Finally, with ML-powered forecasting, you can perform interactive what-if analyses to determine the growth trajectory needed to meet business goals.

Auto-narratives

We’ve all been through this before. You open up a dashboard and spend hours staring at the data, sifting and drilling down the charts and deciphering rows and columns in the tables. You try to understand what the data is trying to say and engage in a lengthy discussion with your peers on your interpretations. As companies continue to expand their BI and reporting across all aspects of their business, this problem only gets worse.

Auto-narratives is a new feature that we’re announcing that provides key insights in everyday language, embedded contextually in your dashboard, saving hours on manual analysis. With auto-narratives, Amazon QuickSight automatically interprets the charts and tables in your dashboard and provides a number of Suggested Insights in natural language. Depending on the shape and form of your data, you might get suggestions such as what the day-over-day changes look like, what was the highest sales date, what the growth rate is at and what the forecast looks like for the next seven days. As the author of the dashboard, you have complete flexibility to customize the computations and business language for your unique needs. You can use auto-narratives to effectively tell the story of your data in plain language.

Sign up for the preview today!

ML Insights are available for preview starting today. You don’t need any ML or domain expertise to use these features. Simply sign up for the preview here.


About the Author

Luis Wang is a principal product manager for Amazon QuickSight. He’s been with AWS for over 6 years, working on various services including Amazon EC2 and then launching Amazon QuickSight. Luis is now focused on the application of machine learning and AI to business intelligence and analytics at QuickSight. He enjoys running, watching sitcoms and spending time with his family.