Tag Archives: Amazon QuickSight

Introduction to Amazon QuickSight ML Insights

Post Syndicated from Rashid Sajjad original https://aws.amazon.com/blogs/big-data/introduction-to-amazon-quicksight-ml-insights/

Amazon QuickSight was launched in November 2016 as a fast, cloud-powered business analytics service to build visualizations, perform ad hoc analysis, and quickly get business insights from a variety of data sources. In 2018, ML Insights for QuickSight (Enterprise Edition) was announced to add machine learning (ML)-powered forecasting and anomaly detection with a few clicks. These insights are automatically generated as suggested insights, and you can also add custom insights to your analysis. Because they’re written out in narrative format, they’re easily consumable by any non-technical user and are a great way to increase adoption of your dashboards. Let’s dive deeper on how these insights are built and how to correctly set up your data to maximize the Suggested Insights feature.

What are ML Insights?

QuickSight uses ML to help uncover hidden insights and trends in your data. It does that by using an ML model that over time and with an increasing volume of data being fed into QuickSight, continually learns and improves its abilities to provide three key features (as of this writing):

  • ML-powered anomaly detection – Detect outliers that show significant variance from the dataset. This can help identify significant changes in your business metrics such has low-performing stores or products, or top selling items.
  • ML-powered forecasting – Detect trends and seasonality to forecast based on historical data. This can help project sales, orders, website traffic, and more.
  • Autonarratives – Embed narratives in your dashboard to tell the story of your data in plain language. This can help convey a shared understanding of the data within your organization. You can use either the suggested autonarrative or you can customize the computations and language to meet your organization’s unique requirements.

How does the ML model work?

QuickSight uses a built-in version of the Random Cut Forest (RCF) algorithm. This is a special type of Random Forest (RF) algorithm, a widely used and successful technique in ML. It takes a set of random data points, cuts them down to the same number of points, and then builds a collection of models. In contrast, a model corresponds to a decision tree—thereby the name “forest.” Because RFs can’t be easily updated in an incremental manner, RCFs were invented with variables in tree construction that were designed to allow incremental updates.

The key takeaway is that RCF is great for finding anomalies and building forecasts. This algorithm is good at finding data points that are outliers or finding trends and patterns to forecast future values.

One important thing to know about ML models is that each model is good at a certain set of predictive activities, but no one model is good for all activities.

Now that you understand what the RCF model is good at, namely anomaly detection and forecasting, you need to make sure the data meets certain requirements, so let’s walk through those steps.

Best practices for setting up data

To maximize the RCF model’s efficiency, the data that is being imported needs to contain certain properties:

  • At least one metric – Whatever you’re measuring (sold units, orders, and so on).
  • At least one dimension – The category or slice by which you look at the metric (product category, industry, customer type, and so on).
  • Data volumes – Your dataset requirements depend on your objective:
    • Anomaly detection – Requires at least 15 data points. For example, if you have Bicycles as a product category and want to detect anomalies at a daily level, you need at least 15 days of transactions (you could have multiple rows for multiple transactions in a given day) for Bicycles in the dataset.
    • Forecasting – This works best with a large dataset simply because the more history you have, the better the model can extract patterns and trends and generate future probable values. If you have daily aggregates, you need at least 38 days of data.
  • At least one date column – If we want to analyze anomalies or forecasts in the dataset.

QuickSight supports a wide variety of connections, like Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Apache Spark. For more information about supported connections and some connection examples, refer to Amazon QuickSight Connection examples.

Get started with Suggested Insights

Let’s use a sample dataset and walk through an example of how to use the Suggested Insights feature.

To get started, let’s download a sample dataset from the public domain. For this post, we use House Sales in King County, USA. You need to have a Kaggle account to download the resource.

  1. Download and unzip the file.

If you inspect the CVS file, you will notice it has the right grain (date), metrics (price, bedrooms) and categories (zipcode, waterfront).

Depending on what your analysis needs are, even bedrooms could be a category by which you analyze price. So your metrics and categories ultimately depend on your analysis goals.

  1. Log in to your QuickSight account or sign up for a QuickSight Enterprise Edition account to use ML Insights.

We need to create a dataset first before we can create a QuickSight analysis.

  1. Choose New dataset.
  2. Choose Upload a file.
  3. Choose the unzipped CSV file.
  4. In the pop-up window, confirm the file upload settings, then choose Edit settings and prepare data.

You’re redirected to the data preparation editor. This is one of the most important yet overlooked functions in QuickSight.

This editor allows you to review your imported fields and their data types, specify if the field will be used as a dimension or measure, along with many other important data import functions. For production datasets, you should spend time reviewing how the dataset has been set up here.

For our sample CSV file, it’s imported into a QuickSight SPICE by default. SPICE is an in-memory engine for fast querying of imported data. For more details, see Importing data into SPICE.

  1. Choose Save & publish to start importing the CSV file into the SPICE engine.

The default dataset name is the file name that was imported, so in our case it’s kc_house_data. You can choose the dataset on the Datasets page to see the import stats for the dataset.

  1. Choose Create analysis to start creating your QuickSight analysis.

The analysis editor page starts by showing a blank Sheet 1 on your workspace. On the top right, your dataset’s import stats are shown again (this becomes important when importing or refreshing large datasets because the import job might still be in progress).

Let’s start by creating our first visual. The default visual type is AutoGraph, which will try to pick the best visual type based on the fields being selected.

  1. Choose the date field.

The visual changes to Count of Records by Date, with the date aggregation set to Day.

  1. To change the aggregation to monthly, choose the down arrow next to date on the X axis.
  2. Choose the price field.

The AutoGraph detects that the date is a dimension (blue color) and the price is a measure (green color) because these were set up like that in the dataset editor screen (I mentioned earlier how important the data preparation editor was).

Because these fields are already set up as dimensions and measures, the AutoGraph automatically changes to Sum of Price by Date.

This visualization isn’t very helpful. What we’re really looking for is the average price per month.

  1. For Field wells, choose price for Value and change the aggregate to Average.

We now have a nice visual that shows us the average sale price of homes in Kings County by month.

Now comes the fun part—ML Insights!

  1. In the navigation pane, choose Insights.

Voila! QuickSight has already run the RCF model along with other statistical computations and has generated insights that are ready to be added.

These suggested insights change based on the type of visual and data that is currently in the visual. We look at how suggested insights change later in this post.

Two immediately useful insights are Highest Month and Lowest Month.

Hover over the Highest Month insight and choose the plus sign to add it to the current Sheet 1.

I can start rearranging insights and visuals and format the price field to give my current layout a more polished look.

  1. For this post, change the format of the price field to 1,2345 to remove decimals.
  2. You can also add titles for the insights and rename the X axis label date to Aggregate.
  3. To add another sheet, choose the plus sign next to Sheet 1.

By default, we start again with an AutoGraph visual.

  1. Under Visual types¸ choose the vertical bar chart.
  2. Choose the price and zipcode fields.
  3. Change the aggregation of price from Sum to Average.
  4. Choose Insights in the navigation pane.

Suggested Insights now displays a completely different set of data highlights compared to Sheet 1.

Although the vertical bar chart may already tell you the top three and bottom three zip codes, Suggested Insights already recognized the type of analysis and selected the best insights to display.

Although you might eventually build a visual to portray the intended story, Suggested Insights speeds up the process of showcasing the highlights in your data and adding them to your worksheet to quickly give the reader the most important insights from your visuals.

Anomaly detection

An anomaly in QuickSight is described a data point that fall outside an overall pattern of distribution. ML-powered anomaly detection in QuickSight enables you to identify the causations and correlations to make data-driven decisions.

We already talked about data preparation for anomaly detection earlier. QuickSight already ran the RCF model during data import. As soon as a visual is added, QuickSight notifies you on the visual if it has detected an “Anomaly Insight.” This part of Suggested Insights. You can choose Setup anomaly detection to add this to your sheet.

You can also manually add an ML insight to detect anomalies.

  1. Let’s go back to Sheet 1 with the line chart displayed.
  2. When you choose the first suggested insight, it starts creating a widget for anomaly detection.

You can add up to five dimensions fields (not calculated fields, unless they were created in the data prep screen). QuickSight splits the metrics using the fields in the Categories section. We use the date field (our time dimension), price (our metric), and yr_built (our category) to create an anomaly detection insight. The question we are trying to answer is “Were there any monthly outliers in price based on the year built?”

  1. Choose Get started to set up anomaly detection.
  2. For Combinations to be analyzed, choose your field combinations.

Choosing Exact means that the date and price are analyzed against the yr_built dimension. You can also choose Hierarchical or All. These latter options become relevant when you choose multiple dimensions in the Categories list. For more information about these options, refer to Adding an ML insight to detect outliers and key drivers.

  1. Choose Save to return to Sheet 1.

Our widget is configured at this point.

  1. Choose Run now to start analyzing the data for anomalies.

Based on the volume of data and the number of data points in the analysis, it may take a while to run the anomaly detection.

Keep in mind that at least 15 data points are needed to run an anomaly, but then you can change the aggregation of a field to have a zoom-out view and therefore view anomalies at a higher level.

For example, if you choose the date field and change Aggregate to Monthly, you get the top anomalies at the monthly level.

In our test case, QuickSight identified a top anomaly. This is a great widget that immediately draws the reader to highlights in data that are outliers and might require further investigation.

Forecast

With ML-powered forecasting, you can forecast your key business metrics in QuickSight easily. The ML algorithm in QuickSight is designed to handle complex real-world scenarios. Not only does QuickSight provide the capability to create forecasts, it also provides Forecast as a Suggested Insight.

  1. Going back to Sheet 1, choose the line chart and expand Insights.

At the bottom you will see a suggested forecast insight. Forecast insights, along with all other suggested insights, are dynamic in the sense that when your data updates or when a user applies filters, the values in the insight will update immediately. Once you add this to your sheet you can even customize how many periods in the future you want the insight to display for the forecast by editing the Narrative and then editing the forecast Calculation.

What if we wanted to customize the price forecasting on this line chart and add it in the visual?

  1. Choose the options menu (three dots) at the top right of the visual and choose Add forecast.
  2. For Periods forward, enter 6.

That is the time interval selected for the visual.

  1. Set Prediction interval to 70.

This is the amount of interval between data points. It causes the forecast to either go wider or narrower. A wider interval means wider gaps between data points, which means the net change is higher, and vice versa.

  1. Leave Seasonality set to Automatic.

Seasonality takes into account complex seasonal trends in your data. You can experiment with both settings to see how it affects the forecast. For our scenario, because house sales are seasonal, we chose Automatic.

  1. Choose Apply.

With just a few clicks, we have added a forecast to our visual, as shown in the following screenshot. The orange shaded area represents the upper and lower bound of the forecasted price.

This is another great way to add intelligence to your data and quickly let analysts focus on key data points and trends.

Conclusion

The Suggested Insights feature in QuickSight allows you to speed up the discovery and highlighting of key data elements. You can find insights in your data faster, and because they’re written out in narrative format, they’re very easy for non-technical users to quickly gain insight into the most interesting trends in the data with no ML training needed.

For more details on QuickSight ML Insights, refer to the QuickSight documentation or interact with the QuickSight Community.

As always, AWS is customer obsessed and we are ready to help with any specific questions.


About the Author

Rashid Sajjad is a Partner Management Solutions Architect focused on Big Data & Analytics with Amazon Web Services. He works with APN Partners to help develop their Migration, Data & Analytics and AI/ML Practices with enterprise, mission critical solutions for their end customers.r

A serverless operational data lake for retail with AWS Glue, Amazon Kinesis Data Streams, Amazon DynamoDB, and Amazon QuickSight

Post Syndicated from Gandhi Raketla original https://aws.amazon.com/blogs/big-data/a-serverless-operational-data-lake-for-retail-with-aws-glue-amazon-kinesis-data-streams-amazon-dynamodb-and-amazon-quicksight/

Do you want to reduce stockouts at stores? Do you want to improve order delivery timelines? Do you want to provide your customers with accurate product availability, down to the millisecond? A retail operational data lake can help you transform the customer experience by providing deeper insights into a variety of operational aspects of your supply chain.

In this post, we demonstrate how to create a serverless operational data lake using AWS services, including AWS Glue, Amazon Kinesis Data Streams, Amazon DynamoDB, Amazon Athena, and Amazon QuickSight.

Retail operations is a critical functional area that gives retailers a competitive edge. An efficient retail operation can optimize the supply chain for a better customer experience and cost reduction. An optimized retail operation can reduce frequent stockouts and delayed shipments, and provide accurate inventory and order details. Today, a retailer’s channels aren’t just store and web—they include mobile apps, chatbots, connected devices, and social media channels. The data is both structured and unstructured. This coupled with multiple fulfillment options like buy online and pick up at store, ship from store, or ship from distribution centers, which increases the complexity of retail operations.

Most retailers use a centralized order management system (OMS) for managing orders, inventory, shipments, payments, and other operational aspects. These legacy OMSs are unable to scale in response to the rapid changes in retail business models. The enterprise applications that are key for efficient and smooth retail operations rely on a central OMS. Applications for ecommerce, warehouse management, call centers, and mobile all require an OMS to get order status, inventory positions of different items, shipment status, and more. Another challenge with legacy OMSs is they’re not designed to handle unstructured data like weather data and IoT data that could impact inventory and order fulfillment. A legacy OMS that can’t scale prohibits you from implementing new business models that could transform your customer experience.

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. An operational data lake addresses this challenge by providing easy access to structured and unstructured operational data in real time from various enterprise systems. You can store your data as is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning (ML)—to guide better decisions. This can ease the burden on OMSs that can instead focus on order orchestration and management.

Solution overview

In this post, we create an end-to-end pipeline to ingest, store, process, analyze, and visualize operational data like orders, inventory, and shipment updates. We use the following AWS services as key components:

  • Kinesis Data Streams to ingest all operational data in real time from various systems
  • DynamoDB, Amazon Aurora, and Amazon Simple Storage Service (Amazon S3) to store the data
  • AWS Glue DataBrew to clean and transform the data
  • AWS Glue crawlers to catalog the data
  • Athena to query the processed data
  • A QuickSight dashboard that provides insights into various operational metrics

The following diagram illustrates the solution architecture.

The data pipeline consists of stages to ingest, store, process, analyze, and finally visualize the data, which we discuss in more detail in the following sections.

Data ingestion

Orders and inventory data is ingested in real time from multiple sources like web applications, mobile apps, and connected devices into Kinesis Data Streams. Kinesis Data Streams is a massively scalable and durable real-time data streaming service. Kinesis Data Streams can continuously capture gigabytes of data per second from hundreds of thousands of sources, such as web applications, database events, inventory transactions, and payment transactions. Frontend systems like ecommerce applications and mobile apps ingest the order data as soon as items are added to a cart or an order is created. The OMS ingests orders when the order status changes. OMSs, stores, and third-party suppliers ingest inventory updates into the data stream.

To simulate orders, an AWS Lambda function is triggered by a scheduled Amazon CloudWatch event every minute to ingest orders to a data stream. This function simulates the typical order management system lifecycle (order created, scheduled, released, shipped, and delivered). Similarly, a second Lambda function is triggered by a CloudWatch event to generate inventory updates. This function simulates different inventory updates such as purchase orders created from systems like the OMS or third-party suppliers. In a production environment, this data would come from frontend applications and a centralized order management system.

Data storage

There are two types of data: hot and cold data. Hot data is consumed by frontend applications like web applications, mobile apps, and connected devices. The following are some example use cases for hot data:

  • When a customer is browsing products, the real-time availability of the item must be displayed
  • Customers interacting with Alexa to know the status of the order
  • A call center agent interacting with a customer needs to know the status of the customer order or its shipment details

The systems, APIs, and devices that consume this data need the data within seconds or milliseconds of the transactions.

Cold data is used for long-term analytics like orders over a period of time, orders by channel, top 10 items by number of orders, or planned vs. available inventory by item, warehouse, or store.

For this solution, we store orders hot data in DynamoDB. DynamoDB is a fully managed NoSQL database that delivers single-digit millisecond performance at any scale. A Lambda function processes records in the Kinesis data stream and stores it in a DynamoDB table.

Inventory hot data is stored in an Amazon Aurora MySQL-Compatible Edition database. Inventory is transactional data that requires high consistency so that customers aren’t over-promised or under-promised when they place orders. Aurora MySQL is fully managed database that is up to five times faster than standard MySQL databases and three times faster than standard PostgreSQL databases. It provides the security, availability, and reliability of commercial databases at a tenth of the cost.

Amazon S3 is object storage built to store and retrieve any amount of data from anywhere. It’s a simple storage service that offers industry-leading durability, availability, performance, security, and virtually unlimited scalability at very low cost. Order and inventory cold data is stored in Amazon S3.

Amazon Kinesis Data Firehose reads the data from the Kinesis data stream and stores it in Amazon S3. Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, and Splunk, enabling near-real-time analytics.

Data processing

The data processing stage involves cleaning, preparing, and transforming the data to help downstream analytics applications easily query the data. Each frontend system might have a different data format. In the data processing stage, data is cleaned and converted into a common canonical form.

For this solution, we use DataBrew to clean and convert orders into a common canonical form. DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to prepare data with an interactive, point-and-click visual interface without writing code. DataBrew provides over 250 built-in transformations to combine, pivot, and transpose the data without writing code. The cleaning and transformation steps in DataBrew are called recipes. A scheduled DataBrew job applies the recipes to the data in an S3 bucket and stores the output in a different bucket.

AWS Glue crawlers can access data stores, extract metadata, and create table definitions in the AWS Glue Data Catalog. You can schedule a crawler to crawl the transformed data and create or update the Data Catalog. The AWS Glue Data Catalog is your persistent metadata store. It’s a managed service that lets you store, annotate, and share metadata in the AWS Cloud in the same way you would in an Apache Hive metastore. We use crawlers to populate the Data Catalog with tables.

Data analysis

We can query orders and inventory data from S3 buckets using Athena. Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. Views are created in Athena that can be consumed by business intelligence (BI) services like QuickSight.

Data visualization

We generate dashboards using QuickSight. QuickSight is a scalable, serverless, embeddable BI service powered by ML and built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights.

QuickSight also has features to forecast orders, detect anomalies in the order, and provide ML-powered insights. We can create analyses such as orders over a period of time, orders split by channel, top 10 locations for orders, or order fulfillment timelines (the time it took from order creation to order delivery).

Walkthrough overview

To implement this solution, you complete the following high-level steps:

  1. Create solution resources using AWS CloudFormation.
  2. Connect to the inventory database.
  3. Load the inventory database with tables.
  4. Create a VPC endpoint using Amazon Virtual Private Cloud (Amazon VPC).
  5. Create gateway endpoints for Amazon S3 on the default VPC.
  6. Enable CloudWatch rules via Amazon EventBridge to ingest the data.
  7. Transform the data using AWS Glue.
  8. Visualize the data with QuickSight.

Prerequisites

Complete the following prerequisite steps:

  1. Create AWS account if you don’t have done already.
  2. Sign up for QuickSight if you’ve never used QuickSight in this account before. To use the forecast ability in QuickSight, sign up for the Enterprise Edition.

Create resources with AWS CloudFormation

To launch the provided CloudFormation template, complete the following steps:

  1. Choose Launch Stack:
  2. Choose Next.
  3. For Stack name, enter a name.
  4. Provide the following parameters:
    1. The name of the S3 bucket that holds all the data for the data lake.
    2. The name of the database that holds the inventory tables.
    3. The database user name.
    4. The database password.
  5. Enter any tags you want to assign to the stack and choose Next.
  6. Select the acknowledgement check boxes and choose Create stack.

The stack takes 5–10 minutes to complete.

On the AWS CloudFormation console, you can navigate to the stack’s Outputs tab to review the resources you created.

If you open the S3 bucket you created, you can observe its folder structure. The stack creates sample order data for the last 7 days.

Connect to the inventory database

To connect to your database in the query editor, complete the following steps:

  1. On the Amazon RDS console, choose the Region you deployed the stack in.
  2. In the navigation pane, choose Query Editor.

    If you haven’t connected to this database before, the Connect to database page opens.
  3. For Database instance or cluster, choose your database.
  4. For Database username, choose Connect with a Secrets Manager ARN.
    The database user name and password provided during stack creation are stored in AWS Secrets Manager. Alternatively, you can choose Add new database credentials and enter the database user name and password you provided when creating the stack.
  5. For Secrets Manager ARN, enter the value for the key InventorySecretManager from the CloudFormation stack outputs.
  6. Optionally, enter the name of your database.
  7. Choose Connect to database.

Load the inventory database with tables

Enter the following DDL statement in the query editor and choose Run:

CREATE TABLE INVENTORY (
    ItemID varchar(25) NOT NULL,
    ShipNode varchar(25) NOT NULL,
    SupplyType varchar(25) NOT NULL,
    SupplyDemandType varchar(25) NOT NULL,
    ItemName varchar(25),
    UOM varchar(10),
    Quantity int(11) NOT NULL,
    ETA varchar(25)	 ,
    UpdatedDate DATE,
    PRIMARY KEY (ItemID,ShipNode,SupplyType)
);

Create a VPC endpoint

To create your VPC endpoint, complete the following steps:

  1. On the Amazon VPC console, choose VPC Dashboard.
  2. Choose Endpoints in the navigation pane.
  3. Choose Create Endpoint.
  4. For Service category, select AWS services.
  5. For Service name, search for rds and choose the service name ending with rds-data.
  6. For VPC, choose the default VPC.
  7. Leave the remaining settings at their default and choose Create endpoint.

Create a gateway endpoint for Amazon S3

To create your gateway endpoint, complete the following steps:

  1. On the Amazon VPC console, choose VPC Dashboard.
  2. Choose Endpoints in the navigation pane.
  3. Choose Create Endpoint.
  4. For Service category, select AWS services.
  5. For Service name, search for S3 and choose the service name with type Gateway.
  6. For VPC, choose the default VPC.
  7. For Configure route tables, select the default route table.
  8. Leave the remaining settings at their default and choose Create endpoint.

Wait for both the gateway endpoint and VPC endpoint status to change to Available.

Enable CloudWatch rules to ingest the data

We created two CloudWatch rules via the CloudFormation template to ingest the order and inventory data to Kinesis Data Streams. To enable the rules via EventBridge, complete the following steps:

  1. On the CloudWatch console, under Events in the navigation pane, choose Rules.
  2. Make sure you’re in the Region where you created the stack.
  3. Choose Go to Amazon EventBridge.
  4. Select the rule Ingest-Inventory-Update-Schedule-Rule and choose Enable.
  5. Select the rule Ingest-Order-Schedule-Rule and choose Enable.

After 5–10 minutes, the Lambda functions start ingesting orders and inventory updates to their respective streams. You can check the S3 buckets orders-landing-zone and inventory-landing-zone to confirm that the data is being populated.

Perform data transformation

Our CloudFormation stack included a DataBrew project, a DataBrew job that runs every 5 minutes, and two AWS Glue crawlers. To perform data transformation using our AWS Glue resources, complete the following steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Choose the project OrderDataTransform.

    You can review the project and its recipe on this page.
  3. In the navigation pane, choose Jobs.
  4. Review the job status to confirm it’s complete.
  5. On the AWS Glue console, choose Crawlers in the navigation pane.
    The crawlers crawl the transformed data and update the Data Catalog.
  6. Review the status of the two crawlers, which run every 15 minutes.
  7. Choose Tables in the navigation pane to view the two tables the crawlers created.
    If you don’t see these tables, you can run the crawlers manually to create them.

    You can query the data in the tables with Athena.
  8. On the Athena console, choose Query editor.
    If you haven’t created a query result location, you’re prompted to do that first.
  9. Choose View settings or choose the Settings tab.
  10. Choose Manage.
  11. Select the S3 bucket to store the results and choose Choose.
  12. Choose Query editor in the navigation pane.
  13. Choose either table (right-click) and choose Preview Table to view the table contents.

Visualize the data

If you have never used QuickSight in this account before, complete the prerequisite step to sign up for QuickSight. To use the ML capabilities of QuickSight (such as forecasting) sign up for the Enterprise Edition using the steps in this documentation.

While signing up for QuickSight, make sure to use the same region where you created the CloudFormation stack.

Grant QuickSight permissions

To visualize your data, you must first grant relevant permissions to QuickSight to access your data.

  1. On the QuickSight console, on the Admin drop-down menu, choose Manage QuickSight.
  2. In the navigation pane, choose Security & permissions.
  3. Under QuickSight access to AWS services, choose Manage.
  4. Select Amazon Athena.
  5. Select Amazon S3 to edit QuickSight access to your S3 buckets.
  6. Select the bucket you specified during stack creation (for this post, operational-datalake).
  7. Choose Finish.
  8. Choose Save.

Prepare the datasets

To prepare your datasets, complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena.
  4. For Data source name, enter retail-analysis.
  5. Choose Validate connection.
  6. After your connection is validated, choose Create data source.
  7. For Database, choose orderdatalake.
  8. For Tables, select orders_clean.
  9. Choose Edit/Preview data.
  10. For Query mode, select SPICE.
    SPICE (Super-fast, Parallel, In-memory Calculation Engine) is the robust in-memory engine that QuickSight uses.
  11. Choose the orderdatetime field (right-click), choose Change data type, and choose Date.
  12. Enter the date format as MM/dd/yyyy HH:mm:ss.
  13. Choose Validate and Update.
  14. Change the data types of the following fields to QuickSight geospatial data types:
    1. billingaddress.zipcode – Postcode
    2. billingaddress.city – City
    3. billingaddress.country – Country
    4. billingaddress.state – State
    5. shippingaddress.zipcode – Postcode
    6. shippingaddress.city – City
    7. shippingaddress.country – Country
    8. shippingaddress.state – State
  15. Choose Save & publish.
  16. Choose Cancel to exit this page.

    Let’s create another dataset for the Athena table inventory_landing_zone.
  17. Follow steps 1–7 to create a new dataset. For Table selection, choose inventory_landing_zone.
  18. Choose Edit/Preview data.
  19. For Query mode, select SPICE.
  20. Choose Save & publish.
  21. Choose Cancel to exit this page.

    Both datasets should now be listed on the Datasets page.
  22. Choose each dataset and choose Refresh now.
  23. Select Full refresh and choose Refresh.

To set up a scheduled refresh, choose Schedule a refresh and provide your schedule details.

Create an analysis

To create an analysis in QuickSight, complete the following steps:

  1. On the QuickSight console, choose Analyses in the navigation pane.
  2. Choose New analysis.
  3. Choose the orders_clean dataset.
  4. Choose Create analysis.
  5. To adjust the theme, choose Themes in the navigation pane, choose your preferred theme, and choose Apply.
  6. Name the analysis retail-analysis.

Add visualizations to the analysis

Let’s start creating visualizations. The first visualization shows orders created over time.

  1. Choose the empty graph on the dashboard and for Visual type¸ choose the line chart.
    For more information about visual types, see Visual types in Amazon QuickSight.
  2. Under Field wells, drag orderdatetime to X axis and ordernumber to Value.
  3. Set ordernumber to Aggregate: Count distinct.

    Now we can filter these orders by Created status.
  4. Choose Filter in the navigation pane and choose Create one.
  5. Search for and choose status.
  6. Choose the status filter you just created.
  7. Select Created from the filter list and choose Apply.
  8. Choose the graph (right-click) and choose Add forecast.
    The forecasting ability is only available in the Enterprise Edition. QuickSight uses a built-in version of the Random Cut Forest (RCF) algorithm. For more information, refer to Understanding the ML algorithm used by Amazon QuickSight.
  9. Leave the settings as default and choose Apply.
  10. Rename the visualization to “Orders Created Over Time.”

If the forecast is applied successfully, the visualization shows the expected number of orders as well as upper and lower bounds.

If you get the following error message, allow for the data to accumulate for a few days before adding the forecast.

Let’s create a visualization on orders by location.

  1. On the Add menu, choose Add visual.
  2. Choose the points on map visual type.
  3. Under Field wells, drag shippingaddress.zipcode to Geospatial and ordernumber to Size.
  4. Change ordernumber to Aggregate: Count distinct.

    You should now see a map indicating the orders by location.
  5. Rename the visualization accordingly.

    Next, we create a drill-down visualization on the inventory count.
  6. Choose the pencil icon.
  7. Choose Add dataset.
  8. Select the inventory_landing_zone dataset and choose Select.
  9. Choose the inventory_landing_zone dataset.
  10. Add the vertical bar chart visual type.
  11. Under Field wells, drag itemname, shipnode, and invtype to X axis, and quantity to Value.
  12. Make sure that quantity is set to Sum.

    The following screenshot shows an example visualization of order inventory.
  13. To determine how many face masks were shipped out from each ship node, choose Face Masks (right-click) and choose Drill down to shipnode.
  14. You can drill down even further to invtype to see how many face masks in a specific ship node are in which status.

The following screenshot shows this drilled-down inventory count.

As a next step, you can create a QuickSight dashboard from the analysis you created. For instructions, refer to Tutorial: Create an Amazon QuickSight dashboard.

Clean up

To avoid any ongoing charges, on the AWS CloudFormation console, select the stack you created and choose Delete. This deletes all the created resources. On the stack’s Events tab, you can track the progress of the deletion, and wait for the stack status to change to DELETE_COMPLETE.

The Amazon EventBridge rules generate orders and inventory data every 15 minutes, to avoid generating huge amount of data, please ensure to delete the stack after testing the blog.

If the deletion of any resources fails, ensure that you delete them manually. For deleting Amazon QuickSight datasets, you can follow these instructions. You can delete the QuickSight Analysis using these steps. For deleting the QuickSight subscription and closing the account, you can follow these instructions.

Conclusion

In this post, we showed you how to use AWS analytics and storage services to build a serverless operational data lake. Kinesis Data Streams lets you ingest large volumes of data, and DataBrew lets you cleanse and transform the data visually. We also showed you how to analyze and visualize the order and inventory data using AWS Glue, Athena, and QuickSight. For more information and resources for data lakes on AWS, visit Analytics on AWS.


About the Authors

Gandhi Raketla is a Senior Solutions Architect for AWS. He works with AWS customers and partners on cloud adoption, as well as architecting solutions that help customers foster agility and innovation. He specializes in the AWS data analytics domain.

Sindhura Palakodety is a Solutions Architect at AWS. She is passionate about helping customers build enterprise-scale Well-Architected solutions on the AWS Cloud and specializes in the containers and data analytics domains.

Automate your validated dataset deployment using Amazon QuickSight and AWS CloudFormation

Post Syndicated from Jeremy Winters original https://aws.amazon.com/blogs/big-data/automate-your-validated-dataset-deployment-using-amazon-quicksight-and-aws-cloudformation/

A lot of the power behind business intelligence (BI) and data visualization tools such as Amazon QuickSight comes from the ability to work interactively with data through a GUI. Report authors create dashboards using GUI-based tools, then in just a few clicks can share the dashboards with business users and decision-makers. This workflow empowers authors to create and manage the QuickSight resources and dashboards they’re responsible for providing.

Developer productivity is a great benefit of UI-based development, but enterprise customers often need to consider additional factors in their BI implementation:

  • Promoting objects through environments (development, testing, production, and so on)
  • Scaling for hundreds of authors and thousands of users
  • Implementing data security, such as row-level and column-level rules to filter the data elements visible to specific users
  • Regulatory requirements, processes, and compliance controls.

Approaches such as source control-backed CI/CD pipelines allow you to address compliance requirements and security gates with automation. For example, a hypothetical build pipeline for a Java Springboot application may enable developers to build and deploy freely to a dev environment, but the code must pass tests and vulnerability scans before being considered for promotion to upper environments. A human approval step then takes place before the code is released into production. Processes such as this provide quality, consistency, auditability, and accountability for the code being released.

The QuickSight API provides functionality for automation pipelines. Pipeline developers can use the API to migrate QuickSight resources from one environment to another. The API calls that facilitate handling QuickSight datasets enables inspection of the JSON representation of the dataset definition.

This post presents an example of how a QuickSight administrator can automate data resource management and security validation through use of the QuickSight API and AWS CloudFormation.

Solution overview

The model implements security rules that rely on naming conventions for tables and columns as an integral part of the security model. Instead of relying on naming conventions, you may want to use a lookup table or similar approach to store the relationships between data tables and security tables.

We guide you through the following steps:

  1. Create relational database tables to be secured.
  2. Create a QuickSight dataset in your dev account.
  3. Generate a CloudFormation template using a Python script that allows you to enforce row-level and column-level security in each environment. You can customize this script to the needs of your organization.
  4. Use the generated CloudFormation template to deploy through dev, test, and prod using your change management process.

You can use AWS CloudFormation to manage several types of QuickSight resources, but dataset resources are a critical junction for security, so they are our focus in this post.

To implement data security rules in a large organization, controls must be in place to agree upon and implement the rules from a process perspective. This post dives deep into using code to validate security aspects of your QuickSight deployment, but data security requires more than code. The approaches put forward are intended as a part of a larger change management process, much of which is based around human review and approval.

In addition to having a change management process in place, we suggest managing your AWS resources using a CI/CD pipeline. The nature of change management and CI/CD processes can vary greatly, and are outside the scope of this post.

Prerequisites

This post assumes a basic command of the following:

We don’t go into the broader picture of integrating into a full CI/CD process, so an understanding of CI/CD is helpful, but not required.

Security rules for your organization

Before we can write a script to confirm security rules have been applied correctly, we need to know what the security rules actually are. This means we need to determine the following:

  • What – What is the data we are trying to secure? Which fields in the database are sensitive? Which field values will be used to filter access?
  • Who – Who are the users and groups that should be provided access to the data and fields we have identified?

In concrete terms, we need to match identities (users and groups) to actual data values (used in row-level security) and sensitive fields (for column-level security). Identities such as users and groups typically correlate to entities in external systems such as Active Directory, but you can use native QuickSight users and groups.

For this post, we define the following rules that indicate the relationship between database objects (tables and fields) and how they should be secured. Keep in mind that these example rules may not apply to every organization. Security should be developed to match your requirements and processes.

  • Any field name with _sensitive appended to it is identified as containing sensitive data. For example, a column named salary_usd_sensitive should be restricted. For our scenario, we say that the user should be a member of the QuickSight restricted group in order to access sensitive fields. No other groups are allowed access to these fields.
  • For a given table, a companion table with _rls appended to the name contains the row-level security rules used to secure the table. In this model, the row-level security rules for the employees table are found in the employees_rls table.
  • Row-level security rules must be sourced 100% from the underlying data store. This means that you can’t upload rules via the QuickSight console, or use custom SQL in QuickSight to create the rules. Rules can be provided as views (if supported by the underlying data store) as long as the view definition is managed using a change management process.
  • The dataset name should match the name of the underlying database table.

These rules rely on a well-structured change management process for the database. If users and developers have access to change database objects in production, the rules won’t carry much weight. For examples of automated schema management using open-source CI/CD tooling, refer to Deploy, track, and roll back RDS database code changes using open source tools Liquibase and Jenkins and How to Integrate Amazon RDS Schema Changes into CI/CD Pipelines with GitLab and Liquibase.

From the QuickSight perspective, our database becomes the source of the “what” and “who” we discussed earlier. QuickSight doesn’t own the security rules, it merely implements the rules as defined in the database.

Security rule management with database objects

For this post, we source data from a Postgres database using a read-only user created for QuickSight.

First, we create our schema and a data table with a few rows inserted:

create schema if not exists ledger;

--the table we are securing
drop table if exists ledger.transactions;
create table if not exists ledger.transactions (
    txn_id integer,
    txn_type varchar(100),
    txn_desc varchar(100),
    txn_amt float,
    department varchar(100),
    discount_sensitive float
);

insert into ledger.transactions (
    txn_id,
    txn_type,
    txn_desc,
    txn_amt,
    department,
    discount_sensitive
) 
values
(1, 'expense', 'commission', -1000.00, 'field sales', 0.0),
(2, 'revenue', 'widgets',  15000.00, 'field sales', 1000.00),
(3, 'revenue', 'interest', 1000.00, 'corporate', 0.0),
(4, 'expense', 'taxes', -1234.00, 'corporate', 0.0),
(5, 'revenue', 'doodads', 1000.00, 'field sales', 100.0)
;

Note the field discount_sensitive. In our security model, any field name with _sensitive appended to it is identified as containing sensitive data. This information is used later when we implement column-level security. In our example, we have the luxury of using naming conventions to tag the sensitive fields, but that isn’t always possible. Other options could involve the use of SQL comments, or creating a table that provides a lookup for sensitive fields. Which method you choose depends upon your data and requirements, and should be supported by a change management process.

Row-level security table

The following SQL creates a table containing the row-level security rules for the ledger.transactions table, then inserts rules that match the example discussed earlier:

drop table if exists ledger.transactions_rls;
create table ledger.transactions_rls (
    groupname varchar(100),
    department varchar(1000)
);


insert into ledger.transactions_rls (groupname, department) 
values
('restricted', null), --null indicates all values
('anybody', 'field sales');

For more information about how to restrict access to a dataset using row-level security, refer to Using row-level security (RLS) with user-based rules to restrict access to a dataset

These rules match the specified QuickSight user groups to values in the department field of the transactions table.

Our last step in Postgres is to create a user that has read-only access to our tables. All end-user or SPICE refresh queries from QuickSight are run using this user. See the following code:

drop role if exists qs_user;
create role qs_user login password 'GETABETTERPASSSWORD';
grant connect on database quicksight TO qs_user;
grant usage on schema ledger to qs_user;
grant select on ledger.transactions to qs_user;
grant select on ledger.transactions_rls to qs_user;

Create user groups

Our security model provides permissions based on group membership. Although QuickSight allows for these groups to be sourced from external systems such as Active Directory, our example uses native QuickSight groups.

We create our groups using the following AWS Command Line Interface (AWS CLI) commands. Take note of the restricted group we’re creating; this is the group we use to grant access to sensitive data columns.

aws quicksight create-group \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name restricted

aws quicksight create-group \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name anybody

You can also add a user to your group with the following code:

aws quicksight create-group-membership \
--aws-account-id YOUR_AWS_ACCOUNT_ID_HERE \
--namespace default \
--group-name anybody \
--member-name [email protected]

The Python script

Now that we have set up our database and groups, we switch focus to the Python script used for the following actions:

  • Extracting the definition of a manually created dataset using the QuickSight API
  • Ensuring that the dataset definition meets security standards
  • Restructuring the dataset definition into the format of a CloudFormation template
  • Writing the CloudFormation template to a JSON file

In the header of the script, you can see the following variables, which you should set to values in your own AWS environment:

# Parameters for the source data set
region_name = 'AWS_REGION_NAME'
aws_account_id = "AWS_ACCOUNT_ID"
source_data_set_id = "ID_FOR_THE_SOURCE_DATA_SET"

# Parameters are used when creating the cloudformation template
target_data_set_name = "DATA_SET_DISPLAY_NAME"
target_data_set_id = "NEW_DATA_SET_ID"
template_file_name = "dataset.json"

QuickSight datasets have a name and an ID. The name is displayed in the QuickSight UI, and the ID is used to reference the dataset behind the scenes. The ID must be unique for a given account and Region, which is why QuickSight uses UUIDs by default, but you can use any unique string.

Create the datasets

You can use the QuickSight GUI or Public API to create a dataset for the transactions_rls and transactions tables. For instructions, refer to Creating a dataset from a database. Connect to the database, create the datasets, then apply transactions_rls as the row-level security for the transactions dataset. You can use the following list-data-sets AWS CLI call to verify that your tables were created successfully:

$ aws quicksight list-data-sets --aws-account-id YOURACCOUNT            
{
    "DataSetSummaries": [
       {
            "Arn": "arn:aws:quicksight:us-west-2:YOURACCOUNT:dataset/<ID>",
            "DataSetId": "<ID>",
            "Name": "transactions",
            "CreatedTime": "2021-09-15T15:41:56.716000-07:00",
            "LastUpdatedTime": "2021-09-15T16:38:03.658000-07:00",
            "ImportMode": "SPICE",
            "RowLevelPermissionDataSet": {
                "Namespace": "default",
                "Arn": "arn:aws:quicksight:us-west-2: YOURACCOUNT:dataset/<RLS_ID>",
                "PermissionPolicy": "GRANT_ACCESS",
                "FormatVersion": "VERSION_1",
                "Status": "ENABLED"
            },
            "RowLevelPermissionTagConfigurationApplied": false,
            "ColumnLevelPermissionRulesApplied": true
        },
        {
            "Arn": "arn:aws:quicksight:us-west-2: YOURACCOUNT:dataset/<RLS_ID>",
            "DataSetId": "<RLS_ID>",
            "Name": "transactions_rls",
            "CreatedTime": "2021-09-15T15:42:37.313000-07:00",
            "LastUpdatedTime": "2021-09-15T15:42:37.520000-07:00",
            "ImportMode": "SPICE",
            "RowLevelPermissionTagConfigurationApplied": false,
            "ColumnLevelPermissionRulesApplied": false
        }
    ]
}

Script overview

Our script is based around the describe_data_set method of the Boto3 QuickSight client. This method returns a Python dictionary containing all the attributes associated with a dataset resource. Our script analyzes these dictionaries, then coerces them into the structure required for dataset creation using AWS CloudFormation. The structure of the describe_data_set method and the AWS::QuickSight::DataSet CloudFormation resource are very similar, but not quite identical.

The following are the top-level fields in the response for the Boto3 QuickSight client describe_data_set method:

{
    'DataSet': {
        'Arn': 'string',
        'DataSetId': 'string',
        'Name': 'string',
        'CreatedTime': datetime(2015, 1, 1),
        'LastUpdatedTime': datetime(2015, 1, 1),
        'PhysicalTableMap': {},
        'LogicalTableMap': {...},
        'OutputColumns': [...],
        'ImportMode': 'SPICE'|'DIRECT_QUERY',
        'ConsumedSpiceCapacityInBytes': 123,
        'ColumnGroups': [...],
        'FieldFolders': {...},
        'RowLevelPermissionDataSet': {...},
        'ColumnLevelPermissionRules': [...]
    },
    'RequestId': 'string',
    'Status': 123
}

Our script converts the response from the API to the structure required for creating a dataset using AWS CloudFormation.

The following are the top-level fields in the AWS::QuickSight::DataSet CloudFormation resource:

{
  "Type" : "AWS::QuickSight::DataSet",
  "Properties" : {
      "AwsAccountId" : String,
      "ColumnGroups" : [ ColumnGroup, ... ],
      "ColumnLevelPermissionRules" : [ ColumnLevelPermissionRule, ... ],
      "DataSetId" : String,
      "FieldFolders" : {Key : Value, ...},
      "ImportMode" : String,
      "IngestionWaitPolicy" : IngestionWaitPolicy,
      "LogicalTableMap" : {Key : Value, ...},
      "Name" : String,
      "Permissions" : [ ResourcePermission, ... ],
      "PhysicalTableMap" : {Key : Value, ...},
      "RowLevelPermissionDataSet" : RowLevelPermissionDataSet,
      "Tags" : [ Tag, ... ]
    }
}

The key differences between both JSON structures are as follows:

  • describe_data_set contains Arn, CreatedTime, and LastUpdatedTime, which are useful fields but only relevant to an existing resource
  • AWS CloudFormation requires AwsAccountId when creating the resource
  • AWS CloudFormation accepts tags for the dataset, but describe_data_set doesn’t provide them
  • The AWS CloudFormation Permissions property allows for assigning AWS Identity and Access Management (IAM) permissions at the time of creation

Our script is able to selectively choose the top-level properties we want from the describe_data_set response, then add the fields that AWS CloudFormation requires for resource creation.

Validate security

Before the script creates the CloudFormation template, it performs validations to ensure that our dataset conforms to the defined security rules.

The following is the snippet from our script that performs validation for row-level security:

if 'RowLevelPermissionDataSet' in describe_response['DataSet']:
    if describe_response['DataSet']['RowLevelPermissionDataSet'] is None:
        raise Exception("row level permissions must be applied!")
    else:
        # now we look up the rls data set so that we can confirm that it conforms to our rules
        rls_dataset_id = describe_response['DataSet']['RowLevelPermissionDataSet']['Arn'].split('/')[-1]
        rls_response = client.describe_data_set(
            AwsAccountId = aws_account_id,
            DataSetId = rls_dataset_id
        )
        
        rls_table_map = rls_response['DataSet']['PhysicalTableMap']

        # rls table must not be custom SQL
        if 'CustomSql' in rls_table_map[list(rls_table_map.keys())[0]]:
            raise Exception("RLS data set can not contain custom SQL!")

        # confirm that the database table name is what we expect it to be 
        if rls_response['DataSet']['Name'] != describe_response['DataSet']['Name'] + '_rls':
            raise Exception("RLS data set name must match pattern tablename_rls!")

The steps in the code are as follows:

  1. Ensure that any row-level security is applied (this is the bare minimum).
  2. Look up the dataset that contains the row-level security rules using another Boto3 call.
  3. Confirm that the row-level security dataset is not custom SQL.
  4. Confirm that the name of the table is as expected, with _rls appended to the name of the table being secured.

The use of custom SQL for sourcing row-level security rules isn’t secure in our case, because a QuickSight developer could use SQL to alter the underlying rules. Because of this, our model requires that a physical table from the dataset is used as the row-level security rule source. Of course, it’s possible to use a view in the database to provide the rules. A view is okay because the definition (in our scenario) is governed by a change management process, as opposed to the custom SQL, which the QuickSight developer can create.

The rules being implemented for your specific organization will be different. You may need to connect to a database directly from your Python script in order to validate the dataset was created in a secure manner. Regardless of your actual rules, the describe_data_set API method provides you the details you need to begin validation of the dataset.

Column-level security

Our model for column-level security indicates that any database field name that ends in _sensitive should only be accessible to members of a QuickSight group named restricted. Instead of validating that the dataset has the column-level security rules applied correctly, we simply enforce the rules directly in two steps:

  1. Identify the sensitive fields.
  2. Create a dictionary and add it to our dataset with the key ColumnLevelPermissionRules.

To identify the sensitive fields, we create a list and iterate through the input columns of the physical table:

sensitive_fields = []
input_columns = physical_table_map[list(physical_table_map.keys())[0]]["RelationalTable"]["InputColumns"]
for input_column in input_columns:
    field_name = input_column['Name']
    if field_name[-10:len(field_name)] == '_sensitive':
        sensitive_fields.append(field_name)

The result is a list of sensitive fields. We can then take this list and integrate it into the dataset through the use of a dictionary:

if len(sensitive_fields) > 0:
    data_set["ColumnLevelPermissionRules"] = [
        {
            "Principals": [
                {"Ref": "RestrictedUserGroupArn"}
            ],
            "ColumnNames": sensitive_fields
        }
    ]

Instead of specifying a specific principal, we reference the CloudFormation template parameter RestrictedUserGroupArn. The ARN for the restricted group is likely to vary, especially if you’re deploying to another AWS account. Using a template parameter allows us to specify the ARN at the time of dataset creation in the new environment.

Access to the dataset QuickSight resources

The Permissions structure is added to the definition for each dataset:

"Permissions": [
    {
        "Principal": {
            "Ref": "QuickSightAdminPrincipal"
        },
        "Actions": [
            "quicksight:DescribeDataSet",
            "quicksight:DescribeDataSetPermissions",
            "quicksight:PassDataSet",
            "quicksight:DescribeIngestion",
            "quicksight:ListIngestions",
            "quicksight:UpdateDataSet",
            "quicksight:DeleteDataSet",
            "quicksight:CreateIngestion",
            "quicksight:CancelIngestion",
            "quicksight:UpdateDataSetPermissions"
        ]
    }
]

A value for the QuickSightAdminPrincipal CloudFormation template parameter is provided at the time of stack creation. The preceding structure provides the principal access to manage the QuickSight dataset resource itself. Note that this is not the same as data access (though an admin user could manually remove the row-level security rules). Row-level and column-level security rules indicate whether a given user has access to specific data, whereas these permissions allow for actions on the definition of the dataset, such as the following:

  • Updating or deleting the dataset
  • Changing the security permissions
  • Initiating and monitoring SPICE refreshes

End-users don’t require this access in order to use a dashboard created from the dataset.

Run the script

Our script requires you to specify the dataset ID, which is not the same as the dataset name. To determine the ID, use the AWS CLI list-data-sets command.

To set the script parameters, you can edit the following lines to match your environment:

# parameters for the source data set
region_name = 'us-west-2'
aws_account_id = "<YOUR_ACCOUNT_ID>"
source_data_set_id = "<SOURCE_DATA_SET_ID>"

# parameters for the target data set
target_data_set_name = "DATA_SET_PRESENTATION_NAME"
target_data_set_id = "NEW_DATA_SET_ID"

The following snippet runs the Python script:

$ quicksight_security % python3 data_set_to_cf.py                                                       
row level security validated!
the following sensitive fields were found: ['discount_sensitive']
cloudformation template written to dataset.json
cli-input-json file written to params.json

CloudFormation template

Now that the security rules have been validated, our script can generate the CloudFormation template. The describe_response_to_cf_data_set method accepts a describe_data_set response as input (along with a few other parameters) and returns a dictionary that reflects the structure of an AWS::QuickSight::DataSet CloudFormation resource. Our code uses this method once for the primary dataset, and again for the _rls rules. This method handles selecting values from the response, prunes some unnecessary items (such as empty tag lists), and replaces a few values with CloudFormation references. These references allow us to provide parameter values to the template, such as QuickSight principals and the data source ARN.

You can view the template using the cat command:

$ quicksight_security % cat dataset.json 
{
    "AWSTemplateFormatVersion": "2010-09-09",
    "Description": "Creates a QuickSight Data Set",
    "Parameters": {
        "DataSourceArn": {
            "Type": "String",
            "Description": "ARN for Postgres data source resource"
        },
        "QuickSightOwnerPrincipal": {
            "Type": "String",
            "Description": "ARN for a QuickSight principal who will be granted API access to the datasets"
        },
        "RestrictedUserGroupArn": {
            "Type": "String",
            "Description": "ARN for a QuickSight principal who will be granted access to sensitive fields"
        }
    },
    "Resources": {
        "NewDataSet": {
            "Type": "AWS::QuickSight::DataSet",
            "Properties": {
                "DataSetId": "NEW_DATA_SET_ID",
                "Name": "DATA_SET_PRESENTATION_NAME",
                "AwsAccountId": {
                    "Ref": "AWS::AccountId"
                },
                "Permissions": [
                    {
                        "Principal": {
                            "Ref": "QuickSightAdminPrincipal"
                        },
                        "Actions": [
                            "quicksight:DescribeDataSet",
                            "quicksight:DescribeDataSetPermissions",
                            "quicksight:PassDataSet",
                            "quicksight:DescribeIngestion",
                            "quicksight:ListIngestions",
                            "quicksight:UpdateDataSet",
                            "quicksight:DeleteDataSet",
                            "quicksight:CreateIngestion",
                            "quicksight:CancelIngestion",
                            "quicksight:UpdateDataSetPermissions"
                        ]
                    }
                ],
                "FieldFolders": {},
                "ImportMode": "DIRECT_QUERY",
                "LogicalTableMap": {
                    "e2305db4-2c79-4ac4-aff5-224b8c809767": {
                        "Alias": "transactions",
                        "DataTransforms": [
                            {
                                "ProjectOperation": {
                                    "ProjectedColumns": [
                                        "txn_id",
                                        "txn_type",
                                        "txn_desc",
                                        "txn_amt",
                                        "department",
                                        "discount_sensitive"
                                    ]
                                }
                            }
                        ],
                        "Source": {
                            "PhysicalTableId": "someguid-2c79-4ac4-aff5-224b8c809767"
                        }
                    }
                },
                "PhysicalTableMap": {
                    "e2305db4-2c79-4ac4-aff5-224b8c809767": {
                        "RelationalTable": {
                            "DataSourceArn": {
                                "Ref": "DataSourceArn"
                            },
                            "Schema": "ledger",
                            "Name": "transactions",
                            "InputColumns": [
                                {
                                    "Name": "txn_id",
                                    "Type": "INTEGER"
                                },
                                {
                                    "Name": "txn_type",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "txn_desc",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "txn_amt",
                                    "Type": "DECIMAL"
                                },
                                {
                                    "Name": "department",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "discount_sensitive",
                                    "Type": "DECIMAL"
                                }
                            ]
                        }
                    }
                },
                "RowLevelPermissionDataSet": {
                    "Namespace": "default",
                    "Arn": {
                        "Fn::GetAtt": [
                            "NewDataSetRLS",
                            "Arn"
                        ]
                    },
                    "PermissionPolicy": "GRANT_ACCESS",
                    "FormatVersion": "VERSION_1"
                },
                "ColumnLevelPermissionRules": [
                    {
                        "Principals": [
                            {
                                "Ref": "RestrictedUserGroupArn"
                            }
                        ],
                        "ColumnNames": [
                            "discount_sensitive"
                        ]
                    }
                ]
            }
        },
        "NewDataSetRLS": {
            "Type": "AWS::QuickSight::DataSet",
            "Properties": {
                "DataSetId": "NEW_DATA_SET_ID_rls",
                "Name": "DATA_SET_PRESENTATION_NAME_rls",
                "AwsAccountId": {
                    "Ref": "AWS::AccountId"
                },
                "Permissions": [
                    {
                        "Principal": {
                            "Ref": "QuickSightAdminPrincipal"
                        },
                        "Actions": [
                            "quicksight:DescribeDataSet",
                            "quicksight:DescribeDataSetPermissions",
                            "quicksight:PassDataSet",
                            "quicksight:DescribeIngestion",
                            "quicksight:ListIngestions",
                            "quicksight:UpdateDataSet",
                            "quicksight:DeleteDataSet",
                            "quicksight:CreateIngestion",
                            "quicksight:CancelIngestion",
                            "quicksight:UpdateDataSetPermissions"
                        ]
                    }
                ],
                "FieldFolders": {},
                "ImportMode": "SPICE",
                "LogicalTableMap": {
                    "someguid-51d7-43c4-9f8c-c60a286b0507": {
                        "Alias": "transactions_rls",
                        "DataTransforms": [
                            {
                                "ProjectOperation": {
                                    "ProjectedColumns": [
                                        "groupname",
                                        "department"
                                    ]
                                }
                            }
                        ],
                        "Source": {
                            "PhysicalTableId": "someguid-51d7-43c4-9f8c-c60a286b0507"
                        }
                    }
                },
                "PhysicalTableMap": {
                    "someguid-51d7-43c4-9f8c-c60a286b0507": {
                        "RelationalTable": {
                            "DataSourceArn": {
                                "Ref": "DataSourceArn"
                            },
                            "Schema": "ledger",
                            "Name": "transactions_rls",
                            "InputColumns": [
                                {
                                    "Name": "groupname",
                                    "Type": "STRING"
                                },
                                {
                                    "Name": "department",
                                    "Type": "STRING"
                                }
                            ]
                        }
                    }
                }
            }
        }
    }
}

You can deploy this template directly into AWS via the CloudFormation console. You are required to provide the following parameters:

  • DataSourceArn – A QuickSight dataset is a reference to a table or other database object. In order for this object to be accessed, we need to specify a QuickSight data source resource that facilitates the connection.
  • QuickSightAdminPrincipal – The IAM principal allowing access to the data source resource via AWS API calls. You can exclude the IAM permissions from this script and template if your existing security policies automatically provide access to the appropriate users and groups.
  • RestrictedUserGroupArn – The ARN of the QuickSight group that is granted access to the sensitive columns.

You can also deploy the template using the AWS CLI. Although it’s possible to pass in all the parameters directly via the command line, you may find it a bit clunky when entering long values. To simplify this, our script generates a params.json file structured to capture all the parameters required by the template:

{
    "Parameters": [
        {
            "ParameterKey": "DataSourceArn",
            "ParameterValue": "YOUR_DATA_SOURCE_ARN_HERE"
        },
        {
            "ParameterKey": "QuickSightAdminPrincipal",
            "ParameterValue": "YOUR_ADMIN_GROUP_PRINCIPAL_HERE"
        },
        {
            "ParameterKey": "RestrictedUserGroupArn",
            "ParameterValue": "YOUR_RESTRICTED_USER_GROUP_ARN_HERE"
        }
    ]
}

Use the following command to build the stack, with params.json as input:

aws cloudformation create-stack \
--stack-name SecuredDataSet \
--template-body file://dataset.json \
--cli-input-json file://params.json

You can use the AWS CloudFormation console to monitor the stack progress. When the creation is complete, you should see your new dataset in QuickSight!

Conclusion

Though the functionality is relatively new, I consider the API and AWS CloudFormation capabilities to be one of QuickSight’s biggest strengths. Automated validation and enforcement of security rules allows for scale and better security. Being able to manage dataset definitions using AWS CloudFormation provides repeatability, and all of this sets you up for automation. The API and AWS CloudFormation provide tooling to customize QuickSight to suit your workflow, bringing BI into your organization’s cloud management strategy.

If you are looking for related information about dashboard management and migration in QuickSight, refer to Migrate Amazon QuickSight across AWS accounts.


About the Author

Jeremy Winters is an Architect in the AWS Data Lab, where he helps customers design and build data applications to meet their business needs. Prior to AWS, Jeremy built cloud and data applications for consulting customers across a variety of industries.

Visualize MongoDB data from Amazon QuickSight using Amazon Athena Federated Query

Post Syndicated from Soujanya Konka original https://aws.amazon.com/blogs/big-data/visualize-mongodb-data-from-amazon-quicksight-using-amazon-athena-federated-query/

In this post, you will learn how to use Amazon Athena Federated Query to connect a MongoDB database to Amazon QuickSight in order to build dashboards and visualizations.

Amazon Athena is a serverless interactive query service, based on Presto, that provides full ANSI SQL support to query a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet, that are stored on Amazon Simple Storage Service (Amazon S3). For data that isn’t stored on Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store it in Amazon S3. With Athena Federated Query, you can run SQL queries across data that is stored in relational, non-relational, object, and custom data sources.

MongoDB is a popular NoSQL database option for websites and API endpoints. You can choose to deploy MongoDB as a self-hosted or fully-managed database. Databases are a popular choice for UI applications for managing user profiles, product catalogs, profile views, clickstream events, events from a connected device, and so on. QuickSight is a serverless business analytics service with built-in machine learning (ML) capabilities that can automatically look for patterns and outliers, and has the flexibility to embed dashboards in applications for a data-driven experience. You can also use QuickSight Q to allow users to ask questions using natural language and find answers to business questions immediately.

Overview of Athena Federated Query

Athena Federated Query uses data source connectors that run on AWS Lambda to run federated queries to other data sources. Prebuilt data source connectors are available for native stores, like Amazon Timestream, Amazon CloudWatch Logs, Amazon DynamoDB, and external sources like Vertica and SAP Hana. You can also write a connector by using the Athena Query Federation SDK. You can customize Athena’s prebuilt connectors for your own use, or modify a copy of the source code to create your own AWS Serverless Application Repository package.

Solution overview

The following architecture diagram shows the components of the Athena Federated Query MongoDB connector. It contains the following components:

  • A virtual private cloud (VPC) configured with public and private subnets across three Availability Zones.
  • A MongoDB cluster with customizable Amazon Elastic Block Store (Amazon EBS) storage deployed in private subnets and NAT gateways in a public subnet for outbound internet connectivity for MongoDB instances.
  • Bastion hosts in an auto scaling group with Elastic IP addresses to allow inbound SSH access.
  • An AWS Identity and Access Management (IAM) MongoDBnode role with Amazon Elastic Compute Cloud (Amazon EC2) and Amazon S3 permissions.
  • Security groups to enable communication within the VPC.
  • Lambda functions deployed in a private subnet accessing S3 buckets. Athena invokes the Lambda function, which in turn fetches the data from MongoDB and maps the response back to Athena.
  • AWS Secrets Manager through a VPC endpoint.

Prerequisites

To implement the solution, you need the following:

  • An AWS account to access AWS services.
  • An IAM user with permission to CreateRole, ListRoles, GetPolicy, and AttachRolePolicy.
  • An IAM user with an access key and secret key to configure an integrated development environment (IDE).
  • A MongoDB database. You can deploy a hosted MongoDB on Amazon EC2 or MongoDB Atlas in a VPC.
  • If you don’t have a QuickSight subscription configured, sign up for one. You can access the QuickSight free trial as part of the AWS Free Tier option.
  • A new secret in Secrets Manager to store your MongoDB user name and password.
  • Data loaded into your MongoDB database. For this example, we used an airline dataset. Load the sample data either from the MongoDB command line or the MongoDB Atlas user interface, if using MongoDB Atlas.

Configure a Lambda connector

The first step in the deployment is to set up the connector environment. Athena uses data source connectors that run on Lambda to run federated queries. To connect with MongoDB, use the Amazon Athena DocumentDB Connector, which also works with any endpoint that is compatible with MongoDB.

To configure a Lambda connector, complete the following steps:

  1. On the Athena console, choose Data sources in the navigation pane.
  2. To view a published list of data sources for Athena, select Amazon DocumentDB.
  3. Choose Next.
  4. In the Data source details section, give your data source a unique name; for example, ds_mongo.
    This will be the connection name that appears under Data sources for Athena.
  5. Choose Create Lambda function.
    This launches the Create function page in Lambda. The connector is deployed by using AWS Serverless Application Repository.
  6. For SecretNameOrPrefix, enter mongo.
  7. For SpillBucket, enter spl-mongo-athena-test.
  8. For AthenaCatalogName, enter us-west-mongo-cat.
  9. For DocDBConnectionString (the MongoDB connection), enter the following:
    mongodb://${docdb_instance_1_creds}@replace_with_mongodb_private_ip:27017/?authSource=admin&readPreference=secondaryPreferred&retryWrites=false; 

  10. For SecurityGroupIds, choose the security group that you want to associate with the function. Make sure that the security group of the MongoDB instance allows traffic from the Lambda function.
  11. For SpillPrefix, enter athena-spill.
  12. For Subnetids, enter the subnet IDs of subnets with MongoDB instances.
    In this case, LambdaMemory and LambdaTimeout have been set to the maximum values, but these can vary depending on the query run and memory requirements. SpillBucket is an S3 bucket in your account to store data that exceeds the Lambda function response size limits.
  13. Keep the rest as defaults.
  14. Select the acknowledgement check box choose Deploy.
    The connection function is launched based on the given parameters.
  15. Create a VPC endpoint to allow the Lambda function to access Amazon S3 through an endpoint.
    This is for the spill bucket. The spill bucket is a staging area for copying the results of the queries that are performed on MongoDB via Athena federation. This is so that the Lambda function in the VPC can access Amazon S3.
  16. Go back to the Athena console.
  17. Under Connection details, for Lambda function, choose the newly created Lambda function.
  18. Choose Next.

  19. To verify the connection, on the Athena console, choose Data sources, then choose ds_mongo.
    Associated databases from the connection should be listed.

    You should now be able to query the datasets from the Athena query editor by using SQL.
  20. In the query editor, for Data Source, choose ds_mongo.

Athena federates the query using the connector, which invokes the Lambda function. Then the query is performed by the function on MongoDB, and the query results are translated back to Athena. The following is a sample query that was performed on the airlines dataset.

Create a dataset on QuickSight to read the data from MongoDB

Before you launch QuickSight for the first time in an AWS account, you must set up an account. For instructions, see Signing in to Amazon QuickSight.

After the initial setup, you can create a dataset with Athena as the source. The QuickSight service role needs permission to invoke the Lambda function that connects MongoDB. The aws-quicksight-service-role-v0 service role is automatically created with the QuickSight account.

To create a dataset in QuickSight, complete the following steps:

  1. On the IAM console, in the navigation pane, choose Roles.
  2. Search for the role aws-quicksight-service-role-v0 and add the permission Lambda _fullaccess.
    In an organization, there could be different data stores based on data load and consumption patterns. Examples include catalog or manual data that is associated with products in a MongoDB or key-value index store, transactions or sales data in a SQL database, and images or video clips that are associated with the product in an object store.
    In this case, an airlines table from MongoDB is joined with a flat file that contains information on the airports.
  3. Use the QuickSight cross-data store feature to join data from different sources on common fields.
  4. We then update the data types for our geographic fields like fields like city, country, latitude, and longitude so we can build maps later.
  5. You can also create calculated fields while preparing your dataset, which allows you to reuse them in other QuickSight analyses.

With a few clicks, you should be able to create a dashboard with the published dataset. For instance, you can plot your data on a map, show trends in a line chart, and add autonarratives from the list of Suggested Insights to create the analysis shown in the following screenshot.

Clean up

Make sure to clean up your resources to avoid resource spend and associated costs. You need to delete the EC2 instances with MongoDB. In the case of MongoDB Atlas, you can delete the databases and tables. Delete the Athena data source ds_mongo and unsubscribe your QuickSight account from the Manage QuickSight admin page.

Conclusion

With QuickSight and Athena Federated Query, organizations can access additional data sources beyond those already supported by QuickSight. If you have data in sources other than Amazon S3, you can use Athena Federated Query to analyze the data in place or build pipelines that extract and store data in Amazon S3. Athena now also supports cross-account federated queries to enable teams of analysts, data scientists, and data engineers to query data stored in other AWS accounts. Try connecting to proprietary data formats and sources, or build new user-defined functions, with the Athena Query Federation SDK.


About the Author

Soujanya Konka is a Solutions Architect and Analytics specialist at AWS, focused on helping customers build their ideas on cloud. Expertise in design and implementation of business information systems and Data warehousing solutions. Before joining AWS, Soujanya has had stints with companies such as HSBC, Cognizant.

Nilesh Parekh is a Partner Solution Architect with ISV India segment. Nilesh help assist partner to review and remediate their workload running on AWS based on the AWS Well-Architected and Foundational Technical Review best practices. He also helps assist partners on Application Modernizations and delivering POCs.

Enable Amazon QuickSight federation with Google Workspace

Post Syndicated from Sriharsh Adari original https://aws.amazon.com/blogs/big-data/enable-amazon-quicksight-federation-with-google-workspace/

Amazon QuickSight is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud that supports identity federation in both Standard and Enterprise editions. Organizations are working towards centralizing their identity and access strategy across all of their applications, including on-premises, third-party, and applications on AWS. Many organizations use Google Workspace to control and manage user authentication and authorization centrally. You can enable federation to QuickSight accounts without needing to create and manage users. This authorizes users to access QuickSight assets—analyses, dashboards, folders, and datasets—through centrally managed Google Workspace Identities.

In this post, we go through the steps to configure federated single sign-on (SSO) between a Google Workspace instance and QuickSight account. We demonstrate registering an SSO application in Google Workspace, and map QuickSight roles (admin, author, and reader) to Google Workspace Identities. These QuickSight roles represent three different personas supported in QuickSight. Administrators can publish the QuickSight app in a Google Workspace Dashboard to enable users to SSO to QuickSight using their Google Workspace credentials.

Solution overview

In your organization, the portal is typically a function of your identity provider (IdP), which handles the exchange of trust between your organization and QuickSight.

On the Google Workspace Dashboard, you can review a list of apps. This post shows you how to configure the custom app for AWS.

The user flow consists of the following steps:

  1. The user logs in to your organization’s portal and chooses the option to go to the QuickSight console.
  2. The portal verifies the user’s identity in your organization.
  3. The portal generates a SAML authentication response that includes assertions that identify the user and include attributes about the user. The portal sends this response to the client browser. Although not discussed here, you can also configure your IdP to include a SAML assertion attribute called SessionDuration that specifies how long the console session is valid.
  4. The client browser is redirected to the AWS single sign-on endpoint and posts the SAML assertion.
  5. The endpoint requests temporary security credentials on behalf of the user, and creates a QuickSight sign-in URL that uses those credentials.
  6. AWS sends the sign-in URL back to the client as a redirect.
  7. The client browser is redirected to the QuickSight console. If the SAML authentication response includes attributes that map to multiple AWS Identity and Access Management (IAM) roles, the user is first prompted to select the role for accessing the console.

The following diagram illustrates the solution architecture.

The following are the high-level steps to set up federated single sign-on access via Google Workspace:

  1. Download the Google IdP information.
  2. Create an IAM IdP with Google as SAML IdP.
  3. Configure IAM policies for QuickSight roles.
  4. Configure IAM QuickSight roles for federated users.
  5. Create a custom user attribute in Google Workspace.
  6. Add the AWS SAML attributes to your Google Workspace user profile.
  7. Set up the AWS SAML app in Google Workspace.
  8. Grant access to users in Google Workspace.
  9. Verify federated access to your QuickSight instance.

Detailed procedures for each of these steps comprise the remainder of this post.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • A Google Workspace subscription
  • An AWS account with QuickSight subscription
  • Basic understanding of QuickSight roles—admin, author, and reader
  • Basic understanding of IAM and privileges required to create an IAM identity provider, roles, policies, and users

Download the Google IdP information

First, let’s get the SAML metadata that contains essential information to enable your AWS account to authenticate the IdP and locate the necessary communication endpoint locations. Complete the following steps:

  1. Log in to the Google Workspace Admin console.
  2. On the Admin console home page, under Security in the navigation pane, choose Authentication and SSO with SAML applications.
  3. Under IdP metadata, choose Download Metadata.

Create an IAM IdP with Google as SAML IdP

You now configure Azure AD as your SAML IdP via the IAM console. Complete the following steps:

  1. On the IAM console, choose Identity providers in the navigation pane.
  2. Choose Add provider.
  3. For Configure provider, select SAML.
  4. For Provider name, enter a name for the IdP (such as Google).
  5. For Metadata document, choose Choose file and specify the SAML metadata document that you downloaded.
  6. Choose Add provider.
  7. Document the Amazon Resource Name (ARN) by viewing the IdP you just created.

The ARN should looks similar to arn:aws:iam::<YOURACCOUNTNUMBER>:saml-provider/Google. We need this ARN to configure claim rules later in this post.

Configure IAM policies for QuickSight roles

In this step, we create three IAM policies for different role permissions in QuickSight:

  • QuickSight-Federated-Admin
  • QuickSight-Federated-Author
  • QuickSight-Federated-Reader

Use the following steps to set up the QuickSight-Federated-Admin policy. This policy grants admin privileges in QuickSight to the federated user:

  1. On the IAM console, choose Policies.
  2. Choose Create policy.
  3. Choose JSON and replace the existing text with the following code:
    {
        “Version”: “2012-10-17”,
        “Statement”: [
            {
                “Effect”: “Allow”,
                “Action”: “quicksight:CreateAdmin”,
                “Resource”: “*”
            }
        ]
    }

  4. Choose Review policy.
  5. For Name, enter QuickSight-Federated-Admin.
  6. Choose Create policy.
  7. Repeat these steps to create QuickSight-Federated-Author, and use the following policy to grant author privileges in QuickSight to the federated user:
    {
        “Version”: “2012-10-17”,
        “Statement”: [
            {
                “Effect”: “Allow”,
                “Action”: “quicksight:CreateUser”,
                “Resource”: “*”
            }
        ]
    }

  8. Repeat the steps to create QuickSight-Federated-Reader, and use the following policy to grant reader privileges in QuickSight to the federated user:
    {
        “Version”: "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "quicksight:CreateReader",
                "Resource": "*"
            }
        ]
    }

Configure IAM QuickSight roles for federated users

Next, create the roles that Google IdP users assume when federating into QuickSight. The following steps set up the admin role:

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. For Trusted entity type, choose SAML 2.0 federation.
  4. For SAML provider, choose the provider you created earlier (Google).
  5. For Attribute, choose SAML:aud.
  6. For Value, enter https://signin.aws.amazon.com/saml.
  7. Choose Next.
  8. On the Add permissions page, select the QuickSight-Federated-Admin IAM policy you created earlier.
  9. Choose Next.
  10. For Role name, enter QuickSight-Admin-Role.
  11. For Role description, enter a description.
  12. Choose Create role.
  13. On the IAM console, in the navigation pane, choose Roles.
  14. Choose the QuickSight-Admin-Role role you created to open the role’s properties.
  15. On the Trust relationships tab, choose Edit trust relationship.
  16. Under Trusted entities, verify that the IdP you created is listed.
  17. Under Condition, verify that SAML:aud with a value of https://signin.aws.amazon.com/saml is present.
  18. Repeat these steps to create author and reader roles and attach the appropriate policies:
    1. For QuickSight-Author-Role, use the policy QuickSight-Federated-Author.
    2. For QuickSight-Reader-Role, use the policy QuickSight-Federated-Reader.
  19. Navigate to the newly created roles and note the ARNs for them.

We use these ARNs to configure claims rules later in this post. They are in the following format:

  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Admin-Role
  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Author-Role
  • arn:aws:iam:: <YOURACCOUNTNUMBER>:role/QuickSight-Reader-Role

Create a custom user attribute in Google Workspace

Now let’s create a custom user attribute in your Google Workspace. This allows us to add the SAML attributes that the AWS Management Console expects in order to allow a SAML-based authentication.

  1. Log in to Google Admin console with admin credentials.
  2. Under Directory, choose Users.
  3. On the More options menu, choose Manage custom attributes.
  4. Choose Add Custom Attribute.
  5. For Select type of trusted entity, choose SAML 2.0 federation.
  6. Configure the custom attribute as follows:
    1. Category: Amazon
    2. Description: Amazon Custom Attributes
  7. For Custom fields, enter the following:
    1. Name: Role
    2. Info type: Text
    3. Visibility: Visible to user and admin
    4. No. of values: Multi-value
  8. Choose Add.

The new category appears on the Manage user attributes page.

Add the AWS SAML attributes to the Google Workspace user profile

Now that we have configured a custom user attribute, let’s add the SAML attributes that we noted earlier to the Google Workspace user profile.

  1. While logged in to the Google Admin console with admin credentials, navigate to the Users page.
  2. In the Users list, find the user. If you need help, see Find a user account.
  3. Choose the user’s name to open their account page.
  4. Choose User information.
  5. Choose custom attribute you recently created, named Amazon.
  6. Add a value to this custom attribute noted earlier in the following format: <AWS Role ARN>,<AWS provider/IdP ARN>.
  7. Choose Save.

Set up the AWS SAML app in Google Workspace

Now that we have everything in place, we’re ready to create a SAML app within our Google Workspace account and provide the QuickSight instance starting URL. This provides the entry point for Google Workspace users to SSO into the QuickSight instance.

  1. While logged in to Google Admin console with admin credentials, under Apps, choose Web and mobile apps.
  2. Choose Add App, and Search for apps.
  3. Enter Amazon Web Services in the search field.
  4. In the search results, hover over the Amazon Web Services SAML app and choose Select.
  5. On the Google Identity Provider details page, choose Continue.
  6. On the Service provider details page, the ACS URL and Entity ID values for Amazon Web Services are configured by default.
  7. For Start URL, enter https://quicksight.aws.amazon.com.
  8. On the Attribute Mapping page, choose the Select field menu and map the following Google directory attributes to their corresponding Amazon Web Services attributes:

    Google Directory Attribute Amazon Web Services Attribute
    Basic Information > Primary Email https://aws.amazon.com/SAML/Attributes/RoleSessionName
    Amazon > Role https://aws.amazon.com/SAML/Attributes/Role

  1. Choose Finish.

Grant access to users in Google Workspace

When the SAML app is created in Google workspace, it’s turned off by default. This means for users logged in to their Google Workspace account, the SAML app isn’t visible to them. We now enable the AWS SAML app to your Google Workspace users.

  1. While logged in to the Google Admin console with admin credentials, navigate to the Web and mobile apps page.
  2. Choose Amazon Web Services.

  3. Choose User access.
  4. To turn on a service for everyone in your organization, choose ON for everyone.
  5. Choose Save.

If you don’t want to activate this application for all users, you can alternatively grant access to a subset of users by using Google Workspace organizational units.

Verify federated access to the QuickSight instance

To test your SAML 2.0-based authentication with QuickSight for users in your existing IDP (Google Workspace), complete the following steps:

  1. Open a new browser session, for example, using Chrome, in a new incognito window.
  2. Log in to your Google Workspace account (for the purpose of this demo, we use the Google Workspace admin account).
  3. Choose Amazon Web Services from the list of Google apps.

Conclusion

This post provided a step-by-step guide for configuring Google Workspace as your IdP, and using IAM roles to enable SSO to QuickSight. Now your users have a seamless sign-in experience to QuickSight and have the appropriate level of access related to their role.

Although this post demonstrated the integration of IAM and Google Workspace, you can replicate this solution using your choice of SAML 2.0 IdPs. For other supported federation options, see Using identity federation and single sign-on (SSO) with Amazon QuickSight.

To get answers to your questions related to QuickSight, refer to the QuickSight Community.

If you have any questions or feedback, please leave a comment.


About the Authors

Sriharsh Adari is a Senior Solutions Architect at Amazon Web Services (AWS), where he helps customers work backwards from business outcomes to develop innovative solutions on AWS. Over the years, he has helped multiple customers on data platform transformations across industry verticals. His core area of expertise include Technology Strategy, Data Analytics, and Data Science. In his spare time, he enjoys playing sports, binge-watching TV shows, and playing Tabla.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.

Tips and tricks for high-performant dashboards in Amazon QuickSight

Post Syndicated from Shekhar Kopuri original https://aws.amazon.com/blogs/big-data/tips-and-tricks-for-high-performant-dashboards-in-amazon-quicksight/

Amazon QuickSight is cloud-native business intelligence (BI) service. QuickSight automatically optimizes queries and execution to help dashboards load quickly, but you can make your dashboard loads even faster and make sure you’re getting the best possible performance by following the tips and tricks outlined in this post.

Data flow and execution of QuickSight dashboard loads

The data flow in QuickSight starts from the client browser to the web server and then flows to the QuickSight engine, which in some cases executes queries against SPICE—a Super-fast, Parallel, In-memory Calculation Engine—or in other cases directly against the database. SPICE uses a combination of columnar storage, in-memory technologies enabled through the latest hardware innovations, and machine code generation to run interactive queries on large datasets and get rapid responses.

The web server, QuickSight engine, and SPICE are auto scaled by QuickSight. This is a fully managed service—you don’t need to worry about provisioning or managing infrastructure when you want to scale up a particular dashboard from tens to thousands of users on SPICE. Dashboards built against direct query data sources may require provisioning or managing infrastructure on the customer side.

The following diagram illustrates the data flow:

Let’s look at the general execution process to understand the implications:

  • A request is triggered in the browser, leading to several static assets such as JavaScript, fonts, and images being downloaded.
  • All the metadata (such as visual configurations and layout) is fetched for the dashboard.
  • Queries are performed, which may include setting up row-level and column-level security, or fetching dynamic control values, default parameters, and all values of drop-downs in filter controls.
  • Up to your concurrency limit, the queries to render your visuals run in a specific sequence (described later in this post). If you’re using SPICE, the concurrency of queries is much higher. Pagination within visuals may lead to additional queries.

The actual execution is more complex and depends on how dashboards are configured and other factors such as the data source types, Direct Query vs. SPICE, cardinality of fields and how often data is getting refreshed etc.  Many operations run in parallel and all visual-related queries are run via WebSocket, as shown in the following screenshot. Many of the steps run in the end-user’s browser, therefore there are limitations such as the number of sequences and workloads that can be pushed onto the browser. Performance may also be slightly different based on the browser type because each browser handles contention differently.

Now let’s look at many great tips that can improve your dashboard’s performance!

SPICE

Utilizing the capabilities of SPICE when possible is a great way to boost overall performance because SPICE manages scaling as well as caching results for you. We recommend using SPICE whenever possible.

Metadata

As seen in the preceding execution sequence, QuickSight fetches metadata up front for a given dashboard during the initial load. We recommend the following actions regarding metadata.

Remove unused datasets from analysis

Datasets that may have been used in the past but have no visual associated with the dashboard anymore add to the metadata payload unnecessarily. It’s likely to impact to dashboard performance.

Make sure your row-level and column-level security is performant

Row-Level security, column-level security and dynamic default parameters each require lookups to take place before the visual queries are issued. When possible, try to limit the number and the complexity of your rules datasets to help these lookups execute faster. Use SPICE for your rules dataset when possible. If you must use a direct query, make sure that the queries are optimal and that the data source you’re querying is scaled appropriately up front.

For embedded dashboards, a great way to optimize row-level security lookups is by utilizing session tags for row-level security paired with an anonymous identity. Similarly, dynamic default parameters, if used, can be evaluated in the host application up front and passed using the embedding SDK.

Calculated functions

In this section, we offer tips regarding calculated functions.

Move calculations to the data prep stage

QuickSight allows you to add calculated fields in the data prep or analysis experiences. We strongly encourage you to move as many calculations as possible to the data prep stage which will allow QuickSight to materialize calculations which do not contain aggregation or parameters into the SPICE dataset. Materializing calculated fields in the dataset helps you reduce the runtime calculations, which improves query performance. Even if you are using aggregation or parameters in your calculation, it might still be possible to move parts of the calculations to data prep. For instance, if you have a formula like the following:

You can remove the sum() and just keep the ifelse(), which will allow QuickSight to materialize (precompute) it and save it as a real field in your SPICE dataset. Then you can either add another calculation which sums it up, or just use sum aggregation once you add it to your visuals.

Generally materializing calculations that use complex ifelse logic or do string manipulation/lookups will result in the greatest improvements in dashboard performance.

Implement the simplified ifelse syntax

The ifelse function supports simplified statements. For example, you might start with the following statement:

The following simplified statement is more performant:

Use the toString() function judiciously

The toString() function has a much lower performance and is much heavier on the database engine than a simple integer or number-based arithmatic calculations. Therefore, you should use it sparingly.

Know when nulls are returned by the system and use null value customization

Most authors make sure that null conditions on calculated fields are handled gracefully. QuickSight often handles nulls gracefully for you. You can use that to your advantage and make the calculations simpler. In the following example, the division by 0 is already handled by QuickSight:

You can write the preceding code as the following:

If you need to represent nulls on visuals with a static string, QuickSight allows you to set custom values when a null value is returned in a visual configuration. In the preceding example, you could just set a custom value of 0 in the formatting option. Removing such handling from the calculated fields can significantly help query performance.

On-sheet filters vs. parameters

Parameters are seemingly a very simple construct but they can quickly get complicated, especially when used in nested calculation functions or when used in controls. Parameters are all evaluated on the fly, forcing all the dependencies to be handled real time. Ask yourself if each parameter is really required. In some cases, you may be able to replace them with simple dropdown control, as shown in the following example for $market.

Instead of creating a control parameter to use in a calculated field, you might be able to use the field with a dropdown filter control.

Text field vs. Dropdown (or List) filter controls

When you are designing an analysis, you can add a filter control for the visuals you want to filter. if the data type of the field is string, you have several choices for the type of control filter. Text field which displays a text box where you can enter a single entry or multiple entries is suggested for the better performance, rather than Dropdown (or List) which requires to fetch the values to populate a list that you can select a single or multiple values.

On-sheet controls

The control panel at the top of the dashboard is collapsible by default, but this setting allows you to have an expanded state while publishing the dashboard. If this setting is enabled, QuickSight prioritizes the calls in order to fetch the controls’ values before the visual loads. If any of the controls have high cardinality, it could impact the performance of loading the dashboard. Evaluate this need against the fact that QuickSight persists last-used control values and the reader might not actually need to adjust controls as a first step.

Visual types: Charts

In this section, we provide advice when using Charts.

Use ‘Hide the “other” category’ when your dimension has less than the cutoff limit

You can choose to limit how many data points you want to display in your visual, before they are added to the other category. This category contains the aggregated data for all the data beyond the cutoff limit for the visual type you are using – either the one you impose or the one based on display limits. If you know your dimension has less than the cutoff limit, use this option. This will improve your dashboard performance.

The other category does not show on scatter plots, heat maps, maps, tables (tabular reports), or key performance indicators (KPIs). It also doesn’t show on line charts when the x-axis is a date.

Visual types: Tables and pivot tables

In this section, we provide advice when using tables and pivot tables.

Use the Values field well when displaying a raw table view

If you want to output all the raw data into table, you can use Group by fields, Values fields, or a mix of them. The most performant approach is set every field into Values. When using Group by, a query is first run under the hood followed by the Group by function, therefore all the data is pulled from the database, which is expensive.

Deploy a minimal set of rows, columns, metrics, and table calculations

If you include too many combinations of rows, columns, metrics, and table calculations in one pivot table, you risk overwhelming the viewer. You can also run into the computational limitations of the underlying database. To reduce the level of complexity and potential errors, you can take the following actions:

  • Apply filters to reduce the data included in for the visual
  • Use fewer fields in the Row and Column field wells
  • Use as few fields as possible in the Values field well
  • Create additional pivot tables so that each displays fewer metrics
  • Reduce subtotals, totals and conditional formatting when possible

Uncollapsed columns are always the simplest case and will likely remain more performant outside of a few cases.

Visual queries sequence

The execution of the individual visual sequence is left to right, then top to bottom. Understanding the sequence of execution can be helpful: you can rearrange visuals on your dashboard without losing the context. Place heavier visuals further down in the dashboard, and place lightweight KPI and insight visuals near the top to display “above-the-fold” content sooner, which improves the dashboard performance’s perception for your readers.

Embedding

Our final set of recommendations are in regards to embedding.

Remove user management flows from the critical path

Most times, user management and authentication flows (such as DescribeUser and RegisterUser APIs) can run asynchronously on the host application.

Consider registering the user in advance before the actual embedding, so that the overhead is removed from every analytics page visit.

Authenticate the user on your website in advance, and acquire any Amazon Cognito or AWS Security Token Service (Amazon STS) session tokens (if required) in advance (for example, at user login time or home page visit). This reduces additional runtime latency overhead when a user visits an analytics page.

Move workloads from clients to the web server or backend services

If a QuickSight dashboard is embedded on a webpage on the host application, which performs other activities too, play close attention to the sequence of API calls on the host. The QuickSight dashboard load might be gated by other heavy API calls on the host application. Move the logic to the web server or backend services as much as possible to limit contention on the browser.

Don’t tear down the embedding iFrame when the user navigates away from analytics section

When the user moves temporarily to a non-analytics page of your web application (especially in single-page applications), instead of removing the embedding iframe from DOM, you can hide it from the user while keeping the iFrame in the page DOM elements. This allows you to resume the same session when the user navigates back to analytics section of your application, and they don’t need to wait for reload.

Use navigateToDashboard() and navigateToSheet() whenever possible

If you have multiple dashboards on your host application that don’t need to load concurrently, you can optimize the authentication flow by utilizing two APIs we expose, navigateToDashboard() or navigateToSheet(), in our JavaScript SDK. These APIs reuse the same iFrame for each load, while reusing the authentication token.

This technique has proven to be very effective for many of our embedding users.

For more information about these APIs, refer to Amazon QuickSight Embedding SDK.

Conclusion

In this post, we shared some tips and tricks for tuning the performance of your QuickSight dashboards. In 2021, we doubled our SPICE data limits to 500 million rows of data per dataset. In addition, incremental data refresh is available for SQL-based data sources such as Amazon Redshift, Amazon Athena, Amazon RDS, Amazon Aurora, PostgreSQL, MySQL, Oracle, SQL Server, MariaDB, Presto, Teradata or Snowflake up to every 15 minutes, which cuts down time between data updates by 75%. In 2022, we continue to innovate on your behalf to make QuickSight dashboard loads even more performant.

We look forward to your feedback on how these tips and tricks helped your dashboards load faster.


About the Authors

Shekhar Kopuri is a Senior Software Development Manager for Amazon QuickSight. He leads the front platform engineering team that focusses on various aspects of front end experience including website performance. Before joining AWS, Shekhar led development of multiple provisioning and activation network OSS applications for a large global telecommunications service provider.

Blake Carroll is a Senior Frontend Engineer for Amazon QuickSight. He works with the frontend platform engineering team with a focus on website performance and has previously been the frontend lead for initial reporting and theming functionality in QuickSight. Prior to joining Amazon, Blake was a co-founder in the digital interactive agency space working with national brands to produce creative web experiences.

Vijay Chaudhari is a Senior Software Development Engineer for Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Vijay started his career with IBM, writing software for the Information Management group. At Amazon, he has built backend applications for retail systems, and near real-time data pre-computation, reporting and analytics systems at Amazon scale. He is passionate about learning and solving new customer problems, and helping them adopt cloud native technologies.

Wakana Vilquin-Sakashita is Specialist Solution Architect for Amazon QuickSight. She works closely with customers to help making sense of the data through visualization. Previously Wakana worked for S&P Global  assisting customers to access data, insights and researches relevant for their business.

Coming June 2022: An updated Amazon QuickSight dashboard experience

Post Syndicated from Rushabh Vora original https://aws.amazon.com/blogs/big-data/coming-june-2022-an-updated-amazon-quicksight-dashboard-experience/

Starting June 30, 2022, Amazon QuickSight is introducing the new look and feel for your dashboards. In this post, we walk through the changes to expect with the new look. The new dashboard experience includes the following improvements:

  • Simplified toolbar
  • Discoverable visual menu
  • Polished controls, menu, and submenus
  • Non-blocking right pane for secondary experiences like filters, threshold alerts, and downloads

Simplified toolbar

The simplified toolbar experience offers updated icons for key actions for better visual clarity.

The following screenshot shows the old look.

screenshot shows the old look

The following screenshot shows the new look with updated icons.

screenshot shows the new look with updated icons

Discoverable visual menu

The visual menu is visible on-hover to improve the discoverability of drills, export, and filter restatements.

To use the visual menu with the old version, you needed to select the visual.

visual menu with the old look, where you needed to select the visual

With the new look, you can view the menu by hovering over it.

With new look, you can view the menu by hovering over it

Polished controls, menu, and submenus

The new look features improved controls, menu, submenus, toast notifications, and other dashboard elements to provide more polished visual experience.

For example, the following screenshot shows the old look for calendar controls.

screenshot shows the old look for calendar controls

The following screenshot shows the new look.

screenshot shows the new look for calendar control

The following screenshot shows the old look for the menu and submenus.

screenshot shows the old look for the menu and submenus

The following screenshot shows the new look.

screenshot shows the new look for menu and submenus

Non-blocking right pane

The new look features a non-blocking right pane for secondary experiences like filters, threshold alerts, and downloads, to improve focus on the content of the dashboard.‘

The following animation shows the old look for ad-hoc filtering.

animation shows the old look for ad-hoc filtering

The filters are now moved to the right pane.

animation shows the new look for ad-hoc filtering that is now moved to the right pane

The following animation shows the old look for creating threshold alerts.

animation shows the old look for creating threshold alerts

Threshold alert creation is now in the right pane.

animation shows the new look for creating threshold alerts, now moved to right pane

The following animation shows the old look for the downloading experience.

animation shows the old look for the downloading experience

The new look offers a downloads pane.

animation shows the new look for the downloading experience in the right pane

Summary

The new look for the QuickSight dashboard experience will be available starting June 30, 2022. If you have any questions or feedback, please reach out to us by leaving a comment.


About the Author

Rushabh Vora is a Senior Technical Product Manager for Amazon QuickSight, Amazon Web Service’s cloud-native, fully managed BI service. He is passionate about Data Visualization. Prior to QuickSight, he was working with Amazon Business as a Product Manager.

Top Amazon QuickSight features and updates launched Q1 2022

Post Syndicated from Mia Heard original https://aws.amazon.com/blogs/big-data/top-amazon-quicksight-features-and-updates-launched-q1-2022/

Amazon QuickSight is a serverless, cloud-based business intelligence (BI) service that brings data insights to your teams and end users through machine learning (ML) powered dashboards and data visualizations, which can be access via QuickSight or embedded in apps and portals that your users access. This post shares the top QuickSight features and updates launched in Q1 2022.

Amazon QuickSight Community

In the new Amazon QuickSight Community, you can ask and answer questions, network with and learn from other BI users from across the globe, access learning content, and stay up to date with what’s new on QuickSight all in one place!

Join the global QuickSight Community today!

Groups Management UI

QuickSight now provides a user interface to manage user groups, allowing admins to efficiently and easily manage user groups via the QuickSight admin console. Groups Management UI is available to administrators with access to QuickSight admin console pages via AWS Identity and Access Management (IAM) credentials.

To learn more, see Creating and managing groups in Amazon QuickSight.

Comparative and cumulative date and time calculations

Amazon QuickSight authors can now quickly implement advanced date/time calculations without having to use complicated row offsets or pre-computed columns. You can add these calculations in regular business reporting, trend analysis, and time series analysis.

To learn more about the new period functions and their capabilities in various use cases, see Add comparative and cumulative date/time calculations in Amazon QuickSight.

Rich text formatting on visual titles and subtitles 

QuickSight authors can now add rich context to their visuals by choosing from various formatting options like font type, size, style, color and style. You can also better organize the text by choosing from various alignment and ordering options. Visual titles and subtitles now also support hyperlinks as well as parameter-based dynamic text.

To learn more, see Formatting a visual title and subtitle.

Custom subtotals at all levels on pivot table

QuickSight allows you to customize how subtotals are displayed in pivot tables, with options for display at multiple levels and for both rows and columns.

To learn more, see Displaying Totals and Subtotals.

Auto refresh direct query controls

QuickSight now supports automatic refreshes of values displayed in drop-down, multi-select and other controls in dashboards that are in direct query mode. Values within controls are updated every 24 hours to ensure the data is automatically updated without any end-user intervention.

For further details, see Refreshing data in Amazon QuickSight.

Conclusion

QuickSight serves millions of dashboard views weekly, enabling data-driven decision-making in organizations of all sizes, including customers like the NFL, 3M, Accenture, and more.

To stay up to date on all things new with QuickSight, visit What’s New with Analytics!


About the Author

Mia Heard is a product marketing manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service.

Detecting data drift using Amazon SageMaker

Post Syndicated from Shibu Nair original https://aws.amazon.com/blogs/architecture/detecting-data-drift-using-amazon-sagemaker/

As companies continue to embrace the cloud and digital transformation, they use historical data in order to identify trends and insights. This data is foundational to power tools, such as data analytics and machine learning (ML), in order to achieve high quality results.

This is a time where major disruptions are not only lasting longer, but also happening more frequently, as discussed in a McKinsey article on risk and resilience. Any disruption—a pandemic, hurricane, or even blocked sailing routes—has a major impact on the patterns of data and can create anomalous behavior.

ML models are dependent on data insights to help plan and support production-ready applications. With any disruptions, data drift can occur. Data drift is unexpected and undocumented changes to data structure, semantics, and/or infrastructure. If there is data drift, the model performance will degrade and no longer provide an accurate guidance. To mitigate the effects of the disruption, data drift needs to be detected and the ML models quickly trained and adjusted accordingly.

This blog post explains how to approach changing data patterns in the age of disruption and how to mitigate its effects on ML models. We also discuss the steps of building a feedback loop to capture the request data in the production environment and create a data pipeline to store the data for profiling and baselining. Then, we explain how Amazon SageMaker Clarify can help detect data drift.

How to detect data drift

There are three stages to detecting data drift: data quality monitoring, model quality monitoring, and drift evaluation (see Figure 1).

Stages in detecting data drift

Figure 1. Stages in detecting data drift

Data quality monitoring establishes a profile of the input data during model training, and then continuously compares incoming data with the profile. Deviations in the data profile signal a drift in the input data.

You can also detect drift through model quality monitoring, which requires capturing actual values that can be compared with the predictions. For example, using weekly demand forecasting, you can compare the forecast quantities one week later with the actual demand. Some use cases can require extra steps to collect actual values. For example, product recommendations may require you to ask a selected group of consumers for their feedback to the recommendation.

SageMaker Clarify provides insights into your trained models, including importance of model features and any biases towards certain segments of the input data. Changes of these attributes between re-trained models also signal drift. Drift evaluation constitutes the monitoring data and mechanisms to detect changes and triggering consequent actions. With Amazon CloudWatch, you can define rules and thresholds that prompt drift notifications.

Figure 2 illustrates a basic architecture with the data sources for training and production (on the left) and the observed data concerning drift (on the right). You can use Amazon SageMaker Data Wrangler, a visual data preparation tool, to clean and normalize your input data for your ML task. You can store the features that you defined for your models in the Amazon SageMaker Feature Store, a fully managed, purpose-built repository to store, update, retrieve, and share ML features.

The white, rectangular boxes in the architecture diagram represent the tasks for detecting data and model drift. You can integrate those tasks into your ML workflow with Amazon SageMaker Pipelines.

Basic architecture on how data drift is detected using Amazon SageMaker

Figure 2. Basic architecture on how data drift is detected using Amazon SageMaker

The drift observation data can be captured in tabular format, such as comma-separated values or Parquet, on Amazon Simple Storage Service (S3) and analyzed with Amazon Athena and Amazon QuickSight.

How to build a feedback loop

The baselining task establishes a data profile from training data. It uses Amazon SageMaker Model Monitor and runs before training or re-training the model. The baseline profile is stored on Amazon S3 to be referenced by the data drift monitoring job.

The data drift monitoring task continuously profiles the input data, compares it with baseline, and the results are captured in CloudWatch. This tasks runs on its own computation resources using Deequ, which checks that the monitoring job does not slow down your ML inference flow and scales with the data. The frequency of running this task can be adjusted to control cost, which can depend on how rapidly you anticipate that the data may change.

The model quality monitoring task computes model performance metrics from actuals and predicted values. The origin of these data points depends on the use case. Demand forecasting use cases naturally capture actuals that can be used to validate past predictions. Other use cases can require extra steps to acquire ground-truth data.

CloudWatch is a monitoring and observability service with which you can define rules to act on deviation in model performance or data drift. With CloudWatch, you can setup alerts to users via e-mail or SMS, and it can automatically start the ML model re-training process.

Run the baseline task on your updated data set before re-training your model. Use the SageMaker model registry to catalog your ML models for production, manage model versions, and control the associate training metrics.

Gaining insight into data and models

SageMaker Clarify provides greater visibility into your training data and models, helping identify and limit bias and explain predictions. For example, the trained models may consider some features more strongly than others when generating predictions. Compare the feature importance and bias between model-provided versions for a better understanding of the changes.

Conclusion

As companies continue to use data analytics and ML to inform daily activity, data drift may become a more common occurrence. Recognizing that drift can have a direct impact on models and production-ready applications, it is important to architect to identify potential data drift and avoid downgrading the models and negatively impacting results. Failure to capture changes in data can result in loss of process confidence, downgraded model accuracy, or a bottom-line impact to the business.

Amazon QuickSight 1-click public embedding available in preview

Post Syndicated from Kareem Syed-Mohammed original https://aws.amazon.com/blogs/big-data/amazon-quicksight-1-click-public-embedding-available-in-preview/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service that makes it easy to connect to your data, create interactive dashboards, and share these with tens of thousands of users, either directly within a QuickSight application, or embedded in web apps and portals.

QuickSight Enterprise Edition now supports 1-click public embedding, a feature that allows you to embed your dashboards into public applications, wikis, and portals without any coding or development needed. Anyone on the internet can start accessing these embedded dashboards with up-to-date information instantly, without any server deployments or infrastructure licensing needed! 1-click public embedding allows you to empower your end-users with access to insights.

In this post, we walk you through the steps to use this feature, demonstrate the end-user experience, and share sample use cases.

Solution overview

1-click public embedding requires an administrator of the QuickSight account to enable this feature and use session-based pricing. After you complete the prerequisites (see the following section), the 1-click embedding process involves three simple steps:

  1. Enable public access on the dashboard
  2. Allow list the domain where you want to embed the dashboard in QuickSight
  3. Embed the dashboard

Prerequisites

  1. The account needs to have session capacity pricing enabled
  2. As a prerequisite, make sure your QuickSight account is enabled for dashboards to be shared by the public. Because this feature allows dashboards to be enabled for any user, the usability of this feature must be enabled by an administrator of your QuickSight account with the following AWS Identity and Administration (IAM) permissions:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Action": "quicksight:UpdatePublicSharingSettings",
                "Resource": "*",
                "Effect": "Allow"
            }
         ]
    }

After you enable this policy, you can activate the account-level settings.

  1. Sign in to your QuickSight account.
  2. On the user name drop-down menu, choose Manage QuickSight.
  3. Choose Security & permissions in the navigation pane.
  4. Under Public access to dashboards, choose Manage.
  5. Enable the setting Everyone on the internet to allow public access.
  6. Confirm this setting and choose Accept.

Enabling this setting doesn’t automatically enable all the dashboards to be accessed by anyone on the internet. It gives the ability for authors of the dashboards to individually enable the dashboard to be accessed by anyone on the internet via the share link or when embedded.

Enable public access permissions on the dashboard

After you create a QuickSight dashboard, to enable public access, complete the following steps:

  1. On the QuickSight dashboard, on the Share menu, choose Share dashboard.

Only owners and co-owners of the dashboard can perform this action.

  1. Under Enable access for, turn on Anyone on the internet (public).

This setting allows you to share this dashboard with anyone on the internet via the share link or when embedded.

  1. Accept this setting and choose Confirm.

You now have the option to copy the link or the embed code to share the dashboard. Note that when this setting is enabled, the dashboard can only be accessed using the link or when embedded using the embed code.

After you enable the dashboard for public access, you can see badges on the dashboard as follows.

The dashboard toolbar has a PUBLIC badge.

The dashboards grid view has an eye icon for each dashboard.

The dashboards list view has an eye icon for each dashboard.

Disable the public access

You can disable the public access of your dashboards in two ways:

  1. Disable the Anyone on the internet setting for the dashboard (to be done by the owners and co-owners of the dashboard).
  2. Disable the Public access to dashboard setting at the account level (to be done by the IAM admin). You can disable the account-level settings even if the dashboard-level setting is enabled. This still prevents the dashboards from being accessed publicly.

Allow list the domain where you want to embed the dashboard in QuickSight

The domain where the dashboard is to be embedded must be allow listed in QuickSight. For instructions, see Adding domains for embedded users.

Embed the dashboard

After you set your desired access to the dashboard, you can choose Copy embed code, which copies the embed code for that dashboard. This code embeds the dashboard when added to an internal application.

The copied embed code is similar to the following code:

    <iframe
        width="960"
        height="720"
        src="https://quicksight.aws.amazon.com/sn/
        embed/share/accounts/accountid/dashboards/dashboardid">
    </iframe>

To embed the dashboard in an HTML page, open the HTML of the page where you want to embed the dashboard and enter the copied embed code into the HTML code.

If your public-facing applications are built on Google Sites, to embed your dashboard, open the page on Google Sites, then choose Insert and Embed. A pop-up window appears with a prompt to enter a URL or embed code. Choose Embed code and enter the copied embed code in the text box.

Make sure to allow list the following domains in QuickSight when embedding in Google Sites: https://googleusercontent.com (enable subdomains), https://www.gstatic.com, and https://sites.google.com.

After you embed the dashboard in your application, anyone who can access your application can now access the embedded dashboard.

Sample use cases

1-click public embedding enables you to embed your dashboards into public applications, wikis, and portals without any coding or development needed. In this section, we present two sample use cases.

For our first use case, a fictitious school district uses 1-click public embedding to report on the teachers’ enrollment in the district. They built a dashboard and used this feature to embed it on their public-facing site.

For our second use case, a fictitious fintech that provides investment solutions is using 1-click public embedding to show how their investment compares against other well-known indexes and commodities. They used this feature to add this comparison dashboard on their public-facing marketing pages.

Try out 1-click public embedding

To try out this feature, see Embed Amazon QuickSight dashboard in seconds. In this demo, you can change the dashboard between a logistics or sales dashboard by choosing Change Dashboard and entering the embed code for the dashboard you want to render on the site.

Conclusion

With 1-click public embedding, you can now embed rich and interactive QuickSight dashboards quickly and easily. Enable your end-users to dive deeper into their data through embedded dashboard with the click of a button—and with no infrastructure setup or management, scale to millions of users. 1-click public embedding is now in preview; to access this feature, please contact [email protected].

QuickSight also supports embedding in SaaS apps without any user management needed. For more information, refer to Embed multi-tenant dashboards in SaaS apps using Amazon QuickSight without provisioning or managing users.

To stay up to date on QuickSight embedded analytics, check out what’s new with the QuickSight User Guide.


About the Authors

Kareem Syed-Mohammed is a Product Manager at Amazon QuickSight. He focuses on embedded analytics, APIs, and developer experience. Prior to QuickSight he has been with AWS Marketplace and Amazon retail as a PM. Kareem started his career as a developer and then PM for call center technologies, Local Expert and Ads for Expedia. He worked as a consultant with McKinsey and Company for a short while.

Srikanth Baheti is a Specialized World Wide Sr. Solution Architect for Amazon QuickSight. He started his career as a consultant and worked for multiple private and government organizations. Later he worked for PerkinElmer Health and Sciences & eResearch Technology Inc, where he was responsible for designing and developing high traffic web applications, highly scalable and maintainable data pipelines for reporting platforms using AWS services and Serverless computing.-

Enhance analytics with Google Trends data using AWS Glue, Amazon Athena, and Amazon QuickSight

Post Syndicated from Drew Philip original https://aws.amazon.com/blogs/big-data/enhance-analytics-with-google-trends-data-using-aws-glue-amazon-athena-and-amazon-quicksight/

In today’s market, business success often lies in the ability to glean accurate insights and predictions from data. However, data scientists and analysts often find that the data they have at their disposal isn’t enough to help them make accurate predictions for their use cases. A variety of factors might alter an outcome and should be taken into account when making a prediction model. Google Trends is an available option, presenting a broad source of data that reflects global trends more comprehensively. This can help enrich a dataset to yield a better model.

You can use Google Trends data for a variety of analytical use cases. For example, you can use it to learn about how your products or brands are faring among targeted audiences. You can also use it to monitor competitors and see how well they’re performing against your brand.

In this post, we shows how to get Google Trends data programmatically, integrate it into a data pipeline, and use it to analyze data, using Amazon Simple Storage Service (Amazon S3), AWS Glue, Amazon Athena, and Amazon QuickSight. We use an example dataset of movies and TV shows and demonstrate how to get the search queries from Google Trends to analyze the popularity of movies and TV shows.

Solution overview

The following diagram shows a high-level architecture of the solution using Amazon S3, AWS Glue, the Google Trends API, Athena, and QuickSight.

The solution consists of the following components:

  1. Amazon S3 – The storage layer that stores the list of topics for which Google Trends data has to be gathered. It also stores the results returned by Google Trends.
  2. AWS Glue – The serverless data integration service that calls Google Trends for the list of topics to get the search results, aggregates the data, and loads it to Amazon S3.
  3. Athena – The query engine that allows you to query the data stored in Amazon S3. You can use it for supporting one-time SQL queries on Google Trends data and for building dashboards using tools like QuickSight.
  4. QuickSight – The reporting tool used for building visualizations.

In the following sections, we walk through the steps to set up the environment, download the libraries, create and run the AWS Glue job, and explore the data.

Set up your environment

Complete the following steps to set up your environment:

  1. Create an S3 bucket where you upload the list of movies and TV shows. For this post, we use a Netflix Movies and TV Shows public dataset from Kaggle.
  2. Create an AWS Identity and Access Management (IAM) service role that allows AWS Glue to read and write data to the S3 buckets you just created.
  3. Create a new QuickSight account with the admin/author role and access granted to Athena and Amazon S3.

Download the external libraries and dependencies for the AWS Glue Job

The AWS Glue job needs the following two external Python libraries: pytrends and awswrangler. pytrends is a library that provides a simple interface for automating the downloading of reports from Google Trends. awswrangler is a library provided by AWS to integrate data between a Pandas DataFrame and AWS repositories like Amazon S3.

Download the following .whl files for the libraries and upload them to Amazon S3:

Create and configure an AWS Glue job

To set up your AWS Glue job, complete the following steps:

  1. On the AWS Glue console, under ETL in the navigation pane, choose Jobs – New.
  2. For Create job, select Python Shell script editor.
  3. For Options, select Create a new script with boilerplate code.
  4. Choose Create.
  5. On the Script tab, enter the following script, replacing the source and target buckets with your bucket names:
    # Import external library TrendReq needed to connect to Google Trends API and library awswrangler to read/write from pandas to Amazon S3.
    
    from pytrends.request import TrendReq
    pytrend = TrendReq(hl='en-US', tz=360, timeout=10) 
    import pandas as pd
    import awswrangler as wr
    
    # Function get_gtrend, accepts a list of terms as input, calls Google Trends API for each term to get the search trends 
    def get_gtrend(terms):
      trends =[]
      for term in terms:
    # Normalizing the data using popular movie Titanic as baseline to get trends over time.
        pytrend.build_payload(kw_list=["Titanic",term.lower()])
        df = pytrend.interest_over_time()
        df["google_trend"] = round((df[term.lower()] /df['Titanic']) *100)
        
    # Transforming and filtering trends results to align with Analytics use case
        df_trend = df.loc[df.index >= "2018-1-1", "google_trend"].resample(rule="M").max().to_frame()
        df_trend["movie"] = term
        trends.append(df_trend.reset_index())
    
    # Last step in function to concatenate the results for each term and return an aggregated dataset 
      concat_df = pd.concat(trends)
      return concat_df
    
    def main():
      
    # Change the bucket and prefix name to Amazon S3 location where movie titles file from Kaggle has been downloaded. 
      source_bucket = "source_bucket"
      source_prefix = "source_prefix"
    
    # Awswrangler method s3.read_csv is called to load the titles from S3 location into a DataFrame and convert it to a list.
      df = wr.s3.read_csv(f's3://{source_bucket}/{source_prefix}/')
      movies = df['title'].head(20).values.tolist()
    
    #  Call the get_trends function and pass the list of movies as an input. Pandas dataframe is returned with trend data for movies.
      df = get_gtrend(terms=movies)
    
    # Change the prefix name to location where you want to store results. 
      target_bucket = "target_bucket" 
      target_prefix = "target_prefix" 
    
    # Use awswrangler to save pandas dataframe to Amazon S3. 
      wr.s3.to_csv(df,f's3://{target_bucket}/{target_prefix}/trends.csv',index= False)
    
    
    # Invoke the main function
    main()

  6. On the Job details tab, for Name, enter the name of the AWS Glue job.
  7. For IAM Role, choose the role that you created earlier with permissions to run the job and access Amazon S3.
  8. For Type, enter Python Shell to run the Python code.
  9. For Python Version, specify the Python version as Python 3.6.
  10. For Data processing units, choose 1 DPU.
  11. For Number of retries, enter .
  12. Expand Advanced properties and under Libraries, enter the location of the S3 bucket where the pytrends and awswrangler files were downloaded.
  13. Choose Save to save the job.

Run the AWS Glue job

Navigate to the AWS Glue console and run the AWS Glue job you created. When the job is complete, a CSV file with the Google Trends values is created in the target S3 bucket with the prefix specified in the main() function. In the next step, we create an AWS Glue table referring to the target bucket and prefix to allow queries to be run against the Google Trends data.

Create an AWS Glue table on the Google Trends data

In this step, we create a table in the AWS Glue Data Catalog using Athena. The table is created on top of the Google Trends data saved in the target S3 bucket.

In the Athena query editor, select default as the database and enter the following DDL command to create a table named trends. Replace the target bucket and prefix with your own values.

CREATE EXTERNAL TABLE `trends`(
  `date` date, 
  `google_trend` double, 
  `title` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://<< target_bucket >>/<<target_prefix >>/'
TBLPROPERTIES (
  'has_encrypted_data'='false', 
  'skip.header.line.count'='1')

This table has three columns:

  • date – The time dimension for aggregating the data. In this example, the time period is monthly.
  • google_trend – The count of Google Trends values normalized on a scale of 0–100.
  • title – The name of the movie or TV show.

Query the data using Athena

Now you can run one-time queries to find the popularity of movies and TV shows.

In the first example, we find the top 10 most popular movies and TV shows for November 2021. In the Athena query editor, enter the following SQL command to query the trends table created in the previous step:

select title,google_trend
from trends 
where date = date_parse('2021-11-30','%Y-%m-%d')
order by google_trend desc
limit 10

In the following example, we find the top 10 most popular movies and TV shows that have grown most in popularity in 2021 until November 30. In the Athena query editor, enter the following SQL command to query the trends table:

select  title,max(google_trend)-min(google_trend) trend_diff
from trends
where date between date_parse('2021-01-31','%Y-%m-%d') and date_parse('2021-11-30','%Y-%m-%d')
group by title
order by 2 desc
limit 10

Build a dashboard to visualize the data using QuickSight

We can use QuickSight to build a dashboard on the data downloaded from Google Trends to identify top movies and TV shows. Complete the following steps:

  1. Sign in to your QuickSight account.
  2. On the QuickSight console, choose Datasets and choose New dataset.
  3. Choose Athena as your data source.
  4. For Data source name, enter a name.
  5. For Athena workgroup, choose [primary].
  6. Choose Create data source.
  7. For Database, choose default.
  8. For Tables, select the trends table.
  9. Choose Select.
  10. Select Directly query your data.
  11. Choose Visualize.

For the first visual, we create a bar chart of the top movies or TV shows by title sorted in ascending order of aggregated Google Trends values.

  1. Choose the horizontal bar chart visual type.
  2. For Y axis, choose title.
  3. For Value, choose google_trend (Average).

Next, we create a time series plot of Google Trends count by month for titles.

  1. Add a new visual and choose the autograph visual type.
  2. For X axis, choose date.
  3. For Value, choose google_trend (Sum).
  4. For Color¸ choose title.

Clean up

To avoid incurring future charges, delete the resources you created for AWS Glue, Amazon S3, IAM, and QuickSight.

  1. AWS Glue Catalog table
    • On the AWS Glue console, choose Tables under Databases in the navigation pane.
    • Select the AWS Glue Data Catalog table that you created.
    • On the Actions drop-down menu, choose Delete.
    • Choose Delete to confirm.
  2. AWS Glue Job
    • Choose Jobs in the navigation pane.
    • Select the AWS Glue job you created.
    • On the Actions drop-down menu, choose Delete.
  3. S3 bucket
    • On the Amazon S3 console, choose Buckets in navigation pane.
    • Choose the bucket you created.
    • Choose Empty and enter your bucket name.
    • Choose Confirm.
    • Choose Delete and enter your bucket name.
    • Choose Delete bucket.
  4. IAM Role
    • On the IAM console, choose Roles in navigation pane.
    • Choose the role you attached to AWS Glue job.
    • Choose Delete role.
    • Choose Yes.
  5. Amazon QuickSight
    • If you created a QuickSight user for trying out this blog and do not want to retain that access, please ask your QuickSight admin to delete your user.
    • If you created the QuickSight account itself just for trying this blog and no longer want to retain it, use following steps to delete it.
    • Choose your user name on the application bar, and then choose Manage QuickSight
    • Choose Account settings.
    • Choose Delete Account.

You can only have one QuickSight account active for each AWS account. Make sure that other users aren’t using QuickSight before you delete the account.

Conclusion

Integrating external data sources such as Google Trends via AWS Glue, Athena, and QuickSight can help you enrich your datasets to yield greater insights. You can use it in a data science context when the model is under-fit and requires more relevant data in order to make better predictions. In this post, we used movies as an example, but the solution extends to a wide breadth of industries, such as products in a retail context or commodities in a finance context. If the simple inventory histories or the transaction dates are available, you may find little correlation to future demand or prices. But with an integrated data pipeline using external data, new relationships in the dataset make the model more reliable.

In a business context, whether your team wants to test out a machine learning (ML) proof of concept more quickly or have limited access to pertinent data, Google Trends integration is a relatively quick way to enrich your data for the purposes of ML and data insights.

You can also extend this concept to other third-party datasets, such as social media sentiment, as your team’s expertise grows and your ML and analytics operations mature. Integrating external datasets such as Google Trends is just one part of the feature and data engineering process, but it’s a great place to start and, in our experience, most often leads to better models that businesses can innovate from.


About the Authors

Drew Philip is a Sr. Solutions Architect with AWS Private Equity. He has held senior
technical leadership positions within key AWS partners such as Microsoft, Oracle, and
Rackspace. Drew focuses on applied engineering that leverages AI-enabled digital innovation and development, application modernization, resiliency and operational excellence for workloads at scale in the public and private sector. He sits on the board of Calvin University’s computer science department and is a contributing member of the AWS Machine Learning Technical Focus Community.

Gautam Prothia is a Senior Solution Architect within AWS dedicated to Strategic Accounts. Gautam has more than 15+ years of experience designing and implementing large-scale data management and analytical solutions. He has worked with many clients across industries to help them modernize their data platforms on the cloud.

Simon Zamarin is an AI/ML Solutions Architect whose main focus is helping customers extract value from their data assets. In his spare time, Simon enjoys spending time with family, reading sci-fi, and working on various DIY house projects.

ProLink uses Amazon QuickSight to enable states to deliver housing assistance to those in need

Post Syndicated from Ryan Kim original https://aws.amazon.com/blogs/big-data/prolink-uses-amazon-quicksight-to-enable-states-to-deliver-housing-assistance-to-those-in-need/

This is a joint post by ProLink Solutions and AWS. ProLink Solutions builds software solutions for emergency fund deployment to help state agencies distribute funds to homeowners in need. Over the past 20 years, ProLink Solutions has developed software for the affordable housing industry, designed to make the experience less complicated and easy to report on.

The COVID-19 pandemic has impacted homeowners across the United States who were unable to pay their mortgages, resulting in delinquencies, defaults, and foreclosures. The federal government acted quickly by establishing the Homeowner Assistance Fund (HAF) under the American Rescue Plan Act of 2021, granting nearly $10 billion to states to distribute to homeowners experiencing COVID-related financial hardships.

Distributing these funds quickly and efficiently required states to rapidly deploy new programs, workflows, and reporting. ProLink Solutions rose to the occasion with a new software as a service (SaaS) solution called ProLink+. Consisting of two parts—a homeowner portal that makes the funding application process easy for homeowners, and a back-office system to help state agencies review and approve funding applications—ProLink+ is a turnkey solution for state agencies looking to distribute their HAF dollars fast. For state agencies responsible for HAF programs, data reporting is key because it reinforces the organizational mission of the agency and helps shape public perception of how the program is progressing. Due to the emergency nature of the funding program, state agencies are continually in the public eye, and therefore access to real-time reporting is a must. As a result, ProLink uses Amazon QuickSight as their business intelligence (BI) solution to create and embed dashboards into the ProLink+ solution.

In this post, we share how ProLink+ uses QuickSight to enhance states’ capabilities to analyze and assess their fund deployment status.

Building the solution with AWS

Data-driven decision-making is critical in any industry or business today, and the affordable housing industry is no exception. As a primary technology player in the affordable housing industry for over two decades, ProLink Solutions supports state housing finance agencies by providing comprehensive suite of software products. ProLink has been an AWS customer since 2012, utilizing AWS services to design and build their in-house software development to maximize agility, scale, functionality, and speed to market of their solutions.

The ProLink+ SaaS solution is built using multiple AWS resources, including but not limited to Amazon Elastic Compute Cloud (Amazon EC2), Elastic Load Balancing, Amazon Simple Storage Service (Amazon S3), Amazon Relational Database Service (Amazon RDS), AWS Lambda, and Amazon Cognito. It also utilizes Amazon CloudFront for a secure and high-performance content delivery system to the end-user.

QuickSight inherent integrations with their AWS resources and microservices make it a logical choice for ProLink. In addition, QuickSight allowsProLink to deliver dashboard functionality efficiently and securely without incurring significant costs. Due to the intuitive design of QuickSight, ProLink business analysts are able to build rich, informative dashboards without writing any code or engineering input, thereby shortening the time to client delivery and increasing the efficiency of the decision-making process. With its simple setup and ability to easily create embedded dashboards and visualization tools, QuickSight is yet another flexible and powerful tool that ProLink Solutions uses to deliver high-quality products and services to the market.

Intuitive and effective data visualization is key

The integration of QuickSight into ProLink+ offers a unique opportunity to create a seamless embedded solution for users. For example, the reporting integration isn’t a separate redirect to a different system. The solution exists in the main user interface with visualizations directly associated to the unique activities. Relevant data can be displayed without adding unnecessary complexity to the solution. This experience adds additional value by reducing the learning curve for new customers.

State agencies use QuickSight’s embedded dashboard capabilities in ProLink+ for internal analytical purposes, as well as for real-time reporting to the public. The agencies are proactively thinking about what information needs to be made available to the public and how to best present it. These are big decisions that impact how the public sees the work of the government.

The Percent of Funds Disbursed chart in the following screenshot illustrates how much of the allocation the agency received from the US Department of the Treasury has been disbursed to homeowners in the state.

Blazing a trail for more easily accessible funding

Federal funding programs have traditionally faced challenges with distribution to citizens in need. ProLink Solutions seeks to provide an easy-to-adopt, easy-to-use, and repeatable solution for governments, powered by modern technology. ProLink+ simplifies the process of distributing the funds to the public through an intuitive interface. The dashboard capabilities of QuickSight are an asset to both ProLink Solutions as a solutions provider and state government agencies as end-users. Intuitive, effective data reporting and visualization provides critical insights that help governments communicate their work on behalf of the public, while continuing to improve delivery of their services.

“State agencies across the board are looking for visual reporting tools to tell their stories more effectively. I’m glad QuickSight was readily available to us and we were able to quickly develop a dashboard in our ProLink+ deployment.” Shawn McKenna, CEO ProLink Solutions

Learn more about how ProLink Solutions is helping states distribute housing assistance quickly to those in need.

_______________________________________________________________________

About the Authors

Ryan Kim is the Director of Product Marketing at ProLink Solutions. Ryan leads industry partnerships/initiatives and positioning of all ProLink Solutions’s technology products that serve the affordable housing industry.

Scott Kirn is the Chief Information Officer at ProLink Solutions. Scott leads the Information Technology group at ProLink Solutions and drives all aspects of product development and delivery.

Walter McCain II is a Solutions Architect at Amazon Web Services. Walter is a Solutions Architect for Amazon Web Services, helping customers build operational best practices, application products, and technical solutions in the AWS Cloud. Walter is involved in evangelizing AWS Cloud computing architectures and development for various technologies such as serverless, media entertainment, migration strategies, and security, to name a few.

Announcing the new Amazon QuickSight Community

Post Syndicated from Lillie Atkins original https://aws.amazon.com/blogs/big-data/announcing-the-new-amazon-quicksight-community/

On February 22, 2022, we launched our new Amazon QuickSight Community. Here you can ask and answer questions, network with and learn from other Business Intelligence (BI) users from across the globe, access learning content, and stay up to date with what’s new on Amazon QuickSight—all in one place!

In this post, we discuss some of the features of the QuickSight Community and show you how to sign up, start posting, create a profile and set up notifications.

Content on the Community

The QuickSight Community has three main sections:

  1. Question and Answer – A discussion forum where you can ask your QuickSight questions and get answers from community experts. You can also showcase your QuickSight expertise by sharing your knowledge with others.
  2. Learning Center – A hub for on-demand QuickSight content, including how-to videos workshop videos, articles, and additional educational resources.
  3. Announcements – Stay up-to-date on the latest QuickSight launches, blogs, feature demo videos, and monthly newsletters. We continuously update this section as new content related to QuickSight is added.

Join the QuickSight Community

This QuickSight Community does not require any login to search or browse existing content. You only need to create an account if you want to interact with the community (such as liking posts, replying to posts, and posting your own questions).

This is a public community, so be careful not to post any confidential or private information.

Sign up for the Community

To sign up, complete the following steps:

  1. On the QuickSight Community home page, choose Sign Up.
  2. You can either use your existing Amazon.com account, or create a new login.
  3. Read the Community guidelines.

You can now contribute to the Community. Welcome!

Post a question

You need to be logged in with an account to start asking questions. Before you post a question, search the recommended questions to make sure your question hasn’t already been answered.

To post a question, complete the following steps:

  1. On the Question and answer page, choose New Question.
  2. Enter the information to create your post, including a title and tag.
  3. Choose New Question.

Set up your profile

To view your profile and change the settings, complete the following steps:

  1. Choose your user icon at the top of the page and choose the person icon.
  2. Select the section you are interested in:
  3. If you chose Preferences, you can control security and notification preferences, and update your profile details.

Set up notifications

You can set up notification preferences to be alerted on a specific post, channel, or tag.

To set up notifications on a post, choose the post and then choose the notification icon to set your notification preferences.

To get notified on a specific channel, go to the channel (in this case the Question and answer channel) and choose the notification bell to set your notification preferences.

To get notified on a specific tag, complete the following steps:

  1. Choose the additional options icon.
  2. Choose Tags.
  3. Choose the tag you are interested in.
  4. Choose the notification icon and set your notification preferences.

Conclusion

In this post, we discussed the new Amazon QuickSight Community, and how you can sign up for it, create a post, edit your profile, and set up notifications. The QuickSight Community is a one-stop shop for all of your QuickSight learning needs, and a place to network with other BI users from around the globe. Start exploring today!


About the Authors

Lillie Atkins is a Product Manager for Amazon QuickSight, Amazon Web Service’s cloud-native, fully managed BI service.

Mia Heard is a Product Marketing Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service.

Enable users to ask questions about data using natural language within your applications by embedding Amazon QuickSight Q

Post Syndicated from Deepak Murthy original https://aws.amazon.com/blogs/big-data/enable-users-to-ask-questions-about-data-using-natural-language-within-your-applications-by-embedding-amazon-quicksight-q/

Amazon QuickSight Q is a new machine learning-based capability in Amazon QuickSight that enables users to ask business questions in natural language and receive answers with relevant visualizations instantly to gain insights from data. QuickSight Q doesn’t depend on prebuilt dashboards or reports to answer questions, which removes the need for business intelligence (BI) teams to create or update dashboards every time a new business question arises. Users can ask questions and receive visual answers in seconds directly from within QuickSight or from web applications and portals. In this post, we look at how you can embed Q in your web applications or portals.

Solution overview

You can now embed Q in your application without any custom development. Q is a fully managed cloud-native BI offering that you can easily embed without requiring expertise within your team to develop and maintain this capability. You can easily democratize your data and scale your insights to a growing user base, while ensuring you only pay for usage with Q’s unique pay-per-question pricing model.

Applications can authenticate users with any identity provider of choice (such as Active Directory, Amazon Cognito, or any SAML-based federated SSO provider that your organization uses) and act on behalf of the user to get access to the Q question bar. This means that every user receives a secure, personalized question answering experience while requiring no user-facing QuickSight-specific authentication. This enables a novel experience to provide insights within your application with minimal upfront work and allows you to focus on your core application functionality! QuickSight Q embedding is available in QuickSight Enterprise Edition and Q-supported Regions.

To facilitate an easy embedding experience, AWS has also launched the Amazon QuickSight Embedding SDK (JavaScript) and a rich set of Q-specific functionalities. The QuickSight Embedding SDK lets you efficiently integrate Q in your application pages, set a default topic, enable topic selection, set themes, and control Q search bar behavior. This helps you roll out Q to your users faster.

To embed Q in your application, you must complete the following high-level steps:

  1. Set up permissions to generate embedded Q URLs.
  2. Generate a URL with the authentication code attached.
  3. Embed the Q search bar URL.

Set up permissions to generate embedded Q URLs

In this step, you set up permissions for your backend application or web server to embed the Q search bar. This task requires administrative access to AWS Identity and Access Management (IAM). Each user who accesses the Q search bar assumes a role that gives them QuickSight permissions to retrieve a Q-embedded URL.

To make this possible, create an IAM role in your AWS account. Associate an IAM policy with the role to provide permissions to any user who assumes it. The IAM role needs to provide permissions to retrieve embedding URLs for a specific user pool. With the help of the wildcard character *, you can grant the permissions to generate a URL for all users in a specific namespace. Or you can grant permissions to generate a URL for a subset of users in specific namespaces. For this, you add quicksight:GenerateEmbedUrlForRegisteredUser.

The following sample policy provides these permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "quicksight:GenerateEmbedUrlForRegisteredUser"
            ],
            "Resource": "arn:partition:quicksight:region:accountId:user/namespace/userName"
        }
    ]
}

Also, if you’re creating first-time users who will be QuickSight readers, make sure to add the quicksight:RegisterUser permission in the policy.

The following sample policy provides permission to retrieve an embedding URL for first-time users who will be QuickSight readers:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "quicksight:RegisterUser",
            "Resource": "*",
            "Effect": "Allow"
        },
        {
            "Effect": "Allow",
            "Action": [
              "quicksight:GenerateEmbedUrlForRegisteredUser"
            ],
            "Resource": [
              "arn:partition:quicksight:region:accountId:user/namespace/userName"
            ]
        }
    ]
}

Finally, your application’s IAM identity must have a trust policy associated with it to allow access to the role that you just created. This means that when a user accesses your application, your application assumes the role on the user’s behalf and provisions the user in QuickSight.

The following example uses a role called embedding_quicksight_q_search_bar_role, which has the sample policy preceding as its resource:

{
    "Version": "2012-10-17",
    "Statement": {
        "Effect": "Allow",
        "Action": "sts:AssumeRole",
        "Resource": "arn:aws:iam::11112222333:role/embedding_quicksight_q_search_bar_role"
    }
}

Generate a URL with the authentication code attached

In this step, you authenticate your user and get the embeddable Q topic URL on your application server. If you plan to embed the Q bar for IAM or QuickSight identity types, share the Q topic with the users. When a user accesses your app, the app assumes the IAM role of the user. If that user is new, the app adds the user to QuickSight, then passes an identifier as the unique role session ID.

These steps make sure that each viewer of the Q topic is uniquely provisioned in QuickSight. It also enforces per-user settings, such as the row-level security and dynamic defaults for parameters.

The following example code performs the IAM authentication on the user’s behalf. This code runs on your app server:

import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.auth.AWSCredentialsProvider;
import com.amazonaws.regions.Regions;
import com.amazonaws.services.quicksight.AmazonQuickSight;
import com.amazonaws.services.quicksight.AmazonQuickSightClientBuilder;
import com.amazonaws.services.quicksight.model.GenerateEmbedUrlForRegisteredUserRequest;
import com.amazonaws.services.quicksight.model.GenerateEmbedUrlForRegisteredUserResult;
import com.amazonaws.services.quicksight.model.RegisteredUserEmbeddingExperienceConfiguration;
import com.amazonaws.services.quicksight.model.RegisteredUserQSearchBarEmbeddingConfiguration;

/**
 * Class to call QuickSight AWS SDK to get url for embedding the Q search bar.
 */
public class RegisteredUserQSearchBarEmbeddingConfiguration {

    private final AmazonQuickSight quickSightClient;

    public RegisteredUserQSearchBarEmbeddingConfiguration() {
        this.quickSightClient = AmazonQuickSightClientBuilder
                .standard()
                .withRegion(Regions.US_EAST_1.getName())
                .withCredentials(new AWSCredentialsProvider() {
                                     @Override
                                     public AWSCredentials getCredentials() {
                                         // provide actual IAM access key and secret key here
                                         return new BasicAWSCredentials("access-key", "secret-key");
                                     }

                                     @Override
                                     public void refresh() {}
                                 }
                )
                .build();
    }

    public String getQuicksightEmbedUrl(
            final String accountId, // AWS Account ID
            final String topicId, // Topic ID to embed
            final String userArn // Registered user arn to use for embedding. Refer to Get Embed Url section in developer portal to find how to get user arn for a QuickSight user.
    ) throws Exception {
        final RegisteredUserEmbeddingExperienceConfiguration experienceConfiguration = new RegisteredUserEmbeddingExperienceConfiguration()
                .withQSearchBar(new RegisteredUserQSearchBarEmbeddingConfiguration().withInitialTopicId(topicId));
        final GenerateEmbedUrlForRegisteredUserRequest generateEmbedUrlForRegisteredUserRequest = new GenerateEmbedUrlForRegisteredUserRequest();
        generateEmbedUrlForRegisteredUserRequest.setAwsAccountId(accountId);
        generateEmbedUrlForRegisteredUserRequest.setUserArn(userArn);
        generateEmbedUrlForRegisteredUserRequest.setExperienceConfiguration(QSearchBar);

        final GenerateEmbedUrlForRegisteredUserResult generateEmbedUrlForRegisteredUserResult = quickSightClient.generateEmbedUrlForRegisteredUser(generateEmbedUrlForRegisteredUserRequest);

        return generateEmbedUrlForRegisteredUserResult.getEmbedUrl();
    }
}

To generate the URL that you can embed in your app, call the GenerateEmbedUrlForRegisteredUser API operation. This URL is valid for 5 minutes, and the resulting session is valid for up to 10 hours. You can configure the session validity by setting the sessionLifetimeinMinutes parameter for GenerateEmbedURL APIs. The API operation provides the URL with an auth_code value that enables a single-sign on session. The following code shows an example response from generate-embed-url-for-registered-user:

//The URL returned is over 900 characters. For this example, we've shortened the string for
//readability and added ellipsis to indicate that it's incomplete.
{
 "Status": "200",
 "EmbedUrl": "https: //dashboards.example.com/embed/620bef10822743fab329fb3751187d2d...",
 "RequestId": "7bee030e-f191-45c4-97fe-d9faf0e03713"
} 

Embed the Q search bar URL

In this step, you embed the Q search bar URL in your website or application page. You can do this with the QuickSight Embedding SDK, which allows you to do the following:

  • Place the Q search bar on an HTML page
  • Pass parameters into the Q search bar
  • Handle error states with messages that are customized to your application

Embed the Q search bar in your webpage by using the QuickSight Embedding SDK or by adding this URL into an iFrame. If you set a fixed height and width number (in pixels), QuickSight uses those and doesn’t change your visual as your window resizes. If you set a relative percent height and width, QuickSight provides a responsive layout that is modified as your window size changes. When you use the QuickSight Embedding SDK, the Q search bar on your page is dynamically resized based on the state. By using the QuickSight Embedding SDK, you can also control parameters within the Q search bar and receive callbacks in terms of page load completion and errors.

The following example code shows how to use the generated URL. This code is generated on your app server:

<!DOCTYPE html>
         <html>
     
         <head>
             <title>QuickSight Q Search Bar Embedding</title>
             <script src="https://unpkg.com/[email protected]/dist/quicksight-embedding-js-sdk.min.js"></script>
             <script type="text/javascript">
                 var session
     
                 function onError(payload) {
                     console.log("Do something when the session fails loading");
                 }
     
                 function onOpen() {
                     console.log("Do something when the Q search bar opens");
                 }
     
                 function onClose() {
                     console.log("Do something when the Q search bar closes");
                 }
     
                 function embedQSearchBar() {
                     var containerDiv = document.getElementById("embeddingContainer");
                     var options = {
                         url: "https://us-east-1.quicksight.aws.amazon.com/sn/dashboards/dashboardId?isauthcode=true&identityprovider=quicksight&code=authcode", // replace this dummy url with the one generated via embedding API
                         container: containerDiv,
                         width: "1000px",
                         locale: "en-US",
                         qSearchBarOptions: {
                             expandCallback: onOpen,
                             collapseCallback: onClose,
                             iconDisabled: false,
                             topicNameDisabled: false, 
                             themeId: 'theme12345',
                             allowTopicSelection: true
                         }
                     };
                     session = QuickSightEmbedding.embedQSearchBar(options);
                     session.on("error", onError);
                 }
     
                 function onCountryChange(obj) {
                     session.setParameters({country: obj.value});
                 }
             </script>
         </head>
     
         <body onload="embedQSearchBar()">
             <div id="embeddingContainer"></div>
         </body>
     
         </html>

For this example to work, make sure to use the QuickSight Embedding SDK to load the embedded dashboard on your website using JavaScript. You can get the SDK in the following ways:

  • Download the QuickSight Embedding SDK from GitHub. This repository is maintained by a group of QuickSight developers.
  • Download the latest QuickSight Embedding SDK version from npmjs.com.
  • If you use npm for JavaScript dependencies, download and install it by running the following command:
    npm install amazon-quicksight-embedding-sdk

The following screenshot is an example of the embedded Q question bar and a QuickSight dashboard to support natural language questions and analysis of pharmaceutical clinical trial data. You can try out this topic and other such topics in an embedded application demo.

Summary

Enterprises can empower users to ask questions about data in plain English within their applications by embedding the QuickSight Q question bar. Embedding Q into your application is straightforward and requires no custom development from your team.

Get started with a free trial of QuickSight Q.


About the Authors

Deepak Murthy is a Senior Product Manager for Amazon QuickSight, AWS’s cloud-native, fully managed BI service. Deepak started his career with Staples, developing enterprise data warehouse solutions. Later, he was the architect of data warehouse and analytics solutions at Wells Fargo, AMC, and Blackhawk Network. Deepak is excited about the potential of self-service analytics and improving data accessibility by enabling new natural language interactions with data, and looks forward to helping customers leverage these latest analytics innovations.

Rob Foley is a Software Development Engineer for Amazon QuickSight, AWS’s cloud-native, fully managed BI service. Rob began his career with AWS, and has been a member of the QuickSight team for over 1.5 years. He has development experience in a breadth of services and stacks, primarily having worked on data-centric applications like Q.

Use AnalyticsIQ with Amazon QuickSight to gain insights for your business

Post Syndicated from Sumitha AP original https://aws.amazon.com/blogs/big-data/use-analyticsiq-with-amazon-quicksight-to-gain-insights-for-your-business/

Decisions are made every day in your organization that impact your business. Making the right decision at the right moment can deeply impact your organization’s growth and your customers. Likewise, having the right data and tools that generate insights into the data can empower your organization’s leaders to make the right decisions.

In the healthcare industry where decisions directly impact an individual’s wellness, having the right data to generate the right insight into the individual experience through the lens of social determinants of health can greatly improve health outcomes and save lives. Understanding the unique social situations of the individuals they serve, from access to transportation, technology to economic, food security and more, allows healthcare providers to address disparities and give all their patients an equal opportunity to achieve their desired level of health.

For example, let’s say a healthcare organization or government agency wants to better understand the factors that affect public health in order to improve the quality of life for various ethnic groups, based on data.

In this post, we show you how to use AnalyticsIQ datasets and Amazon QuickSight to generate valuable insights that could improve your organization’s decision-making. we use the AnalyticsIQ Social Determinants of Health Sample Data dataset to gain insights into the relationship between ethnicity and health, as well as how the social determinants impact the health and wellness of individuals.

Solution overview

The following architecture diagram outlines the components of this solution:

The solution consists of the following components:

To implement the solution, you complete the following high-level steps:

  1. Export the dataset to an S3 bucket.
  2. Sign up for a QuickSight subscription.
  3. Create a QuickSight dataset.
  4. Create visualizations in QuickSight.

Prerequisites

To run this solution, you must have an AWS account. If you don’t already have one, you can create one.

Export the dataset to an S3 bucket

To start working with your dataset, you must subscribe to the dataset and then export the data to an S3 bucket. Complete the following steps:

  1. If you don’t already have a bucket, navigate to the Amazon S3 console, and choose Create bucket.
  2. Give a unique name for your bucket.

Make sure that you create the bucket in the us-east-1 Region.

  1. To subscribe to the sample dataset, follow this link. On the AWS Data Exchange console, choose Continue to subscribe.
  2. On the Complete subscription page, choose Subscribe.
  3. For Select Amazon S3 bucket folder destination, choose your S3 bucket.

The subscription process can take up to 2 minutes to complete.

  1. On the AWS Data Exchange Console, under My subscriptions in the navigation pane, choose Entitled Data.
  2. Under Products, expand Social Determinants of Health Sample Data – Offline, and choose the AnalyticsIQ sample dataset.
  3. On the Revisions tab, select the revision and choose Export to Amazon S3.
  4. Enter the name of the S3 bucket you created for this dataset.
  5. Leave the other options as default.
  6. Choose Export.

You can view the dataset in your S3 bucket under the prefix Sample-Data.

Sign up for a QuickSight subscription

To sign up for a QuickSight subscription, complete the following steps:

  1. On the AWS Management Console, open QuickSight.
  2. Choose Sign up for QuickSight and choose Enterprise.
  3. For QuickSight account name, enter a unique name.
  4. Enter a valid email.
  5. Under Allow access and autodiscovery for these resources, select Amazon S3 and choose Select S3 buckets.
  6. Choose the S3 bucket that you created earlier, and choose Finish.
  7. After your QuickSight account is created, choose Go to QuickSight account.

Create a QuickSight dataset

To create your dataset, complete the following steps:

  1. Using a local text editor, create a JSON file. Copy the following content and replace the placeholder with the name of the bucket that you created earlier:
    {
          "fileLocations": [
              {
                  "URIPrefixes": [
                           "https://<Your BucketName>.s3.amazonaws.com/Sample-Data/"
                  ]
              }
           ],
         "globalUploadSettings": {
                  "format": "TSV"
            }
    }

  2. On the QuickSight console, choose New data set on the Datasets page.
  3. Choose S3.
  4. For DataSource, enter a name.
  5. Choose Upload and upload the JSON file.
  6. Choose Connect.
  7. Choose Visualize.

The following screenshot shows your imported sample data:

Create visualizations in QuickSight

Let’s visualize the average number of cars by various ethnic groups. For more information about the fields, refer to the Key Data Points section on the AWS Marketplace listing.

  1. Choose the sheet and choose the vertical bar chart under Visual types.
  2. From the Fields list, drag EthnicIQ_v2 to X axis and Number_of_Autos to Value.
  3. Choose Aggregate as Average.

Now you can create a visualization for urgent care visits by ethnic groups.

  1. Choose +Add, and choose Add Visual.
  2. Choose a pivot table under Visual types.
  3. From the Fields list, drag EthinicIQ_v2 to Rows and HW_Urgent_Care_Visits_SC to Values.
  4. Choose Aggregate as Average.
  5. Choose the HW_Urgent_Care_Visits_SC field in the pivot table, and choose Sort descending.

Similarly, you can add more visualizations as shown in the following images.

From these visualizations created from sample data, you can see that a person’s use of healthcare services reduces when they have less access to transportation. The AA ethnic group has fewer cars compared to the other groups. The wellness score for the AA group is low when compared to the others. Transportation barriers could be a major factor here. Job satisfaction also contributes to wellness levels. Furthermore, the sample data indicates that the Hispanic community has the highest likelihood of recent urgent care visits. Does this mean these groups aren’t getting enough preventative care, leading to more urgent care visits?

Sleep and job satisfaction play a critical role in affecting stress levels, as well as overall health. This would be a critical factor for people who work shifts. What measures can be taken to increase the sleep quality for that set of people?

These are just few of the innumerable valuable analyses that you can create from the AnalyticsIQ Social Determinants of Health Sample Data dataset. These insights are valuable for various groups of people, such as health professionals, preventative care, employee care, scientists, and governments, to empower communities and help build better public health and social determinant solutions.

Clean up

To avoid incurring ongoing charges, complete the following steps to clean up your resources:

  1. On the QuickSight console, on the Analyses page, choose the details icon on the analysis you created, and choose Delete.
  2. On the QuickSight start page, on the Datasets page, choose the dataset that you created earlier, then choose Delete Data Set.
  3. On the Amazon S3 console, on the Buckets page, select the option next to the name of your bucket, and then choose Delete at the top of the page.
  4. Confirm that you want to delete the bucket by entering the bucket name into the text field, then choose Delete bucket.

Conclusion

In this post, we showed you how you can use the AnalyticsIQ Social Determinants of Health Sample Data dataset to gain insights into society’s health and wellness. We also showed you how you can generate easy-to-understand visualizations using QuickSight. Amazon QuickSight allows dashboards to be shared with 1000s of users without any servers, and with pay-per-session pricing. QuickSight dashboards can also be easily embedded in SaaS applications or corporate portals for sharing insights with all users. You can explore the AnalyticsIQ dataset more on the AWS Data Exchange console. For queries related to the AnalyticsIQ dataset, you can reach out directly to the support team at [email protected].To learn more about the features of QuickSight, refer to Amazon QuickSight Features.


About the Author

Sumitha AP is an AWS Solutions Architect based in Washington DC. She works with SMB customers to help them design secure, scalable, reliable and cost effective solutions in the AWS cloud.

How to build a multi-Region AWS Security Hub analytic pipeline and visualize Security Hub data

Post Syndicated from David Hessler original https://aws.amazon.com/blogs/security/how-to-build-a-multi-region-aws-security-hub-analytic-pipeline/

AWS Security Hub is a service that gives you aggregated visibility into your security and compliance posture across multiple Amazon Web Services (AWS) accounts. By joining Security Hub with Amazon QuickSight—a scalable, serverless, embeddable, machine learning-powered business intelligence (BI) service built for the cloud—your senior leaders and decision-makers can use dashboards to empower data-driven decisions and facilitate a secure configuration of AWS resources

In organizations that operate at cloud scale, being able to summarize and perform trend analysis is key to identifying and remediating problems early, which leads to the overall success of the organization. Additionally, QuickSight dashboards can be embedded in dashboard and reporting platforms that leaders are already familiar with, making the dashboards even more user friendly.

With the solution in this blog post, you can provide leaders with cross-AWS Region views of data to enable decision-makers to assess the health and status of an organizations IT infrastructure at a glance. You also can enrich the dashboard with data sources not available to Security Hub. Finally, this solution allows you the flexibility to have multiple administrator accounts across several AWS organizations and combine them into a single view.

In this blog post, you will learn how to build an analytics pipeline of your Security Hub findings, summarize the data with Amazon Athena, and visualize the data via QuickSight using the following steps:

  • Deploy an AWS Cloud Development Kit (AWS CDK) stack that builds the infrastructure you need to get started.
  • Create an Athena view that summarizes the raw findings.
  • Visualize the summary of findings in QuickSight.
  • Secure QuickSight using best practices.

For a high-level discussion without code examples please see Visualize AWS Security Hub Findings using Analytics and Business Intelligence Tools.

Prerequisites

This blog post assumes that you:

  • Have a basic understanding of how to authenticate and access your AWS account.
  • Are able to run commands via a command line prompt on your local machine.
  • Have a basic understanding of Structured Query Language (SQL).

Solution overview

Figure 1 shows the flow of events and a high-level architecture diagram of the solution.

Figure 1. High level architecture diagram

Figure 1. High level architecture diagram

The steps shown in Figure 1 include:

  • Detect
  • Collect
  • Aggregate
  • Transform
  • Analyze
  • Visualize

Detect

AWS offers a number of tools to help detect security findings continuously. These tools fall into three types:

In this blog, you will use two built-in security standards of Security Hub—CIS AWS Foundations Benchmark controls and AWS Foundational Security Best Practices Standard—and a serverless Prowler scanner that acts as a third-party partner product. In cases where AWS Organizations is used, member accounts send these findings to the member account’s Security Hub

Collect

Within a region, security findings are centralized into a single administrator account using Security Hub.

Aggregate

Using the cross-Region aggregation feature within Security Hub, findings within each administrator account can be aggregated and continuously synchronized across multiple regions.

Ingest

Security Hub not only provides a comprehensive view of security alerts and security posture across your AWS accounts, it also acts as a data sink for your security tools. Any tool that can expose data via AWS Security Finding Format (ASFF) can use the BatchImportFindings API action to push data to Security Hub. For more details, see Using custom product integration to send findings to AWS Security Hub and Available AWS service integrations in the Security Hub User Guide.

Transform

Data coming out of Security Hub is exposed via Amazon EventBridge. Unfortunately, it’s not quite in a form that Athena can consume. EventBridge streams data through Amazon Kinesis Data Firehose directly to Amazon Simple Storage Service (Amazon S3). From Amazon S3, you can create an AWS Lambda function that flattens and fixes some of the column names, such as by removing special characters that Athena cannot recognize. The Lambda function then saves the results back to S3. Finally, an AWS Glue crawler dynamically discovers the schema of the data and creates or updates an Athena table.

Analyze

You will aggregate the raw findings data and create metrics along various grains or pivots by creating a simple yet meaningful Athena view. With Athena, you also can use views to join the data with other data sources, such as your organization’s configuration management database (CMDB) or IT service management (ITSM) system.

Visualize

Using QuickSight, you will register the data sources and build visualizations that can be used to identify areas where security can be improved or reduce risk. This post shares steps detailing how to do this in the Build QuickSight visualizations section below.

Use AWS CDK to deploy the infrastructure

In order to analyze and visualize security related findings, you will need to deploy the infrastructure required to detect, ingest, and transform those findings. You will use an AWS CDK stack to deploy the infrastructure to your account. To begin, review the prerequisites to make sure you have everything you need to deploy the CDK stack. Once the CDK stack is deployed, you can deploy the actual infrastructure. After the infrastructure has been deployed, you will build an Athena view and a QuickSight visualization.

Install the software to deploy the solution

For the solution in this blog post, you must have the following tools installed:

  • The solution in this blog post is written in Python, so you must install Python in addition to CDK. Instructions on how to install Python version 3.X can be found on their downloads page.
  • AWS CDK requires node.js. Directions on how to install node.js can found on the node.js downloads page.
  • This CDK application uses Docker for local bundling. Directions for using Docker can be found at Get Docker.
  • AWS CDK—a software-development framework for defining cloud infrastructure in code and provisioning it through AWS CloudFormation. To install CDK, visit AWS CDK Toolkit page.

To confirm you have the everything you need

  1. Confirm you are running version 1.108.0 or later of CDK.

    $ cdk ‐‐version

  2. Download the code from github by cloning the repository. cd into the clone directory.

    $ git clone [email protected]:aws-samples/aws-security-hub-analytic-pipeline.git

    $ cd aws-security-hub-analytic-pipeline

  3. Manually create a virtualenv.

    $ python3 -m venv .venv

  4. After the initialization process completes and the virtualenv is created, you can use the following step to activate your virtualenv.

    $ source .venv/bin/activate

  5. If you’re using a Windows platform, use the following command to activate virtualenv:

    % .venv/Scripts/activate.bat

  6. Once the virtualenv is activated, you can install the required dependencies.

    $ pip install -r requirements.txt

Use AWS CDK to deploy the infrastructure into your account

The following steps use AWS CDK to deploy the infrastructure. This infrastructure includes the various scanners, Security Hub, EventBridge, and Kinesis Firehose streams. When complete, the raw Security Hub data will already be stored in an S3 bucket.

To deploy the infrastructure using AWS CDK

  1. If you’ve never used AWS CDK in the account you’re using or if you’ve never used CDK in the us-east-1, us-east-2, or us-west-1 Regions, you must bootstrap the regions via the command prompt.

    $ cdk bootstrap

  2. At this point, you can deploy the stack to your default AWS account via the command prompt.

    $ cdk deploy –all

  3. While cdk deploy is running, you will see the output in Figure 2. This is a prompt to ensure you’re aware that you’re making a security-relevant change and creating AWS Identity and Access Management (IAM) roles. Enter y when prompted to continue the deployment process:

    Figure 2. CDK approval prompt to create IAM roles

    Figure 2. CDK approval prompt to create IAM roles

  4. Confirm cdk deploy is finished. When the deployment is finished, you should see three stack ARNs. It will look similar to Figure 3.

    Figure 3. Final output of CDK deploy

    Figure 3. Final output of CDK deploy

As a result of the deployed CDK code, Security Hub and the Prowler scanner will automatically scan your account, process the data, and send it to S3. While it takes less than an hour for some data to be processed and searchable in Athena, we recommend waiting 24 hours before proceeding to the next steps, to ensure enough data is processed to generate useful visualizations. This is because the remaining steps roll-up findings by the hour. Also, it takes several minutes to get initial results from the Security Hub standards and up to an hour to get initial results from Prowler.

Build an Athena view

Now that you’re deployed the infrastructure to detect, ingest, and transform security related findings, it’s time to use an Athena view to accomplish the analyze portion of the solution. The following view aggregates the number of findings for a given day. Athena views can be used to summarize data or enrich it with data from other sources. Use the following steps to build a simple example view. For more information on creating Athena views, see Working with Views.

To build an Athena view

  1. Open the AWS Management Console and ensure that the Region is set to us-east-1 (Northern Virginia).
  2. Navigate to the Athena service. If you’ve never used this service, choose Get Started to navigate to the Query Editor screen. Otherwise, the Query Editor screen is the default view.
  3. If you’re new to Athena, you also need to set up a query result location.
    1. Choose Settings in the top right of the Query Editor screen to open the settings panel.
    2. Choose Select to select a query result location.

      Figure 4. Athena settings

      Figure 4. Athena settings

    3. Locate an S3 bucket in the list that starts with analyticsink-queryresults and choose the right-arrow icon.
    4. Choose Select to select a query results bucket.

      Figure 5. Select S3 location confirmation

      Figure 5. Select S3 location confirmation

  4. Select AwsDataCatalog as the Data source and security_hub_database as the Database. The Query Editor screen should look like Figure 6.

    Figure 6. Empty query editor

    Figure 6. Empty query editor

  5. Copy and paste the following SQL in the query window:

    CREATE OR REPLACE VIEW “security-hub-rolled-up-finding” AS
    SELECT

    “date_format”(“from_iso8601_timestamp”(updatedat), ‘%Y-%m-%d %H:00’) year_month_day
    , region
    , compliance_status
    , workflowstate
    , severity_label
    , COUNT(DISTINCT title) as cnt
    FROM
    security_hub_database.“security-hub-crawled-findings”
    GROUP BY “date_format”(“from_iso8601_timestamp”(updatedat), ‘%Y-%m-%d %H:00’), compliance_status, workflowstate, severity_label, region

  6. Choose the Run query button.

If everything is correct, you should see Query successful in the Results, as shown in Figure 7.

Figure 7. Creating an Athena view

Figure 7. Creating an Athena view

Build QuickSight visualizations

Now that you’ve deployed the infrastructure to detect, ingest, and transform security related findings, and have created an Athena view to analyze those findings, it’s time to use QuickSight to visualize the findings. To use QuickSight, you must first grant QuickSight permissions to access S3 and Athena. Next you create a QuickSight data source. Third, you will create a QuickSight analysis. (Optional) When complete, you can publish the analysis.

You will build a simple visualization that shows counts of findings over time separated by severity, though it’s also possible to use QuickSight to tell rich and compelling visual stories.

In order to use QuickSight, you need to sign up for a QuickSight subscription. Steps to do so can be found in Signing Up for an Amazon QuickSight Subscription.

The first thing you need to do once logged in to QuickSight is create the data source. If this is your first time logging in to the service, you will be greeted with an initial QuickSight page as shown in Figure 8.

Figure 8. Initial QuickSight page

Figure 8. Initial QuickSight page

Grant QuickSight access to S3 and Athena

While creating the Athena data source will enable QuickSight to query data from Athena, you also need to enable QuickSight to read from S3.

To grant QuickSight access to S3 and Athena

  1. Inside QuickSight, select your profile name (upper right). Choose Manage QuickSight, and then choose Security & permissions.
  2. Choose Add or remove.
  3. Ensure the checkbox next to Athena is selected.
  4. Ensure the checkbox next to Amazon S3 is selected.
  5. Choose Details and then choose Select S3 Buckets.
  6. Locate an S3 bucket in the list that starts with analyticsink-bucket and ensure the checkbox is selected.
    Figure 9. Example permissions

    Figure 9. Example permissions

  7. Choose Finish to save changes.

Create a QuickSight dataset

Once you’ve given QuickSight the necessary permissions, you can create a new dataset.

To create a QuickSight dataset

  1. Choose Datasets from the navigation pane at left. Then choose New Dataset.

    Figure 10. Dataset page

    Figure 10. Dataset page

  2. To create a new Athena connection profile, use the following steps:
    1. In the FROM NEW DATA SOURCES section, choose the Athena data source card.
    2. For Data source name, enter a descriptive name. For example: security-hub-rolled-up-finding.
    3. For Athena workgroup choose [ primary ].
    4. Choose Validate connection to test the connection. This also confirms encryption at rest.
    5. Choose Create data source.
  3. On the Choose your table screen, select:
    Catalog: AwsDataCatalog
    Database: security_hub_database
    Table: security-hub-rolled-up-finding
  4. Finally, select the Import to SPICE for quicker analytics option and choose Visualize.

Once you’re finished, the page to create your first analysis will automatically open. Figure 11 shows an example of the page.

Figure 11. Create an analysis page

Figure 11. Create an analysis page

Create a QuickSight analysis

A QuickSight analysis is more than just a visualization—it helps you uncover hidden insights and trends in your data, identify key drivers, and forecast business metrics. You can create rich analytic experiences with QuickSight. For more information, visit Working with Visuals in the QuickSight User Guide.

For simplicity, you’ll build a visualization that summarizes findings categories by severity and aggregated by hour.

To create a QuickSight analysis

  1. Choose Line Chart from the Visual Types.

    Figure 12. Visual types

    Figure 12. Visual types

  2. Select Fields. Figure 13 shows what your field wells should look like at the end of this step.
    1. Locate the year_month_day_hour field in the field list and drag it over to the X axis field well.
    2. Locate the cnt field in the field list and drag it over to the Value field well.
    3. Locate the severity_label field in the field list and drag it over to Color field well.

      Figure 13. Field wells

      Figure 13. Field wells

  3. Add Filters.
    1. Select Filter in the left navigation panel.

      Figure 14. Filters panel

      Figure 14. Filters panel

    2. Choose Create one… and select the compliance_status field.
    3. Expand the filter and clear NOT_AVAILABLE and PASSED (Note: depending on your data, you might not have all of these statuses).
    4. Choose Apply to apply the filter.

      Figure 15. Filtering out findings that are not failing

      Figure 15. Filtering out findings that are not failing

You should now see a visualization that looks like Figure 16, which shows a summary count of events and their severity.

Figure 16. Example visualization (note: this visualization has five days’ worth of data.)

Figure 16. Example visualization (note: this visualization has five days’ worth of data.)

Publish a QuickSight analysis dashboard (optional)

Publishing a dashboard is a great way to share reports with leaders. This two-step process allows you to share visualizations as a dashboard.

To publish a QuickSight analysis

  1. Choose Share on the application bar, then choose Publish dashboard.
  2. Select Publish new dashboard as, and then enter a dashboard name, such as Security Hub Findings by Severity.

You can also embed dashboards into web applications. This requires using the AWS SDK or through the AWS Command Line Interface (AWS CLI). For more information, see Embedding QuickSight Data Dashboards for Everyone.

Encouraged security posture in QuickSight

QuickSight has a number of security features. While the AWS Security section of the QuickSight User Guide goes into detail, here’s a summary of the standards that apply to this specific scenario. For more details see AWS security in Amazon QuickSight within the QuickSight user guide.

Clean up (optional)

When done, you can clean up QuickSight by removing the Athena view and the CDK stack. Follow the detailed steps below to clean up everything.

To clean up QuickSight

  1. Open the console and choose Datasets in the left navigation pane.
  2. Select security-hub-rolled-up-finding then choose Delete dataset.
  3. Confirm dataset deletion by choosing Delete.
  4. Choose Analyses from the left navigation pane.
  5. Choose the menu in the lower right corner of the security-hub-rolled-up-finding card.

    Figure 17. Example analysis card

    Figure 17. Example analysis card

  6. Select Delete and confirm Delete.

To remove the Athena view

  1. Paste the following SQL in the query window:

    DROP VIEW “security-hub-rolled-up-finding”

  2. Choose the Run query button.

To remove the CDK stack

  1. Run the following command in your terminal:

    cdk destroy

    Note: If you experience errors, you might need to reactivate your Python virtual environment by completing steps 3–5 of Use AWS CDK to deploy the infrastructure.

Conclusion

In this blog, you used Security Hub and QuickSight to deploy a scalable analytic pipeline for your security tools. Security Hub allowed you to join and collect security findings from multiple sources. With QuickSight, you summarized data for your senior leaders and decision-makers to give them the right data in real-time.

You ensured that your sensitive data remained protected by explicitly granting QuickSight the ability to read from a specific S3 bucket. By authorizing access only to the data sources needed to visualize your data, you ensure least privilege access. QuickSight supports many other AWS data sources, including Amazon RDS, Amazon Redshift, Lake Formation, and Amazon OpenSearch Service (successor to Amazon Elasticsearch Service). Because the data doesn’t live inside an Amazon Virtual Private Cloud (Amazon VPC), you didn’t need to grant access to any specific VPCs. Limiting access to VPCs is another great way to improve the security of your environment.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security Hub forum. To start your 30-day free trial of Security Hub, visit AWS Security Hub.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

David Hessler

David Hessler

David is a senior cloud consultant with AWS Professional Services. He has over a decade of technical experience helping customers tackle their most challenging technical problems and providing tailor-made solutions using AWS services. He is passionate about DevOps, security automation, and how the two work together to allow customers to focus on what matters: their mission.

Add comparative and cumulative date/time calculations in Amazon QuickSight

Post Syndicated from Emily Zhu original https://aws.amazon.com/blogs/big-data/add-comparative-and-cumulative-date-time-calculations-in-amazon-quicksight/

Amazon QuickSight recently added native support for comparative (e.g., year-over-year) and cumulative (e.g., year-to-date) period functions which allow you to easily introduce these calculations in business reporting, trend analysis and time series analysis. This allows authors in QuickSight to implement advanced calculations without having to use complicated date offsets in calculations to achieve such datetime-aware comparisons.

In this post, we introduce the new period functions and their capabilities, and demonstrate several typical use cases. We also discuss several scenarios to extend the usage of the period functions, which will be useful in more advanced situations.

New period functions

Before we demonstrate use cases, let’s go over the new period function suite and see what new functions we now support. We can divide period functions into two main groups: comparative (period over period) functions and cumulative (period to date) functions.

Comparative (period over period) functions

You can use period over period functions to compare measures at different time periods, such as year, quarter, and month. For example, you can compute a year-over-year increase in sales, or week-over-week percentage revenue changes.

A typical comparative period function has the syntax periodOverPeriodDifference(measure, date, period, offset), with two optional arguments: period and offset.

You can use the period argument in the function to define the period granularity of the calculation. The granularity of YEAR means year-over-year computation, Quarter means quarter-over-quarter, and so on. If the period argument is left empty, the calculation changes based on the period granularity that is chosen (in the field well) to be displayed in the visual.

You can also use the offset argument to specify how many periods apart you want to compute the comparison. For instance, a period of a quarter with an offset of 2 means comparing against the previous two quarters.

Note that period and offset have to be both specified or both left empty. You can’t specify just one of them.

The following table summarizes the three available period over period functions.

Function Name Function Type Description
periodOverPeriodDifference Table Calculation Calculates the difference of a measure over two different time periods as specified by period granularity and offset.
periodOverPeriodLastValue Table Calculation Calculates the last (previous) value of a measure from a previous time period as specified by period granularity and offset.
periodOverPeriodPercentDifference Table Calculation Calculates the percent difference of a measure over two different time periods as specified by period granularity and offset.

Cumulative (period to date) functions

You can use period to date functions to calculate metrics within a given period-to-date window. There are two main types of cumulative functions.

  • “OverTime” functions, which are table calculations and return outputs for each row in the visual.

E.g., you can use periodToDateCountOverTime with WEEK granularity to compute a series of week-to-date new customer counts to track the fluctuation of customer engagement.

  • Aggregation functions, which output an aggregated value for a fixed period-to-date time range.

E.g., periodtoDateSum with YEAR granularity returns a single value for the total of the metric from the beginning of the year to the endDate provided in the formula. If left blank, the default endDate takes now(), which is the moment when users load the dashboard.

The following table summarizes the period to date functions.

Function Name Function Type Description
periodToDateAvgOverTime Table Calculation Calculates the average of a measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateCountOverTime Table Calculation Calculates the count of a dimension or measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateMaxOverTime Table Calculation Calculates the maximum of a measure or date for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateMinOverTime Table Calculation Calculates the minimum of a measure or date for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateSumOverTime Table Calculation Calculates the sum of a measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateAvg Aggregation Averages the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateCount Aggregation Calculates the number of values in a dimension or measure for a given time granularity (for instance, a quarter) up to a point in time, including duplicates.
periodToDateMax Aggregation Returns the maximum value of the specified measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateMedian Aggregation Returns the median value of the specified measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateMin Aggregation Returns the minimum value of the specified measure or date for a given time granularity (for instance, a quarter) up to a point in time.
periodToDatePercentile Aggregation Calculates the percentile based on the actual numbers in the measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDatePercentileCont Aggregation Calculates the percentile based on a continuous distribution of the numbers in the measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateStDev Aggregation Calculates the standard deviation of the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time based on a sample.
periodToDateStDevP Aggregation Calculates the population standard deviation of the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time based on a sample.
periodtoDateSum Aggregation Adds the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateVar Aggregation Calculates the sample variance of the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time.
periodToDateVarP Aggregation Calculates the population variance of the set of numbers in the specified measure for a given time granularity (for instance, a quarter) up to a point in time.

Use case 1: Using a period over period function to analyze sales growth in fixed and dynamic granularity

Let’s dive into how period over period functions can support typical business and financial use cases. The following example uses periodOverPeriodDifference to calculate YoY sales growth. Because we have Segment and Region in the visual, the YoY sales is calculated for each segment and region.

We define the measure of YoYSales with the following formula: YoYSales=periodOverPeriodDifference(sum(Sales),{Order Date},YEAR,1)

The first argument, sum(Sales), tells the function to calculate based on this measure. The second argument, Order Date, specifies the date/time column from which Year information is extracted. The third argument, YEAR, fixes the granularity of this calculation. When this optional argument is specified, this measure always returns YoY (not QoQ or MoM) no matter how Order Date is selected (in the field well) to be displayed in the visual. The fourth argument, 1, specifies the offset of the comparison. In this example, it means we want to compare the sales of each order date with the same date of the previous year. The measure returns empty for order dates of 2018, because no previous periods exist to be compared with.

The period functions are working with totals and subtotals. By adding the total for columns into the visual, you can see the total sales and total YoYSales for each region.

If you leave the optional argument of period granularity empty, meaning change the formula to PoPSales=periodOverPeriodDifference(sum(Sales),{Order Date})as shown in the following example, the time period of the calculation is then determined by the granularity of Order Date displayed on the visual. In the following example, Order Date is chosen to display at quarter level (in the field well), so PoPSales dynamically calculates the QoQ sales growth. Changing Order Date to the monthly level updates the measure to calculate MoM. For PoPSales, only Q1 2018 returns empty because that’s the only quarter that doesn’t have a previous quarter to compare with.

If we add YoYSales from the previous example to this visual, it calculates YoY sales growth at the quarter level (compares sales of Q1 2019 with Q1 2018). This demonstrates the difference between a fixed granularity and a dynamic granularity of period over period functions.

The period over period functions can differentiate between a positive change (increase) and negative change (decrease). Therefore, when we add the conditional formatting to the visual, it’s very straightforward to see the financial performance of each period (green is good, red is bad).

Similarly, you can use periodOverPeriodPercentDifference to calculate relative sales growth over time. You can add dimensions into the visual to dive further into business insights, such as analyzing the breakdown of each business segment’s sales change by quarter, and their contribution to the total sales increase. We use the formula PoPSales%=periodOverPeriodPercentDifference(sum(Sales),{Order Date}).

Use case 2: Using a period to date function to track YTD sales in table calculations and aggregations

Similar to period over period functions, the period to date function suite provides a quick and easy way to calculate year-to-date (YTD) or quarter-to-date (QTD) metrics. In the following example, we use the formula of YTDSumoverSales=periodToDateSumOverTime(sum(Sales),{Order Date},YEAR), and YTDtotalcountSales=periodToDateSumOverTime(count(Sales),{Order Date},YEAR) to calculate YTD sales and YTD total number of transactions.

Contrary to period over period functions, the third argument of period to date functions, period, isn’t optional. Therefore, the calculation granularity is always fixed. In this example, with the granularity defined as YEAR, this measure always calculates YTD, instead of QTD or MTD. Because Order Date is displayed at the monthly level, this calculation outputs the YTD sales of each month, and starts over again in January for the next year. As shown in the result table, YTDSumoverSales of January 2018 is the monthly sales of January 2018, and YTDSumoverSales of February 2018 is the monthly sales of January 2018 plus that of February 2018. And YTDSumoverSales of January 2019 goes back to the monthly sales of January 2019.

You can further dive into the details by populating the calculations in a line chart, and adding more dimensions into the analysis. The following example shows the YTD weekly sales growth trend for each region along the past four years, and uncovers some interesting sales competition between AMER and EMEA in year 2021.

In addition to the table calculations, the aggregation period functions are particularly useful when you need to build KPI charts to evaluate YTD metrics in a real-time manner. In the following example, we use the aggregation period to date functions to build two KPI charts to track the YTD total sales, and YTD total number of transactions. For the date December 26, 2021, the timestamp results match the corresponding table calculations for the date of December 26, 2021 in the table. The following table summarizes the formulas.

Formula Formula Type
YTDSumoverSales=periodToDateSumOverTime(sum(Sales),{Order Date},YEAR) Table Calculation
YTDtotalcountSales=periodToDateSumOverTime(count(Sales),{Order Date},YEAR) Table Calculation
YTDSumSales=periodToDateSum(Sales,{Order Date},YEAR) Aggregation (KPI chart)
YTDCountSales=periodToDateCount(Sales,{Order Date},YEAR) Aggregation (KPI chart)

Advanced use case 1: Date/time awareness with period functions

Period functions are not only easier to define and read, they’re also date/time-aware, meaning the functions are calculated based on a date/time-based offset instead of a fixed number of rows. It can solve two major problems that were not possible to be addressed before.

Period functions can handle varying period duration

If you want to calculate the daily MoM sales increase, you can’t use a fixed offset on each month because the number of days of each month are different (31 days for January and 28 or 29 days for February).

Period functions are calculated based on calendar dates instead of a fixed offset. In the following example, we use the formula MoMsalesDiff=periodOverPeriodDifference(sum(Sales),{Order Date},MONTH,1). The daily MoM increase is calculated correctly based on the day of the month. The sales of the first day of the month are compared with the first day of the previous month, and the same applies to all other days. (Visuals are duplicated for demonstration purposes.)

Period functions can handle sparse (missing) data points

Not all datasets can guarantee a complete set of dates. In the previous example, sales data of January 1, 2018, is missing. Using the workaround based on a fixed offset can cause a problem here because we compare February 1, 2018, with a different date instead of January 1, 2018. Period functions always compare measures by date/time offsets so that only desired dates are compared. In the previous example, MoMsalesDiff shows empty for February 1, 2018, because of the missing data of January 1, 2018.

Advanced use case 2: Nesting period functions with other calculations

Now that we can use period over period and period to date functions to create calculated fields, we can nest these functions with other calculations to drive more advanced analysis.

For example, you may want to know for each year, what are the top 10 weeks of the year in regards to week-over-week sales growth. You can do this by calculating WoWSales first: WoWSales=periodOverPeriodDifference(sum(Sales), {Order Date}, WEEK, 1). Then you nest it with the denseRank window function: RankWoWSales=denseRank([WoWSales DESC],[{YEAR}]). This wouldn’t be possible using the fixed-row based workaround, which is implemented using on-visual calculations instead of calculated fields. In the following visual, the top 10 weeks of each year with the highest sales growth are fetched by a simple filter on RankWoWSales.

You can even nest the period functions with other period functions to generate interesting insights. For example, you can calculate a monthly YoY growth based on the monthly YTD number of transactions. The following formula demonstrates the capability of nesting a YTD calculated field inside a YoY calculated field:

YTDtotalsountSales=periodToDateSumOverTime(count(Sales),{Order Date},YEAR)
YoYYTDSalesCount=periodOverPeriodDifference(YTDtotalcountSales,{Order Date},YEAR,1)

The results in the following visual show a YoY growth based on a YTD accumulated number of transactions instead of the absolute monthly numbers.

Advanced use case 3: Partial period comparisons

Finally, we discuss a third advanced use case: partial period comparison. Imagine it’s November 15, 2021 (which is the 46th day of the last quarter of 2021), and you want to calculate 4 to compare the performance of this quarter with past quarters, but only using the first 46 days of each quarter instead of the whole quarter. This requires a calculated field using periodOverPeriodDifference nested with the sumIf() window function.

The following example demonstrates using a nested calculated field to address this use case:

QuarterToDate=periodToDateSumOverTime(sum(Sales), {Order Date}, QUARTER)
PartialQTDSales=sumIf(Sales, dateDiff(truncDate("Q", {Order Date}), {Order Date}, "HH") <= dateDiff(truncDate("Q", now()), now(), "HH"))
PartialQoQQTDSales=periodOverPeriodDifference(sumif(Sales, {Order Date} <= addDateTime(dateDiff(truncDate("Q", now()), now(), "HH"), "HH", truncDate("Q", {Order Date}))), {Order Date}, QUARTER, 1)

PartialQTDSales computes how many hours from the beginning of this quarter to the current date and uses sumIf() to calculate the total sales of that period of each quarter. partialQoQQTDSales then nests the periodOverPeriodDifference function with PartialQTDSales to find the partial QoQ differences. Such a comparison based on a partial period isn’t feasible without the new date/time-aware period functions.

Conclusion

In this blog, we introduced new QuickSight period functions which enable quick and powerful date/time-based calculations. We reviewed comparative and cumulative period functions (i.e., period over period and period to date), discussed two major use cases (fixed vs. dynamic granularity and table calculation vs. aggregation), and extended the usage to three advanced use cases. Period functions are now generally available in all supported QuickSight Regions.

Looking forward to your feedback and stories on how you apply these calculations for your business needs.


About the Authors

Emily Zhu is a Senior Product Manager at Amazon QuickSight, AWS’s cloud-native, fully managed SaaS BI service. She leads the development of QuickSight core analytics and calculations capability. Before joining AWS, she was working in the Amazon Prime Air drone delivery program and the Boeing company as senior strategist for several years. Emily is passionate about potentials of cloud-based BI solutions and looks forward to helping customers to advance in their data-driven strategy making.

Rajkumar Haridoss is a Senior Software Development Engineer for AWS QuickSight. He is the lead engineer on the Query Generation team and works on back-end calculations, query planning, and query generation layer in QuickSight. Outside of work, he likes spending quality time with family and 4-year-old.

How Ribbon Communications Built a Scalable, Resilient Robocall Mitigation Platform

Post Syndicated from Siva Rajamani original https://aws.amazon.com/blogs/architecture/how-ribbon-communications-built-a-scalable-resilient-robocall-mitigation-platform/

Ribbon Communications provides communications software, and IP and optical networking end-to-end solutions that deliver innovation, unparalleled scale, performance, and agility to service providers and enterprise.

Ribbon Communications is helping customers modernize their networks. In today’s data-hungry, 24/7 world, this equates to improved competitive positioning and business outcomes. Companies are migrating from on-premises equipment for telephony services and looking for equivalent as a service (aaS) offerings. But these solutions must still meet the stringent resiliency, availability, performance, and regulatory requirements of a telephony service.

The telephony world is inundated with robocalls. In the United States alone, there were an estimated 50.5 billion robocalls in 2021! In this blog post, we describe the Ribbon Identity Hub – a holistic solution for robocall mitigation. The Ribbon Identity Hub enables services that sign and verify caller identity, which is compliant to the ATIS standards under the STIR/SHAKEN framework. It also evaluates and scores calls for the probability of nuisance and fraud.

Ribbon Identity Hub is implemented in Amazon Web Services (AWS). It is a fully managed service for telephony service providers and enterprises. The solution is secure, multi-tenant, automatic scaling, and multi-Region, and enables Ribbon to offer managed services to a wide range of telephony customers. Ribbon ensures resiliency and performance with efficient use of resources in the telephony environment, where load ratios between busy and idle time can exceed 10:1.

Ribbon Identity Hub

The Ribbon Identity Hub services are separated into a data (call-transaction) plane, and a control plane.

Data plane (call-transaction)

The call-transaction processing is typically invoked on a per-call-setup basis where availability, resilience, and performance predictability are paramount. Additionally, due to high variability in load, automatic scaling is a prerequisite.

Figure 1. Data plane architecture

Figure 1. Data plane architecture

Several AWS services come together in a solution that meets all these important objectives:

  1. Amazon Elastic Container Service (ECS): The ECS services are set up for automatic scaling and span two Availability Zones. This provides the horizontal scaling capability, the self-healing capacity, and the resiliency across Availability Zones.
  2. Elastic Load Balancing – Application Load Balancer (ALB): This provides the ability to distribute incoming traffic to ECS services as the target. In addition, it also offers:
    • Seamless integration with the ECS Auto Scaling group. As the group grows, traffic is directed to the new instances only when they are ready. As traffic drops, traffic is drained from the target instances for graceful scale down.
    • Full support for canary and linear upgrades with zero downtime. Maintains full-service availability without any changes or even perception for the client devices.
  3. Amazon Simple Storage Service (S3): Transaction detail records associated with call-related requests must be securely and reliably maintained for over a year due to billing and other contractual obligations. Amazon S3 simplifies this task with high durability, lifecycle rules, and varied controls for retention.
  4. Amazon DynamoDB: Building resilient services is significantly easier when the compute processing can be stateless. Amazon DynamoDB facilitates such stateless architectures without compromise. Coupled with the availability of the Amazon DynamoDB Accelerator (DAX) caching layer, the solution can meet the extreme low latency operation requirements.
  5. AWS Key Management Service (KMS): Certain tenant configuration is highly confidential and requires elevated protection. Furthermore, the data is part of the state that must be recovered across Regions in disaster recovery scenarios. To meet the security requirements, the KMS is used for envelope encryption using per-tenant keys. Multi-Region KMS keys facilitates the secure availability of this state across Regions without the need for application-level intervention when replicating encrypted data.
  6. Amazon Route 53: For telephony services, any non-transient service failure is unacceptable. In addition to providing high degree of resiliency through Multi-AZ architecture, Identity Hub also provides Regional level high availability through its multi-Region active-active architecture. Route 53 with health checks provides for dynamic rerouting of requests within minutes to alternate Regions.

Control plane

The Identity Hub control plane is used for customer configuration, status, and monitoring. The API is REST-based. Since this is not used on a call-by-call basis, the requirements around latency and performance are less stringent, though the requirements around high resiliency and dynamic scaling still apply. In this area, ease of implementation and maintainability are key.

Figure 2. Control plane architecture

Figure 2. Control plane architecture

The following AWS services implement our control plane:

  1. Amazon API Gateway: Coupled with a custom authenticator, the API Gateway handles all the REST API credential verification and routing. Implementation of an API is transformed into implementing handlers for each resource, which is the application core of the API.
  2. AWS Lambda: All the REST API handlers are written as Lambda functions. By using the Lambda’s serverless and concurrency features, the application automatically gains self-healing and auto-scaling capabilities. There is also a significant cost advantage as billing is per millisecond of actual compute time used. This is significant for a control plane where usage is typically sparse and unpredictable.
  3. Amazon DynamoDB: A stateless architecture with Lambda and API Gateway, all persistent state must be stored in an external database. The database must match the resilience and auto-scaling characteristics of the rest of the control plane. DynamoDB easily fits the requirements here.

The customer portal, in addition to providing the user interface for control plane REST APIs, also delivers a rich set of user-customizable dashboards and reporting capability. Here again, the availability of various AWS services simplifies the implementation, and remains non-intrusive to the central call-transaction processing.

Services used here include:

  1. AWS Glue: Enables extraction and transformation of raw transaction data into a format useful for reporting and dashboarding. AWS Glue is particularly useful here as the data available is regularly expanding, and the use cases for the reporting and dashboarding increase.
  2. Amazon QuickSight: Provides all the business intelligence (BI) functionality, including the ability for Ribbon to offer separate author and reader access to their users, and implements tenant-based access separation.

Conclusion

Ribbon has successfully deployed Identity Hub to enable cloud hosted telephony services to mitigate robocalls. Telephony requirements around resiliency, performance, and capacity were not compromised. Identity Hub offers the benefits of a 24/7 fully managed service requiring no additional customer on-premises equipment.

Choosing AWS services for Identity Hub gives Ribbon the ability to scale and meet future growth. The ability to dynamically scale the service in and out also brings significant cost advantages in telephony applications where busy hour traffic is significantly higher than idle time traffic. In addition, the availability of global AWS services facilitates the deployment of services in customer-local geographic locations to meet performance requirements or local regulatory compliance.

Enrich datasets for descriptive analytics with AWS Glue DataBrew

Post Syndicated from Daniel Rozo original https://aws.amazon.com/blogs/big-data/enrich-datasets-for-descriptive-analytics-with-aws-glue-databrew/

Data analytics remains a constantly hot topic. More and more businesses are beginning to understand the potential their data has to allow them to serve customers more effectively and give them a competitive advantage. However, for many small to medium businesses, gaining insight from their data can be challenging because they often lack in-house data engineering skills and knowledge.

Data enrichment is another challenge. Businesses that focus on analytics using only their internal datasets miss the opportunity to gain better insights by using reliable and credible public datasets. Small to medium businesses are no exception to this shortcoming, where obstacles such as not having sufficient data diminish their ability to make well-informed decisions based on accurate analytical insights.

In this post, we demonstrate how AWS Glue DataBrew enables businesses of all sizes to get started with data analytics with no prior coding knowledge. DataBrew is a visual data preparation tool that makes it easy for data analysts and scientists to clean and normalize data in preparation for analytics or machine learning. It includes more than 350 pre-built transformations for common data preparation use cases, enabling you to get started with cleaning, preparing, and combining your datasets without writing code.

For this post, we assume the role of a fictitious small Dutch solar panel distribution and installation company named OurCompany. We demonstrate how this company can prepare, combine, and enrich an internal dataset with publicly available data from the Dutch public entity, the Centraal Bureau voor de Statistiek (CBS), or in English, Statistics Netherlands. Ultimately, OurCompany desires to know how well they’re performing compared to the official reported values by the CBS across two important key performance indicators (KPIs): the amount of solar panel installations, and total energy capacity in kilowatt (kW) per region.

Solution overview

The architecture uses DataBrew for data preparation and transformation, Amazon Simple Storage Service (Amazon S3) as the storage layer of the entire data pipeline, and the AWS Glue Data Catalog for storing the dataset’s business and technical metadata. Following the modern data architecture best practices, this solution adheres to foundational logical layers of the Lake House Architecture.

The solution includes the following steps:

  1. We set up the storage layer using Amazon S3 by creating the following folders: raw-data, transformed-data, and curated-data. We use these folders to track the different stages of our data pipeline consumption readiness.
  2. Three CSV raw data files containing unprocessed data of solar panels as well as the external datasets from the CBS are ingested into the raw-data S3 folder.
  3. This part of the architecture incorporates both processing and cataloging capabilities:
    1. We use AWS Glue crawlers to populate the initial schema definition tables for the raw dataset automatically. For the remaining two stages of the data pipeline (transformed-data and curated-data), we utilize the functionality in DataBrew to directly create schema definition tables into the Data Catalog. Each table provides an up-to-date schema definition of the datasets we store on Amazon S3.
    2. We work with DataBrew projects as the centerpiece of our data analysis and transformation efforts. In here, we set up no-code data preparation and transformation steps, and visualize them through a highly interactive, intuitive user interface. Finally, we define DataBrew jobs to apply these steps and store transformation outputs on Amazon S3.
  4. To gain the benefits of granular access control and easily visualize data from Amazon S3, we take advantage of the seamless integration between Amazon Athena and Amazon QuickSight. This provides a SQL interface to query all the information we need from the curated dataset stored on Amazon S3 without the need to create and maintain manifest files.
  5. Finally, we construct an interactive dashboard with QuickSight to depict the final curated dataset alongside our two critical KPIs.

Prerequisites

Before beginning this tutorial, make sure you have the required Identity and Access Management (IAM) permissions to create the resources required as part of the solution. Your AWS account should also have an active subscription to QuickSight to create the visualization on processed data. If you don’t have a QuickSight account, you can sign up for an account.

The following sections provide a step-by-step guide to create and deploy the entire data pipeline for OurCompany without the use of code.

Data preparation steps

We work with the following files:

  • CBS Dutch municipalities and provinces (Gemeentelijke indeling op 1 januari 2021) – Holds all the municipalities and provinces names and codes of the Netherlands. Download the file gemeenten alfabetisch 2021. Open the file and save it as cbs_regions_nl.csv. Remember to change the format to CSV (comma-delimited).
  • CBS Solar power dataset (Zonnestroom; vermogen bedrijven en woningen, regio, 2012-2018) – This file contains the installed capacity in kilowatts and total number of installations for businesses and private homes across the Netherlands from 2012–2018. To download the file, go to the dataset page, choose the Onbewerkte dataset, and download the CSV file. Rename the file to cbs_sp_cap_nl.csv.
  • OurCompany’s solar panel historical data – Contains the reported energy capacity from all solar panel installations of OurCompany across the Netherlands from 2012 until 2018. Download the file.

As a result, the following are the expected input files we use to work with the data analytics pipeline:

  • cbs_regions_nl.csv
  • cbs_sp_cap_nl.csv
  • sp_data.csv

Set up the storage Layer

We first need to create the storage layer for our solution to store all raw, transformed, and curated datasets. We use Amazon S3 as the storage layer of our entire data pipeline.

  1. Create an S3 bucket in the AWS Region where you want to build this solution. In our case, the bucket is named cbs-solar-panel-data. You can use the same name followed by a unique identifier.
  2. Create the following three prefixes (folders) in your S3 bucket by choosing Create folder:
    1. curated-data/
    2. raw-data/
    3. transformed-data/

  3. Upload the three raw files to the raw-data/ prefix.
  4. Create two prefixes within the transformed-data/ prefix named cbs_data/ and sp_data/.

Create a Data Catalog database

After we set up the storage layer of our data pipeline, we need to create the Data Catalog to store all the metadata of the datasets hosted in Amazon S3. To do so, follow these steps:

  1. Open the AWS Glue console in the same Region of your newly created S3 bucket.
  2. In the navigation pane, choose Databases.
  3. Choose Add database.
  4. Enter the name for the Data Catalog to store all the dataset’s metadata.
  5. Name the database sp_catalog_db.

Create AWS Glue data crawlers

Now that we created the catalog database, it’s time to crawl the raw data prefix to automatically retrieve the metadata associated to each input file.

  1. On the AWS Glue console, choose Crawlers in the navigation pane.
  2. Add a crawler with the name crawler_raw and choose Next.
  3. For S3 path, select the raw-data folder of the cbs-solar-panel-data prefix.
  4. Create an IAM role and name it AWSGlueServiceRole-cbsdata.
  5. Leave the frequency as Run on demand.
  6. Choose the sp_catalog_db database created in the previous section, and enter the prefix raw_ to identify the tables that belong to the raw data folder.
  7. Review the parameters of the crawler and then choose Finish.
  8. After the crawler is created, select it and choose Run crawler.

After successful deployment of the crawler, your three tables are created in the sp_catalog_db database: raw_sp_data_csv, raw_cbs_regions_nl_csv, and raw_cbs_sp_cap_nl_csv.

Create DataBrew raw datasets

To utilize the power of DataBrew, we need to connect datasets that point to the Data Catalog S3 tables we just created. Follow these steps to connect the datasets:

  1. On the DataBrew console, choose Datasets in the navigation pane.
  2. Choose Connect new dataset.
  3. Name the dataset cbs-sp-cap-nl-dataset.
  4. For Connect to new dataset, choose Data Catalog S3 tables.
  5. Select the sp_catalog_db database and the raw_cbs_sp_cap_nl_csv table.
  6. Choose Create dataset.

We need to create to two more datasets following the same process. The following table summarizes the names and tables of the catalog required for the new datasets.

Dataset name Data catalog table
sp-dataset raw_sp_data_csv
cbs-regions-nl-dataset raw_cbs_regions_nl_csv

Import DataBrew recipes

A recipe is a set of data transformation steps. These transformations are applied to one or multiple datasets of your DataBrew project. For more information about recipes, see Creating and using AWS Glue DataBrew recipes.

We have prepared three DataBrew recipes, which contain the set of data transformation steps we need for this data pipeline. Some of these transformation steps include: renaming columns (from Dutch to English), removing null or missing values, aggregating rows based on specific attributes, and combining datasets in the transformation stage.

To import the recipes, follow these instructions:

  1. On the DataBrew console, choose Recipes in the navigation pane.
  2. Choose Upload recipe.
  3. Enter the name of the recipe: recipe-1-transform-cbs-data.
  4. Upload the following JSON recipe.
  5. Choose Create recipe.

Now we need to upload two more recipes that we use for transformation and aggregation projects in DataBrew.

  1. Follow the same procedure to import the following recipes:
Recipe name Recipe source file
recipe-2-transform-sp-data Download
recipe-3-curate-sp-cbs-data Download
  1. Make sure the recipes are listed in the Recipes section filtered by All recipes.

Set up DataBrew projects and jobs

After we successfully create the Data Catalog database, crawlers, DataBrew datasets, and import the DataBrew recipes, we need to create the first transformation project.

CBS external data transformation project

The first project takes care of transforming, cleaning, and preparing cbs-sp-cap-nl-dataset. To create the project, follow these steps:

  1. On the DataBrew console, choose Projects in the navigation pane.
  2. Create a new project with the name 1-transform-cbs-data.
  3. In the Recipe details section, choose Edit existing recipe and choose the recipe recipe-1-transform-cbs-data.
  4. Select the newly created cbs-sp-cap-nl-dataset under Select a dataset.
  5. In the Permissions section, choose Create a new IAM role.
  6. As suffix, enter sp-project.
  7. Choose Create project.

After you create the project, a preview dataset is displayed as a result of applying the selected recipe. When you choose 10 more recipe steps, the service shows the entire set of transformation steps.

After you create the project, you need to grant put and delete S3 object permissions to the created role AWSGlueDataBrewServiceRole-sp-project on IAM. Add an inline policy using the following JSON and replace the resource with your S3 bucket name:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:DeleteObject"
            ],
            "Resource": "arn:aws:s3:::<your-S3-bucket-name>/*"
        }
    ]
}

This role also needs permissions to access the Data Catalog. To grant these permissions, add the managed policy AWSGlueServiceRole to the role.

CBS external data transformation job

After we define the project, we need to configure and run a job to apply the transformation across the entire raw dataset stored in the Raw-data folder of your S3 bucket. To do so, you need to do the following:

  1. On the DataBrew project page, choose Create job.
  2. For Job name, enter 1-transform-cbs-data-job.
  3. For Output to, choose Data Catalog S3 tables.
  4. For File type¸ choose Parquet.
  5. For Database name, choose sp_catalog_db.
  6. For Table name, choose Create new table.
  7. For Catalog table name, enter transformed_cbs_data.
  8. For S3 location, enter s3://<your-S3-bucket-name>/transformed-data/cbs_data/.
  9. In the job output settings section, choose Settings.
  10. Select Replace output files for each job run and then choose Save.
  11. In the permissions section, choose the automatically created role with the sp-project suffix; for example, AWSGlueDataBrewServiceRole-sp-project.
  12. Review the job details once more and then choose Create and run job.
  13. Back in the main project view, choose Job details.

After a few minutes, the job status changes from Running to Successful. Choose the output to go to the S3 location where all the generated Parquet files are stored.

Solar panels data transformation stage

We now create the second phase of the data pipeline. We create a project and a job using the same procedure described in the previous section.

  1. Create a DataBrew project with the following parameters:
    1. Project name2-transform-sp-data
    2. Imported reciperecipe-2-transform-sp-data
    3. Datasetsp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create and run another DataBrew job with the following parameters:
    1. Job name2-transform-sp-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table nametransformed_sp_data
    6. S3 locations3://<your-S3-bucket-name>/transformed-data/sp_data/
    7. Settings – Replace output files for each job run.
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  3. After the job is complete, create the DataBrew datasets with the following parameters:
Dataset name Data catalog table
transformed-cbs-dataset awsgluedatabrew_transformed_cbs_data
transformed-sp-dataset awsgluedatabrew_transformed_sp_data

You should now see five items as part of your DataBrew dataset.

Data curation and aggregation stage

We now create the final DataBrew project and job.

  1. Create a DataBrew project with the following parameters:
    1. Project name3-curate-sp-cbs-data
    2. Imported reciperecipe-3-curate-sp-cbs-data
    3. Datasettransformed_sp_dataset
    4. Permissions roleAWSGlueDataBrewServiceRole-sp-project
  2. Create a DataBrew job with the following parameters:
    1. Job name3-curate-sp-cbs-data-job
    2. Output to – Data Catalog S3 tables
    3. File type – Parquet
    4. Database namesp_catalog_db
    5. Create new table with table namecurated_data
    6. S3 locations3://<your-S3-bucket-name>/curated-data/
    7. Settings – Replace output files for each job run
    8. Permissions roleAWSGlueDataBrewServiceRole-sp-project

The last project defines a single transformation step; the join between the transformed-cbs-dataset and the transformed-sp-dataset based on the municipality code and the year.

The DataBrew job should take a few minutes to complete.

Next, check your sp_catalog_db database. You should now have raw, transformed, and curated tables in your database. DataBrew automatically adds the prefix awsgluedatabrew_ to both the transformed and curated tables in the catalog.

Consume curated datasets for descriptive analytics

We’re now ready to build the consumption layer for descriptive analytics with QuickSight. In this section, we build a business intelligence dashboard that reflects OurCompany’s solar panel energy capacity and installations participation in contrast to the reported values by the CBS from 2012–2018.

To complete this section, you need to have the default primary workgroup already set up on Athena in the same Region where you implemented the data pipeline. If it’s your first time setting up workgroups on Athena, follow the instructions in Setting up Workgroups.

Also make sure that QuickSight has the right permissions to access Athena and your S3 bucket. Then complete the following steps:

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose Create a new dataset.
  3. Select Athena as the data source.
  4. For Data source name, enter sp_data_source.
  5. Choose Create data source.
  6. Choose AWSDataCatalog as the catalog and sp_catalog_db as the database.
  7. Select the table curated_data.
  8. Choose Select.
  9. In the Finish dataset creation section, choose Directly query your data and choose Visualize.
  10. Choose the clustered bar combo chart from the Visual types list.
  11. Expand the field wells section and then drag and drop the following fields into each section as shown in the following screenshot.
  12. Rename the visualization as you like, and optionally filter the report by sp_year using the Filter option.

From this graph, we can already benchmark OurCompany against the regional values reported by the CBS across two dimensions: the total amount of installations and the total kW capacity generated by solar panels.

We went one step further and created two KPI visualizations to empower our descriptive analytics capabilities. The following is our final dashboard that we can use to enhance our decision-making process.

Clean up resources

To clean all the resources we created for the data pipeline, complete the following steps:

  1. Remove the QuickSight analyses you created.
  2. Delete the dataset curated_data.
  3. Delete all the DataBrew projects with their associated recipes.
  4. Delete all the DataBrew datasets.
  5. Delete all the AWS Glue crawlers you created.
  6. Delete the sp_catalog_db catalog database; this removes all the tables.
  7. Empty the contents of your S3 bucket and delete it.

Conclusion

In this post, we demonstrated how you can begin your data analytics journey. With DataBrew, you can prepare and combine the data you already have with publicly available datasets such as those from the Dutch CBS (Centraal Bureau voor de Statistiek) without needing to write a single line of code. Start using DataBrew today and enrich key datasets in AWS for enhanced descriptive analytics capabilities.


About the Authors

Daniel Rozo is a Solutions Architect with Amazon Web Services based out of Amsterdam, The Netherlands. He is devoted to working with customers and engineering simple data and analytics solutions on AWS. In his free time, he enjoys playing tennis and taking tours around the beautiful Dutch canals.

Maurits de Groot is an intern Solutions Architect at Amazon Web Services. He does research on startups with a focus on FinTech. Besides working, Maurits enjoys skiing and playing squash.


Terms of use: Gemeentelijke indeling op 1 januari 2021, Zonnestroom; vermogen bedrijven en woningen, regio (indeling 2018), 2012-2018, and copies of these datasets redistributed by AWS, are licensed under the Creative Commons 4.0 license (CC BY 4.0), sourced from Centraal Bureau voor de Statistiek (CBS). The datasets used in this solution are modified to rename columns from Dutch to English, remove null or missing values, aggregate rows based on specific attributes, and combine the datasets in the final transformation. Refer to the CC BY 4.0 use, adaptation, and attribution requirements for additional information.

ConexED uses Amazon QuickSight to empower its institutional partners by unifying and curating powerful insights using engagement data

Post Syndicated from Michael Gorham original https://aws.amazon.com/blogs/big-data/conexed-uses-amazon-quicksight-to-empower-its-institutional-partners-by-unifying-and-curating-powerful-insights-using-engagement-data/

This post was co-written with Michael Gorham, Co-Founder and CTO of ConexED.

ConexED is one of the country’s fastest-growing EdTech companies designed specifically for education to enhance the student experience and elevate student success. Founded as a startup in 2008 to remove obstacles that hinder student persistence and access to student services, ConexED provides advisors, counselors, faculty, and staff in all departments across campus the tools necessary to meet students where they are.

ConexED offers a student success and case management platform, HUB Kiosk – Queuing System, and now a business intelligence (BI) dashboard powered by Amazon QuickSight to empower its institutional partners.

ConexED strives to make education more accessible by providing tools that make it easy and convenient for all students to connect with the academic support services that are vital to their success in today’s challenging and ever-evolving educational environment. ConexED’s student- and user-friendly interface makes online academic communications intuitive and as personalized as face-to-face encounters, while also making on-campus meetings as streamlined, and well reported as online meetings.

One of the biggest obstacles facing school administrators is getting meaningful data quickly so that informed, data-driven decisions can be made. Reporting can be time-consuming, so they are often generated infrequently, which leads to outdated data. In addition, reporting often lacks customization and data is typically captured in spreadsheets, which doesn’t provide a visual representation of the information that is easy to interpret. ConnexED has always offered robust reporting features, but the problem was that in providing this kind of data for our partners, our development team was spending more than half its time creating custom reporting for the constantly increasing breadth of data the ConexED system generates.

Every new feature we built requires at least two or three new reports – and therefore more of our development team’s time. After we implemented QuickSight, not only can ConexED’s development team focus all its energies on creating competitive features to accelerate the rollout of new product features, but also the reporting and data visualization are now features our customers can control and customize. QuickSight features such as drill-down filtering, predictive forecasting, and aggregation insights have given us the competitive edge that our customers expect from a modern, cloud-based solution.

New technology enables strategic planning

With QuickSight, we’re able to focus on building customer-facing solutions that capture data rather than spending a large portion of our development time solving data visualization and custom report problems. Our development team no longer has to spend its time creating reports for all the data generated, and our customers don’t need to wait. Partnering with QuickSight has enabled ConexED to develop its business intelligence dashboard, which is designed to create operational efficiencies, identify opportunities, and empower institutions by uniting critical data insights to cross-campus student support services. The QuickSight data used in ConexED’s BI dashboard analyzes collected information in real time, allowing our partners to properly project trends in the coming school year using predictive analytics to improve staff efficiency, enhance the student experience, and increase rates of retention and graduation.

The following image demonstrates heat mapping, which displays the recurring days and times when student requests for support services are most frequent, with the busiest hour segments appearing more saturated in color. This enables leadership to utilize staff efficiently so that students have the support services they need when they need it on their pathway to graduation. ConexED’s BI dashboard powered by QuickSight makes this kind of information possible so that our partners can plan strategically.

QuickSight dashboards allow our customers to drill down on the data to glean even more insights of what is happening on their campus. In the following example, the pie chart depicts a whole-campus view of meetings by department, but leadership can choose one of the colored segments to drill down further for more information about a specific department. Whatever the starting point, leadership now has the ability to access more specific, real-time data to understand what’s happening on their campus or any part of it.

Dashboards provide data visualization

Our customers have been extremely impressed with our QuickSight dashboards because they provide data visualizations that make the information easier to comprehend and parse. The dynamic, interactive nature of the dashboards allows ConexED’s partners to go deeper into the data with just a click of the mouse, which immediately generates new data based on what was clicked and therefore new visuals.

With QuickSight, not only can we programmatically display boiler-plate dashboards based on role type, but we can also allow our clients to branch off these dashboards and customize the reporting to their liking. The development team is now able to move quickly to build interesting features that ingest data and provide insightful visualizations and reports on the gathered data easily. ConexED’s BI dashboard powered by QuickSight enables leadership at our partner institutions to understand how users engage with support services on their campus – when they meet, why they meet, how they meet – so that they can make informed decisions to improve student engagement and services.

The right people with the right information

In education, giving the right level of data access to the right people is essential. With intuitive row- and column-level security and anonymous tagging in QuickSight, the ConexED development team was able to quickly build visualizations that correctly display partitioned data to thousands of different users with varying levels of access across our client base.

At ConexED, student success is paramount, and with QuickSight powering our BI dashboard, the right people get the right data, and our institutional customers can now easily analyze vast amounts of data to identify trends in student acquisition, retention, and completion rates. They can also solve student support staffing allocation problems and improve the student experience at their institutions.

QuickSight does the heavy lifting

The ability to securely pull and aggregate data from disparate sources with very little setup work has given ConexED a head start on the predictive analytics space in the EdTech market. Now building visualizations is intuitive, insightful, and fun. In fact, the development team even built in only 1 day an internal QuickSight dashboard to view our own customers’ QuickSight usage. The data visualization combinations are seemingly endless and infinitely valuable to our customers.

ConexED’s partnership with AWS has enabled us to use QuickSight to drive our BI dashboard and provide our customers with the power and information needed for today’s dynamic modern student support services teams.


About the Author

Michael Gorham is Co-Founder and CTO of ConexED. Michael is a multidisciplinary software architect with over 20 years’ experience